## Notebook Topic: Modeling Techniques Continued

<ins>Learning Objectives</ins>

1. To learn basic machine learning methods

This is *NOT* a machine learning course.  We will only learn when to use each method, how to implement in Python, and how to analyze the output.  

**Section IV: Clustering with K-Means**

K-Means is a machine learning method for clustering.

* Clustering is the act of grouping data based on similarities in certain **quantitative** attributes.  For example, perhaps you tell Tinder that you only want to see people who are at least 6 feet tall.  Tinder will obviously recommend people greater than 6 feet tall, but it might also recommend people who are between 5'9" and 6'.
* K-Means is a method (there are many others!) that creates $k$ groups based on distance between data points.  I will go over this method here.


In [16]:
# recall how to load the iris data set from Notebook 05
import ___ as sbn
iris_DF = __

# in this line, go ahead and check out the column names and the first few rows like we did in Notebook 05!


Notice that the iris data set has four quantitative variables for each iris.  A *quantitative* variable is data that has a unit of measurement that is either continuous, like inches, or discrete, like number of cars someone owns.

Perhaps we want to cluster irises together based on "sepal length" and "petal length".  Let's look at a graph to see if this makes sense to do.

In [None]:
iris_DF.plot(kind = 'scatter', x = 'sepal_length', y = 'petal_length')

<span style="color:purple">Are there any clear groupings based on the graph?  If so, how many do you think there are?  Use this textbox to answer.</span>

K-means creates "circles" around groups in the data.  

* Initialize $k$ means, called $\bar{x}_i$, for i = 1 to k.
* For all points, if they are less than $r$ distance away from a particular $\bar{x}_i$, then they are assigned to the $i$-th group.
* The $k$ means are re-calculated based on the data points in their groups.
* Start this process over again.  Do this $t$ times.

We, as the data analysts, select 

* $k$ (the number of clusters to create), and 
* $t$ (the number of times to repeat the process).


<span style="color:purple">Explain the process above in your own words.  This will help you synthesize the information.  If it helps, use an example or try to explain to a classmate or draw a picture!  Insert any of this work here.</span>

In [None]:
# we use sklearn, the most common machine learning library
from sklearn.cluster import KMeans

k_clusters = 2
t_times = 10

kmeans = KMeans(n_clusters=k_clusters, max_iter = t_times)
predicted_labels = kmeans.fit_predict(X = iris_DF.loc[:,['sepal_length','petal_length']])

The results will give us the cluster number for each data point.  Let's check this out to see what we have.

In [None]:
import matplotlib.pyplot as plt

plt.scatter(x = iris_DF.loc[:,'sepal_length'], y = iris_DF.loc[:,'petal_length'], c = predicted_labels)

<span style="color:purple">Check out how the clusters change when you change the number of clusters and the number of iterations.  
Try it out in new code chunks and text boxes below.</span>

**Section V: Classification with K-Nearest Neighbors**

K-Nearest Neighbors is a machine learning method for classification.

* Classification is the act of grouping data based on some **qualitative** attribute.  For example, once you've watched *Mean Girls* on Netflix, it might recommend *Freaky Friday* to you next because \<lindsay lohan\>, \<teen\>, \<movie\>, \<romcom\> are all attributes of *Mean Girls* shared with *Freaky Friday*.
* K-Nearest Neighbors is a method that determines which group a data point belongs to based on the most common attributes of the $k$ nearest neighbors.  I will go over this method here.

We'll stick with the iris dataset!  We first split the data into two subsets, a smaller data set of irises that we'll try to classify correctly based on a larger subset of irises.  That is,

* training data 
* test data

This is quite common to do!  Usually you want about 68-85% of the original as your training data and the remainder as the testing data.

In [20]:
# sticking with sklearn
from sklearn.model_selection import train_test_split

# setting up the data for the function
X = iris_DF[['sepal_length', 'petal_length']]
y = iris_DF['species']

# split the data so that 82% is in the training set
train_X, test_X, train_y, test_y = train_test_split(X , y, test_size = 0.12)
(n_train, m_train) = train_X.shape
n_test = len(test_y)

K-Nearest Neighbors classifies new data.

* Determines the $k$ nearest neighbors to a new data point.
* Checks out how those neighbors are classified.
* Gives the new data point the classification most common for those $k$ neighbors.

As the data analysts, we only specify 

* $k$, the number of neighbors we should care about.



In [21]:
# sticking with sklearn
from sklearn.neighbors import KNeighborsClassifier

k_neighbors = 3
# train the model
knn_model = KNeighborsClassifier(k_neighbors).fit(train_X, train_y)

<span style="color:purple">Explain the process above in your own words.  This will help you synthesize the information.  If it helps, use an example or try to explain to a classmate or draw a picture!  Insert any of this work here. </span>

Let's see how accurate the predicted iris class for the data in *test_X* is by comparing it to the truth! called *test_y*.

In [None]:
y_pred = knn_model.predict(test_X)
score = np.sum(y_pred == test_y)/n_test

score

We can also graph our results for K-Nearest Neighbors.  <span style="color:purple">Describe a visualization that would make the most sense for what KNN does.</span>  You do not need to create one, but bonus points if you do!

**Section VI: Prediction with Linear Regression**

Linear regression can be used to make predictions!  At the start, linear regression helped us to answer 

1. Is there really a linear relationship between the explanatory and response variables in the population (all objects) or might the pattern we see in the scatterplot plausibly arise just by chance?

2. What is the rate of change that relates the response variable to the explanatory variable in the population (all objects), including the margin of error for our estimate of the slope?

Here, we will answer different questions

3. Given a new data point $x = a$, what is the $y$-value predicted by the line?
4. What is the margin of error for this prediction?  That is, under a slightly different model, how much will the prediction change?


We'll revisit the motivating problem from notebook 08.

<ins>Motivating Problem</ins> "The bigger they are, the harder they fall."  
Below is the weight (kg) of 5 different objects and the force (kg $\cdot$ m/s $^2$) with which they hit the ground (taking into consideration there is air resistance).

| weight | force |
| ------ | ----- |
| 45.3 | 443.94 |
| 22.6 | 221.48 |
| 34.5 | 338.10 |
| 0.91 | 8.82 |
| 38.6 | 378.29 |

We will work with the original data at the start of this very long notebook, *DF*.  And we'll be working with the same least squares model as earlier, ```lsq```.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.regression.linear_model as lm
import seaborn as sbn

# create a pandas data frame of our data
DF = pd.DataFrame(data = np.array([[45.3, 22.6,34.5, 0.91, 38.6], [443.94, 221.48, 338.10, 8.82, 378.29]]).T, columns = ["weight", "force"])
lsq = lm.OLS(endog = DF['force'], exog = DF['weight']).fit()

DF.head()

In [None]:
lsq.summary()

We can use ```lsq``` to predict the force with which an object weighing 55.1 kg hits the ground.

In [None]:
lsq.get_prediction(55.1)

<span style="color:purple">Why doesn't this work?  Check out ```help(lsq.predict)```.</span>

In [None]:
help(lsq.predict)

In [None]:
new_data = pd.DataFrame(np.array([55.1]), columns = ['weight'], index = [0])
predict_y = lsq.get_prediction(new_data)

predict_y.predicted

<span style="color:purple">Does the result to seem to accurately fit the relationship between weight and force?  Why or why not? Answer in this textbox. </span>

In [None]:
predict_y.summary_frame(0.05)

The code chunk above produces an output: mean, mean_se, mean_ci_lower, mean_ci_lower, obs_ci_lower, obs_ci_upper

* get_prediction is just the result of plugging $x=55.1$ into $\hat{y} = m\cdot x + b$, where slope $m$ and intercept $b$ is found by the least squares method OLS. 
* but we need to know how much variation there would be in the prediction, so we need to use summary_frame and look at the obs_ci_lower and obs_ci_upper because this tells us that if we randomly sampled another object that weighed 55.1 kg, it will result in a force between 539.81 and 540.15, 95% of the time.


## End Quiz!

<span style="color:purple">Write as much as possible here</span>.  Your answers provide you guidance later during your projects.

1.   When do we want to cluster data?  What does it mean to cluster data?
2.   When do we want to classify data?  What does it mean to classify data?
3.   Why might we want to use the regression line to make predictions?