# Machine Learning Overview

We are going to cover a handful of main topics in this presentation including: **Linear Regressions**, **Validation**, **Classifications**, and **Unsupervised learning**. We need to start with a broader question, though.

## What is Machine Learning?

The goal this project is to use machine learning to teach the computer how to identify objects. So we'll start with my definitions of machine learning -- in particular of supervised machine learning. We are using a programming algorithm that gives the computer the tools it needs to identify patterns in a set of data. Once we have those patterns, we can use them to make predictions - what we would expect should happen if we gather more data that may not necessarily be exactly the same as the data we learned from.

We'll start by looking at a very simple set of fake data that will help us cover the key ideas. Suppose that we just collected four data points. I've manually input them. Execute the following cell to see what the data look like.

In [None]:
import pandas as pd

fakedata1 = pd.DataFrame( 
       [[ 0.862,  2.264],
       [ 0.694,  1.847],
       [ 0.184,  0.705],
       [ 0.41 ,  1.246]], columns=['input','output'])

fakedata1.plot(x='input',y='output',kind='scatter')

It is pretty clear that there is a linear trend here. If I wanted to predict what would happen if we tried the input of `x=0.6`, it would be a good guess to pick something like `y=1.6` or so. Training the computer to do this is what we mean by _Machine Learning_. 

To formalize this a little bit, it consists of four steps:

1. We start with relevant historical data. This is our input to the machine learning algorithm.
2. Choose an algorithm. There are a number of possibilities.
3. Train the model. This is where the computer learns the pattern.
4. Test the model. We now have to check to see how well the model works.

We then refine the model and repeat the process until we are happy with the results.

### The Testing Problem

There is a bit of a sticky point here. If we use our data to train the computer, what do we use to test the model to see how good it is? If we use the same data to test the model we will, most likely, get fantastic results! After all, we used that data to train the model, so it should (if the model worked at all) do a great job of predicting the results.

However, this doesn't tell us anything about how well the model will work with a _**new**_ data point. Even if we get a new data point, we won't necessarily know what it is _supposed_ to be, so we won't know how well the model is working. There is a way around all of this that works reasonably well. What we will do is set aside a part of our historical data as "test" data. We won't use that data to train the model. Instead, we will use it to test the model to see how well it works. This gives us a good idea of how the model will work with new data points. As a rule of thumb, we want to reserve about 20% of our data set as testing data.

There is a library that does this for us in Python called `train_test_split`. The documentation is here: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html.

One of the inputs we will use is the `random_state` option. By using the same number here we should all end up with the same results. If you change this number, you change the random distribution of the data and, thus, the end result.

In [None]:
from sklearn.model_selection import train_test_split

faketrain1, faketest1 = train_test_split(fakedata1, test_size=0.2, random_state=23)
faketrain1.plot(x='input',y='output',kind='scatter')
faketest1.plot(x='input',y='output',kind='scatter')

You can see that, with a 20% split, our small fake dataset doesn't have very many points. Really we shouldn't be working with less than 100 points for anything we do. Any fewer than that and the statistics just start breaking. Ideally we'd have tens of millions of data points. We'll talk later about how to get that much data, but we'll start small for now. We'll load in the `Class02_fakedata2.csv` file and split it 80/20 training/testing datasets.

In [None]:
fakedata2 = pd.read_csv('mondaydata/Class02_fakedata2.csv')
faketrain2, faketest2 = train_test_split(fakedata2, test_size=0.2, random_state=23)
faketrain2.plot(x='input',y='output',kind='scatter')
faketest2.plot(x='input',y='output',kind='scatter')

# Linear Regression

We are now ready to train our linear model on the training part of this data. Remember that, from this point forward, we must "lock" the testing data and not use it to train our models. This takes two steps in Python. The first step is to define the model and set any model parameters (in this case we'll use the defaults). This is a Python object that will subsequently hold all the information about the model including fit parameters and other information about the fit. Again, take a look at the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html.

The second step is to actually fit the data. We need to reformat our data so that we can tell the computer what our inputs are and what our outputs are. We define two new variables called "features" and "labels". Note the use of the double square bracket in selecting data for the features. This will allow us to, in the future, select mutltiple columns as our input variables. In the mean time, it formats the data in the way that the fit algorithm needs it to be formatted.

In [None]:
faketrain2.head()

In [None]:
from sklearn.linear_model import LinearRegression

# Step 1: Create linear regression object
regr = LinearRegression()

# Step 2: Train the model using the training sets
features = faketrain2[['input']].values
labels = faketrain2['output'].values

regr.fit(features,labels)

We now want to see what this looks like!  We start by looking at the fit coefficient and intercept. When we have more than one input variable, there will be a coefficient corresponding to each feature.

In [None]:
print('Coefficients: \n', regr.coef_)
print('Intercept: \n', regr.intercept_)

That doesn't really tell us much. It would be better if we could compare the model to the test data. We will use the inputs from the test data and run them through the model. It will predict what the outputs should be. We can then compare them to the actual outputs. We'll plot the predictions as a line (since they will all lie on the same line due to our model being a linear regression).

In [None]:
testinputs = faketest2[['input']].values
predictions = regr.predict(testinputs)
actuals = faketest2['output'].values

import matplotlib.pyplot as plt
plt.scatter(testinputs, actuals, color='black', label='Actual')
plt.plot(testinputs, predictions, color='blue', linewidth=1, label='Prediction')

# We also add a legend to our plot. Note that we've added the 'label' option above. This will put those labels together in a single legend.
plt.legend(loc='upper left', shadow=False, scatterpoints=1)
plt.xlabel('input')
plt.ylabel('output')

In [None]:
plt.scatter(testinputs, (actuals-predictions), color='green', label='Residuals just because $\lambda$')
plt.xlabel('input')
plt.ylabel('residuals')
plt.legend(loc='upper left', shadow=False, scatterpoints=1)

This looks pretty good. We can go one step futher and define a quantitative measure of the quality of the fit. We will subtract the difference between the prediction and the actual value for each point. We then square all of those and average them.  Finally we take the square root of all of that. This is known as the RMS error (for Root Mean Squared).

In [None]:
import numpy as np
print("RMS Error: {0:.3f}".format( np.sqrt(np.mean((predictions - actuals) ** 2))))

# Using Multiple Inputs

We'll now move to a real-world data set (which means it is messy). We'll load in the diabetes data set from Class 01 and try training it. Our input will be the 'BMI' feature and the output is the 'Target' column.


In [None]:
diabetes = pd.read_csv('mondaydata/Class01_diabetes_data.csv')
diabetes.head()

I've put all the steps together in one cell and commented on each step.

In [None]:
# Step 1: Split off the test data
dia_train, dia_test = train_test_split(diabetes, test_size=0.2, random_state=23)

# Step 2: Create linear regression object
dia_model = LinearRegression()

# Step 3: Train the model using the training sets
features = dia_train[['BMI']].values
labels = dia_train['Target'].values

# Step 4: Fit the model
dia_model.fit(features,labels)

# Step 5: Get the predictions
testinputs = dia_test[['BMI']].values
predictions = dia_model.predict(testinputs)
actuals = dia_test['Target'].values

# Step 6: Plot the results
plt.scatter(testinputs, actuals, color='black', label='Actual')
plt.plot(testinputs, predictions, color='blue', linewidth=1, label='Prediction')
plt.xlabel('BMI') # Label the x axis
plt.ylabel('Target') # Label the y axis
plt.legend(loc='upper left', shadow=False, scatterpoints=1)

# Step 7: Get the RMS value
print("RMS Error: {0:.3f}".format( np.sqrt(np.mean((predictions - actuals) ** 2))))

Not too surprising that the RMS error isn't very good. This is the real world after all. However, we saw in Class 01 that there may be some dependence on some of the other variables like the LDL. We can try a linear regression with both of them as inputs. I have to change the code a little to do this. Compare this with the previous cell to see what needs to change.

In [None]:
# Step 2: Create linear regression object
dia_model2 = LinearRegression()

# Possible columns:
# 'Age', 'Sex', 'BMI', 'BP', 'TC', 'LDL', 'HDL', 'TCH', 'LTG', 'GLU'
#
inputcolumns = [ 'BMI', 'HDL']

# Step 3: Train the model using the training sets
features = dia_train[inputcolumns].values
labels = dia_train['Target'].values

# Step 4: Fit the model
dia_model2.fit(features,labels)

# Step 5: Get the predictions
testinputs = dia_test[inputcolumns].values
predictions = dia_model2.predict(testinputs)
actuals = dia_test['Target'].values

# Step 6: Plot the results
#
# Note the change here in how we plot the test inputs. We can only plot one variable, so we choose the first.
# Also, it no longer makes sense to plot the fit points as lines. They have more than one input, so we only visualize them as points.
#

plt.scatter(testinputs[:,0], actuals, color='black', label='Actual')
plt.scatter(testinputs[:,0], predictions, color='blue', label='Prediction')
plt.legend(loc='upper left', shadow=False, scatterpoints=1)

# Step 7: Get the RMS value
print("RMS Error: {0:.3f}".format( np.sqrt(np.mean((predictions - actuals) ** 2))))

## ML Models: Classifiers

We are now going to work with classifier models. We start with a sample dataset from Sebastian Thrun's Udacity Machine Learning course. Here's the scenario: we are building a self-driving car. We have mapped out the course we are taking and created a dataset that indicates, on a scale from 0 to 1, how bumpy the road is and, on the same scale, how steep the road is (measured in "grade"). For each road we need to know whether we should have the car drive "slow" or "fast". For example, we want to slow down for bumpy roads. But we may want to speed up when we are going up steep hills. I've created a sample dataset from fake data that maps this out. We start by loading and plotting the data.

In [None]:
import pandas as pd
import seaborn as sns
sns.set_style("white")

#Note the new use of the dtype option here. We can directly tell pandas to use the Speed column as a category in one step.
speeddf = pd.read_csv("mondaydata/Class04_speed_data.csv",dtype={'Speed':'category'})

lm = sns.lmplot(x='Grade', y='Bumpiness', data=speeddf, hue='Speed', fit_reg=False)
sns.despine(ax=lm.ax, top=False, right=False)

We will start with a subset of this data to illustrate what we are trying to do here. We use the `sample()` function to get a small piece of the data (we use the random_state option to make sure we use the same set of data every time, otherwise the data will change).

In [None]:
speedsub = speeddf.sample(16,random_state=55)
sns.lmplot(x='Grade', y='Bumpiness', data=speedsub, hue='Speed', fit_reg=False)
sns.despine(top=False, right=False)

What we want to do is have the computer learn where the boundary lies between the fast data points and the slow data points. That way we can input in any grade and any bumpiness and the computer will tell us whether to go fast or slow. It looks like there is a region between the two sets of data where we could potentially put our boundary.

In [None]:
lm = sns.lmplot(x='Grade', y='Bumpiness', data=speedsub, hue='Speed', fit_reg=False)
sns.despine(ax=lm.ax, top=False, right=False)

from matplotlib.patches import Polygon
from matplotlib.collections import PatchCollection
patches=[]
polygon = Polygon([[.92,0],[1,0],[1,.24],[0,.9],[0,.67]], True)
patches.append(polygon)
p = PatchCollection(patches, alpha=0.4)
lm.ax.add_collection(p)

How do we decide where in this region to put the boundary? There are a couple of different algorithms that will do the job for us. We're not going to spend time describing how they work - you can look them up if you are interested in the mathematics. Instead, we'll look at how to apply them and look at how well they work.

## Perceptron

The first algorithm is called the Perceptron (information on how it works is found on Wikipedia: https://en.wikipedia.org/wiki/Perceptron#Learning_algorithm). The documentation for the Scikit Learn Perceptron is found [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html). We'll use a syntax very similar to the pattern we used in Class02. First, we split the data into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

trainsub, testsub = train_test_split(speedsub, test_size=0.2, random_state=23)

Now we import the model and train it, just like we did with the linear regression.

In [None]:
from sklearn.linear_model import Perceptron

# Step 1: Create linear regression object
model = Perceptron()

# Step 2: Train the model using the training sets
features = trainsub[['Grade','Bumpiness']].values
labels = trainsub['Speed'].values

model.fit(features,labels)
print("Model Coefficients: {}".format(model.coef_))
print("Model Intercept: {}".format(model.intercept_))

We would like to visualize the decision boundary between the two classes. There are a couple of ways we could do this. For linear models like the perceptron, we can get the coefficients from the model and then plot them as a line. There are a couple of other steps to this, but fortunately, there is [code](http://stackoverflow.com/questions/22294241/plotting-a-decision-boundary-separating-2-classes-using-matplotlibs-pyplot) to help us figure it out.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
w = model.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(0,1)
yy = a * xx - (model.intercept_[0]) / w[1]

# Plot the points
lm2 = sns.lmplot(x='Grade', y='Bumpiness', data=speedsub, hue='Speed', fit_reg=False)
sns.despine(ax=lm2.ax, top=False, right=False)

# Plot our range estimate
p2 = PatchCollection(patches, alpha=0.4)
lm2.ax.add_collection(p2)

# Plot the actual decision boundary
plt.plot(xx, yy, 'k-')



Note that the line isn't very good - remember that we only used a subset of the data to fit the decision boundary. But it still lies in the expected range.

There is another way we could plot this: we could split our figure into small boxes, then make a prediction for each box. We then plot all the decisions in two different colors, showing the prediction for each box. This gives us a more general tool for plotting not only linear boundaries, but any possible decision boundary.

In [None]:

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh
x_min = 0.0; x_max = 1.0 # Mesh x size
y_min = 0.0; y_max = 1.0  # Mesh y size
h = .01  # step size in the mesh
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Now predict the results at each point and get the categorical values
Zpred = model.predict(np.c_[xx.ravel(), yy.ravel()])
Zseries = pd.Series(Zpred, dtype='category')
Zvalues = Zseries.cat.codes.values
Z = Zvalues.reshape(xx.shape)


# First plot our points
lm2 = sns.lmplot(x='Grade', y='Bumpiness', data=speedsub, hue='Speed', fit_reg=False)
sns.despine(ax=lm2.ax, top=False, right=False)

# Now add in the decision boundary
plt.pcolormesh(xx, yy, Z, cmap= plt.cm.cool, alpha=0.1)

At this point, let's go back to the entire test dataset and fit the decision boundary for it. We'll also look at the out-of-sample performance by plotting the test data instead of the train data.

In [None]:
train, test = train_test_split(speeddf, test_size=0.2, random_state=23)

model2 = Perceptron()

features_train = train[['Grade','Bumpiness']].values
labels_train = train['Speed'].values
features_test = test[['Grade','Bumpiness']].values
labels_test = test['Speed'].values

model2.fit(features_train,labels_train)

Zpred = pd.Series(model2.predict(np.c_[xx.ravel(), yy.ravel()]), dtype='category').cat.codes.values
Z = Zpred.reshape(xx.shape)

# First plot our points
lm = sns.lmplot(x='Grade', y='Bumpiness', data=test, hue='Speed', fit_reg=False)
sns.despine(ax=lm.ax, top=False, right=False)
plt.pcolormesh(xx, yy, Z, cmap= plt.cm.cool, alpha=0.1)

So, there are a few things to note here. First, the Perceptron has given us a boundary that works fairly well. However, it isn't perfect. There are a few points that are labeled "fast" that will now be classified as "slow". It would be nice to have a way to quantify how well the classifier has performed. We'll look at a new set of tools to do that.

## Evaluation Metrics

First, we review the evaluation metric we've already seen: the RMS value for the linear regression. We calculated this by taking our model prediction, subtracting the actual value, squaring the difference, then averaging over all points in the test set. Finally, we took the square root of this to get the RMS: "[Square]Root [of the] Mean-Squared". A perfect fit would give an RMS of 0.0 and larger RMS values mean that the fit is not performing as well.

There are more ways to evaluate the performance of a classifier model. They all start with the confusion matrix, so we'll start there.

### The Confusion Matrix

The first thing we do is recognize that there are, for a binary, or two-state classifier, four possible outcomes when we evaluate each test point:
1. The prediction says "slow" and the actual label says "slow"
2. The prediction says "fast", but the actual label says "slow"
3. The prediction says "slow", but the actual label says "fast"
1. The prediction says "fast" and the actual label says "fast"


The first and last possibilies indicate that the prediction did a good job, but the other two mean there were problems. Let's make this into a table:


|           | | Predicted | Predicted|
|:--------: |:-----:|:-----:|:-----:|
|  | |  Slow | Fast | 
|**Actual**	|Slow | #1  | #2 |
|**Actual**	| Fast | #3 | #4 | | 

Now we need to count how many of each possibility there were using the test data. There is, naturally, a tool to do this for us.


In [None]:
from sklearn.metrics import confusion_matrix
class_labels = ["slow", "fast"]
y_pred = model2.predict(features_test)
cnf_matrix = confusion_matrix(labels_test, y_pred,labels=class_labels)
print(cnf_matrix)

We can also visualize this as a graphic, showing a shade of color for each of the different values. This is especially useful when we have more than two classes. Because we'll use this again, we define a function that takes the class labels and confusion matrix as inputs and creates the plot.

In [None]:
def show_confusion_matrix(cnf_matrix, class_labels):
    plt.matshow(cnf_matrix,cmap=plt.cm.YlGn,alpha=0.7)
    ax = plt.gca()
    ax.set_xlabel('Predicted Label', fontsize=16)
    ax.set_xticks(range(0,len(class_labels)))
    ax.set_xticklabels(class_labels)
    ax.set_ylabel('Actual Label', fontsize=16, rotation=90)
    ax.set_yticks(range(0,len(class_labels)))
    ax.set_yticklabels(class_labels)
    ax.xaxis.set_label_position('top')
    ax.xaxis.tick_top()

    for row in range(len(cnf_matrix)):
        for col in range(len(cnf_matrix[row])):
            ax.text(col, row, cnf_matrix[row][col], va='center', ha='center', fontsize=16)
        
show_confusion_matrix(cnf_matrix,class_labels)

We can see now that the diagonal entries are what we want- the darker they are, the better we are doing. The off-diagonal terms (the slow-fast and fast-slow terms) are points that have been incorrectly identified. It would be nice if we could distill this matrix down into a single number. Unfortunately, there is no unique way of doing that. There are a couple of different metrics that people use and we can quickly go through them. There is a nice summary [here](http://www.kdnuggets.com/2016/12/best-metric-measure-accuracy-classification-models.html) of some of the metrics and how people use them.

The Perceptron is typically slow and not very flexible. With a large dataset it takes a long time to reach a solution. Altough it is simple to implement, it isn't very good and isn't used much.

## ML Models: Clustering

Up to this point we have been working with *supervised* learning - we have a set of features and labels (or outputs in the case of a regression) that we used to teach the machine. What if we don't have labels? What if all we have is a set of unlabeled data points? It is still possible to do some types of *unsupervised* machine learning. We are interested in separating out groups of data or creating *clusters*. This can be useful if we are looking for patterns in our data. One pattern could be that the data clumps together around certain points. However, before we can check to see if there are data clusters, we have to know how many cluster points to look for. Fortunately the **K-means Classifier** algorithm works very quickly, so we should be able to try a variety of cluster numbers fairly quickly.

### Demo

Before we dive into working with our own data, there is an [excellent visualization tool](https://www.naftaliharris.com/blog/visualizing-k-means-clustering/) that shows how this algorithm works. We will explore this together as a class before we move on to the next step.

### Sample Data

I am going to follow [a tutorial](https://www.datascience.com/blog/introduction-to-k-means-clustering-algorithm-learn-data-science-tutorials) that does a good job of describing how k-means clustering works. The data are based on measurements of truck drivers. There are two features: the mean percentage of time a driver was >5 mph over the speed limit and the mean distance driven per day.

We'll use a [scikit tool](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster) to try the unsupervised clustering on the data. Naturally we'll start by loading the sample data and plotting it to see what we've got.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
df1=pd.read_csv('mondaydata/Class09_cluster_example1.csv',index_col=0)
df1.head()


Not knowing exactly what we're working with, let's get the minima and maxima of the two features. We'll use this later to create plots of our predictions.

In [None]:
df1.plot.scatter(x='Distance_Feature',y='Speeding_Feature',marker='.')
x_min, x_max = df1['Distance_Feature'].min() - 1, df1['Distance_Feature'].max() + 1
y_min, y_max = df1['Speeding_Feature'].min() - 1, df1['Speeding_Feature'].max() + 1
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)

There are a couple of ways we could try splitting up this data. Like we saw in the demonstration, we have to choose the number of clusters ($k$) before we start. We'll try a couple of values and then look at them to see how they map against the data. There isn't any point in splitting the data into training/testing subsets because we don't have a label to train on. So we'll fit all the data.

In [None]:
from sklearn.cluster import KMeans

kmeans2 = KMeans(n_clusters=2).fit(df1)

We will make a visualization like we've done with the classification algorithms: we want to map out which regions will be predicted to be which class. That will take a bit of work.

In [None]:
import numpy as np

# Step size of the mesh. Decrease to increase the quality of the plot.
h = 0.5     # point in the mesh [x_min, x_max]x[y_min, y_max].

# Create the mesh for plotting the decision boundary
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans2.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

# Get the centroid (or center position) of each region so we can plot that, too.
centroids = kmeans2.cluster_centers_

# First plot the mesh that has the predictions for each point on the mesh.
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

# Now put the points on top
plt.plot(df1['Distance_Feature'], df1['Speeding_Feature'], 'k.', markersize=2)

# Plot the centroids as a white X
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means with 2 Clusters')

# And fix the plot limits and labels.
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.ylabel('Speeding_Feature')
plt.xlabel('Distance_Feature')


So this looks pretty good. There is a nice clear boundary between the two halves of the plot. The centroids (marked as white "X"s on the plot) look about right, too. I'm happy with this clustering. 

If we get a future point in now, we can easily classify it as belonging to one of these two groups. For example, we could now create a feature based on this grouping and then use that for other machine learning. 

## Cluster-data mismatches

Now let's see what happens if we pick 3 clusters from the start.

In [None]:
kmeans3 = KMeans(n_clusters=3).fit(df1)


Z = kmeans3.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
centroids = kmeans3.cluster_centers_

plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(df1['Distance_Feature'], df1['Speeding_Feature'], 'k.', markersize=2)
# Plot the centroids as a white X
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means with 3 Clusters')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.ylabel('Speeding_Feature')
plt.xlabel('Distance_Feature')

The algorithm does the best it can with what we gave it- it found three clusters of data. But the decision boundaries do not match the data very well. So let's try 4 instead.

In [None]:
kmeans4 = KMeans(n_clusters=4).fit(df1)

Z = kmeans4.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
centroids = kmeans4.cluster_centers_

plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(df1['Distance_Feature'], df1['Speeding_Feature'], 'k.', markersize=2)
# Plot the centroids as a white X
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('K-means with 4 Clusters')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.ylabel('Speeding_Feature')
plt.xlabel('Distance_Feature')

That looks better. There are at least three good groups but the algorithm picks out the upper right-hand corner as the fourth grouping. I think we can work with this.


## Using k-means groups as features

What if we want to add the groups to the dataframe to use it for other machine learning algorithms? We'll add in the feature column then plot the data using this new feature.


In [None]:
# Create the new column based on the labels from the kmeans fit
df1['kmeansgroup'] = kmeans4.labels_

# We group the data by this column
groups = df1.groupby('kmeansgroup')


# Then plot it
trainfig, ax = plt.subplots()
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
# The next step is to cycle through the groups (based on our categories) and plot each one on the same axis.
for name, group in groups:
    ax.plot(group['Distance_Feature'], group['Speeding_Feature'], marker='o', linestyle='', ms=6, label=name)
    ax.set_aspect(1)
ax.legend(bbox_to_anchor=(1.2,0.5))
ax.set_xlabel('Distance_Feature')
ax.set_ylabel('Speeding_Feature')

We didn't change the names of the features- the clusters are named 0-3, but that's probably good enough for now.

## Weekly Assignment

The primary goal for the next couple of weeks is to expand your object detection model, train, and test it on real world images. You have two more weeks until final presentations. We hope everyone will have a trained and functioning model to show at the end!