# Illustrating a few types of Machine Learning
* Unsupervised: Clustering 
* Supervised: Classification
* Supervised: Regression

## Scikit-learn

<img src="images/scikit-learn.png" width=500>

https://scikit-learn.org/stable/index.html
<br>

In [None]:
# You may need to run the following to install packages
# !pip install scikit-learn ipywidgets plotly

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import ipywidgets
import plotly.graph_objects as go

import sklearn.datasets
import sklearn.cluster
import sklearn.linear_model

## Make some data

This will make 3 clusters of points in a 2D space, with y being a label of 0, 1, or 2.

In [None]:
x, y = sklearn.datasets.make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)

In [None]:
# first 5 elements of x
print(x[:5])

In [None]:
# first 5 elements of y (the labels)
y[:5]

In [None]:
# this is a list of colors to use for plotting
ourcolors = ['red','blue','black','green','yellow','magenta','orange','brown','grey','aqua']

In [None]:
plt.scatter(x[:,0],
            x[:,1],
            color=[ourcolors[i] for i in y])
plt.show()

For our ML examples:
* Unsupervised
    * We'll assume that we **do not** have access to the labels y
* Supervised
    * We'll assume that we **do** have access to the labels y
    * Discrete values of labels -> Classification
    * Continuous values of labels -> Regression
        * We need some continuously-valued labels to showcase regression
        * Set here $y_{4regression} = -3 + 2x_0 + 5x_1$

In [None]:
y_4regression = -3 + 2*x[:, 0] + 5*x[:, 1]

In [None]:
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(projection='3d')

ax.scatter(x[:, 0], x[:, 1], y_4regression)

There are 2 features here, $x_0$ and $x_1$.

It's kind of hard to visualize trends in 3D.  Let's:
* Make two 2D plots showing how y varies with each component of x
* Calculate the correlation coefficients
* Make a more interactive 3D plot using Plotly

In [None]:
plt.scatter(x[:,0], y_4regression)

In [None]:
plt.scatter(x[:,1], y_4regression)

In [None]:
np.corrcoef(x[:,0], y_4regression)

In [None]:
np.corrcoef(x[:,1], y_4regression)

In [None]:
fig = go.Figure(data=[go.Scatter3d(x=x[:, 0],
                                   y=x[:, 1],
                                   z=y_4regression,
                                   mode='markers',
                                   marker=dict(
                                       size=2,
    ))])
fig.show()

## Unsupervised: Clustering

For unsupervised learning, we'll assume that we do not know what any labels are.
* Ignore `y`

We create an object for our model by calling "KMeans" with the number of clusters we want to look for

In [None]:
# Choose the model

model = sklearn.cluster.KMeans(n_clusters=3)

We then call the fit method, and pass in the data in which to identify clusters.

In [None]:
# Train the model

model.fit(x)

The model now has an attribute `labels_` that stores the values of which cluster every point belongs to.

In [None]:
model.labels_

Example x value:

In [None]:
x[[0]]

And an example of its predicted cluster:

In [None]:
model.predict(x[[0]])

We can plot the clusters that have been identified by the model by again coloring them.

Color is now the **model label**, not the y-value.

In [None]:
plt.scatter(x[:,0],
            x[:,1],
            color=[ourcolors[i] for i in model.labels_])

The following allows us to interactively see the identified clusters when using KMeans to identify different numbers of clusters.

In [None]:
def plotblobs(n):
    model = sklearn.cluster.KMeans(n_clusters=n)
    model.fit(x)
    plt.scatter(x[:,0], x[:,1], color=[ourcolors[i] for i in model.labels_])
    
ipywidgets.interact(plotblobs,n=(1,10));

## Supervised: Classification

For supervised learning, we'll assume that we do know what the labels are.
* Include `y`

We will train a classification algorithm called Logistic Regression.

In [None]:
# Choose the model

model = sklearn.linear_model.LogisticRegression()

In [None]:
# Train the model

model.fit(x, y)

The model training "learns" the optimum parameters of the logistic equation to establish classification boundaries in the $(x_0, x_1)$ space.

In [None]:
model.intercept_

In [None]:
model.coef_

In order to look at the predictions, let's make a 2D grid of points and plot the predicted values for each point on that grid.

In [None]:
dx = 0.04
dy = 0.04
x_min = x[:, 0].min() - 1
x_max = x[:, 0].max() + 1
y_min = x[:, 1].min() - 1
y_max = x[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, dx),
                      np.arange(y_min, y_max, dy))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, cmap="RdBu_r")

We can plot the actual x and y data as scatter points on top of this to see how good the classifier did.

In [None]:
dx = 0.04
dy = 0.04
x_min = x[:, 0].min() - 1
x_max = x[:, 0].max() + 1
y_min = x[:, 1].min() - 1
y_max = x[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, dx),
                      np.arange(y_min, y_max, dy))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, cmap="RdBu_r")

plt.scatter(x[:,0], x[:,1], c = y)

## Supervised: Regression

Regression is also supervised learning, and we need labels for our data.
* Regression is for continuous `y` values -> use `y_4regression`

We will train a linear regression algorithm.

In [None]:
# Choose the model

model = sklearn.linear_model.LinearRegression()

In [None]:
# Train the model

model.fit(x, y_4regression)

Did the model "learn" the coefficients of our equation?
* $y_{4regression} = -3 + 2x_0 + 5x_1$?

In [None]:
model.intercept_

In [None]:
model.coef_

In [None]:
dx = 0.04
dy = 0.04
x_min = x[:, 0].min() - 1
x_max = x[:, 0].max() + 1
y_min = x[:, 1].min() - 1
y_max = x[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, dx),
                      np.arange(y_min, y_max, dy))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(projection='3d')

ax.plot_surface(xx, yy, Z, cmap="RdBu_r")

ax.scatter(x[:, 0], x[:, 1], y_4regression)

In [None]:
fig = go.Figure(data=[go.Scatter3d(x=x[:, 0], 
                                   y=x[:, 1], 
                                   z=y_4regression, 
                                   mode='markers',
                                   marker=dict(
                                       size=2,
                                   )),
                      go.Surface(x=xx, y=yy, z=Z)])
fig.show()