# Assessing our models (in reverse order)
* Supervised: Regression
    * mean squared error (MSE) and coefficient of determination (R^2)
* Supervised: Classification
    * accuracy, precision, recall, confusion matrix
* Unsupervised: Clustering 
    * inertia, silhoette coefficient

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import sklearn.datasets
import sklearn.cluster
import sklearn.linear_model
import sklearn.metrics

## Make some data - same as before but streamlined

In [None]:
x, y = sklearn.datasets.make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=0)
y_4regression = -3 + 2*x[:, 0] + 5*x[:, 1]

## Supervised: Regression

Regression is also supervised learning, and we need labels for our data.
* Regression is for continuous `y` values -> use `y_4regression`

We will train a linear regression algorithm.

In [None]:
# Choose the model

model = sklearn.linear_model.LinearRegression()

In [None]:
# Train the model

model.fit(x, y_4regression)

Did the model "learn" the coefficients of our equation?
* $y_{4regression} = -3 + 2x_0 + 5x_1$?

In [None]:
model.intercept_

In [None]:
model.coef_

To assess the error, we need to compare the predicted y values against the dataset's actual y values.
* Make predictions with the model for all x

In [None]:
y_pred = model.predict(x)

In [None]:
# The mean squared error
print("Mean squared error: %.2f" % 
      sklearn.metrics.mean_squared_error(y_4regression, y_pred))

# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination R^2: %.2f" % 
      sklearn.metrics.r2_score(y_4regression, y_pred))

Those are some mighty fine values! -- Perfect!  More on this later.

## Supervised: Classification

For supervised learning, we'll assume that we do know what the labels are.
* Include `y`

We will train a classification algorithm called Logistic Regression.

In [None]:
# Choose the model

model = sklearn.linear_model.LogisticRegression()

In [None]:
# Train the model

model.fit(x, y)

The model training "learns" the optimum parameters of the logistic equation to establish classification boundaries in the $(x_0, x_1)$ space.

In [None]:
model.intercept_

In [None]:
model.coef_

To assess the error, we need to compare the predicted y values against the dataset's actual y values.
* Make predictions with the model for all x

In [None]:
y_pred = model.predict(x)

For classification, we assess the confusion matrix:

In [None]:
cm = sklearn.metrics.confusion_matrix(y, y_pred)
print(cm)

Again perfect classification.

But to demonstrate additional quantitative metrics:

In [None]:
cm = sklearn.metrics.classification_report(y, y_pred)
print(cm)

In [None]:
print(f"Accuracy: {sklearn.metrics.accuracy_score(y, y_pred):.2%}")
print(f"Precision: {sklearn.metrics.precision_score(y, y_pred, average='micro'):.2%}")
print(f"Recall: {sklearn.metrics.recall_score(y, y_pred, average='micro'):.2%}")

In contrast with binary classification, calculating precision and recall (and etc) for multi-class classification problems can be computed in slightly different ways depending on how one does averaging. 

A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. 

In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes).

## Unsupervised: Clustering

For unsupervised learning, we'll assume that we do not know what any labels are.
* Ignore `y`

We create an object for our model by calling "KMeans" with the number of clusters we want to look for

In [None]:
# Choose the model

model = sklearn.cluster.KMeans(n_clusters=3)

We then call the fit method, and pass in the data in which to identify clusters.

In [None]:
# Train the model

model.fit(x)

The model now has an attribute `labels_` that stores the values of which cluster every point belongs to.

In [None]:
model.labels_

To assess "goodness".... there are no labeled data to compare our predictions against.

What to do?
* measure the 'quality' of the model itself, for example:
  * Inertia
      * the within-cluster sum-of-squares
      * this is what clustering minimizes when it's being trained
  * Silhouette Coefficient
      * from scikit-learn docs: *"The score is bounded between -1 for incorrect clustering and +1 for highly dense clustering. Scores around zero indicate overlapping clusters.  The score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster."*
* note that clustering can be done for supervised learning too, in which case ground truth class values for each sample can be used.

In [None]:
sklearn.metrics.silhouette_score(x, model.labels_, metric='euclidean')

In [None]:
model.inertia_

Really the inertia only makes sense as a relative value, with higher or lower values on the same dataset representing higher or lower within-cluster sum-of-squares.

## Note

There are many, many metrics.  See for example [the docs for sklearn-metrics-metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics)

Regarding inertia, let's look at how it changes when we change the number of identified clusters:

In [None]:
inertia_scores = []
cluster_num = []
for n in range(1,10):
    cluster_num.append(n)
    model = sklearn.cluster.KMeans(n_clusters=n)
    model.fit(x)
    inertia_scores.append(model.inertia_)
    
plt.plot(cluster_num,inertia_scores,'ko')

We can use this as a guide for choosing an appropriate number of clusters.