
# Evaluation

Machine learning always tries to generate 'artifacts' (eg. predictions, decissions, clusters, ...).
Evaluation can estimate the quality of a  model based on the generated artifacts with two fundamental principles:
 - If the expected, correct outcome is known we can calculate the **error**
 - If the expected outcome is not known, we formularize the "uglyness" of the artifact, usually called the **cost function** or **stress**

Both evaluation measures are often called the **score**

## Error based scores

### Regression

Regression analysis estimates the relationships among variables. It tries to model the relationship between a dependent variable (the 'target' or 'criterion') and one or more independent variables (the 'features' or 'predictors'). This allows to predict a target value based on the given feature values.

Calculating the regression error is rather simple. Try to summarize the differences between prediction and correct value.

_Let's generate and visualize some data:_

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# generate this many points
numPoints = 42;
rng = np.random.RandomState()

# these variables are actually vectors of the size numPoints
x = 10 * rng.rand(numPoints) 
y = 2 * x - 1 + rng.randn(numPoints)

# how does the data look like?
print('X: {}'.format(x))
print('Y: {}'.format(y))
plt.scatter(x, y);

#### What's next?

In order to calculate an error, we first have to make some predictions. Lets start with a most basic statistical prediction, aptly named "Expectation".

1. Calcualate the average of the y values.
2. Our regression model is then to always return the mean if someone asks for a prediction based on a x value
3. Plot the data and the regression
4. Calculate the standard deviation as an error metric and print it


***Hints:***

Mean = $\frac{\sum_{i}^{n} x_i}{n}$

Standard Deviation = $\sqrt{\frac{\sum_{i}^{n} (\check{y_i }- y_i)^2}{n-1}}$

np.full(n,value) creates a vector of size n which has the same value at every position

In [None]:

# Do something here: 
mean = 

error = 

print("Mean: {}".format(mean))
print("Error: {}".format(error))

#### What' next?

In order to put this error value in perspective we could compare it with better and worse prediction models. Try to formulate these, plot them and calculate their error values

***Hints:***
- Take the formula of the data generation in the first cell (without the randomness!) as very good model
- Take some kind of inverse of the formula as a worse model

### Classification

Classification tries to assign a class instead of a skalar value. The result is often a binary decision and therefore the error is not quantifiable as a difference but every prediction is either correct or wrong.
If we compare the prediction with the truth, there are four possible cases (see slides) on wich most other evaluation scores are based on.

_Let's generate and visualize some data:_

In [None]:
import random

numPoints = 100
lexicon = ["hund", "katze", "maus"]

# init with zeroes set randomly one class true for each "point"
truth       = np.zeros((numPoints,3), dtype=int)
predictions = np.zeros((numPoints,3), dtype=int)
for i in range(numPoints):
    index = random.randint(0,2)
    truth[i][index] = 1
    predictions[i][((index+int(random.gauss(0,1.2))) % 3)] = 1

# visualize
order = np.arange(len(lexicon))
values_truth = np.sum(truth,axis=0).tolist()
values_preds = np.sum(predictions,axis=0).tolist()

plt.bar(order, values_truth)
plt.xticks(order, lexicon)
plt.ylabel('Count')
plt.title('truth')
plt.show()

plt.bar(order, values_preds)
plt.xticks(order, lexicon)
plt.ylabel('Count')
plt.title('predictions')
plt.show()

CEND      = '\33[0m'
CRED    = '\33[31m'
CGREEN  = '\33[32m'
for i in range(numPoints):
    if np.dot(truth[i],predictions[i]):
        print(CGREEN + "Row: {} T: {} P: {}".format(i,truth[i],predictions[i],) + CEND, end=", ")
    else:
        print(CRED + "Row: {} T: {} P: {}".format(i,truth[i],predictions[i],)+ CEND, end=", ")

### What's next

If you have more the two classes, the generalization of the four possible outcome cases is the *confusion matrix*. You can calculate it by creating an empty *n x n* matrix, where *n* are the number of classes. The first dimension corresponds to the true class, the second dimension corresponds to the predicted class. Going through each prediction of the test set you can then increase the count on the respective cell `[truth][prediction]` accordingly.

***Hints:***
- dot products might be of help (np.dot())
- scitkit-learn also has a method to calucalte the confusion matrix, but you have to transform into the expected format first.
- To visualize the matrix you can use mathplots matshow()

In [None]:
cm = 


# visualization is given
print(cm)

fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(cm, cmap=plt.cm.gray)
fig.colorbar(cax)

ax.set_xticklabels(['']+lexicon)
ax.set_yticklabels(['']+lexicon)

plt.show()

### What's next?
Now we can calcualte the derived evaluation metrix. We supplied a function which takes the confusion matrix and a class index (0=hund, 1=katze, 2=maus) and calculates the true/false positives/negatives counts for this class. You can use these values to calculate Accuracy, Recall, Precision, F-1 Score for each class individually.

Think about how your predictor could cheat and produce optimal Recall or optimal Precission easily.

In [None]:
def get_cases(class_index):
    i = class_index
    mask = mask = np.identity(3)
    mask[i][i] = 0;
    return {
        'TP': cm[i][i],
        'TN': (np.sum(np.dot(mask,np.dot(cm,mask)))),
        'FN': (np.sum(cm[i])-cm[i][i]),
        'FP': (np.sum(cm, axis=0)[i]-cm[i][i])
    }

for i in range(len(lexicon)):
    cases = get_cases(i)
    print("Class: {} Cases: {} -> ".format(lexicon[i],cases))

    # add calculations here:
    print("Accuracy:  {0:.0%}".format( )
    print("Recall:    {0:.0%}".format( )
    print("Precision: {0:.0%}".format( )
    print("F1 Score : {0:.0%}".format( )

## Stress-based Evaluation

We will defer this topic until we have learned more about clustering :)

But for completeness, in a nutshell: There is no ground truth but only desired qualities such as:
- good between-cluster separation (large distance)
- good within-cluster connectivity (small distance)
- and many more based on the application

In [None]:
from sklearn.datasets.samples_generator import make_blobs

# number of clusters to generate and to seek
k = 4;

X, y_true = make_blobs(n_samples=300, centers=k,
                       cluster_std=0.20, random_state=0)

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=k)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

In [None]:
from sklearn.metrics import pairwise_distances

# Divide the sum of distances throug the number of distances
avg_separation = np.sum(pairwise_distances(centers))/k/(k-1)

# do the same for the members of each cluster 
# create an array of sets
groups = []
for i in range(k):
    groups.append([]); 
# assign the members to their groups
for i in range(len(X)):
  groups[y_kmeans[i]].append(X[i])

connectivity = []
for i in range(k):
  connectivity.append(np.sum(pairwise_distances(groups[i]))/len(groups[i])/(len(groups[i])-1))

# print results
print("Average separation: {}".format(avg_separation))
print("Average connectivity: {} ({})".format(np.average(connectivity),connectivity))

print("Ratio separation / connectivity: {} (higher is better)".format(avg_separation/np.average(connectivity)))

# visual 
order = range(3)
values = ()
values_preds = np.sum(predictions,axis=0).tolist()

plt.bar(range(3), [avg_separation,np.average(connectivity),avg_separation/np.average(connectivity)])
plt.xticks(range(3), ["Sep", "Con", "Ratio"])
plt.show()


### What's next?

You can now play with the data generation parameter *cluster_std* and see how the evaluation metric changes. Higher STDs should create worse results.