> Reference:
+ [machinelearningmastery: evaluation metrics](http://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/)
+ [machinelearningmasterry: ROC](http://machinelearningmastery.com/assessing-comparing-classifier-performance-roc-curves-2/)
    - **TODO: read**
+ [scikit-learn: model evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html)
    - **TODO: read**
+ [scikit-learn: Classification Report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support)
    
**NOTE**
+ Some evaluation metrics (like mean squared error) are naturally descending scores (the smallest score is best) and as such are reported as negative by the cross_validation.cross_val_score() function. This is important to note, because some scores will be reported as negative that by definition can never be negative.

# Classification Metrics #
+ Classification Accuracy.
+ Logarithmic Loss.
+ Area Under ROC Curve.
+ Confusion Matrix.
+ Classification Report.

### 1: Classification: Accuracy ###

+ Classification accuracy is the number of correct predictions made as a ratio of all predictions made.
+ It is the most misused metric. It is really only suitable when there are an equal number of observations in each class (which is rarely the case) and that all predictions and prediction errors are equally important, which is often not the case.


In [3]:
# Cross Validation Classification Accuracy
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LogisticRegression()
scoring = 'accuracy'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("Accuracy: mean={:.3%}, std={:.3%}".format(results.mean(), results.std()))

Accuracy: mean=76.951%, std=4.841%


### 2: Classification: Logarithmic Loss (logloss) ###

+ Evaluates the predictions of probabilities of membership to a given class. 
+ The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm.
+ Smaller logloss is better with 0 representing a perfect logloss.
+ Commonly used in multinomial logistic regression and neural networks

In [6]:
# Cross Validation Classification LogLoss
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LogisticRegression()
scoring = 'log_loss'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("LogLoss: mean={:.3%}, std={:.3%}".format(results.mean(), results.std()))

LogLoss: mean=-49.250%, std=4.704%


### 3: Classification: Area Under ROC Curve (AUC) ##
+ Metric for binary classificatin to discriminate between positive and negative classes.
+ An area of:
    - **1** means all predictions made perfectly
    - **0.5** means predictions as good as random
+ A binary classification problem is really a trade-off between sensitivity and specificity.
    - **Sensitivity** is the true positive rate also called the recall. It is the number instances from the positive (first) class that actually predicted correctly.
    - **Specificity** is also called the true negative rate. Is the number of instances from the negative class (second) class that were actually predicted correctly.

The Example below shows the AUC is relatively close to 1 and greater than 0.5, suggesting some skill in the predictions.

In [8]:
# Cross Validation Classification ROC AUC
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LogisticRegression()
scoring = 'roc_auc'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("AUC: mean={:.3f}, std={:.3f}".format(results.mean(), results.std()))

AUC: mean=0.824, std=0.041


### 4: Classification: Confusion Matrix ###
+ The confusion matrix is a handy presentation of the accuracy of a model with two or more classes.
+ In the example below, although the array is printed without headings, you can see that the majority of the predictions fall on the diagonal line of the matrix (which are correct predictions).

In [9]:
# Cross Validation Classification Confusion Matrix
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
matrix = confusion_matrix(Y_test, predicted)
print(matrix)

[[141  21]
 [ 41  51]]


### 5: Classification: Classification Report ###
+ The **precision** is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
+ The **recall** is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
+ The **F-beta** score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0. The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.
+ The **support** is the number of occurrences of each class in y_true.

In [15]:
# Cross Validation Classification Report
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=test_size, random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
predicted = model.predict(X_test)
report = classification_report(Y_test, predicted)
print(report)

             precision    recall  f1-score   support

        0.0       0.77      0.87      0.82       162
        1.0       0.71      0.55      0.62        92

avg / total       0.75      0.76      0.75       254



# Regression Metrics #
+ Mean Absolute Error.
+ Mean Squared Error.
+ R^2.

### 1: Regression: Mean Absolute Error ###
+ It is the sum of the absolute differences between predictions and actual values.
+ The measure gives an idea of the magnitude of the error, but no idea of the direction (e.g. over or under predicting).
+ A value of 0 indicates no error or perfect predictions. Like logloss, this metric is inverted by the cross_val_score() function.

In [16]:
# Cross Validation Regression MAE
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LinearRegression()
scoring = 'mean_absolute_error'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MAE: mean={:.3f}, std={:.3f}".format(results.mean(), results.std()))

MAE: mean=-4.005, std=2.084


### 2: Regression: Mean Squared Error (MSE or RMSE) ###
+ It provides a gross idea of the magnitude of error.
+ Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation.
+ This metric too is inverted so that the results are increasing. **Remember to take the absolute value before taking the square root if you are interested in calculating the RMSE**.

In [18]:
# Cross Validation Regression MSE
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LinearRegression()
scoring = 'mean_squared_error'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MSE: mean={:.3f}, std={:.3f}".format(results.mean(), results.std()))

MSE: mean=-34.705, std=45.574


### 3: Regression: R Squared (R^2) ###
+ It provides an indication of the goodness of fit of a set of predictions to the actual values. In statistical literature, this measure is called the coefficient of determination.
+ This is a value between 0 and 1 for no-fit and perfect fit respectively.
+ In the example, the predictions have a poor fit to the actual values with a value close to zero and less than 0.5.

In [21]:
# Cross Validation Regression R^2
import pandas
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
dataframe = pandas.read_csv(url, delim_whitespace=True, names=names)
array = dataframe.values
X = array[:,0:13]
Y = array[:,13]
num_folds = 10
num_instances = len(X)
seed = 7
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = LinearRegression()
scoring = 'r2'
results = cross_validation.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("R^2: mean={:.3f}, std={:.3f}".format(results.mean(), results.std()))

R^2: mean=0.203, std=0.595
