# Python Basics Tutorial

#### Machine Learning Algorithm Performance Metrics

####  Machine Learning Mastery with Python
####  Jason Brownlee

### In this recipe:

- Classification metrics
- Regression metrics


- 10 fold CV with either Logistic or Linear Regression are used
- cross_validation.cross_val_score used to rpt performance
    - all scores reported so they can be sorted
    - largest score is best (opposite of other metrics such as MSE)

## Classification Metrics

- Classification Accuracy
- Logistic Loss
- Area under ROC Curve
- Confusion Matrix
- Classification Report

### Classification Accuracy

- ratio: correct predictions to total predictions made
- only suitable when observation frequency is equal across classes

In [2]:
filename = 'pima-indians-diabetes.data.csv'
path = 'D:\\OneDrive - QJA\\My Files\\DataScience\\DataSets'

# name columns
names = ['preg', 'plas', 'pres', 'skin', 'test',
        'mass', 'pedi', 'age', 'class']



In [3]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

dataframe = read_csv(path + '\\' + filename, names = names)

In [5]:
array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

seed = 7

# object to store kfold specs
kfold = KFold(n_splits = 10, random_state = seed)

# object to specify the model
model = LogisticRegression(solver = 'liblinear')

scoring = 'accuracy'

# run model
results = cross_val_score(model, X, Y, 
                          cv = kfold,
                          scoring = scoring)

print('Accuracy: %.3f, Stnd Dev: %.3f' % (results.mean() * 100.0,
                                          results.std() * 100.0))

Accuracy: 76.951, Stnd Dev: 4.841


### Logistic Loss  (Logloss)

- performance metric of probability (0-1) that observation will fall within certain class
- predictions are rewarded/punished proportionally to confidence of prediction

In [8]:
#from pandas import read_csv
#from sklearn.model_selection import KFold
#from sklearn.model_selection import cross_val_score
#from sklearn.linear_model import LogisticRegression

array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

kfold = KFold(n_splits = 10, random_state = 7)

model = LogisticRegression(solver = 'liblinear')
scoring = 'neg_log_loss'

results = cross_val_score(model, X, Y,
                          cv = kfold,
                          scoring = scoring)

print('Accuracy: %.3f, Stnd Dev: %.3f' % (results.mean()*100.0,
                                          results.std()*100.0))

## note: smaller logloss is better.  0 means perfect logloss
## result is negative because measure is inveted when using
##    cross_val_score() (see machine learning master with python
##    ch10 for further explanation)

Accuracy: -49.266, Stnd Dev: 4.689


### Area Under ROC Curve

- used for binary classification
- Curve is plot of true + rate and false + rate for probability prediction at specified threshold used to map probabilities to class labels
- discriminates between + & - classes
- area = 1 means model made all predictions perfectly

In [9]:
#from pandas import read_csv
#from sklearn.model_selection import KFold
#from sklearn.model_selection import cross_val_score
#from sklearn.linear_model import LogisticRegression

array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

kfold = KFold(n_splits = 10, random_state = 7)
model = LogisticRegression(solver = 'liblinear')

scoring = 'roc_auc'

results = cross_val_score(model, X, Y,
                          cv = kfold,
                          scoring = scoring)

print('AUC: %.3f, Stnd Dev: %.3f' % (results.mean()*100.0,
                                     results.std()*100.0))

AUC: 82.342, Stnd Dev: 4.071


### Confusion Matrix

- predictions on x-axis and true outcomes on y-axis

In [11]:
#from pandas import read_csv
#from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

test_size = 0.33
seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size = test_size,
                                                    random_state = seed)

# object to represent the logisticregression algo
model = LogisticRegression(solver = 'liblinear')

# fit on training data
model.fit(X_train, Y_train)

# predict on test date
predicted = model.predict(X_test)

# create matrix of Y_test compared to predicted values
matrix = confusion_matrix(Y_test, predicted)
print(matrix)

# correct predictions will fall on diagonal line of matrix [1-1, 2-2]

[[141  21]
 [ 41  51]]


### Classification Report

- displays precision, recall, F1-score, and support

In [12]:
#from pandas import read_csv
#from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

array = dataframe.values

X = array[:, 0:8]
Y = array[:, 8]

test_size = 0.33
seed = 7

X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                   test_size = test_size,
                                                   random_state = seed)

# object to hold log reg algo specs
model = LogisticRegression(solver = 'liblinear')

# fit model on train data
model.fit(X_train, Y_train)

# predict on test set
predicted = model.predict(X_test)

# create report
report = classification_report(Y_test, predicted)

print(report)

## will need to look up precision, recall, f1score, support

              precision    recall  f1-score   support

         0.0       0.77      0.87      0.82       162
         1.0       0.71      0.55      0.62        92

    accuracy                           0.76       254
   macro avg       0.74      0.71      0.72       254
weighted avg       0.75      0.76      0.75       254



## Regression metrics

- Mean absolute error
- Mean squared error
- Rsqd

- common metrics for evaluating predictions on regression machine learning problems

### Mean Absolute Error

- sum of absolute difference between predicted and actual values
- gives idea of magnitude of error (but not direction)


In [18]:
# this example uses boston house price data

path = 'D:\\OneDrive - QJA\\My Files\\DataScience\\DataSets'
filename = 'housing.csv'
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 
         'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 
         'LSTAT', 'MEDV']



In [21]:
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

dataframe = read_csv(path + '\\' + filename, 
                     delim_whitespace = True,
                     names = names)

array = dataframe.values
X = array[:, 0:13]
Y = array[:, 13]

n_splits = 10
seed = 7

# object to specify kfold specs
kfold = KFold(n_splits = n_splits, random_state = seed)

# object to specify desired algo
model = LinearRegression()

# specify scoring method
scoring = 'neg_mean_absolute_error'

# run model
results = cross_val_score(model, X, Y,
                          cv = kfold,
                          scoring = scoring)

print('MAE: %.3f, Stnd Dev: %.3f' % (results.mean(),
                                     results.std()))

## note: MAE indicates perfect prediction
##    like logloss, cross_val_score() function inverts the metric

MAE: -4.005, Stnd Dev: 2.084


### Mean Squared Error

- provide magnitude of error but not direction

In [23]:
#from pandas import read_csv
#from sklearn.model_selection import KFold
#from sklearn.model_selection import cross_val_score
#from sklearn.linear_model import LinearRegression

array = dataframe.values

X = array[:, 0:13]
Y = array[:, 13]

n_splits = 10
seed = 7

# object to hold kfold algo specs
kfold = KFold(n_splits = n_splits, random_state = seed)

model = LinearRegression()

scoring = 'neg_mean_squared_error'

results = cross_val_score(model, X, Y,
                          cv = kfold,
                          scoring = scoring)

print('MSE: %.3f, Stnd Dev: %.3f' % (results.mean(),
                                     results.std()))

## NOTE: metric is invested so that results are increasing
## to calc root means square error, take abs first

MSE: -34.705, Stnd Dev: 45.574


### R squared Metric

- indication of "goodness of fit"
- also called 'coefficient of determination'
- value btwn 0-1

In [24]:
#from pandas import read_csv
#from sklearn.model_selection import KFold
#from sklearn.model_selection import cross_val_score
#from sklearn.linear_model import LinearRegression

array = dataframe.values

X = array[:, 0:13]
Y = array[:, 13]

seed = 7
n_splits = 10

kfold = KFold(n_splits = n_splits, random_state = seed)

model = LinearRegression()

scoring = 'r2'

results = cross_val_score(model, X, Y,
                          cv = kfold,
                          scoring = scoring)

print('R^2: %.3f, Stnd Dev: %.3f' % (results.mean(),
                                     results.std()))

R^2: 0.203, Stnd Dev: 0.595
