# Module 1: Classification

In this lab you will create a classification model on the same red wine quality dataset and then apply and practice the same training and validation methodology. 
The classification model will be based on Naive Bayes provided by sci-kit learn.

In [None]:
import os, sys
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

## Load Dataset

We will load the dataset from file into a Panda data frame and investigate its structure. 


In [None]:
# Dataset location
DATASET = '/dsa/data/all_datasets/wine-quality/winequality-red.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET, sep=';').sample(frac = 1).reset_index(drop=True)

# View some metadata of the dataset and see if that makes sense
print('dataset.shape', dataset.shape)

X = np.array(dataset.iloc[:,:-1])[:, [1,2,6,9,10]]
y = np.array(dataset.quality)

print('X', X.shape, 'y', y.shape)
print('Label distribution:', {i: np.sum(y==i) for i in np.unique(dataset.quality)})

Describe dataset.

In [None]:
dataset.describe()

## Make the train/validation split and then train the model

In [None]:
# You have seen this before!
# If you are so inclined, you may want to tweak the test_size and see how the model performs
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = GaussianNB()
model.fit(X_train, y_train)
model.score(X_test, y_test)

Optionally you can print out a sample and see for yourself how the classification performs.

In [None]:
print(y[20:50])
print(model.predict(X[20:50]))

### Model Confusion
Again, we will judge the classification performance with a confusion matrix.

Please read about it here: https://en.wikipedia.org/wiki/Confusion_matrix  
You will note that from a confusion matrix, a large number of additional model performance metrics can be computed.


In [None]:
from sklearn.metrics import confusion_matrix
# Compute confusion matrix with expected value, predicted values... similar to RMSE 
confusion_matrix(y_test, np.round(model.predict(X_test)).astype('i4'))

### Beyond Confusion Matrix: Precision, Recall, and F1

Here we are going to look at a couple additional measures.

First: 
  * _condition positive_ (P) is the number of real positive cases in the data
  * _condition negatives_ (N) is the number of real negative cases in the data 

Then: 
  * _true positive_ (TP) is a correct prediction of a class, eqv. with hit in a Yes / No model
  * _true negative_ (TN) is a correct prediction of not a class, eqv. with correct rejection in a Yes / No model
  * _false positive_ (FP) is misclassification, eqv. with false alarm in a Yes / No model, **Type I error**
  * _false negative_ (FN) is misclassification, eqv. with miss in a Yes / No model, **Type II error** 

Metrics:
  * Recall or True Positive Rate:$$ Recall = \frac{TP}{P} = \frac{TP}{TP+FN} $$ 
  * Precision or Positive Predictive Value:$$ Precision = \frac{TP}{TP+FP} $$
  * [F1 is the harmonic mean of precision and sensitivity](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)$$ F_{1} = 2 * \frac{Precision * Recall}{Precision + Recall}$$
  
#### More details on scikit-learn model scoring:
http://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
from sklearn.metrics import f1_score
# Please read documentation on this parameter
f1_score(y_test, np.round(model.predict(X_test)).astype('i4'), average='micro')