# Module 1: Regression

In this lab you will create a regression model on the same red wine quality dataset and then apply and practice the same training and validation methodology. The regression model will be based on linear regression provided by sci-kit learn.

#### Scikit Learn

Read about Scikit as your time permits: http://scikit-learn.org/stable/


Relevant sklearn API references:
 * [sklearn.linear_model.LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
 * [sklearn.linear_model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
 * [sklearn.metrics.mean_squared_error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html)

In [None]:
%matplotlib inline
import os, sys
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

## Load Dataset

Load dataset from files into multi-dimensional array and understand its structure. 
Then check it's shape, columns and label distribution.

In [None]:
# Dataset location
DATASET = '/dsa/data/all_datasets/wine-quality/winequality-red.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET, sep=';').sample(frac = 1).reset_index(drop=True)

# View some metadata of the dataset and see if that makes sense
print('dataset.shape', dataset.shape)
dataset.describe()

In [None]:
X = np.array(dataset.iloc[:,:-1])[:, [1,2,6,9,10]]
y = np.array(dataset.quality)

print('X', X.shape, 'y', y.shape)
                                # Refresher: This is a dictionary comprehension
print('Label distribution:', {i: np.sum(y==i) for i in np.unique(dataset.quality)})
                                # For each unique value in the quality column, 
                                #     count the number of times it occurs 
                                #      and store it in the dictionary with the quality as the key

## Make the training/validation split, train the model, and validate it

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = LinearRegression()     # this is a blank model
model.fit(X_train, y_train)    # Train the model against the data
model.score(X_test, y_test)

This score is the *coefficient of determination* (aka R squared) of the model on this dataset, 
which measures what portion of total variation is explained by the model.

Sci-kit learn also provides convenience functions for computing mean squared error.

In [None]:
from sklearn.metrics import mean_squared_error

# Measure the model error based on expected output and predicted output
mean_squared_error(y, model.predict(X))  # also known as   MSE

Optionally you can print out a sample and see for yourself how the linear regression model performs.

In [None]:
print(y[20:55])
print(np.round(model.predict(X)[20:55]).astype('i4'))

You can also choose to plot one of the input features against the regression response to visualize the regression as the following: (5 features are chosen so axis 1 would be indexed from 0 to 4)

In [None]:
sns.regplot(X[:,0], y)


### Model Confusion
A useful tool to employ, beyond single number metrics is a confusion matrix.

Please read about it here: https://en.wikipedia.org/wiki/Confusion_matrix  
You will note that from a confusion matrix, a large number of additional model performance metrics can be computed.


In [None]:
from sklearn.metrics import confusion_matrix
# Compute confusion matrix with expected value, predicted values... similar to RMSE 
confusion_matrix(y_test, np.round(model.predict(X_test)).astype('i4'))

The center diagonal, top left to bottom right, is the set of correct model predictions.
The off diagonal counts are the errors.
This gives a per class-pair breakdown of the model performance.

### <span style="background:yellow">NOTE:</span>
Though we think of wine **quality** as an ordinal value, we are attempting to use regression which predicts continuous values.
We are purposely looking at this problem with sub-optimal tools to illustrate various concepts and stimulate some contemplation.


### Using an alternate form of regression
We can reduce this to a 2-class (binary) problem as we did in the previous lab.
Then, a 2-class regression classifier can be used - Logistic Regression.  
You previously saw this model in the Statistical and Mathematical foundations class.

In [None]:
# Binarize wine quality just for simplification
y[y<6]=0; y[y>=6]=1
print('X', X.shape, 'y', y.shape)
print('Label distribution:', {0: np.sum(y==0), 1: np.sum(y==1)})

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
model = LogisticRegression()   # this is a new model
model.fit(X_train, y_train)
model.score(X_test, y_test)

**That number may look familiar!**
How does it compare to the Naive Bayes classifier in the previous lab.