# Machine Learning introduction

We learn about the machine learning approach using regression and classification algorithms.We consider multiple linear regression and logistic regression for classification, how they are used from the `sklearn` library.

## The data

Let's use a dataset from the UCI Machine Learning Repository that contains different size NACA 0012 airfoils that were exposed to various wind tunnel speeds and angles of attack. The span of the airfoil and the observer position were the same in all of the experiments.

The dataset has the following attributes.
These are the inputs:
1. Frequency, in Hertzs.
2. Angle of attack, in degrees.
3. Chord length, in meters.
4. Free-stream velocity, in meters per second.
5. Suction side displacement thickness, in meters.

The only output is:
6. Scaled sound pressure level, in decibels. 

Source: [Airfoil Self-Noise Data Set](http://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise)

# Multiple Linear Regresssion

### The goal

The goal is to predict the pressure level, in decibles from the given input variables using multiple linear regression.

### Import libraries

Import the necessary libraries.

In [None]:
from sklearn import linear_model # for linear regression modeling
from sklearn import preprocessing # for preprocessing like imputting missing values
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
plt.rc("font", size=14)
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

In [None]:
%matplotlib inline 

### Load the data

In [None]:
data = pd.read_csv('airfoil_self_noise.dat', sep='\t', names=['Frequency(Hz)', 
                                                              'Angle(deg)', 
                                                              'Chord(m)', 
                                                              'Velocity(m/s)',
                                                              'Suction(m)',
                                                              'Pressure(dec)'])

In [None]:
data.head()

In [None]:
data.shape

### Explore the data

In [None]:
data.describe()

### Visualize the data

In [None]:
pd.DataFrame.hist(data, figsize = [15,15]);

Data points for all attributes except the `Volume` fall into the same range. No need to rescale the data.

### Split data into training and test sets

Firstly, divide dataset into predictor variables and outcome variable.

In [None]:
X = data.drop(['Pressure(dec)'], axis = 1).values # X are the input (or independent) variables
y = data['Pressure(dec)'].values # Y is output (or dependent) variable

Now split the data into training and test sets. A rule of thumb is to split data into training and test sets by 80/20 or 70/30. To do this, first import the following package:

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

### Fit the model

The `fit()` function fits a linear model. We'll fit the model on the training data.

In [None]:
lm = linear_model.LinearRegression()
model = lm.fit(X_train,y_train)

### Make predictions

We'll predict the dependent variable using the linear model we fitted with the test dataset.

In [None]:
predictions = lm.predict(X_test)

In [None]:
print(predictions[0:5]) # print the first 5 predictions

### Plot the model

In [None]:
import matplotlib.pyplot as plt

Don't forget to import the plotting package. 

Now plot the true values against the predictions.

In [None]:
plt.scatter(y_test, predictions)
plt.xlabel("True values")
plt.ylabel("Predictions")

### Determine model accuracy

In [None]:
print("Score:", model.score(X_test, y_test))

The `score` is a coefficient of determination $R^2$. It is also known as a goodness-of-fit measure. Put another, $R^2$ is the total variation explained by the model over the total variance. The best possible score is 1.0, which means we are able to predict without errors the dependent variable using our model.
See [sklearn.linear_model.LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.score) for a description on how the coefficient of determination is calculated.

# Logistic Regression

We now look at how to build a classifier, here a binary classifier, that classifies data points into two classes, 0 or 1.

### The data

The data is experimental data that we'll use to determine room occupancy depending on the temperature, humidity, light and CO2 levels in a room. 
Source: [UCI Machine Learning Repository: Occupancy Detection Data Set](http://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+).

### The goal

Predict whether a room is occupied or not given the temperature, humidity, light, carbon dioxide and humidity ratio. We'll use logistic regression to solve this problem.

### Load the data

In [None]:
df = pd.read_csv('datatraining.txt')

### Explore the data

In [None]:
df.head()

In [None]:
df.shape

Because this is time series data and we're not particularly interested in analyses over time, drop the `date` column.

In [None]:
df.drop(['date'], axis=1, inplace=True); df.tail()

In [None]:
#pd.DataFrame.hist(df, figsize = [15,15]); # uncomment and run

In [None]:
df.groupby('Occupancy').count()

In [None]:
df.Occupancy.value_counts() # function 

In [None]:
sns.countplot(x='Occupancy',data=df, palette='hls')


### Import the libraries

Import the logistic regression module and metrics module for evaluating.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

Put data into matrices.

In [None]:
X = df.drop(['Occupancy'], axis = 1).values # X are the input (or independent) variables
y = df['Occupancy'].values # Y is output (or dependent) variable

In [None]:
print(X[0:5])
print(y[0:5])

### Feature selection

You can use the recursive feature elimination algorithm (among many) to determine those features that are important for predicting the outcome of interest. According to the [sklearn source](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html), "the function selects features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a  feature_importances_ attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached."

Source: [sklearn.feature_selection.RFE (Recursive Feature Elimination)](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)

In [None]:
from sklearn.feature_selection import RFE # import the package

In [None]:
logreg = LogisticRegression(solver='lbfgs') # create the model

In [None]:
selector = RFE(logreg, n_features_to_select=5) # defaults to half the features but we have 5 features, let's use them all
selector = selector.fit(X, y)
print(selector.support_)
print(selector.ranking_)

In this case all the predictor features or independent variables are equally important in predicting the output variable. So we use all the predictor features.

### Split into training and test datasets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [None]:
print('Training data', len(X_train), len(y_train))
print('Test data', len(X_test), len(y_test))

We split the data 70-30; i.e., 70% for training and 30% for testing the model

### Fit the model

In [None]:
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)

See [sklearn.linear_model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for a detailed output of what each of the outcome parameters means.

### Predicting with the model

We predict using the test dataset.

In [None]:
y_pred = logreg.predict(X_test)

Then determine the accuracy of the predictions.

In [None]:
print('Accuracy score: {:.2f}'.format(logreg.score(X_test, y_test)))

### Cross validation

We do cross validation to avoid overfitting. We can determine whether the model generalises well.

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

We use a method called k-Folds cross validation. We divide the training set into k subsets For each k fold, build the model on $k-1$ folds; test on the $k$th fold. Record the error when the model is fitted. This is an iterative process. So repeat until each of the $k$ folds has served as a test set. The average of each of the k-recorded errors is the cross-validation error and is the performance metric.

In [None]:
cv = KFold(n_splits=10); cv

In [None]:
logreg = LogisticRegression(solver='lbfgs')
results = cross_val_score(logreg, X_train, y_train, cv=cv, scoring='accuracy')

In [None]:
print(results)

In [None]:
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))

The result is close to the training accuracy, so we can say the model generalises well.

### Confusion matrix

Why do we use a confusion matrix and what does a confusion matrix tell us? We use a confusion matrix to show how many data points were correctly classified and how many were misclassified.

See this guide for details on confusion matrices: [Simple guide to confusion matrix terminology](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

In [None]:
ax= plt.subplot()
sns.heatmap(confusion_matrix, annot=True, ax=ax); # annot=True to annotate cells i.e., put numbers in cells
# labels, title and ticks
ax.set_xlabel('Predicted');ax.set_ylabel('Actual'); 
ax.set_title('Confusion Matrix'); 
ax.xaxis.set_ticklabels(['0', '1']); ax.yaxis.set_ticklabels(['0', '1']);

Two predictions that are actually true were classified incorrectly. Recall the size test dataset:

In [None]:
print('Test data', len(y_test))

The model or classifier was able to accurately classify 1888 data points as occupancy = 1, and 526 with occupancy = 1. The ohter data points were misclassified.

### Precision, Recall and F-1 score

Precision returns the ratio of true positives (i.e., the 00 cell in the confusion matrix, in which tests indicates that the results exists when in fact it does.) That is, $$Precision = \dfrac{\text{true positive}}{\text{true positive + false positive.}}$$ This is the ability of the model to label data points correctly.

The recall rate is is given as $$Recall = \dfrac{\text{true positive}}{\text{true positive + false negative}}.$$ This is the ability of the model to find all the positive (see reference).

The F-1 score is a weighted harmonic mean of the precision and recall. See [Harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean).

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

### ROC analysis

The receiver operating characteristic curve, or ROC curve,  plots the true positive rate (or recall) against the false positive rate (also called fallout). The red line indicates a base random model. 

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X_train, y_train)
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")

### References

1. [Building A Logistic Regression in Python, Step by Step](https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8)

2. [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall#Recall)

3. [Receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)