# Introduction to Machine Learning with Python 


## Module 2

### Learning Activity 1: Load the required libraries

In [None]:
import scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Extra plotting functionality 
import visplots 

from sklearn import preprocessing, metrics
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from scipy.stats.distributions import randint

% matplotlib inline

print "libraries all imported, ready to go"

### Learning Activity 2: Importing the data

The dataset we will be using throughout this workshop is an adapted version of the wine quality case study, available from the UCI Machine Learning repository at https://archive.ics.uci.edu/ml/datasets/Wine+Quality. The first thing you will need to do in order to work with the wine quality dataset is to read the contents from the provided `wine_quality.csv` data file using the `read_csv` command. You should also try to explore the first few rows of the imported wine DataFrame using the `head` function from the `pandas` package (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html):

In [None]:
# Import the data and explore the first few rows

In order to feed the data into our classification models and sklearn, the imported wine quality DataFrame needs to be converted into a `numpy` array. For more information on numpy arrays, see http://scipy-lectures.github.io/intro/numpy/array_object.html. 

In addition, it is always a good practice to **always** check the dimensionality of the imported data using the `shape` command prior to constructing any classification model to check that you really have imported all the data and imported it in the correct way (e.g. one common mistake is to get the separator wrong and end up with only one column). 

In [None]:
# Convert to numpy array and check the dimensionality

### Learning Activity 3: Inspect your data by indexing and index slicing

To select elements in an array, you specify their indices with square bracket notation. For a two-dimensional array, the first index indicates the row number and the second index indicates the column number. Try selecting the values of the first and second columns of the first sample in the npArray:

In [None]:
# Print the 1st row and 1st column of npArray

In [None]:
# Print the 1st row and 2nd column of npArray

To select ranges of elements, we use "index slicing". Index slicing is the technical name for the syntax A[lower:upper], where lower refers to the lower bound index that is included, and upper refers to the upper bound index that is not included. Try selecting the first three samples (rows):

In [None]:
# Print the first 3 rows of npArray

and also the first three samples (rows) of the last column:

In [None]:
# Print the first 3 rows from the last column of npArray

### Learning Activity 4: Split the data into input features, X, and outputs, y

Subsequently, we need to split our initial dataset into the data matrix X (independent variable) and the associated class vector y (dependent or target variable). The input features, _X_,  are the variables that you use to predict the outcome. In this data set, there are ten input features stored in columns 1-10 (index 0-9, although the upper bound is not included so the range for indexing is 0:10), all of which have continuous values. The output label, _y_, holds the information of whether the wine has been rated as high or low quality, and is stored in the final (eleventh) column (index 10). To split the data, we need to assign the columns of the input features and the columns of the output labels to different arrays:

In [None]:
# Split to input matrix X and class vector y

Try printing the size of the input matrix _X_ and class vector _y_ using the "`shape`" command:

In [None]:
# Print the dimensions of X and y

## Exploratory Data Analysis

Exploratory data analysis (EDA) is the field dealing with the analysis of data sets as a means of summarising their main characteristics, often using visual methods.


### Learning Activity 5: Plot y frequencies 

An important thing to understand before applying any classification algorithms is how the output labels are distributed. Are they evenly distributed? Imbalances in distribution of labels can often lead to poor classification results for the minority class even if the classification results for the majority class are very good. 

In [None]:
# Print the y frequencies

In our current dataset, the _y_ values are categorical (i.e. they can only take one of a discrete set of values) and have a non-numeric representation, "high" vs. "low". This can be problematic for scikit-learn and plotting functions in Python, since they assume numerical values, so we need to map the text categories to numerical representations using `LabelEncoder`  and the `fit_transform` function from the `preprocessing` module:

In [None]:
# Convert the categorical to numeric values, and print the y frequencies

Visualising the data in some way is a good way to get a feel for how the data is distributed. As a simple example, try plotting the frequencies of the class labels (held in _yFreq_), 1 and 0, and see how they are distributed using the function `bar()`:

In [None]:
# Display the y frequencies in a barplot

### Learning Activity 6: Scale the data

It is usually advisable to scale your data prior to fitting a classification model to avoid attributes with
greater numeric ranges dominating those with smaller numeric ranges. Boxplots are a powerful visual aid, commonly used
in order to investigate the differences in ranges of the input features. For example, try and plot the features of the _raw_ matrix _X_ using the script for the boxplots:

In [None]:
# Create a boxplot of the raw data

There are many ways of scaling but one common scaling mechanism is auto-scaling, where for each
column, the values are centred around the mean and divided by their standard deviation. This scaling
mechanism can be applied by calling the `scale()` function in scikit-learn’s `preprocessing` module.

In [None]:
# Auto-scale the data

If we re-run the previous plotting script, we can have a look at the outcome of the boxplot after scaling:

In [None]:
# Create a boxplot of the scaled data

### Learning Activity 7:  Plot pairs of input features X as scatter plots

You can visualise the relationship between two variables (features) using a simple scatter plot. This step can give you a good first indication of the ML model model to apply and its complexity (linear vs. non-linear). At this stage, let’s plot the first two variables against each other:

In [None]:
# Create a scatter plot of the first two features

We can also relate associations between features to their y classifications by making the colour of
the points dependent on the corresponding y value:

In [None]:
# Create an enhanced scatter plot of the first two features

### Learning Activity 8: Bonus 1 - Try different combinations of f1 and f2 (in a grid if you can).

Hint: you may want to use nested loops, and the functions `subplot()` and `tight_layout()`

In [None]:
# Create a grid plot of scatterplots using a combination of features

### Learning Activity 9: Bonus 2 -  Try plotting different combinations of three features (f1, f2, f3) in the same plot.

Hint: you may want to use the `Axes3D` function from the `mpl_toolkits.mplot3d` package

In [None]:
# Create a 3D scatterplot using the first three features

### Learning Activity 10: Bonus 3 -  Create a correlation matrix and plot a heatmap of correlations between the input features in X

Often, the different features (variables) in X are not completely independent from each other. For example,
fixed acidity is related to volatile acidity. To quickly identify which features are related and to
what extent, it is useful to see how they are correlated. You can do this by creating a correlation matrix
from X using `corrcoef()` in the `numpy` module:

In [None]:
# Calculate the correlation coefficient

To search for linear relationships between features across all pairs of features, you can use a heatmap
of correlations (directly from X), which is simply a matrix of subplots whose colours represent the
sizes of the correlations:

In [None]:
# Create a heatmap of the correlation coefficients

## Module 3

### Learning Activity 11: Split the data into training and test sets

Training and testing a classification model on the same dataset is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data (poor generalisation). To use different datasets for training and testing, we need to split the wine dataset into two disjoint sets: train and test (**Holdout method**) using the `train_test_split` function. <br/> 

In [None]:
# Split into training and test sets

XTrain and yTrain are the two arrays you use to train your model. XTest and yTest are the two arrays that you use to evaluate your model. By default, scikit-learn splits the data so that 25% of it is used for testing, but you can also specify the proportion of data you want to use for training and testing.

<br/>You can check the sizes of the different training and test sets by using the `shape` attribute:

In [None]:
# Print the dimensionality of the individual splits

You can also investigate how the class labels are distributed within the *yTest* vector by using the `itemfreq` function as previously

In [None]:
# Calculate the frequency of classes in yTest

We can see that 129 random samples of class 0 (high quality) and 243 random samples of class 1 (low quality) are included in the yTest set.


### Learning Activity 12: Apply KNN classification algorithm with scikit-learn

To build KNN models using scikit-learn, you will be using the `KNeighborsClassifier` object, which allows you to set the value of K using the `n_neighbors` parameter (http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). The optimal choice for the value K is highly data-dependent: in general a larger K suppresses the effects of noise, but makes the classification boundaries less distinct. <br/>


For every classification model built with scikit-learn, we will follow four main steps: 1) **Building** the classification model (using either default, pre-defined or optimised parameters), 2) **Training** the model with data, 3) **Testing** the model, and 4) **Performance evaluation** using various metrics. <br/> <br/>

We are going to start by trying two pre-defined random values of K and compare their performance. Let us start with a small number of K such as K=3.

In [None]:
# Build a KNN classifier with 3 nearest neighbors

Let us try a larger value of K, for instance K = 99 or another number of your own choice; remember, it is good practice to select an **odd** number for K in a binary classification problem to avoid ties. Can you generate the KNN model and print the overall performance for a larger K (such as K=99) using as guidance the previous example? 

In [None]:
# Build a KNN classifier with 99 nearest neighbors

### Learning Activity 13: Calculate validation metrics for your classifier

In a classification task, once you have created your predictive model, you will need to evaluate it. Evaluation functions help you to do this by reporting the performance of the model through four main performance metrics: precision, recall and specificity for the different classes, and overall accuracy. To understand these metrics, it is useful to create a _confusion matrix_, which records all the true positive, true negative, false positive and false negative values.

We can compute the confusion matrix for our classifier using the `confusion_matrix` function in the `metrics` module.


In [None]:
# Get the confusion matrix for your classifier using metrics.confusion_matrix


Because performance metrics are such an important step of model evaluation, scikit-learn offers a wrapper around these functions, `metrics.classification_report`, to facilitate their computation. It also offers the function `metrics.accuracy_score` that we tried before to compute the overall accuracy.


In [None]:
# Report the metrics using metrics.classification_report

### Learning Activity 14: Plot the decision boundaries for different models

We can visualise the classification boundary created by the KNN classifier using the built-in function `visplots.knnDecisionPlot`. For easier visualisation, only the test samples are depicted in the plot. Remember though that the decision boundary has been built using the _training_ data! <br/> 

In [None]:
# Check the arguments of the function

# Visualise the boundaries

** Answer: <BR/> For smaller values of K the decision boundaries present many "creases". In this case the models may suffer from instances of overfitting. For larger values of K, we can see that the decision boundaries are less distinct and tend towards linearity. In these cases the boundaries may be too simple and unable to learn thus leading to cases of underfitting. **

### Learning Activity 15 - Bonus: Try different weight configurations

Under some circumstances, it is better to give more importance ("weight" in computing terms) to nearer neighbors. This can be accomplished through the `weights` parameter.  When `weights = 'distance'`, weights are assigned to the training data points in a way that is proportional to the inverse of the distance from the query point. In other words, nearer neighbors contribute more to the fit. <br/>

What if we use weights based on distance? Does it improve the overall performance?

In [None]:
# Build the classifier with two pre-defined parameters (n_neighbors and weights)

# Visualise the boundaries of a KNN model with weights equal to "distance"

## Module 4

### Learning Activity 16: Implement k-fold cross-validation

Let us estimate the accuracy of the classifier on the wine quality dataset by splitting the data 5 consecutive times (the parameter cv gives the number of samples the data is split into) using the `cross_val_score` function. For example, try to implement cross-validation for knn3, your KNN model with k=3:


In [None]:
# Implement cross-validation for knn3 

### Parameter Tuning

### Learning Activity 17: Grid search on hyperparameters

Rather than trying one-by-one predefined values of K, we can automate this process. The scikit-learn library provides the grid search function `GridSearchCV` (http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html), which allows us to exhaustively search for the optimum combination of parameters by evaluating models trained with a particular algorithm with all provided parameter combinations. Further details and examples on grid search with scikit-learn can be found at http://scikit-learn.org/stable/modules/grid_search.html <br/>

You can use the `GridSearchCV` function with the validation technique of your choice (in this example, 10-fold cross-validation has been applied) to search for a parametisation of the KNN algorithm that gives a more optimal model:

In [None]:
# Conduct a grid search with 10-fold cross-validation using the dictionary of parameters

Now we can find and print the best parameter set:

In [None]:
# Print the optimal parameters

We can also graphically represent the results of the grid search using a heatmap:

In [None]:
# Visualise the grid search results using a heatmap

When evaluating the resulting model it is important to do it on held-out samples that were not seen during the grid search process (XTest). <Br/>
So, we are testing our independent XTest dataset using the optimised model:

In [None]:
# Build the classifier using the optimal parameters detected by grid search 

### Learning Activity 18: Randomized search on hyperparameters

Unlike `GridSearchCV`, `RandomizedSearchCV` does not exhaustively try all the parameter settings. Instead, it samples a fixed number of parameter settings based on the distributions you specify (e.g. you might specify that one parameter should be sampled uniformly while another is sampled following a Gaussian distribution). The number of parameter settings that are tried is given by `n_iter`. If all parameters are presented as a list, sampling without replacement is performed. If at least one parameter is given as a distribution, sampling with replacement is used. You should use continuous distributions for continuous parameters. Further details can be found at http://scikit-learn.org/stable/modules/grid_search.html

In [None]:
# Conduct a randomised search on hyperparameters

As with the previous example, we can print out the optimal parameters: 

In [None]:
# Print the optimal n_neighbors detected by randomised search

We can also graphically represent the results of the randomised search using a scatterplot:

In [None]:
# Visualise the randomised search results using a scatterplot

Finally, testing our independent XTest dataset using the optimised model: 

In [None]:
# Build the classifier using the optimal parameters detected by randomised search

## Module 5

### Learning Activity 19:  Random Forests

The random forests model is an `ensemble method` since it aggregates a group of decision trees into an ensemble (http://scikit-learn.org/stable/modules/ensemble.html). Ensemble learning involves the combination of several models to solve a single prediction problem. It works by generating multiple classifiers/models which learn and make predictions independently. Those predictions are then combined into a single (mega) prediction that should be as good or better than the prediction made by any one classifer. Unlike single decision trees which are likely to suffer from high Variance or high Bias (depending on how they are tuned) Random Forests use averaging to find a natural balance between the two extremes. <br/> 

Let us start by building a simple Random Forest model which consists of 100 independently trained decision trees. For further details and examples on how to construct a Random Forest, see http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
# Build a Random Forest classifier with 100 decision trees

### Learning Activity 20: Visualising the RF accuracy

We can also investigate how the overall test accuracy gets influenced with the increase of `n_estimators` (decision trees) in our model. In order to do so, we can use the provided `rfAvgAcc` function from `visplots`:

In [None]:
# Visualise the average accuracy 

### Learning Activity 21: Feature Importance 

Random forests allow you to compute a heuristic for determining how “important” a feature is in predicting a target. This heuristic measures the change in prediction accuracy if you take a given feature and permute (scramble) it across the datapoints in the training set. The more the accuracy drops when the feature is permuted, the more “important” we can conclude the feature is.

We can use the `feature_importances_` attribute of the RF classifier to obtain the relative importance of each feature, which we can then visualise using a simple bar plot.

In [None]:
# Display the importance of the features in a barplot

###  Learning activity 22: Boundary visualisation

We can visualise the classification boundary created by the Random Forest using the `visplots.rfDecisionPlot` function. You can check the arguments passed in this function by using the `help` command. For easier visualisation, only the test samples have been included in the plot. And remember that the decision boundary has been built using the _training_ data!

In [None]:
# Check the arguments of the function

# Visualise the boundaries

### Learning Activity 23: Tuning Random Forests

Random forests offer several parameters that can be tuned. In this case, parameters such as `n_estimators`, `max_features`, `max_depth` and `min_samples_leaf` can be some of the parameters to be optimised. 

In [None]:
# View the list of arguments to be optimised

Create a dictionary of allowed parameter ranges for `n_estimators` and `max_depth` (or include more of the parameters you would like to tune) and conduct a grid search with cross validation using the `GridSearchCV` function as before:

In [None]:
# Conduct a grid search with 10-fold cross-validation using the dictionary of parameters

Now we can find and print the best parameter set:

In [None]:
# Print the optimal parameters

Finally, testing our independent XTest dataset using the optimised model: 

In [None]:
# Build the classifier using the optimal parameters detected by grid search

Bonus: We can also graphically represent the results of the grid search using a heatmap:

In [None]:
# Visualise the grid search results using a heatmap

### Learning Activity 24: Bonus - Parallelisation


The scikit-learn implementation of Random Forests also features the parallel construction of the trees and the parallel computation of the predictions through the n_jobs parameter.
If `n_jobs=k` then computations are partitioned into k jobs, and run on k cores of the machine.
If `n_jobs=-1` then all cores available on the machine are used.


In [None]:
# 1. Build a RF classification model using parallelisation
# 2. Try and tune its parameters using parallel processing
# 3. Import the `timeit` module and use the `default_timer` function to calculate the speedup from sequential to parallel processing
# 4. Can you plot the execution times with incremental number of processors?