<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#practical_plan">Practical and Data Overview</a></li>
        <li><a href="#reading_data">Preparation: importing packages and loading data </a></li>
        <li><a href="#naive_rf">A Naive RF Model</a></li>
        <li><a href="#good_rf">A Tuned RF Model  </a></li>
        <li><a href="#task">Your Task: Implement a Naive and a Tuned KNN Classifier</a></li>
    </ol>
</div>
<br>
<hr>

<h2 id="practical_plan">Practical and Data Overview</h2>

- This practical will examine the effect of hyperparameter tuning on model performance. 
- Models: we will train and evaluate two classification models: Random Forest and K-nearest Neighbours.
- Data: 
    - We will be using a well-known diabetes dataset, which is available from many ML reposiroties, e.g. UCI
        - UCI link: https://archive.ics.uci.edu/ml/support/diabetes
        - I personally used the github link provided by Jason Brown-Lee (Machine Learning Mastery's blogger):         
            - The Dataset: https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv
            - Dataset Description: https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names
            


<h2 id="reading_data">Importing packages and reading data</h2>

- Data Download Instructions:             
    - Download the dataset and place it in your local working directory, the same location as your python file.
    - Save it with the filename: pima-indians-diabetes.csv

<b>Note: </b> see how the data is saved with data only: column names are provided in a separate file (for you to understand what each data column means). This is done to ease the process of loading the data directly into a numpy array (2D matrix). We can now load the file as a matrix of numbers using the NumPy function loadtxt(), which is available from the numpy library. Hence, the following imports are needed: 


In [1]:
## Note: i'm only importing pprint because I'd like to be able to use more 'deocrative' printing options
## for the decimal points
from pprint import pprint 
from numpy import loadtxt


from sklearn.ensemble import RandomForestClassifier  
from sklearn.neighbors import KNeighborsClassifier

## the hyperparameters will be selected through a cross-validation experiment, hence we need the following packages:
from sklearn.model_selection import GridSearchCV, cross_val_score,  train_test_split

from sklearn.metrics import classification_report

In [2]:
### load the dataset: 
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=',')


- The .names file shows that the dataset contains the following 9 columns: 
   1. Number of times pregnant
   2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
   3. Diastolic blood pressure (mm Hg)
   4. Triceps skin fold thickness (mm)
   5. 2-Hour serum insulin (mu U/ml)
   6. Body mass index (weight in kg/(height in m)^2)
   7. Diabetes pedigree function
   8. Age (years)
   9. Class variable (0 or 1)

- Column 9 is the class label. 
- Since the .csv file has no labels, and we did not load it into a dataframe, we have to treat it as an array. 
- Arrays (and matrices) are referenced by indices. 
- dataset[:, 8] returns all rows (designated by :) and the 9th column (please note that python indices start at 0).
- dataset[:, 0:7] returns all rows for columns 1-8

In [3]:

X = dataset[:,0:7]
y = dataset[:,8]

## check the types: both are numpy  arrays
print(type(X))
print(type(y))

X.shape

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


(768, 7)

<h2 id="naive_rf">A Naive RF Model</h2>

- We will first implement a 'naive' RF classifier using the default parameters and evaluate the model using the regular hold-out method we have been using so far. Steps: 

    1. Split our data into training and testing samples.
    2. Initialise a RF classifier using all default parameters
    3. Train the RF using the .fit function and the training data+labels
    4. Extract the classifier's predictions on test data (X_test) using the .predict function
    5. Examine the classifier's performance on unseen data (y_test) by comparing with the classifier's predictions (y_pred)

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
naive_classifier = RandomForestClassifier()
naive_classifier.fit(X_train, y_train)  
y_pred= naive_classifier.predict(X_test)  
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.78      0.85      0.82       102
         1.0       0.65      0.54      0.59        52

    accuracy                           0.75       154
   macro avg       0.72      0.70      0.70       154
weighted avg       0.74      0.75      0.74       154



- I ran the model a few times and got an accuracy score ranging from 79% to 81%


<h2 id="good_rf">A Tuned RF Model</h2>

- Now let’s tune our hyperparameters using cross-validation.
- Before that, let's examine what the parameters used in the classification were: 
    - You can find out what each parameter means by reading the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


In [5]:
pprint(naive_classifier.get_params())

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}



The process for finding the right hyperparameters is still somewhat of a dark art, and it can be perfomred using a three-way holdout (as seen in the lecture) or using cross-validation. In this practical, will use cross-validation to find the optimal hyperparameters. The data used in the hyperparameter-tuning phase is X_train (and y_train), because we will be evaluating the best algorithm's performance on X_test (and y_test) in order to obtain a 'final' estimate. 


#### Create the 'parameter grid'
 - Set up possible values of parameters to optimize over
 - The 'norm' is to use a dictionary object (see 5.5 of https://docs.python.org/3/tutorial/datastructures.html) to store 'a parameter grid'
 - A parameter grid contains all possible values of the hyperparameters we would like to 'tune'

In [6]:
parameter_grid = {
            "min_samples_leaf": [10, 20, 30],
            'n_estimators': [20, 60, 100],
             'max_features' : [3,5,8]
}


#### Using cross validation to tune a RF classifier for a given hyperparameter grid. 

- A grid search across Cartesian products of sets of hyperparameters. What is meant by the cartesian product is the creation of 'a set of parameters' for every possible combination of the min_samples_split, n_estimators and max_features listed in the hyperparameter grid we've created. e.g. 
 
        - min_samples_leaf = 1, n_estimators = 10, max_features = 3
        - min_samples_leaf = 1, n_estimators = 10, max_features = 5
        - min_samples_leaf = 1, n_estimators = 100, max_features = 3
        - min_samples_leaf = 1, n_estimators = 100, max_features = 5
        -etc...

- Note: the code will take a while to return an output, since we are fitting the classifier over all possible combinations of the parameters (this is what a grid search is). So, we are running the random forest using all permuatations listed above.  



- Create a gridsearch object with the random forest classifier and the parameter candidates obtained from parameter_grid
- We are using a 7-fold cross validation (no scientific reason behind it - best go with 10, but this will be more computationally expensive)
- Documentation of GridSearchCV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [7]:
classifier_grid = GridSearchCV(estimator=RandomForestClassifier(), param_grid=parameter_grid, cv=7)

In [8]:
# Fit the cross validated grid search on the data 
classifier_grid.fit(X_train, y_train)

print(" The best parameters found are: ")
classifier_grid.best_params_


Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 386, in fit
    trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 1048, in __call__
    if self.dispatch_one_batch(iterator):
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 866, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 784, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backen

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 386, in fit
    trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 1048, in __call__
    if self.dispatch_one_batch(iterator):
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 866, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 784, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backen

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 386, in fit
    trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 1048, in __call__
    if self.dispatch_one_batch(iterator):
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 866, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 784, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backen

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 386, in fit
    trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 1048, in __call__
    if self.dispatch_one_batch(iterator):
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 866, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 784, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backen

Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 386, in fit
    trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 1048, in __call__
    if self.dispatch_one_batch(iterator):
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 866, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 784, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backen

 The best parameters found are: 


Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.8/site-packages/sklearn/ensemble/_forest.py", line 386, in fit
    trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 1048, in __call__
    if self.dispatch_one_batch(iterator):
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 866, in dispatch_one_batch
    self._dispatch(tasks)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 784, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/opt/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backen

{'max_features': 3, 'min_samples_leaf': 30, 'n_estimators': 60}

- Now that we have found the best parameters, we can use those to evaluate the best model's performance

In [9]:
best_model = classifier_grid.best_estimator_

##### Evaluating the best model's performance
- Using the test dataset, we evaluate the performance of the model created using the best hyperparameters (best_model) using X_test (and its corresponding labels y_test). 

In [10]:
##accuracy_over_runs = cross_val_score(best_model, X_test, y_test, cv=3)
y_pred= best_model.predict(X_test)

In [13]:
#print(accuracy_over_runs)
#accuracy_over_runs.mean()
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.82      0.89      0.85       102
         1.0       0.74      0.62      0.67        52

    accuracy                           0.80       154
   macro avg       0.78      0.75      0.76       154
weighted avg       0.79      0.80      0.79       154



- Depending on the run, we've achieved an unspectacular improvement in accuracy of 1-8%.
- But remember: 
   - Depending on the application, this could be a significant benefit :) 

<h2 id="task">Implement a KNN Version of the Naive and Refined Models</h2>

- Use cross-validation to tune the hyperparameter of a KNN classifier on the same X,y data.  
- Guide:
    - You can view KNN's parameters using the function: my_knn_model.get_params()
    - You can also lookup KNNs in the scikit learn API documentation. 
        - From our k-nearest neighbour lecture, we know that K is the most important parameter to specify. 
        - Try a large variety of values for K .
    - The needed library has already been imported for you (from sklearn.neighbors import KNeighborsClassifier
), so you can jumpt right into building/tuning/training/testing the model. 

In [14]:
### Your Solution Here ###


What do you think???