# Exploring Classical Machine Learning

Let's load in any libraries we will use in this notebook.

In [None]:
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import sklearn.model_selection

## Part 0: Loading in the Dataset and Normalising Data

We're going to be using a publicly available dataset -- the 'Maternal Health Risk Data', available from https://www.kaggle.com/datasets/csafrit2/maternal-health-risk-data

From the dataset website: "Data has been collected from different hospitals, community clinics, maternal health cares through the IoT based risk monitoring system.

* Age: Age in years when a woman is pregnant.
* SystolicBP: Upper value of Blood Pressure in mmHg, another significant attribute during pregnancy.
* DiastolicBP: Lower value of Blood Pressure in mmHg, another significant attribute during pregnancy.
* BS: Blood glucose levels is in terms of a molar concentration, mmol/L.
* HeartRate: A normal resting heart rate in beats per minute.
* Risk Level: Predicted Risk Intensity Level during pregnancy considering the previous attributes."

We're going to see if we can predict the Risk Level of a patient -- low risk, medium risk, or high risk -- based on the other variables provided.

Below, I'm going to load in the dataset and do some initial processing. There's nothing for you to change here, but I'll leave comments in case you're interested on what's going on.

In [None]:
all_data = pd.read_csv('Maternal Health Risk Data Set.csv')   #read the file into a pandas data frame

print(all_data.info())   #we can call this command to get some stats on the dataset, including the features we have, the number of data points for each category, and the data type for each category

Above, we can see that there are 1014 data points for a variety of features. 

We're interested in using features 0-5 to help us predict which risk level the patient has -- 'low risk', 'medium risk', or 'high risk'.

**Some important notes:** 
* we are predicting a category -- therefore this is a classification task
* all of our input features are in a numerical form currently -- we can see this from the 'Dtype'. If any were in a categorical form, we would need to convert them first.

Below, I'm going to separate the pandas dataframe into our input data, and our ground-truth output data. I'm also going to use a histogram to plot the occurence of each risk level in the dataset.

Again, there's nothing for you to change here, but you should try to understand what's going on.

In [None]:
#converting both to numpy, as these will be easier to work with following on from here
input_data = all_data[['Age', 'SystolicBP', 'DiastolicBP', 'BS', 'BodyTemp', 'HeartRate']].to_numpy()

gt_output = all_data['RiskLevel'].to_numpy()

plt.hist(gt_output) #plots a histogram - google if you don't know what this is
plt.xlabel('Risk Level') #let's use good etiquette and label our graph axis
plt.ylabel('Count')
plt.show()

Looking at the histogram, some class imbalance is definitely present, but it doesn't appear major. We aren't going to change this at this point, let's just keep it in mind.

Below, I'm preparing the data a little further:
* If I needed to convert any non-numerical, categorical features into an integer, I would do that here. That's not necessary in this dataset.
* I'm normalizing each feature so that all features are on the same scale, between 0 and 1. I'm going to use min-max scaling -- again, Google this if you're not sure what it is.

In [None]:
print('Data sample prior to normalization:')
print(input_data[:3])  

#min-max scaling means (x-xmin)/(xmax-xmin)
min_features = np.min(input_data, axis = 0)
max_features = np.max(input_data, axis = 0)
input_data = (input_data-min_features)/(max_features-min_features)

print('Data sample after normalization:')
print(input_data[:3])

## Part 1: Training, Validation and Test Subsets 

### (a) split the total dataset into a train, val and test subset
There are many different ways to take a dataset, and split it into training, validation and test subsets. I'm going to introduce a method that will work nicely when we move on to the deep learning methods and image data.

We have 1014 data points, and I've identified in the code below that I want to split this data so that we have 50% for the training subset, and 25% each for the validation and test subsets. One important consideration when splitting data -- make sure you are randomly splitting the data, so that we are likely to end up with a more balanced selection and avoid any unintended ordering in the dataset file. For example, imagine if the entries in the dataset where ordered based on the patient's age.

If you're really stuck -- try googling this! **If you use code you find online, make sure you don't just copy and paste, but take time to understand how it works.** If you use code from online during the assessment, and it's wrong or you can't explain how it works, you won't get any marks for that category.

Some starting points:
* numpy split function -- https://numpy.org/doc/stable/reference/generated/numpy.split.html, just make sure that you are splitting the input_data and gt_output in the same way.
* sklearn train_test_split function -- https://numpy.org/doc/stable/reference/generated/numpy.split.html, just be aware that this only splits the data into 2 portions. You'll need to think carefully about how to end up with train, val and test.


In [None]:
train_portion = 0.5
val_portion = 0.25
test_portion = 0.25

##################################################################################################################################################################################################################
################################################     YOUR CODE GOES BELOW     ####################################################################################################################################
##################################################################################################################################################################################################################
# at the end of this block, you should have 6 variables created
#            - input_train = 50% of input_data
#            - gt_train = 50% of gt_output (should be the same 50% as was used for input_train)
#
#            - input_val = 25% of input_data
#            - gt_val = 25% of gt_output (should be the same 25% as was used for input_val)
#
#            - input_test = 25% of input_data
#            - gt_test = 25% of gt_output (should be the same 25% as was used for input_test)


Your total dataset should have an input shape of (1014, 6) and a ground-truth shape of (1014,).
This is because we have 1014 data points, the input has 6 features, and the output is a single number (category of risk level).

Your train, validation, and test subsets should also have 6 input features and the output as a single number, but will have different numbers of data points.

In [None]:
print('Total Dataset shape:')
print(f'    Input shape: {input_data.shape}   GT shape: {gt_output.shape}')
print('Train Subset shape:')
print(f'    Input shape: {input_train.shape}   GT shape: {gt_train.shape}')
print('Validation Subset shape:')
print(f'    Input shape: {input_val.shape}   GT shape: {gt_val.shape}')
print('Test Subset shape:')
print(f'    Input shape: {input_test.shape}   GT shape: {gt_test.shape}')


### (b) visualise the class distribution across different dataset splits

If you're happy with your dataset splits, let's check how the class balance looks between these splits.

Use a histogram to visualise the distribution of the risk level classes in the train, validation, and test subset.

If the class balance looks very different between classes for different data subsets, you may want to go back to where you split your data and fix this.

Since you're using random shuffling (hopefully) and our imbalance wasn't too bad to start with, you can try re-running the cell and seeing if you get a more balanced split.

Otherwise, the sklearn train_test_split function has a stratify argument that may be helpful.

In [None]:
##################################################################################################################################################################################################################
################################################     YOUR CODE GOES BELOW     ####################################################################################################################################
##################################################################################################################################################################################################################

#if you need help, look above to where we visualised a histogram for gt_output
#HINT: using the argument density = True in the plt.hist() function might help compare the shape of the distribution, despite having different absolute numbers in each subset


## Part 2: Implementing a K Nearest Neighbour Model

### (a) An Example of a K=1 Nearest Neighbour
Below, I've shown an example of using the sklearn KNeighborsClassifier -- this uses a K Nearest Neighbour approach to classification.
You can read from the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier

Read through the below code, and understand the process that's being followed.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

#initialise the ML model
K = 1   # this is a hyperparameter
knn_model = KNeighborsClassifier(n_neighbors=K)

#fit the ML model to the training data
knn_model.fit(input_train, gt_train)

#test the ML model on the validation data
val_pred = knn_model.predict(input_val)

#Let's use the accuracy performance metric to find how good performance is on the validation data
correct = np.sum(val_pred == gt_val)
total = len(gt_val)
accuracy = correct/total

#Report the results
print(f'KNN with K={K}, Accuracy of {100.*accuracy}%')

### (b) Use the validation dataset to find K

Let's use the validation dataset to find the best value of K! You can adapt the code above to search through a range of K values, store the validation accuracy, and then store the best value of K in a variable called *K_best*.

It's also a good idea to plot the results you get, using something like plt.plot() -- see here: https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html

Sometimes, you'll get similar performance with a high value of K and a low value of K -- remember: the lower value is usually the better choice in this case (see Occam's razor)

In [None]:
##################################################################################################################################################################################################################
################################################     YOUR CODE GOES BELOW     ####################################################################################################################################
##################################################################################################################################################################################################################

#Report the results
print(f'Best value of K was {K_best}')

### (c) Find KNN Performance on the Test Data 

Now that we've used our validation dataset to find the best value of K, let's use this value of K to create a model, and then test it on the test data to see the final 'real-world' performance.

Your turn -- try to implement this process. It'll be very similar to the approach from above -- you're still using a KNeighborsClassifier and fitting it to the training data subset. This time, use the K_best variable and test on the test data to find the accuracy of the model.

I got KNN with K=1, and an accuracy of 74.8% -- what about you?

Some differences are expected (we did use different data due to the random sampling), but your K value and test accuracy should be fairly close to that.

In [None]:
##################################################################################################################################################################################################################
################################################     YOUR CODE GOES BELOW     ####################################################################################################################################
##################################################################################################################################################################################################################


#Report the results
print(f'KNN with K={K_best} on the test data, Accuracy of {100.*accuracy}%')

### (c) Visualise performance with a confusion matrix

Create a Confusion Matrix based on the performance of the KNN model on the test dataset.

Looking at the Confusion Matrix, reflect on the following questions:
1. Is performance consistent across the classes, or is there a clear discrepancy for some classes? If there is, why do you think this might be?
2. Given the potential use of this ML model, are some types of errors worse or more dangerous than others? How does the KNN model perform for these types of errors? (e.g. if a patient is medium risk, is it better or worse for them to be misclassified as low risk or high risk?)

Sklearn has a useful function -- ConfusionMatrixDisplay.from_predictions() -- that creates a confusion matrix if given an array of predicted labels and an array of true labels. Read the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay.from_predictions

Note: you may want to use the normalize argument in the above function to allow easy interpretation in the presence of class imbalance.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

##################################################################################################################################################################################################################
################################################     YOUR CODE GOES BELOW     ####################################################################################################################################
##################################################################################################################################################################################################################



## Part 3: Implementing a Random Forest Classifier

### (a) Use the validation dataset to find the best number of estimators
In the cell below, implement the sklearn RandomForestClassifier and find the best value for the number of trees in the forest using the validation dataset. Adding more trees usually offers more robust performance, but it also comes at the cost of slower performance. You will likely follow these steps:
* Read the sklearn documentation on RandomForestClassifier to see how to implement -- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
* For each number in the num_trees list:
    * initialise a RandomForestClassifier with that number of trees
    * fit the classifier to the training data
    * find accuracy of the classifier on the validation dataset
    * record this accuracy and the number of trees currently being tested
* Create a plot to visualise the number of trees versus the accuracy, and choose the best value for the number of trees.
    * to pick the best number of trees, you'll need to weigh up speed versus accuracy.
    * Remember with Random Forest that the results are not deterministic and can change each time you fit the classifier. You can run this cell a number of times to get a sense of consistency of performance
    * You can also use the time.time() function to get an idea of run-time for each value -- see https://www.tutorialspoint.com/python/time_time.htm

*Note: the cell will take about 1-2 minutes to run through all values in est_range*

In [None]:
from sklearn.ensemble import RandomForestClassifier
import time

num_trees = [i for i in range(50, 1500, 50)] #check from 50-1500 in increments of 50

##################################################################################################################################################################################################################
################################################     YOUR CODE GOES BELOW     ####################################################################################################################################
##################################################################################################################################################################################################################


### (b) Find the performance on the test dataset with your selected best number of estimators

Fill in the best_trees variable with your chosen number of trees. I opted for best_trees = 250, as this seemed to perform pretty consistently when I tested and also didn't have too much computational cost.

Then write the code to (1) create a classifier with best_trees number of trees, (2) fit on the training dataset, (3) test on the test dataset and (4) report accuracy.

My accuracy was ~77.95%, better than the KNN model! Did you get the same?


In [None]:
##################################################################################################################################################################################################################
################################################     YOUR CODE GOES BELOW     ####################################################################################################################################
##################################################################################################################################################################################################################

best_trees = 



### (c) Visualise performance with a confusion matrix

Use the same approach as earlier.

How does performance compare with the KNN classifier? Does it have a similar distribution of errors, or different? 

In [None]:
##################################################################################################################################################################################################################
################################################     YOUR CODE GOES BELOW     ####################################################################################################################################
##################################################################################################################################################################################################################
