## FIT3077 Assignment 2
### Machine Learning bonus mark question
#### Written by Megan Ooi Jie Yi (30101670) & Hew Ye Zea

#### Introduction

Based on NHS [1] which is the United Kingdom National Health Service, main factors of high cholesterol include age, obesity, high blood pressure and triglyceride, LDL, HDL levels. 

This is why we have decided to collect these data related to the patients from the FHIR server. 

#### Import necessary libraries

In [191]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler 

#### Read csv file

In [235]:
data = pd.read_csv("patient_data.csv")
data = data.drop(data.columns[[0,1]], axis=1) # remove first column
data.head() # display first 5 rows for checking

Unnamed: 0,id,Cholesterol,Age,BMI,Blood_Pressure,Triglycerides,LD_Lipoprotein,HD_Lipoprotein
0,1,166.23,47,27.46,86.0,113.32,99.43,63.99
1,3689,175.0,85,27.29,87.0,,,
2,5309,197.06,70,28.63,76.0,170.01,147.28,46.39
3,6959,218.62,51,30.54,87.0,175.21,149.21,44.85
4,9934,167.99,61,27.02,73.0,166.43,129.05,55.541


In [236]:
data.shape # get number of rows and columns

(3293, 8)

#### Data wrangling

Some patients have missing information. We have to wrangle the data to ensure there are no NaN values.

In [237]:
data.BMI.isnull().values.sum() # 99 patients have missing BMI values

99

In [238]:
data.Blood_Pressure.isnull().values.sum() # 94 patients have missing blood pressure values

94

In [239]:
data.Triglycerides.isnull().values.sum() # 263 patients have missing triglyceride values

263

In [240]:
data.LD_Lipoprotein.isnull().values.sum() # 265 patients have missing low density lipoprotein values

265

In [241]:
data.HD_Lipoprotein.isnull().values.sum() # 270 patients have missing high density lipoprotein values

270

Use KNN to wrangle data to fill in the missing values. The results are returned in a numpy array so we have to convert it back to pandas DataFrame.

In [242]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)
data = imputer.fit_transform(data)
data = pd.DataFrame(data)
data.columns = ['id','Cholesterol','Age','BMI','Blood_Pressure','Triglycerides','LDL','HDL']

#### Convert cholesterol to categorical data

Total cholesterol levels are a combination of readings which include triglycerides, LDL and HDL cholesterol[2]. Based on Cleveland Clinic's website, the total cholesterol levels are grouped into

| Category | Total cholesterol (mg/dL) |
|:----------|:-----------------:|
| Low      | < 200             |
| Borderline|          200-239 | 
| High | > 240 | < 35 | >= 160 |

In [243]:
data['CholesterolLvl'] = pd.cut(data.Cholesterol,
                               bins=[0,200,240,10000],
                               labels=['Low','Borderline','High'])

Split the data into dependent and independent variables. Dependent - cholesterol levels. Independent - age, BMI, blood pressure, triglyceride, low and high density lipoprotein

In [244]:
x = data.iloc[:,[2,3,4,5,6,7]].values # variables
y = data.iloc[:, 8].values # result

Split data into train and test data. 80% - train, 20% - test

In [245]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 5)

#### Decision tree
The decision tree machine learning algorithm is used as it can handle non linear relationships, variable interactions and support multiple variables in the model. It also supports categorical outcomes, in this case, categorical data of cholesterol levels. 

In [246]:
# Reference: FIT1043 - Tutorial 7
# fitting decision tree to training data
from sklearn.tree import DecisionTreeClassifier 
classifier = DecisionTreeClassifier(
criterion = 'entropy', random_state = 5)
classifier.fit(x_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=5, splitter='best')

In [247]:
# predict results for test data
y_pred = classifier.predict(x_test)

In [248]:
# confusion matrix
from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(y_test, y_pred)
correct_predictions = cm[0][0] + cm[1][1] + cm[2][2]

The overall accuracy of all predictions based on the confusion matrix is 0.907.

In [249]:
total = 0
for j in range(3):
    for i in range(3):
        total += cm[j][i]
overall_accuracy = correct_predictions/total

In [251]:
overall_accuracy

0.9074355083459787

Meanwhile, the accuracy for predicting high cholesterol levels is 0.962.

In [254]:
high_cholesterol_accuracy = cm[2][2]/(cm[2][0]+cm[2][1]+cm[2][2])

In [255]:
high_cholesterol_accuracy

0.9621993127147767

#### References
[1] NHS: https://www.nhsinform.scot/illnesses-and-conditions/blood-and-lymph/high-cholesterol#causes-of-high-cholesterol

[2] Cleveland Clinic: https://my.clevelandclinic.org/health/articles/11920-cholesterol-numbers-what-do-they-mean