## FIT3077 Assignment 2
### Machine Learning bonus mark question
#### Written by Megan Ooi Jie Yi (30101670) & Hew Ye Zea (29035546)

#### Introduction

Based on NHS [1] which is the United Kingdom National Health Service, main factors of high cholesterol include age, obesity, high blood pressure and smoking.

This is why we have decided to collect these data related to the patients from the FHIR server. 

#### Import necessary libraries

In [1]:
import pandas as pd
import numpy as np

#### Read csv file

In [2]:
data = pd.read_csv("patient_data.csv")
data = data.drop(data.columns[[0]], axis=1) # remove first column
data.head() # display first 5 rows for checking

Unnamed: 0,id,Cholesterol,Age,BMI,Diastolic_BP,Systolic_BP,Smoking,Prediabetes,Diabetes,Hypertension,Heart_Disease,Obesity
0,5309,197.06,70,28.63,76.0,126.0,,,,,,active
1,6959,218.62,51,30.54,87.0,123.0,Never smoker,active,active,,,
2,9934,167.99,61,27.02,73.0,123.0,Former smoker,active,active,active,,
3,15886,225.33,59,29.13,84.0,110.0,Former smoker,active,active,active,,
4,18847,196.73,34,23.31,84.0,112.0,Never smoker,,,active,,


In [3]:
data = data.drop_duplicates(['id','Cholesterol']) # remove duplicated data 

In [4]:
data.shape # get number of rows and columns

(3125, 12)

#### Data wrangling

Some patients have missing information. We have to wrangle the data to ensure there are no NaN values.

In [327]:
print("Missing values")
print("BMI: " + str(data.BMI.isnull().values.sum()))
print("Diastolic BP: " + str(data.Diastolic_BP.isnull().values.sum()))
print("Systolic BP: " + str(data.Systolic_BP.isnull().values.sum()))
print("Smoking: " + str(data.Smoking.isnull().values.sum()))
print("Prediabetes: " + str(data.Prediabetes.isnull().values.sum()))
print("Diabetes: " + str(data.Diabetes.isnull().values.sum()))
print("Hypertension: " + str(data.Hypertension.isnull().values.sum()))
print("Heart disease: " + str(data.Heart_Disease.isnull().values.sum()))

Missing values
BMI: 178
Diastolic BP: 156
Systolic BP: 156
Smoking: 137
Prediabetes: 1489
Diabetes: 1927
Hypertension: 1805
Heart disease: 2779


The heart disease field has a significant number of missing rows so let's remove this field from our data set.

In [328]:
data = data.drop(['Heart_Disease'], axis=1)

Use KNN to wrangle data to fill in the missing numerical values. The results are returned in a numpy array so we have to convert it back to pandas DataFrame.

In [329]:
from sklearn.impute import KNNImputer

# get numerical data from data set
numerical = data[['id','Cholesterol','Age','BMI','Diastolic_BP','Systolic_BP']]

imputer = KNNImputer(n_neighbors=2) # initialise KNN
numerical = imputer.fit_transform(numerical) # impute numerical data
numerical = pd.DataFrame(numerical) # convert np array back to data frame
numerical.columns = ['id','Cholesterol','Age','BMI','Diastolic_BP','Systolic_BP']
data = data.merge(numerical, how='right') # merge to store imputed values

For categorical data such as smoking, prediabetes, diabetes, hypertension and heart disease have a large number of NaN values, we will assume that the NaN values mean that the patient does not have any record of these conditions. If they do not have any records, we will assume that they do not have these conditions. We will fill the NaN values with 'None'

In [330]:
data['Smoking'] = data['Smoking'].fillna(value='Never smoker')

In [331]:
data[['Prediabetes','Diabetes','Hypertension']] = data[['Prediabetes','Diabetes','Hypertension']].fillna(value='none')

#### Convert cholesterol to categorical data

Total cholesterol levels are a combination of readings which include triglycerides, LDL and HDL cholesterol[2]. We will group the total cholesterol levels into

| Category | Total cholesterol (mg/dL) |
|:----------|:-----------------:|
| Low      | < 200             |
| High|   >= 200 |  

In [332]:
data['CholesterolLvl'] = pd.cut(data.Cholesterol,
                               bins=[0,200,10000],
                               labels=['Low','High'])

#### Use BMI to determine if patient is obese

In [333]:
data['BMILvl'] = pd.cut(data.BMI,
                               bins=[0,25,30,1000],
                               labels=['Non obese','Overweight','Obese'])

#### Convert patients' ages into different age groups

In [334]:
data['AgeLvl'] = pd.cut(data.Age,
                               bins=[0,18,35,55,200],
                               labels=['Child','Young Adult','Middle Age','Older Age'])

#### Convert blood pressure into categorical data

Based on the blood pressure chart[3] on BloodPressureUK.org, the readings are split into groups whereby

| Category | Diastolic | Systolic | 
|:---------|:---------:|:--------:|
| Low | 40-60 | 70-90 |
| Ideal | 60-80 | 90-120 | 
| Pre-high | 80-90 | 120-140 | 
| High | 90-100 | 140-190 |

In [335]:
data['BPLvl'] = pd.cut(data.Systolic_BP,
                               bins=[0,90,140,250],
                               labels=['Low','Normal','High'])

data['BPLvl'] = pd.cut(data.Diastolic_BP,
                               bins=[0,60,90,150],
                               labels=['Low','Normal','High'])

### Test & train data

Since all the fields in the data set have been converted to categorical variables, we will be conducting a decision tree on categorical variables.

Split the data into dependent and independent variables. Dependent - cholesterol levels. Independent - age, BMI, blood pressure, smoking, prediabetes, diabetes, hypertension, heart disease.

In [336]:
data.head()

Unnamed: 0,id,Cholesterol,Age,BMI,Diastolic_BP,Systolic_BP,Smoking,Prediabetes,Diabetes,Hypertension,Obesity,CholesterolLvl,BMILvl,AgeLvl,BPLvl
0,5309,197.06,70,28.63,76.0,126.0,Never smoker,none,none,none,active,Low,Overweight,Older Age,Normal
1,6959,218.62,51,30.54,87.0,123.0,Never smoker,active,active,none,,High,Obese,Middle Age,Normal
2,9934,167.99,61,27.02,73.0,123.0,Former smoker,active,active,active,,Low,Overweight,Older Age,Normal
3,15886,225.33,59,29.13,84.0,110.0,Former smoker,active,active,active,,High,Overweight,Older Age,Normal
4,18847,196.73,34,23.31,84.0,112.0,Never smoker,none,none,active,,Low,Non obese,Young Adult,Normal


In [337]:
x = data.iloc[:,[2,3,6,7,8,9,12,13,14]].values # variables
y = data.iloc[:, 11].values # result

Split data into train and test data. 80% - train, 20% - test

In [338]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 5)

Write train and test data sets into 2 different csv files for the machine learning task.

In [339]:
train_file = open('train.csv','w')
import csv

train = []

train.append(['Age', 'BMI','Smoking','Prediabetes','Diabetes','Hypertension','BMILvl','AgeLvl','BPLvl','CholesterolLvl'])
for row in range(len(x_train)):
    train.append(np.append(x_train[row], y_train[row]))

writer = csv.writer(train_file)
for row in train:
    writer.writerow(row)

In [340]:
test_file = open('test.csv','w')
import csv

test = []

test.append(['Age', 'BMI','Smoking','Prediabetes','Diabetes','Hypertension','BMILvl','AgeLvl','BPLvl','CholesterolLvl'])
for row in range(len(x_test)):
    test.append(np.append(x_test[row], y_test[row]))

writer = csv.writer(test_file)
for row in test:
    writer.writerow(row)

The machine learning portion of code will be done in R. Refer to machineLearning.Rmd

#### References
[1] NHS: https://www.nhsinform.scot/illnesses-and-conditions/blood-and-lymph/high-cholesterol#causes-of-high-cholesterol

[2] Cleveland Clinic: https://my.clevelandclinic.org/health/articles/11920-cholesterol-numbers-what-do-they-mean

[3] BloodpressureUK.org: http://www.bloodpressureuk.org/BloodPressureandyou/Thebasics/Bloodpressurechart