# Best Direct Oral Anticoagulant (DOACs) prediction based on Patient's data

__Author: Muhammad Ayub Ansari__ <br>
__Email: ayub_ansari@outlook.com__<br>
__LinkedIN: https://www.linkedin.com/in/muhammad-ayub-ansari-74301065/__<br>

<br>
This project aims to predict the best Direct Oral Anticoagulants and the their right dose amount for patients with various different pre-exsisting conditions. The side effects of previously used DOACs and the obesity level of the patients were also considered in predicitng the most suitable and safe DOACs and their dose amount.
<br>
Currently in use DOACs and their doses are shown in the table below. 
<br>


| DOAC | Doses |
| :- | :- |
| Apixaban | 2.5mg, 5mg,10mg 
| Rivaroxaban | 75mg,110mg,150mg 
| Edoxaban | 30mg,60mg 
| Dabigatran | 10mg,15mg,20mg 

<br>
The dataset was accquired from NHS. The data is too noisy and consisted of too many error values and missing values. Moreover, the feature space is big and many features are not relevant or important for the task at hand. The data will be cleaned via vatious pre-processing steps. The final features will be used for the model traing.

### Importing Libraries

In [57]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import sklearn
from sklearn import preprocessing

### Displaying the version of Libraries used

In [58]:
# Python version
print('Python: {}'.format(sys.version))
# pandas
print('pandas: {}'.format(pd.__version__))
# numpy
print('numpy: {}'.format(np.__version__))
# seaborn
print('seaborn: {}'.format(sns.__version__))
# scikit-learn
print('sklearn: {}'.format(sklearn.__version__))


Python: 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
pandas: 1.5.0
numpy: 1.23.3
seaborn: 0.12.0
sklearn: 1.1.2


# Data = DOAC2022.xlsx
__Reading data from csv file__

In [59]:
data = pd.read_excel("DOAC2022.xlsx")

In [60]:
data.head()

Unnamed: 0,MRN,Ethnic Group,PersonSK,EncounterSK,Length Of Stay,EventSk,Age at Given Date,EventType,AC AddedTo EPR,AC Stated PerformedTime,...,Emergency Admissions Upto 36 Months After PerformedTime,Mat Admissions Within 3 Years,Admit Method,Has Comorbidity Recorded on Spell,Has Bleeding Event (or Stroke) on Spell,Has Stroke on Spell,Has Thrombosis on Spell,Treatment_days,AC_END_DATE,AC_START_DATE
0,96,White - British,5859512,39337940,9,832948603,71,Rivaroxaban,2018-10-03 17:31:09,2018-10-03 17:25:00,...,3,0,Accident and Emergency,Yes,No,No,No,1117.0,2021-10-24,3/10/2018
1,96,White - British,5859512,39337940,9,834980349,71,Rivaroxaban,2018-10-04 16:36:44,2018-10-04 16:36:00,...,3,0,Accident and Emergency,Yes,No,No,No,1116.0,2021-10-24,4/10/2018
2,96,White - British,5859512,39337940,9,837113509,71,Rivaroxaban,2018-10-05 17:10:49,2018-10-05 17:10:00,...,3,0,Accident and Emergency,Yes,No,No,No,1115.0,2021-10-24,5/10/2018
3,96,White - British,5859512,39337940,9,838714390,71,Rivaroxaban,2018-10-06 16:58:47,2018-10-06 16:58:00,...,3,0,Accident and Emergency,Yes,No,No,No,1114.0,2021-10-24,6/10/2018
4,96,White - British,5859512,39337940,9,840199866,71,Rivaroxaban,2018-10-07 17:17:19,2018-10-07 17:17:00,...,3,0,Accident and Emergency,Yes,No,No,No,1113.0,2021-10-24,7/10/2018


In [61]:
# As the dataset has too many features, it cannot be fully displayed here. Let's print the columns in data.
print("Shape of data: ", data.shape)
print(data.columns)

Shape of data:  (183552, 50)
Index(['MRN', 'Ethnic Group', 'PersonSK', 'EncounterSK', 'Length Of Stay',
       'EventSk', 'Age at Given Date', 'EventType', 'AC AddedTo EPR',
       'AC Stated PerformedTime', 'AC Stated Performed Date',
       'AC Stated Performed Week', 'AC Stated Performed Month', 'AC Amount',
       'AC untits', 'EGFR Type', 'EGFR Stated Date', 'EGFR Result 60 Or Less',
       'EGFR Result', 'EGFR Units', 'Gender', 'Date Of Death',
       'Deceased in EPR', 'Clinical Display', 'Bleeding Risk',
       'Bleeding Risk Time', 'VTE Risk', 'VTE Risk Time', 'Height',
       'Height Time', 'Weight', 'Weight Time', 'Calculated BMI', 'Actual BMI',
       'Actual BMI Time', 'BMI Type', 'BMI Result', 'BMI Stated Date',
       'BMI Flag', 'Difference in BMIs',
       'Emergency Admissions Upto 36 Months After PerformedTime',
       'Mat Admissions Within 3 Years', 'Admit Method',
       'Has Comorbidity Recorded on Spell',
       'Has Bleeding Event (or Stroke) on Spell', 'Has St

__1 Droping irrelevant columns__
<br>
Firstly, lets get rid of the all irrelevant features/columns

In [62]:
data = data.drop(['MRN', 'PersonSK', 'EncounterSK','EventSk', 'AC AddedTo EPR','AC Stated PerformedTime', 
                  'AC Stated Performed Date','AC Stated Performed Week', 'AC Stated Performed Month', 
                  'AC untits', 'EGFR Type', 'EGFR Stated Date', 'EGFR Result 60 Or Less','EGFR Result', 
                  'EGFR Units',  'Date Of Death','Clinical Display', 'Bleeding Risk Time','VTE Risk Time',
                  'Height Time', 'Weight Time','Actual BMI Time', 'BMI Stated Date','Difference in BMIs',
                  'Mat Admissions Within 3 Years', 'Admit Method','AC_END_DATE','AC_START_DATE'],axis=1)

In [63]:
print("Shape of data: ", data.shape)
print(data.columns)

Shape of data:  (183552, 22)
Index(['Ethnic Group', 'Length Of Stay', 'Age at Given Date', 'EventType',
       'AC Amount', 'Gender', 'Deceased in EPR', 'Bleeding Risk', 'VTE Risk',
       'Height', 'Weight', 'Calculated BMI', 'Actual BMI', 'BMI Type',
       'BMI Result', 'BMI Flag',
       'Emergency Admissions Upto 36 Months After PerformedTime',
       'Has Comorbidity Recorded on Spell',
       'Has Bleeding Event (or Stroke) on Spell', 'Has Stroke on Spell',
       'Has Thrombosis on Spell', 'Treatment_days'],
      dtype='object')


In [64]:
data.head()

Unnamed: 0,Ethnic Group,Length Of Stay,Age at Given Date,EventType,AC Amount,Gender,Deceased in EPR,Bleeding Risk,VTE Risk,Height,...,Actual BMI,BMI Type,BMI Result,BMI Flag,Emergency Admissions Upto 36 Months After PerformedTime,Has Comorbidity Recorded on Spell,Has Bleeding Event (or Stroke) on Spell,Has Stroke on Spell,Has Thrombosis on Spell,Treatment_days
0,White - British,9,71,Rivaroxaban,15.0,Female,No,High,High,154.94,...,,BMI Score Measured,18.5 - 20 kg/m2,No,3,Yes,No,No,No,1117.0
1,White - British,9,71,Rivaroxaban,15.0,Female,No,High,High,158.0,...,20.63,BMI Score Measured,18.5 - 20 kg/m2,No,3,Yes,No,No,No,1116.0
2,White - British,9,71,Rivaroxaban,15.0,Female,No,High,High,158.0,...,20.63,BMI Score Measured,18.5 - 20 kg/m2,No,3,Yes,No,No,No,1115.0
3,White - British,9,71,Rivaroxaban,15.0,Female,No,High,High,158.0,...,20.63,BMI Score Measured,18.5 - 20 kg/m2,No,3,Yes,No,No,No,1114.0
4,White - British,9,71,Rivaroxaban,15.0,Female,No,High,High,158.0,...,20.63,BMI Score Measured,18.5 - 20 kg/m2,No,3,Yes,No,No,No,1113.0


Instead of 50 features, we are left with 22 releavnt features. We will further drop more features as it required but for the time being we need all these features.
<br>
Let's give our feature meaningful names.


__Rename Columns__

In [65]:
data = data.rename(columns={'Ethnic Group': 'Ethnicity', 'Length Of Stay':'LOS', 'Age at Given Date':'Age',
        'Deceased in EPR':'Mortality', 'Bleeding Risk':'Bleeding_Risk', 'VTE Risk':'VTE_Risk',
       'Emergency Admissions Upto 36 Months After PerformedTime':'Emergency_visits',
       'Has Comorbidity Recorded on Spell':'Comorbidity','Has Bleeding Event (or Stroke) on Spell':'Bleeding', 
       'Has Stroke on Spell':'Stroke',
       'Has Thrombosis on Spell':'Thrombosis'})

In [66]:
print(data.columns)

Index(['Ethnicity', 'LOS', 'Age', 'EventType', 'AC Amount', 'Gender',
       'Mortality', 'Bleeding_Risk', 'VTE_Risk', 'Height', 'Weight',
       'Calculated BMI', 'Actual BMI', 'BMI Type', 'BMI Result', 'BMI Flag',
       'Emergency_visits', 'Comorbidity', 'Bleeding', 'Stroke', 'Thrombosis',
       'Treatment_days'],
      dtype='object')


__2 - Replacing "-" with N/A__

In [67]:
data = data.replace("-", "")

__3 - Missing values__

In [68]:
data.isnull( ).sum( )

Ethnicity               0
LOS                   971
Age                     0
EventType               0
AC Amount               4
Gender                  0
Mortality               0
Bleeding_Risk        3919
VTE_Risk             3867
Height              54351
Weight              47908
Calculated BMI      62037
Actual BMI          67282
BMI Type            58537
BMI Result          58537
BMI Flag                0
Emergency_visits        0
Comorbidity             0
Bleeding                0
Stroke                  0
Thrombosis              0
Treatment_days      17544
dtype: int64

>Lenght of stay, bleeding and VTE side effect feaatures have some of missing values. One of the most important feature is the BMI of the patient. The BMI value determines the obesisty level of the patient and the medicine and dose recommendation depends significantly on the obesity level. We will try to calculte the correct BMI value and hence the obesity level from height, weight, calculated bmi, actual bmi, bmi type and bmi flag features. 

In [69]:
#Drop rows with missing values
data = data.dropna()
print(data.isnull( ).sum( ))
print(data.info())

Ethnicity           0
LOS                 0
Age                 0
EventType           0
AC Amount           0
Gender              0
Mortality           0
Bleeding_Risk       0
VTE_Risk            0
Height              0
Weight              0
Calculated BMI      0
Actual BMI          0
BMI Type            0
BMI Result          0
BMI Flag            0
Emergency_visits    0
Comorbidity         0
Bleeding            0
Stroke              0
Thrombosis          0
Treatment_days      0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 84380 entries, 1 to 183550
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Ethnicity         84380 non-null  object 
 1   LOS               84380 non-null  object 
 2   Age               84380 non-null  int64  
 3   EventType         84380 non-null  object 
 4   AC Amount         84380 non-null  float64
 5   Gender            84380 non-null  object 
 6   Mortality   

__4 - Case consistency__

> __As most of the features are object, convert all of them to lower case strings for case consistency. Later we will convert the numeric features back to numeric data types from objects__

In [70]:
data = data.apply(lambda x: x.astype(str).str.lower())
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84380 entries, 1 to 183550
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Ethnicity         84380 non-null  object
 1   LOS               84380 non-null  object
 2   Age               84380 non-null  object
 3   EventType         84380 non-null  object
 4   AC Amount         84380 non-null  object
 5   Gender            84380 non-null  object
 6   Mortality         84380 non-null  object
 7   Bleeding_Risk     84380 non-null  object
 8   VTE_Risk          84380 non-null  object
 9   Height            84380 non-null  object
 10  Weight            84380 non-null  object
 11  Calculated BMI    84380 non-null  object
 12  Actual BMI        84380 non-null  object
 13  BMI Type          84380 non-null  object
 14  BMI Result        84380 non-null  object
 15  BMI Flag          84380 non-null  object
 16  Emergency_visits  84380 non-null  object
 17  Comorbidity

In [71]:
data.head()

Unnamed: 0,Ethnicity,LOS,Age,EventType,AC Amount,Gender,Mortality,Bleeding_Risk,VTE_Risk,Height,...,Actual BMI,BMI Type,BMI Result,BMI Flag,Emergency_visits,Comorbidity,Bleeding,Stroke,Thrombosis,Treatment_days
1,white - british,9,71,rivaroxaban,15.0,female,no,high,high,158,...,20.63,bmi score measured,18.5 - 20 kg/m2,no,3,yes,no,no,no,1116.0
2,white - british,9,71,rivaroxaban,15.0,female,no,high,high,158,...,20.63,bmi score measured,18.5 - 20 kg/m2,no,3,yes,no,no,no,1115.0
3,white - british,9,71,rivaroxaban,15.0,female,no,high,high,158,...,20.63,bmi score measured,18.5 - 20 kg/m2,no,3,yes,no,no,no,1114.0
4,white - british,9,71,rivaroxaban,15.0,female,no,high,high,158,...,20.63,bmi score measured,18.5 - 20 kg/m2,no,3,yes,no,no,no,1113.0
5,white - british,9,71,rivaroxaban,15.0,female,no,high,high,158,...,20.63,bmi score measured,18.5 - 20 kg/m2,no,3,yes,no,no,no,1112.0


>__Let's covert numeric features back to int or float data types.__

In [72]:
data["Age"] = pd.to_numeric(data["Age"])
data["Emergency_visits"] = pd.to_numeric(data["Emergency_visits"])
data["AC Amount"] = pd.to_numeric(data["AC Amount"])
data["Treatment_days"] = pd.to_numeric(data["Treatment_days"])
data["LOS"] = pd.to_numeric(data["LOS"])

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84380 entries, 1 to 183550
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Ethnicity         84380 non-null  object 
 1   LOS               84380 non-null  int64  
 2   Age               84380 non-null  int64  
 3   EventType         84380 non-null  object 
 4   AC Amount         84380 non-null  float64
 5   Gender            84380 non-null  object 
 6   Mortality         84380 non-null  object 
 7   Bleeding_Risk     84380 non-null  object 
 8   VTE_Risk          84380 non-null  object 
 9   Height            84380 non-null  object 
 10  Weight            84380 non-null  object 
 11  Calculated BMI    84380 non-null  object 
 12  Actual BMI        84380 non-null  object 
 13  BMI Type          84380 non-null  object 
 14  BMI Result        84380 non-null  object 
 15  BMI Flag          84380 non-null  object 
 16  Emergency_visits  84380 non-null  int64

>__Now our data is free of missing values and consistent in terms of case.__

__5 - Lable encoding__
<br>
The categorical features such as thnicity, gender etc will be encoded to appropriate numeric labels.

In [73]:
data['Ethnicity'].unique()

array(['white - british', 'black or black british - caribbean',
       'white - any other white background', 'other - not stated',
       'white - irish', 'other - any other ethnic group',
       'asian or asian british - pakistani',
       'asian - any other asian background',
       'mixed - white and black caribbean',
       'black or black british - african',
       'asian or asian british - indian',
       'black - any other black background', 'other - chinese',
       'other - not known', 'mixed - white and asian',
       'asian or asian british - bangladeshi',
       'mixed - any other mixed background'], dtype=object)

In [74]:
#instead of wokring with 17 different ethinicities. Lets group them in three broader groups.
#Group1: White (Consisting of whites from UK or any part of the world)
#Group2: BAME (Consisting of Black and Asian)
#Group3: Others (All remaining)
data['Ethnicity'] = data['Ethnicity'].replace(['white - british', 
              'white - any other white background', 
              'white - irish'], 1)
data['Ethnicity'] = data['Ethnicity'].replace(['black or black british - caribbean', 
              'asian or asian british - pakistani', 
              'asian - any other asian background',
              'mixed - white and black caribbean',
              'black or black british - african',
              'asian or asian british - indian',
              'black - any other black background',
              'mixed - white and asian',
              'asian or asian british - bangladeshi',
              'mixed - any other mixed background'], 2)
data['Ethnicity'] = data['Ethnicity'].replace(['other - not stated', 
              'other - any other ethnic group', 
              'other - chinese',
             'other - not known'], 3)

## Convert the data type from Object to numeric
data["Ethnicity"] = pd.to_numeric(data["Ethnicity"])

In [75]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84380 entries, 1 to 183550
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Ethnicity         84380 non-null  int64  
 1   LOS               84380 non-null  int64  
 2   Age               84380 non-null  int64  
 3   EventType         84380 non-null  object 
 4   AC Amount         84380 non-null  float64
 5   Gender            84380 non-null  object 
 6   Mortality         84380 non-null  object 
 7   Bleeding_Risk     84380 non-null  object 
 8   VTE_Risk          84380 non-null  object 
 9   Height            84380 non-null  object 
 10  Weight            84380 non-null  object 
 11  Calculated BMI    84380 non-null  object 
 12  Actual BMI        84380 non-null  object 
 13  BMI Type          84380 non-null  object 
 14  BMI Result        84380 non-null  object 
 15  BMI Flag          84380 non-null  object 
 16  Emergency_visits  84380 non-null  int64

In [76]:
data['Ethnicity'].unique()

array([1, 2, 3], dtype=int64)

__The ethinicity features has only three classes now and converted to numeric feature.__

__EventType__

In [77]:
data['EventType'].unique()

array(['rivaroxaban', 'apixaban', 'dabigatran etexilate', 'dabigatran',
       'edoxaban'], dtype=object)

__EventType feature represent the medicine. There are four different types of DOACs. Lets merge two version of Dabigatran into one. So, that we will have four unique DOACs in the dataset.__

In [78]:
data['EventType'] = data['EventType'].replace(['dabigatran etexilate', 'dabigatran'], 'dabigatran')
data['EventType'].unique()

array(['rivaroxaban', 'apixaban', 'dabigatran', 'edoxaban'], dtype=object)

__AC Amount__
<br>Since our goal is to pridict the DOAC and its correct dose, it is better to merge the two columns into one.

In [79]:
#Lets combine EventType(medicine) and AC amount(dose) columns.
data["AC_Amount"] = data["EventType"] + data["AC Amount"].astype(str)
print(data["AC_Amount"].unique())

## Lets drop Eventtype and AC Amount columns. As they no longer required.
data = data.drop(['EventType', 'AC Amount'],axis=1)

['rivaroxaban15.0' 'apixaban5.0' 'dabigatran110.0' 'rivaroxaban20.0'
 'apixaban2.5' 'apixaban10.0' 'dabigatran150.0' 'edoxaban60.0'
 'edoxaban30.0' 'rivaroxaban10.0' 'apixaban1.0' 'dabigatran75.0'
 'rivaroxaban2.5']


In [80]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84380 entries, 1 to 183550
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Ethnicity         84380 non-null  int64  
 1   LOS               84380 non-null  int64  
 2   Age               84380 non-null  int64  
 3   Gender            84380 non-null  object 
 4   Mortality         84380 non-null  object 
 5   Bleeding_Risk     84380 non-null  object 
 6   VTE_Risk          84380 non-null  object 
 7   Height            84380 non-null  object 
 8   Weight            84380 non-null  object 
 9   Calculated BMI    84380 non-null  object 
 10  Actual BMI        84380 non-null  object 
 11  BMI Type          84380 non-null  object 
 12  BMI Result        84380 non-null  object 
 13  BMI Flag          84380 non-null  object 
 14  Emergency_visits  84380 non-null  int64  
 15  Comorbidity       84380 non-null  object 
 16  Bleeding          84380 non-null  objec

__Gender__

In [81]:
#Male =1, Female = 2
print("Unique values in gender: ", data['Gender'].unique())
data['Gender'] = data['Gender'].replace(['male','female'], [1,0])

## Convert the data type from Object to numeric
data["Gender"] = pd.to_numeric(data["Gender"])
print(data.info())
print("Unique values in gender: ", data['Gender'].unique())

Unique values in gender:  ['female' 'male']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 84380 entries, 1 to 183550
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Ethnicity         84380 non-null  int64  
 1   LOS               84380 non-null  int64  
 2   Age               84380 non-null  int64  
 3   Gender            84380 non-null  int64  
 4   Mortality         84380 non-null  object 
 5   Bleeding_Risk     84380 non-null  object 
 6   VTE_Risk          84380 non-null  object 
 7   Height            84380 non-null  object 
 8   Weight            84380 non-null  object 
 9   Calculated BMI    84380 non-null  object 
 10  Actual BMI        84380 non-null  object 
 11  BMI Type          84380 non-null  object 
 12  BMI Result        84380 non-null  object 
 13  BMI Flag          84380 non-null  object 
 14  Emergency_visits  84380 non-null  int64  
 15  Comorbidity       84380 non-null  object 


__Mortality__

In [82]:
#Male =1, Female = 2
print("Unique values in Mortality: ", data['Mortality'].unique())
data['Mortality'] = data['Mortality'].replace(['yes','no'], [1,0])

## Convert the data type from Object to numeric
data["Mortality"] = pd.to_numeric(data["Mortality"])
print(data.info())
print("Unique values in Mortality: ", data['Mortality'].unique())

Unique values in Mortality:  ['no' 'yes']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 84380 entries, 1 to 183550
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Ethnicity         84380 non-null  int64  
 1   LOS               84380 non-null  int64  
 2   Age               84380 non-null  int64  
 3   Gender            84380 non-null  int64  
 4   Mortality         84380 non-null  int64  
 5   Bleeding_Risk     84380 non-null  object 
 6   VTE_Risk          84380 non-null  object 
 7   Height            84380 non-null  object 
 8   Weight            84380 non-null  object 
 9   Calculated BMI    84380 non-null  object 
 10  Actual BMI        84380 non-null  object 
 11  BMI Type          84380 non-null  object 
 12  BMI Result        84380 non-null  object 
 13  BMI Flag          84380 non-null  object 
 14  Emergency_visits  84380 non-null  int64  
 15  Comorbidity       84380 non-null  object 
 1

__Bleeding_Risk__

In [84]:
#Male =1, Female = 2
print("Unique values in Bleeding_Risk: ", data['Bleeding_Risk'].unique())
data['Bleeding_Risk'] = data['Bleeding_Risk'].replace(['high', 'low'], [1,0])

## Convert the data type from Object to numeric
data["Bleeding_Risk"] = pd.to_numeric(data["Bleeding_Risk"])
#print(data.info())
print("Unique values in Bleeding_Risk: ", data['Bleeding_Risk'].unique())

Unique values in Bleeding_Risk:  ['high' 'low']
Unique values in Bleeding_Risk:  [1 0]


__VTE_Risk__

In [85]:
#Male =1, Female = 2
print("Unique values in VTE_Risk: ", data['VTE_Risk'].unique())
data['VTE_Risk'] = data['VTE_Risk'].replace(['high','low'], [1,0])

## Convert the data type from Object to numeric
data["VTE_Risk"] = pd.to_numeric(data["VTE_Risk"])
#print(data.info())
print("Unique values in VTE_Risk: ", data['VTE_Risk'].unique())

Unique values in VTE_Risk:  ['high' 'low']
Unique values in VTE_Risk:  [1 0]


In [88]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 84380 entries, 1 to 183550
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Ethnicity         84380 non-null  int64  
 1   LOS               84380 non-null  int64  
 2   Age               84380 non-null  int64  
 3   Gender            84380 non-null  int64  
 4   Mortality         84380 non-null  int64  
 5   Bleeding_Risk     84380 non-null  int64  
 6   VTE_Risk          84380 non-null  int64  
 7   Height            84380 non-null  object 
 8   Weight            84380 non-null  object 
 9   Calculated BMI    84380 non-null  object 
 10  Actual BMI        84380 non-null  object 
 11  BMI Type          84380 non-null  object 
 12  BMI Result        84380 non-null  object 
 13  BMI Flag          84380 non-null  object 
 14  Emergency_visits  84380 non-null  int64  
 15  Comorbidity       84380 non-null  int64  
 16  Bleeding          84380 non-null  int64

In [87]:
#Comorbidity
print("Unique values in Comorbidity: ", data['Comorbidity'].unique())
data['Comorbidity'] = data['Comorbidity'].replace(['yes','no'], [1,0])
## Convert the data type from Object to numeric
data["Comorbidity"] = pd.to_numeric(data["Comorbidity"])
print("Unique values in Comorbidity: ", data['Comorbidity'].unique())

#Bleeding
print("Unique values in Bleeding: ", data['Bleeding'].unique())
data['Bleeding'] = data['Bleeding'].replace(['yes','no'], [1,0])
## Convert the data type from Object to numeric
data["Bleeding"] = pd.to_numeric(data["Bleeding"])
print("Unique values in Bleeding: ", data['Bleeding'].unique())

#Stroke
print("Unique values in Stroke: ", data['Stroke'].unique())
data['Stroke'] = data['Stroke'].replace(['yes','no'], [1,0])
## Convert the data type from Object to numeric
data["Stroke"] = pd.to_numeric(data["Stroke"])
print("Unique values in Stroke: ", data['Stroke'].unique())

#Thrombosis
print("Unique values in Thrombosis: ", data['Thrombosis'].unique())
data['Thrombosis'] = data['Thrombosis'].replace(['yes','no'], [1,0])
## Convert the data type from Object to numeric
data["Thrombosis"] = pd.to_numeric(data["Thrombosis"])
print("Unique values in Thrombosis: ", data['Thrombosis'].unique())


Unique values in Comorbidity:  ['yes' 'no']
Unique values in Comorbidity:  [1 0]
Unique values in Bleeding:  ['no' 'yes']
Unique values in Bleeding:  [0 1]
Unique values in Stroke:  ['no' 'yes']
Unique values in Stroke:  [0 1]
Unique values in Thrombosis:  ['no' 'yes']
Unique values in Thrombosis:  [0 1]


### Removing "Warfarin" drug aka event type from the data

In [None]:
data = data[data.EventType != "warfarin"]
data.shape

In [None]:
data1 = data

### Replce EGFR Results >90 to 100

In [None]:
data["EGFR Result"].unique()

In [None]:
data1["EGFR Result"].replace({">90": 100}, inplace=True)
data1["EGFR Result"].replace({"Not applicable": 0}, inplace=True)
data1["EGFR Result"].replace({"Probably contaminated with Potassium EDTA, suggest rpt.": 0}, inplace=True)
data1["EGFR Result"].replace({"No sample received": 0}, inplace=True)
data1["EGFR Result"].replace({"Comment": 0}, inplace=True)
data1["EGFR Result"].replace({"Not Available": 0}, inplace=True)
data1["EGFR Result"].replace({"** INSUFFICIENT SPECIMEN **": 0}, inplace=True)
data1["EGFR Result"].replace({"Grossly haemolysed": 0}, inplace=True)
data1["EGFR Result"] = pd.to_numeric(data1["EGFR Result"])
print(data1["EGFR Result"])

#### Merge dabigatran and dabigatran etexilate from eventtype

In [None]:
data1.loc[data1['EventType']=="dabigatran etexilate", 'EventType'] = "dabigatran"

In [None]:
data1.EventType.unique()

In [None]:
data1.to_csv ("clean_DOAC2022.csv", index = False, header=True, encoding='utf-8')


# Clean DOAC2022 CSV File

In [None]:
df = pd.read_csv("clean_DOAC2022.csv")

In [None]:
df.head()

In [None]:
print("Shape of data: ", df.shape)
print(df.columns)

In [None]:
df.isnull( ).sum( )

In [None]:
df.info()

### Length of Stay
replace missing values with 0.

In [None]:
df["Length Of Stay"].unique()

In [None]:
df["Length Of Stay"].fillna(0, inplace=True)

# Split the dataset into 4 sub datasets

In [None]:
obese_df = df[df['BMI Flag'] == 'yes']
not_obese_df = df[df['BMI Flag'] == 'no']
normal_kidney_df = df[df['EGFR Result 60 Or Less'] == 'no']
kidney_df = df[df['EGFR Result 60 Or Less'] == 'yes']

In [None]:
obese_df.to_csv ("obese_DOAC2022.csv", index = False, header=True, encoding='utf-8')
not_obese_df.to_csv ("non_obese_DOAC2022.csv", index = False, header=True, encoding='utf-8')
normal_kidney_df.to_csv ("normal_kidney_DOAC2022.csv", index = False, header=True, encoding='utf-8')
kidney_df.to_csv ("kidney_DOAC2022.csv", index = False, header=True, encoding='utf-8')

# 1. Obese Dataset

In [None]:
obese_df = pd.read_csv("obese_DOAC2022.csv")

In [None]:
print("Shape of data: ", obese_df.shape)
print(obese_df.columns)

In [None]:
obese_df.isnull( ).sum( )

In [None]:
obese_df.info()

# 1 - Missing Values

## Handling missing values in EGFR Results

In [None]:
print(obese_df["EGFR Result"].value_counts())

In [None]:
obese_df["EGFR Result"].describe()

#### Since all the missing values in EGFR results hvae no in the corresponding "EGFR Result 60 Or Less" attribute. 
#### So, its better to replace the missing values with 
#### the average of only those records whoes corresponding EGFR values are greater than 60 or "no" in "EGFR result 60 or less".

In [None]:
temp_df = obese_df[["EGFR Result", "EGFR Result 60 Or Less"]]
print(temp_df.head())

tt = temp_df[temp_df['EGFR Result 60 Or Less'] == 'no']
print(tt.head())
print(tt.describe())

replacing missing values in "EGFR Result" with mean.

In [None]:
EGFR_mean = 82.13
print(EGFR_mean)

In [None]:
obese_df["EGFR Result"].fillna(EGFR_mean, inplace=True)

## Handling missing values in Actual BMI

Replacing missing values in Actual BMI from calculated BMI column

In [None]:
obese_df['Actual BMI'].fillna(obese_df['Calculated BMI'], inplace=True)

## Handling missing values in Bleeding risk and VTE risk
Drop the values where 'VTE Risk'and 'Bleeding Risk' are both null.

In [None]:
#obese_df = obese_df[obese_df['Bleeding Risk'].notna() && obese_df['VTE Risk'].notna()]
col_lst = ['Bleeding Risk', 'VTE Risk']
obese_df.dropna(axis = 0, subset = col_lst, how = 'all', inplace = True)

## Handling missing values in AC Amount

In [None]:
obese_df = obese_df[obese_df['AC Amount'].notna()]

## Handling missing values in Bleeding risk, Height, weight and calculated BMI.

In [None]:
obese_df = obese_df.dropna()

In [None]:
obese_df.isnull( ).sum( )

In [None]:
obese_df.info()

In [None]:
obese_df.nunique( )

# 2 - Visualization

## Ethinic Group

In [None]:
kk = obese_df["Ethnic Group"].value_counts()
print(kk)
plt.figure();

kk.plot(kind="pie");

## Age at Given Date - bar

In [None]:
kk = obese_df['Age at Given Date'].value_counts()
#kk = pd.value_counts(obese_df['Age at Given Date'])
print(kk)
plt.figure(figsize=(18, 8), dpi=80);
plt.xlabel('Age at a given date')
plt.ylabel('count')
kk.plot(kind="bar");


## EventType - pie

In [None]:
kk = obese_df["EventType"].value_counts()
print(kk)
plt.figure();

kk.plot(kind="pie");

## AC Amount -pie

In [None]:
kk = obese_df["AC Amount"].value_counts()
print(kk)
plt.figure(figsize=(10, 5), dpi=80);
plt.xlabel('AC Amount')
plt.ylabel('count')
kk.plot(kind="bar");

## Bleeding Risk -pie

In [None]:
kk = obese_df["Bleeding Risk"].value_counts()
print(kk)
plt.figure();

kk.plot(kind="pie");

## VTE Risk  - pie

In [None]:
kk = obese_df["VTE Risk"].value_counts()
print(kk)
plt.figure();

kk.plot(kind="pie");

## Emergency Admissions Upto 36 Months After PerformedTime - bar

In [None]:
kk = obese_df["Emergency Admissions Upto 36 Months After PerformedTime"].value_counts()
print(kk)
plt.figure(figsize=(10, 5), dpi=80);
plt.xlabel('Emergency Admissions Upto 36 Months After PerformedTime')
plt.ylabel('count')
kk.plot(kind="bar");

In [None]:
obese_df.to_csv ("obese.csv", index = False, header=True, encoding='utf-8')

### convert nominal values to numerical
Material Attribute is converted as
abs = 0, pla = 1
Infill Pattrens are converted into
grid = 0, honeycomb = 1

In [None]:
data.material = [0 if each == "abs" else 1 for each in data.material]
data.infill_pattern = [0 if each == "grid" else 1 for each in data.infill_pattern]

### convert numerical values to nominal ---> Numerical Binning
data.elongation = ['small' if each >=0 and each <1  else 'Big' if  each >=1 and each <3   else 'very_Big' for each in data.elongation]

'small' = [0,1], 'big' =[1,3], 'very big' = [3,4]

In [None]:
data['elongation'] = pd.cut(data['elongation'], bins=[0,1,3,4], labels=["small", "big", "very_big"])

In [None]:
X = data[['layer_height', 'wall_thickness', 'infill_density', 'infill_pattern','nozzle_temperature', 'bed_temperature', 'print_speed', 'material','fan_speed']]
y = data[['elongation']]

### Count of each class in class-attribute

In [None]:
data['elongation'].value_counts()

## Data Visualization

### Pair Plots

In [None]:
g = sns.pairplot(data, hue='EventType', markers='+')
g._legend.get_title().set_fontsize(20)
#plt.show()

In [None]:
g = sns.violinplot(y='elongation', x='layer_height', data=data, inner='quartile')
plt.show()
g = sns.violinplot(y='elongation', x='wall_thickness', data=data, inner='quartile')
plt.show()
g = sns.violinplot(y='elongation', x='infill_density', data=data, inner='quartile')
plt.show()
g = sns.violinplot(y='elongation', x='infill_pattern', data=data, inner='quartile')
plt.show()

### Correlation Analysis 

In [None]:
corr_matrix = data.corr()
plt.figure(figsize=(12, 10)) ### Setting the size of the plot
sns.heatmap(data = corr_matrix,cmap='BrBG', annot=True, linewidths=0.2)

## Data Normalization
Normalization refers to rescaling real valued numeric attributes into the range 0 and 1.

In [None]:
normalized_X = preprocessing.normalize(X)
print(normalized_X[0:2])

## Data Standardization
Standardization refers to shifting the distribution of each attribute to have a mean of zero and a standard deviation of one (unit variance).

In [None]:
standardized_X = preprocessing.scale(X)
print(standardized_X[0:2])

# Feature Engineering

### Looking for any null/missing values in the data set

In [None]:
print(data.isnull().any())

No need of Feature imputation as there is no missing values.

## Outlier Detection

In [None]:
sns.boxplot(data=standardized_X, orient="v", palette="Set2")

# Classifiers

## 1. Decision Trees

In [None]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import metrics
from sklearn.tree import export_graphviz


In [None]:
X = data[['layer_height', 'wall_thickness', 'infill_density', 'infill_pattern','nozzle_temperature', 'bed_temperature', 'print_speed', 'material','fan_speed']]
y = data[['elongation']]
#Normalised Data
normalized_X = preprocessing.normalize(X)
# Standardised Data
standardized_X = preprocessing.scale(X)
#encode categorical data into digits
y = pd.get_dummies(y)
print(y.head())

In [None]:
########################################## Using Normal Data
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=1)
# decision tree construction 
dt = DecisionTreeClassifier(criterion = 'entropy', random_state=1)
result_DT= dt.fit(X_train, y_train)
#pridiction
y_pred = dt.predict(X_test)
#accuracy
data_accuracy = metrics.accuracy_score(y_test, y_pred)
print("Data Accuracy:",data_accuracy)
########################################## Using Normalised Data
# train test split
X_train, X_test, y_train, y_test = train_test_split(normalized_X, y,test_size=0.3, random_state=1)
# decision tree construction 
dt = DecisionTreeClassifier(criterion = 'entropy', random_state=1)
result_DT= dt.fit(X_train, y_train)
#pridiction
y_pred = dt.predict(X_test)
#accuracy
data_accuracy = metrics.accuracy_score(y_test, y_pred)
print("Normalised data Accuracy:",data_accuracy)
########################################## Using Standardised Data
# train test split
X_train, X_test, y_train, y_test = train_test_split(standardized_X, y,test_size=0.3, random_state=1)
# decision tree construction 
dt = DecisionTreeClassifier(criterion = 'entropy', random_state=1)
result_DT= dt.fit(X_train, y_train)
#pridiction
y_pred = dt.predict(X_test)
#accuracy
data_accuracy = metrics.accuracy_score(y_test, y_pred)
print("Standardised Accuracy:",data_accuracy)

In [None]:

namez =  ['Small','Big','very_Big']
print(metrics.classification_report(y_test, y_pred, target_names=namez, digits=2,output_dict=False)) 


In [None]:
#confusion matrix
print('Confusion Matrix')
species = np.array(y_test).argmax(axis=1)
predictions = np.array(y_pred).argmax(axis=1)
print(confusion_matrix(species, predictions))

In [None]:
# Variable importance in classifier
print("Variable importacne in the classifier.")
pd.concat((pd.DataFrame(data.iloc[:, 1:].columns, columns = ['variable']), 
           pd.DataFrame(result_DT.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)[:20]

### Tree Visualization 

In [None]:
######## Creating Trees and printing on console. 
from sklearn import tree
import graphviz 
##### setting environment variable of Path for Graphviz
import os
os.environ["PATH"] += os.pathsep + 'C:/ProgramData/Anaconda3/Library/bin/graphviz/'

input_features = ['layer_height', 'wall_thickness', 'infill_density', 'infill_pattern','nozzle_temperature', 'bed_temperature', 'print_speed', 'material','fan_speed']
output_features = 'elongation'
dot_data = tree.export_graphviz(result_DT, out_file=None, 
                      feature_names=input_features,  
                      class_names=output_features,  
                      filled=True, rounded=True,  
                      special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 
######### Creating Tree and storing as PDF
#graph.render("DecisionTree")

#### Hyper Parameter Tuning

In [None]:
###############################################################################
###                Hyper parameter tuning and cross validation              ###
###############################################################################
from scipy.stats import randint
from sklearn.model_selection import GridSearchCV
import warnings
warnings.simplefilter("ignore")

##### 1 Setup the parameters
param_dist = {"max_depth":[1,9],
              "min_samples_leaf": [2,9],
              "criterion": ["gini","entropy"]}

#### 2 instantiate a decision tree
Dtree = DecisionTreeClassifier()

#### 3 instantiate ranommizedsearchCV object
tree_cv =  GridSearchCV(Dtree, param_dist, n_jobs=None, cv=10, verbose=0)

#### 4 fit to the data
Result = tree_cv.fit(X_train, y_train)
DT_result = Result
#### 5 print the tunned parameters and score
print("Best parameters are: {}".format(tree_cv.best_params_))
print("Best score is {} ".format(tree_cv.best_score_))

## 2. Bootstrap Aggregation(Bagging)

Bootstrap Aggregation (or Bagging for short), is a simple and very powerful ensemble method.

An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model.

Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance. An algorithm that has high variance are decision trees, like classification and regression trees (CART).

Decision trees are sensitive to the specific data on which they are trained. If the training data is changed (e.g. a tree is trained on a subset of the training data) the resulting decision tree can be quite different and in turn the predictions can be quite different.

Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.

Let’s assume we have a sample dataset of 1000 instances (x) and we are using the CART algorithm. Bagging of the CART algorithm would work as follows.

1. Create many (e.g. 100) random sub-samples of our dataset with replacement.
2. Train a CART model on each sample.
3. Given a new dataset, calculate the average prediction from each model.
For example, if we had 5 bagged decision trees that made the following class predictions for a in input sample: blue, blue, red, blue and red, we would take the most frequent class and predict blue.


Source: https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/

In [None]:
"""
Source: 
https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/
"""
# example of grid searching key hyperparameters for BaggingClassifier

from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import RepeatedStratifiedKFold

X = data[['layer_height', 'wall_thickness', 'infill_density', 'infill_pattern','nozzle_temperature', 'bed_temperature', 'print_speed', 'material','fan_speed']]
y = data[['elongation']]

# define models and parameters
model = BaggingClassifier(random_state=1)
n_estimators = [10, 100,200, 300, 1000]
# define grid search
grid = dict(n_estimators=n_estimators)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X, y)

BA_result = grid_result
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

## 3. Random Forest

Random Forests are an improvement over bagged decision trees.

A problem with decision trees like CART is that they are greedy. They choose which variable to split on using a greedy algorithm that minimizes error. As such, even with Bagging, the decision trees can have a lot of structural similarities and in turn have high correlation in their predictions.

Combining predictions from multiple models in ensembles works better if the predictions from the sub-models are uncorrelated or at best weakly correlated.

Random forest changes the algorithm for the way that the sub-trees are learned so that the resulting predictions from all of the subtrees have less correlation.

In [None]:
from sklearn.ensemble import RandomForestClassifier

#encode categorical data into digits
y = pd.get_dummies(y)
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=1)
# Random Forests construction 
rf = RandomForestClassifier(criterion='entropy', 
                             n_estimators=1000,
                             min_samples_split=2,
                             min_samples_leaf=1,
                             max_features='auto',
                             oob_score=True,
                             random_state=1,
                             n_jobs=-1,
                             verbose= 0)
result_RF = rf.fit(X_train, y_train)

#pridiction
y_pred = rf.predict(X_test)
#accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
namez =  ['Small','Big','very_Big']
print(metrics.classification_report(y_test, y_pred, target_names=namez, digits=2,output_dict=False)) 

#confusion matrix
species = np.array(y_test).argmax(axis=1)
predictions = np.array(y_pred).argmax(axis=1)
print("Confusion Matrix")
print(confusion_matrix(species, predictions))



### Parameter Importance in Classifier

In [None]:
# Variable importance in classifier
print("Variable importacne in the classifier.")
print(pd.concat((pd.DataFrame(data.iloc[:, 1:].columns, columns = ['variable']), 
           pd.DataFrame(result_RF.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False))


### Hyper-parameter Tuning

In [None]:
###############################################################################
###                Hyper parameter tuning and cross validation              ###
###############################################################################

y = pd.get_dummies(y)
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=1)

##### 1 Setup the parameters
param_dist = {"max_depth":[None,9],
              "min_samples_leaf": [1,50],
              "max_features":['sqrt', 'log2'],
              "n_estimators":[10,1000],
              "criterion": ["gini","entropy"]
              }

#### 2 instantiate a decision tree
RF_tree = RandomForestClassifier()

#### 3 instantiate ranommizedsearchCV object
tree_cv =  GridSearchCV(RF_tree, param_dist, n_jobs=-1, cv=5)

#### 4 fit to the data
Result = tree_cv.fit(X_train, y_train)

#### 5 print the tunned parameters and score
print("Best parameters are: {}".format(tree_cv.best_params_))
print("Best score is {} ".format(tree_cv.best_score_))
RF_result = Result

## 4. K-Nearest Neighbhours 

K-Nearest Neighbors, or KNN for short, is one of the simplest machine learning algorithms and is used in a wide array of institutions. KNN is a non-parametric, lazy learning algorithm. When we say a technique is non-parametric, it means that it does not make any assumptions about the underlying data. In other words, it makes its selection based off of the proximity to other data points regardless of what feature the numerical values represent. Being a lazy learning algorithm implies that there is little to no training phase. Therefore, we can immediately classify new data points as they present themselves.

Pros:
No assumptions about data
Simple algorithm — easy to understand
Can be used for classification and regression
Cons:
High memory requirement — All of the training data must be present in memory in order to calculate the closest K neighbors
Sensitive to irrelevant features
Sensitive to the scale of the data since we’re computing the distance to the closest K points

Source: https://towardsdatascience.com/k-nearest-neighbor-python-2fccc47d2a55


In [None]:
"""
Source: 
https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/
"""

# example of grid searching key hyperparametres for KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier


X = data[['layer_height', 'wall_thickness', 'infill_density', 'infill_pattern','nozzle_temperature', 'bed_temperature', 'print_speed', 'material','fan_speed']]
y = data[['elongation']]
# define models and parameters
model = KNeighborsClassifier()
n_neighbors = range(1, 21, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
# define grid search
grid = dict(n_neighbors=n_neighbors,weights=weights,metric=metric)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0, verbose=0)
######################################## Using raw data
grid_result = grid_search.fit(X, y)
KNN_result = grid_result
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
#for mean, stdev, param in zip(means, stds, params):
#    print("%f (%f) with: %r" % (mean, stdev, param))
    
########################################## Using Standardised Data
print("Using Standardised Data")
grid_result = grid_search.fit(standardized_X, y)
KNN_result_standard = grid_result
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
#for mean, stdev, param in zip(means, stds, params):
 #   print("%f (%f) with: %r" % (mean, stdev, param))

########################################## Using Normalised Data
print("Using Normalised Data")
grid_result = grid_search.fit(normalized_X, y)
KNN_result_normal = grid_result
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
#for mean, stdev, param in zip(means, stds, params):
 #   print("%f (%f) with: %r" % (mean, stdev, param))


# Linear Regression
The following are a set of methods intended for regression in which the target value is expected to be a linear combination of the features. In mathematical notation, if 'y'is the predicted value.

'y'(w, x) = w0 + w1.x1 + w2.x2 + ... + wp.xp

Across the module, we designate the vector w1.x1 + w2.x2 + ... + wp.xp as coef_ and 
w0 as intercept_.


When we talk about Regression, we often end up discussing Linear and Logistic Regression. But, that’s not the end. Do you know there are 7 types of Regressions?

Linear and logistic regression is just the most loved members from the family of regressions.  Last week, I saw a recorded talk at NYC Data Science Academy from Owen Zhang, Chief Product Officer at DataRobot. He said, ‘if you are using regression without regularization, you have to be very special!’. I hope you get what a person of his stature referred to.

I understood it very well and decided to explore regularization techniques in detail.

In this article, I have explained the complex science behind ‘Ridge Regression‘ and ‘Lasso Regression‘ which are the most fundamental regularization techniques used in data science, sadly still not used by many.
Source: https://www.analyticsvidhya.com/blog/2016/01/ridge-lasso-regression-python-complete-tutorial/

## 5. Multiple Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np

data = pd.read_csv("data.csv", sep = ",")

data.material = [0 if each == "abs" else 1 for each in data.material]
data.infill_pattern = [0 if each == "grid" else 1 for each in data.infill_pattern]

X = data[['layer_height', 'wall_thickness', 'infill_density', 'infill_pattern','nozzle_temperature', 'bed_temperature', 'print_speed', 'material','fan_speed']]
y = data[['elongation']]
X.head()
#y.head()

In [None]:
lnr_reg = LinearRegression()
MSEs = cross_val_score(lnr_reg, X,y, scoring='neg_mean_squared_error', cv=50 )
print(MSEs)
print("Mean Squared Error", np.mean(MSEs))

## 6. Ridged Regression

In [None]:
from sklearn.linear_model import Ridge

ridge_reg = Ridge()
params = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-2,1e-1, 1,0.1,0.01,0.001,2,4,5]}
grid_result = GridSearchCV(cv=50, scoring='neg_mean_squared_error', estimator=ridge_reg, param_grid=params)
grid_result = grid_result.fit(X, y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Ridge_result = grid_result

## 7. Lasso Regression

In [None]:
from sklearn.linear_model import Lasso

data.material = [0 if each == "abs" else 1 for each in data.material]
data.infill_pattern = [0 if each == "grid" else 1 for each in data.infill_pattern]


X = data[['layer_height', 'wall_thickness', 'infill_density', 'infill_pattern','nozzle_temperature', 'bed_temperature', 'print_speed', 'material','fan_speed']]
y = data[['elongation']]
normalized_X = preprocessing.normalize(X)
standardized_X = preprocessing.scale(X)
lasso_reg = Lasso()
params = {'alpha': [1e-15,1e-10,1e-8,1e-5,1e-2,1e-1, 1,0.1,0.01,0.001,2,4,5]}
grid_resul = GridSearchCV(cv=50, scoring='neg_mean_squared_error', estimator=lasso_reg, param_grid=params)
########################################## Using raw data
grid_result =grid_result.fit(X, y)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
LASSO_result = grid_result


## 8. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

data['elongation'] = pd.cut(data['elongation'], bins=[0,1,3,4], labels=["small", "big", "very_big"])
X = data[['layer_height', 'wall_thickness', 'infill_density', 'infill_pattern','nozzle_temperature', 'bed_temperature', 'print_speed', 'material','fan_speed']]
y = data[['elongation']]

# define models and parameters
model = LogisticRegression()
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0, verbose=0)


grid_result = grid_search.fit(X, y)
LR_result = grid_result
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
    
########################################## Using Standardised Data
print("Using Standardised Data")
grid_result = grid_search.fit(standardized_X, y)
LR_result_standard = grid_result
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

########################################## Using Normalised Data
print("Using Normalised Data")
grid_result = grid_search.fit(normalized_X, y)
LR_result_normal = grid_result
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

## 9. Gradient Bossting Classifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

data = pd.read_csv("data.csv", sep = ",")
data.material = [0 if each == "abs" else 1 for each in data.material]
data.infill_pattern = [0 if each == "grid" else 1 for each in data.infill_pattern]
data['elongation'] = pd.cut(data['elongation'], bins=[0,1,3,4], labels=["small", "big", "very_big"])

X = data[['layer_height', 'wall_thickness', 'infill_density', 'infill_pattern','nozzle_temperature', 'bed_temperature', 'print_speed', 'material','fan_speed']]
y = data[['elongation']]

# define models and parameters
model = GradientBoostingClassifier()
n_estimators = [10, 100, 1000]
learning_rate = [0.001, 0.01, 0.1]
subsample = [0.5, 0.7, 1.0]
max_depth = [3, 7, 9]
# define grid search
grid = dict(learning_rate=learning_rate, n_estimators=n_estimators, subsample=subsample, max_depth=max_depth)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)


grid_result = grid_search.fit(X, y)
GBC_result = grid_result
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

## 10. Support Vector Machine

In [None]:
from sklearn.svm import SVC

data = pd.read_csv("data.csv", sep = ",")
data.material = [0 if each == "abs" else 1 for each in data.material]
data.infill_pattern = [0 if each == "grid" else 1 for each in data.infill_pattern]
data['elongation'] = pd.cut(data['elongation'], bins=[0,1,3,4], labels=["small", "big", "very_big"])

X = data[['layer_height', 'wall_thickness', 'infill_density', 'infill_pattern','nozzle_temperature', 'bed_temperature', 'print_speed', 'material','fan_speed']]
y = data[['elongation']]

# define model and parameters
model = SVC()
kernel = ['poly', 'rbf', 'sigmoid']
C = [50, 10, 1.0, 0.1, 0.01]
gamma = ['scale']
# define grid search
grid = dict(kernel=kernel,C=C,gamma=gamma)
cv = RepeatedStratifiedKFold(n_splits=20, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0, verbose=1)

grid_result = grid_search.fit(X, y)
SVM_result = grid_result
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

########################################## Using Standardised Data
print("Using Standardised Data")
grid_result = grid_search.fit(standardized_X, y)
SVM_result_standard = grid_result
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
########################################## Using Normalised Data
print("Using Normalised Data")
grid_result = grid_search.fit(normalized_X, y)
SVM_result_normal = grid_result
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

# Result Comparisons

In [None]:
# Compare Algorithms
import pandas
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# load dataset
# Reading data from csv file
data = pd.read_csv("data.csv", sep = ",")
# Looking for any null/missing values in the data set
print("Looking for any null/missing values in the data set")
print(data.isnull().any())
#convert nominal values to numerical
data.material = [0 if each == "abs" else 1 for each in data.material]                  # abs = 0, pla = 1
data.infill_pattern = [0 if each == "grid" else 1 for each in data.infill_pattern]     # grid = 0, honeycomb = 1
#convert numerical values to nominal ---> Numerical Binning
data['elongation'] = pd.cut(data['elongation'], bins=[0,1,3,4], labels=["small", "big", "very_big"])

X = data[['layer_height', 'wall_thickness', 'infill_density', 'infill_pattern','nozzle_temperature', 'bed_temperature', 'print_speed', 'material','fan_speed']]
Y = data[['elongation']]

# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)

'''    
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
'''