# AI/ML Challenge

### Arya Suneesh

---
<br>
A hospital in the province of Greenland has been trying to improve its care conditions by looking at historic survival of the patients. They tried looking at their data but could not identify the main factors leading to high survivals.

## Importing necessary modules

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

## Importing training and testing dataset

In [2]:
pharma_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/pharma_data/Training_set_begs.csv')

In [3]:
pharma_data.head()

Unnamed: 0,ID_Patient_Care_Situation,Diagnosed_Condition,Patient_ID,Treated_with_drugs,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,Patient_mental_condition,A,B,C,D,E,F,Z,Number_of_prev_cond,Survived_1_year
0,22374,8,3333,DX6,56,18.479385,YES,URBAN,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0
1,18164,5,5740,DX2,36,22.945566,YES,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
2,6283,23,10446,DX6,48,27.510027,YES,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3,5339,51,12011,DX1,5,19.130976,NO,URBAN,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
4,33012,0,12513,,128,1.3484,Cannot say,RURAL,Stable,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1


In [4]:
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/pharma_data/Testing_set_begs.csv')

In [5]:
test_data.head()

Unnamed: 0,ID_Patient_Care_Situation,Diagnosed_Condition,Patient_ID,Treated_with_drugs,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,Patient_mental_condition,A,B,C,D,E,F,Z,Number_of_prev_cond
0,19150,40,3709,DX3,16,29.443894,NO,RURAL,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0
1,23216,52,986,DX6,24,26.836321,NO,URBAN,Stable,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0
2,11890,50,11821,DX4 DX5,63,25.52328,NO,RURAL,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0
3,7149,32,3292,DX6,42,27.171155,NO,URBAN,Stable,1.0,0.0,1.0,0.0,1.0,0.0,0.0,3.0
4,22845,20,9959,DX3,50,25.556192,NO,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


## Preparation of Data

- Eliminating null values
- Removing redundant columns
- Making sure there are no discrepancies
- Seperating columns with multiple values into individual columns
- converting all object type columns into integer or float type columns


In [6]:
pharma_data["Treated_with_drugs"].value_counts()

DX6                     8606
DX5                     1909
DX2                     1904
DX1                     1835
DX3                     1830
DX4                     1792
DX3 DX4                  448
DX1 DX2                  448
DX1 DX3                  424
DX4 DX5                  423
DX2 DX4                  419
DX1 DX4                  408
DX3 DX5                  407
DX1 DX5                  402
DX2 DX5                  400
DX2 DX3                  398
DX1 DX2 DX5              103
DX1 DX3 DX5              101
DX1 DX2 DX4               99
DX3 DX4 DX5               96
DX1 DX2 DX3               95
DX2 DX3 DX5               91
DX1 DX3 DX4               90
DX2 DX3 DX4               87
DX2 DX4 DX5               84
DX1 DX4 DX5               80
DX1 DX2 DX3 DX4           24
DX1 DX3 DX4 DX5           24
DX2 DX3 DX4 DX5           22
DX1 DX2 DX4 DX5           18
DX1 DX2 DX3 DX5           14
DX1 DX2 DX3 DX4 DX5        3
Name: Treated_with_drugs, dtype: int64

In [7]:
pharma_data.Treated_with_drugs.isnull().sum()

13

In [8]:
pharma_data['Treated_with_drugs'] = pharma_data['Treated_with_drugs'].str.upper()

In [9]:
pharma_data["Treated_with_drugs"] = pharma_data["Treated_with_drugs"].fillna(pharma_data["Treated_with_drugs"].mode()[0])

In [10]:
pharma_data.Treated_with_drugs.isnull().sum()

0

In [11]:
pharma_data.Patient_Smoker.value_counts()

NO            13246
YES            9838
Cannot say       13
Name: Patient_Smoker, dtype: int64

In [12]:
pharma_data.isnull().sum()

ID_Patient_Care_Situation       0
Diagnosed_Condition             0
Patient_ID                      0
Treated_with_drugs              0
Patient_Age                     0
Patient_Body_Mass_Index         0
Patient_Smoker                  0
Patient_Rural_Urban             0
Patient_mental_condition        0
A                            1235
B                            1235
C                            1235
D                            1235
E                            1235
F                            1235
Z                            1235
Number_of_prev_cond          1235
Survived_1_year                 0
dtype: int64

In [13]:
pharma_data["A"] = pharma_data["A"].fillna(pharma_data["A"].mode()[0])
pharma_data["B"] = pharma_data["B"].fillna(pharma_data["B"].mode()[0])
pharma_data["C"] = pharma_data["C"].fillna(pharma_data["C"].mode()[0])
pharma_data["D"] = pharma_data["D"].fillna(pharma_data["D"].mode()[0])
pharma_data["E"] = pharma_data["E"].fillna(pharma_data["E"].mode()[0])
pharma_data["F"] = pharma_data["F"].fillna(pharma_data["F"].mode()[0])
pharma_data["Z"] = pharma_data["Z"].fillna(pharma_data["Z"].mode()[0])

In [14]:
pharma_data.isnull().sum()

ID_Patient_Care_Situation       0
Diagnosed_Condition             0
Patient_ID                      0
Treated_with_drugs              0
Patient_Age                     0
Patient_Body_Mass_Index         0
Patient_Smoker                  0
Patient_Rural_Urban             0
Patient_mental_condition        0
A                               0
B                               0
C                               0
D                               0
E                               0
F                               0
Z                               0
Number_of_prev_cond          1235
Survived_1_year                 0
dtype: int64

In [15]:
pharma_data["Number_of_prev_cond"] = pharma_data["Number_of_prev_cond"].fillna(pharma_data["Number_of_prev_cond"].mode()[0])

In [16]:
pharma_data.isnull().sum()

ID_Patient_Care_Situation    0
Diagnosed_Condition          0
Patient_ID                   0
Treated_with_drugs           0
Patient_Age                  0
Patient_Body_Mass_Index      0
Patient_Smoker               0
Patient_Rural_Urban          0
Patient_mental_condition     0
A                            0
B                            0
C                            0
D                            0
E                            0
F                            0
Z                            0
Number_of_prev_cond          0
Survived_1_year              0
dtype: int64

In [17]:
pharma_data.Patient_mental_condition.value_counts()

Stable    23097
Name: Patient_mental_condition, dtype: int64

In [18]:
pharma_data = pharma_data.drop("Patient_mental_condition", axis = 1)

In [19]:
pharma_data.Patient_Smoker[pharma_data['Patient_Smoker'] == "Cannot say"] = 'NO'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pharma_data.Patient_Smoker[pharma_data['Patient_Smoker'] == "Cannot say"] = 'NO'


In [20]:
pharma_data.Patient_Smoker.value_counts()

NO     13259
YES     9838
Name: Patient_Smoker, dtype: int64

In [21]:
list_drugs = pharma_data["Treated_with_drugs"].str.get_dummies(sep = ' ')
pharma_data = pd.concat([pharma_data, list_drugs], axis = 1)
pharma_data = pharma_data.drop("Treated_with_drugs", axis = 1)


In [22]:
pharma_data.value_counts()

ID_Patient_Care_Situation  Diagnosed_Condition  Patient_ID  Patient_Age  Patient_Body_Mass_Index  Patient_Smoker  Patient_Rural_Urban  A    B    C    D    E    F    Z    Number_of_prev_cond  Survived_1_year  DX1  DX2  DX3  DX4  DX5  DX6
2                          32                   12157       32           18.622978                YES             RURAL                0.0  0.0  1.0  0.0  1.0  0.0  0.0  2.0                  0                0    1    1    0    0    0      1
22116                      32                   12425       27           19.059354                NO              RURAL                1.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0                  1                0    0    1    0    0    0      1
22114                      18                   7801        48           20.664038                YES             RURAL                1.0  1.0  0.0  0.0  0.0  0.0  0.0  2.0                  1                0    0    0    0    0    1      1
22112                      52        

In [23]:
pharma_data = pd.get_dummies(pharma_data, columns=['Patient_Smoker', 'Patient_Rural_Urban'])


In [24]:
pharma_data.head()

Unnamed: 0,ID_Patient_Care_Situation,Diagnosed_Condition,Patient_ID,Patient_Age,Patient_Body_Mass_Index,A,B,C,D,E,...,DX1,DX2,DX3,DX4,DX5,DX6,Patient_Smoker_NO,Patient_Smoker_YES,Patient_Rural_Urban_RURAL,Patient_Rural_Urban_URBAN
0,22374,8,3333,56,18.479385,1.0,0.0,0.0,0.0,1.0,...,0,0,0,0,0,1,0,1,0,1
1,18164,5,5740,36,22.945566,1.0,0.0,0.0,0.0,0.0,...,0,1,0,0,0,0,0,1,1,0
2,6283,23,10446,48,27.510027,1.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,1,0,1,1,0
3,5339,51,12011,5,19.130976,1.0,0.0,0.0,0.0,0.0,...,1,0,0,0,0,0,1,0,0,1
4,33012,0,12513,128,1.3484,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,1,1,0,1,0


## Splitting data

In [25]:
y = pharma_data["Survived_1_year"]
X = pharma_data.drop("Survived_1_year", axis = 1)

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

## Training Model

In [27]:
dt_clf = DecisionTreeClassifier()

In [28]:
dt_clf.fit(X_train, y_train)

DecisionTreeClassifier()

In [29]:
y_pred = dt_clf.predict(X_test)

## Checking accuracy

In [30]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [31]:
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

Accuracy:  0.7688311688311689


In [32]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.68      0.70      0.69      2539
           1       0.82      0.81      0.82      4391

    accuracy                           0.77      6930
   macro avg       0.75      0.75      0.75      6930
weighted avg       0.77      0.77      0.77      6930



In [33]:
print(confusion_matrix(y_test, y_pred))

[[1771  768]
 [ 834 3557]]


## Cleaning test data

In [34]:
test_data.head()

Unnamed: 0,ID_Patient_Care_Situation,Diagnosed_Condition,Patient_ID,Treated_with_drugs,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,Patient_mental_condition,A,B,C,D,E,F,Z,Number_of_prev_cond
0,19150,40,3709,DX3,16,29.443894,NO,RURAL,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0
1,23216,52,986,DX6,24,26.836321,NO,URBAN,Stable,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0
2,11890,50,11821,DX4 DX5,63,25.52328,NO,RURAL,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0
3,7149,32,3292,DX6,42,27.171155,NO,URBAN,Stable,1.0,0.0,1.0,0.0,1.0,0.0,0.0,3.0
4,22845,20,9959,DX3,50,25.556192,NO,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [35]:
test_data.Treated_with_drugs.value_counts()

DX6                     3462
DX4                      785
DX5                      782
DX1                      753
DX3                      747
DX2                      745
DX2 DX4                  181
DX2 DX3                  179
DX1 DX5                  166
DX2 DX5                  165
DX3 DX5                  161
DX1 DX2                  160
DX4 DX5                  157
DX1 DX4                  153
DX1 DX3                  152
DX3 DX4                  148
DX1 DX3 DX4               41
DX1 DX2 DX5               41
DX2 DX3 DX4               40
DX1 DX2 DX3               40
DX3 DX4 DX5               40
DX1 DX2 DX4               38
DX2 DX3 DX5               37
DX1 DX4 DX5               34
DX2 DX4 DX5               33
DX1 DX3 DX5               23
DX1 DX3 DX4 DX5           11
DX2 DX3 DX4 DX5            8
DX1 DX2 DX4 DX5            8
DX1 DX2 DX3 DX5            6
DX1 DX2 DX3 DX4            5
DX1 DX2 DX3 DX4 DX5        2
Name: Treated_with_drugs, dtype: int64

In [36]:
list_drugs_test = test_data["Treated_with_drugs"].str.get_dummies(sep = ' ')
test_data = pd.concat([test_data, list_drugs_test], axis = 1)
test_data = test_data.drop("Treated_with_drugs", axis = 1)

In [37]:
test_data.isnull().sum()

ID_Patient_Care_Situation    0
Diagnosed_Condition          0
Patient_ID                   0
Patient_Age                  0
Patient_Body_Mass_Index      0
Patient_Smoker               0
Patient_Rural_Urban          0
Patient_mental_condition     0
A                            0
B                            0
C                            0
D                            0
E                            0
F                            0
Z                            0
Number_of_prev_cond          0
DX1                          0
DX2                          0
DX3                          0
DX4                          0
DX5                          0
DX6                          0
dtype: int64

In [38]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9303 entries, 0 to 9302
Data columns (total 22 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID_Patient_Care_Situation  9303 non-null   int64  
 1   Diagnosed_Condition        9303 non-null   int64  
 2   Patient_ID                 9303 non-null   int64  
 3   Patient_Age                9303 non-null   int64  
 4   Patient_Body_Mass_Index    9303 non-null   float64
 5   Patient_Smoker             9303 non-null   object 
 6   Patient_Rural_Urban        9303 non-null   object 
 7   Patient_mental_condition   9303 non-null   object 
 8   A                          9303 non-null   float64
 9   B                          9303 non-null   float64
 10  C                          9303 non-null   float64
 11  D                          9303 non-null   float64
 12  E                          9303 non-null   float64
 13  F                          9303 non-null   float

In [39]:
test_data.Patient_mental_condition.value_counts()

Stable    9303
Name: Patient_mental_condition, dtype: int64

In [40]:
test_data = test_data.drop("Patient_mental_condition", axis = 1)

In [41]:
test_data = pd.get_dummies(test_data, columns=['Patient_Smoker', 'Patient_Rural_Urban'])


In [42]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9303 entries, 0 to 9302
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID_Patient_Care_Situation  9303 non-null   int64  
 1   Diagnosed_Condition        9303 non-null   int64  
 2   Patient_ID                 9303 non-null   int64  
 3   Patient_Age                9303 non-null   int64  
 4   Patient_Body_Mass_Index    9303 non-null   float64
 5   A                          9303 non-null   float64
 6   B                          9303 non-null   float64
 7   C                          9303 non-null   float64
 8   D                          9303 non-null   float64
 9   E                          9303 non-null   float64
 10  F                          9303 non-null   float64
 11  Z                          9303 non-null   float64
 12  Number_of_prev_cond        9303 non-null   float64
 13  DX1                        9303 non-null   int64

## Predicting for test data

In [43]:
test_pred = dt_clf.predict(test_data)

In [44]:
result = pd.DataFrame(test_pred)
result.index = test_data.index
result.columns = ["prediction"]

In [45]:
from google.colab import files
result.to_csv('prediction.csv', index=False)         
files.download('prediction.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>