<a href="https://colab.research.google.com/github/aishwaryajadhav/Python-programs/blob/master/Pharma_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Problem Statement:** 
To predict if a patient will survive after 1 year of treatment. 

Target Variable: Survived_1_year contains values 0/1 denoting whether the patient Surviced/Did not survive after a year of treatment. This is a binary classification problem.

Solution steps:
1. Load data
2. Understand your data: EDA
3. Pre-process the data 
4. Prepare train and test datasets
5. Choose a model
6. Train your model
7. Evaluate the model (F1-score calculation)
8. Optimize: repeat steps 5 - 7


### **Load Libraries**

In [None]:
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split 
 
from sklearn.linear_model import LogisticRegression

from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import f1_score

### **Load Data**

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/pharma_data/Training_set_begs.csv')

### **EDA**

Primary screenings: 
1. Get a look at the data, its columns and kind of values contained in these columns: df.head()
2. Stepping back a bit, get a look at the column overview: number, types, NULL counts: df.info()

In [None]:
data.head()

In [None]:
data.info()

First off, we will look at the distribution of our target variable to determine if we have a balanced dataset


In [None]:
sns.countplot(x='Survived_1_year', data)
plt.show()

Next, we will perform EDA with our continuous variables

---



In [None]:
numeric_features = data.select_dtypes(include=[np.number])
numeric_features.columns

In [None]:
numeric_data=data[['Patient_Age', 'Patient_Body_Mass_Index', 'Number_of_prev_cond', 'Survived_1_year']]  #keeping in the target varibale for analysis purposes
numeric_data.head()

In [None]:
numeric_data.info()

In [None]:
numeric_data.isnull().sum()

We can see there are a lot of missing values for Number_of_prev_cond. We will fill these with the mode 

In [None]:
data['Number_of_prev_cond'].fillna(data['Number_of_prev_cond'].mode()[0], inplace = True)
numeric_data.drop(['Number_of_prev_cond'], axis=1)
numeric_data['Number_of_prev_cond']=data['Number_of_prev_cond']
numeric_data.isnull().sum()

In [None]:
for feature in numeric_data.columns[:-1]:
  sns.boxplot(x='Survived_1_year', y=feature, data=numeric_data)
  sns.swarmplot(x='Survived_1_year', y=feature, data=numeric_data)
  plt.show()

### Missing values

In [None]:
data.isnull().sum()

ID_Patient_Care_Situation       0
Diagnosed_Condition             0
Patient_ID                      0
Treated_with_drugs             13
Patient_Age                     0
Patient_Body_Mass_Index         0
Patient_Smoker                  0
Patient_Rural_Urban             0
Patient_mental_condition        0
A                            1235
B                            1235
C                            1235
D                            1235
E                            1235
F                            1235
Z                            1235
Number_of_prev_cond          1235
Survived_1_year                 0
dtype: int64

In [None]:
drugs = data.select_dtypes(include = 'object').columns 


In [None]:
data[drugs] = data[drugs].fillna(data[drugs].mode().iloc[0])


In [None]:
data.isnull().sum()

ID_Patient_Care_Situation       0
Diagnosed_Condition             0
Patient_ID                      0
Treated_with_drugs              0
Patient_Age                     0
Patient_Body_Mass_Index         0
Patient_Smoker                  0
Patient_Rural_Urban             0
Patient_mental_condition        0
A                            1235
B                            1235
C                            1235
D                            1235
E                            1235
F                            1235
Z                            1235
Number_of_prev_cond          1235
Survived_1_year                 0
dtype: int64

In [None]:
encoder = LabelEncoder()
encoder.fit(data['Treated_with_drugs'])
encoder.classes_

array(['DX1 ', 'DX1 DX2 ', 'DX1 DX2 DX3 ', 'DX1 DX2 DX3 DX4 ',
       'DX1 DX2 DX3 DX4 DX5 ', 'DX1 DX2 DX3 DX5 ', 'DX1 DX2 DX4 ',
       'DX1 DX2 DX4 DX5 ', 'DX1 DX2 DX5 ', 'DX1 DX3 ', 'DX1 DX3 DX4 ',
       'DX1 DX3 DX4 DX5 ', 'DX1 DX3 DX5 ', 'DX1 DX4 ', 'DX1 DX4 DX5 ',
       'DX1 DX5 ', 'DX2 ', 'DX2 DX3 ', 'DX2 DX3 DX4 ', 'DX2 DX3 DX4 DX5 ',
       'DX2 DX3 DX5 ', 'DX2 DX4 ', 'DX2 DX4 DX5 ', 'DX2 DX5 ', 'DX3 ',
       'DX3 DX4 ', 'DX3 DX4 DX5 ', 'DX3 DX5 ', 'DX4 ', 'DX4 DX5 ', 'DX5 ',
       'DX6'], dtype=object)

In [None]:
data['Treated_with_drugs'] = encoder.transform(data['Treated_with_drugs'])
data.head()

Unnamed: 0,ID_Patient_Care_Situation,Diagnosed_Condition,Patient_ID,Treated_with_drugs,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,Patient_mental_condition,A,B,C,D,E,F,Z,Number_of_prev_cond,Survived_1_year
0,22374,8,3333,31,56,18.479385,YES,URBAN,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0
1,18164,5,5740,16,36,22.945566,YES,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
2,6283,23,10446,31,48,27.510027,YES,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3,5339,51,12011,0,5,19.130976,NO,URBAN,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
4,33012,0,12513,31,128,1.3484,Cannot say,RURAL,Stable,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1


In [None]:
encoder = LabelEncoder()
encoder.fit(data['Patient_Smoker'])
encoder.classes_

array(['Cannot say', 'NO', 'YES'], dtype=object)

In [None]:
data['Patient_Smoker'] = encoder.transform(data['Patient_Smoker'])

In [None]:
encoder = LabelEncoder()
encoder.fit(data['Patient_Rural_Urban'])
encoder.classes_

array(['RURAL', 'URBAN'], dtype=object)

In [None]:
data['Patient_Rural_Urban'] = encoder.transform(data['Patient_Rural_Urban'])

In [None]:
encoder = LabelEncoder()
encoder.fit(data['Patient_mental_condition'])
encoder.classes_

array(['Stable'], dtype=object)

In [None]:
data['Patient_mental_condition'] = encoder.transform(data['Patient_mental_condition'])

In [None]:
data.head()

Unnamed: 0,ID_Patient_Care_Situation,Diagnosed_Condition,Patient_ID,Treated_with_drugs,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,Patient_mental_condition,A,B,C,D,E,F,Z,Number_of_prev_cond,Survived_1_year
0,22374,8,3333,31,56,18.479385,2,1,0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0
1,18164,5,5740,16,36,22.945566,2,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
2,6283,23,10446,31,48,27.510027,2,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
3,5339,51,12011,0,5,19.130976,1,1,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1
4,33012,0,12513,31,128,1.3484,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1


In [None]:
data[data["B"].isnull()]

Unnamed: 0,ID_Patient_Care_Situation,Diagnosed_Condition,Patient_ID,Treated_with_drugs,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,Patient_mental_condition,A,B,C,D,E,F,Z,Number_of_prev_cond,Survived_1_year
27,4691,31,4500,16,44,27.935658,2,1,0,,,,,,,,,0
36,4869,28,6826,0,4,17.342624,1,1,0,,,,,,,,,0
41,32899,24,7275,23,12,20.994843,1,0,0,,,,,,,,,1
97,9311,13,7538,16,49,26.641499,1,0,0,,,,,,,,,1
105,13511,44,7903,28,41,28.079769,2,0,0,,,,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23002,6195,3,7812,16,26,26.548517,2,0,0,,,,,,,,,1
23012,20220,48,12318,28,15,18.849124,1,0,0,,,,,,,,,1
23024,25571,6,3423,16,13,24.343030,1,0,0,,,,,,,,,1
23038,644,30,8032,27,45,19.272509,1,0,0,,,,,,,,,1


In [None]:
data['A'].fillna(data['A'].mode()[0], inplace = True)


In [None]:
data['B'].fillna(data['B'].mode()[0], inplace = True)

In [None]:
data['C'].fillna(data['C'].mode()[0], inplace = True)

In [None]:
data['D'].fillna(data['D'].mode()[0], inplace = True)

In [None]:
data['E'].fillna(data['E'].mode()[0], inplace = True)

In [None]:
data['F'].fillna(data['F'].mode()[0], inplace = True)

In [None]:
data['Z'].fillna(data['Z'].mode()[0], inplace = True)


In [None]:

data['Number_of_prev_cond'].fillna(data['Number_of_prev_cond'].mode()[0], inplace = True)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23097 entries, 0 to 23096
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID_Patient_Care_Situation  23097 non-null  int64  
 1   Diagnosed_Condition        23097 non-null  int64  
 2   Patient_ID                 23097 non-null  int64  
 3   Treated_with_drugs         23097 non-null  int64  
 4   Patient_Age                23097 non-null  int64  
 5   Patient_Body_Mass_Index    23097 non-null  float64
 6   Patient_Smoker             23097 non-null  int64  
 7   Patient_Rural_Urban        23097 non-null  int64  
 8   Patient_mental_condition   23097 non-null  int64  
 9   A                          23097 non-null  float64
 10  B                          23097 non-null  float64
 11  C                          23097 non-null  float64
 12  D                          23097 non-null  float64
 13  E                          23097 non-null  flo

In [None]:
data.isnull().sum()

ID_Patient_Care_Situation    0
Diagnosed_Condition          0
Patient_ID                   0
Treated_with_drugs           0
Patient_Age                  0
Patient_Body_Mass_Index      0
Patient_Smoker               0
Patient_Rural_Urban          0
Patient_mental_condition     0
A                            0
B                            0
C                            0
D                            0
E                            0
F                            0
Z                            0
Number_of_prev_cond          0
Survived_1_year              0
dtype: int64

### separate input and output

In [None]:
X = data.drop('Survived_1_year',axis = 1)
y = data['Survived_1_year']

### Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### Model Building

In [None]:
model = LogisticRegression(max_iter = 1000)
model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

### Prediction

In [None]:
pred = model.predict(X_test)

In [None]:
print(f1_score(y_test,pred))

0.7515113935823903


### Random Forest and Boruta

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier

In [None]:
forest = RandomForestClassifier(n_jobs=-1, max_depth=5, random_state=1)
 
forest.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=5, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [None]:
y_pred = forest.predict(X_test)

fscore = f1_score(y_test ,y_pred)
fscore

0.8347687400318979

## Boruta

In [None]:
!pip install Boruta

Collecting Boruta
[?25l  Downloading https://files.pythonhosted.org/packages/b2/11/583f4eac99d802c79af9217e1eff56027742a69e6c866b295cce6a5a8fc2/Boruta-0.3-py3-none-any.whl (56kB)
[K     |████████████████████████████████| 61kB 1.8MB/s 
Installing collected packages: Boruta
Successfully installed Boruta-0.3


In [None]:
from boruta import BorutaPy

In [None]:
boruta_selector = BorutaPy(forest, n_estimators='auto', verbose=2, random_state=1)
boruta_selector.fit(np.array(X_train), np.array(y_train))

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	17
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	10
Tentative: 	3
Rejected: 	4
Iteration: 	9 / 100
Confirmed: 	10
Tentative: 	3
Rejected: 	4
Iteration: 	10 / 100
Confirmed: 	10
Tentative: 	3
Rejected: 	4
Iteration: 	11 / 100
Confirmed: 	10
Tentative: 	3
Rejected: 	4
Iteration: 	12 / 100
Confirmed: 	12
Tentative: 	1
Rejected: 	4
Iteration: 	13 / 100
Confirmed: 	12
Tentative: 	1
Rejected: 	4
Iteration: 	14 / 100
Confirmed: 	12
Tentative: 	1
Rejected: 	4
Iteration: 	15 / 100
Confirmed: 	12
Tentative: 	1
Rejected: 	4
Iteration: 	16 / 100
Confirmed: 	12
Tentative: 	1
Rejected: 	4
I

BorutaPy(alpha=0.05,
         estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                          class_weight=None, criterion='gini',
                                          max_depth=5, max_features='auto',
                                          max_leaf_nodes=None, max_samples=None,
                                          min_impurity_decrease=0.0,
                                          min_impurity_split=None,
                                          min_samples_leaf=1,
                                          min_samples_split=2,
                                          min_weight_fraction_leaf=0.0,
                                          n_estimators=101, n_jobs=-1,
                                          oob_score=False,
                                          random_state=RandomState(MT19937) at 0x7F4A322F7888,
                                          verbose=0, warm_start=False),
         max_iter=100, n_estimators='aut

In [None]:
print("Selected Features: ", boruta_selector.support_)
 

print("Ranking: ",boruta_selector.ranking_)

print("No. of significant features: ", boruta_selector.n_features_)

Selected Features:  [False  True  True  True  True  True  True  True False  True  True  True
  True False False False  True]
Ranking:  [2 1 1 1 1 1 1 1 6 1 1 1 1 3 4 5 1]
No. of significant features:  12


In [None]:
selected_rfe_features = pd.DataFrame({'Feature':list(X_train.columns),
                                      'Ranking':boruta_selector.ranking_})
selected_rfe_features.sort_values(by='Ranking')

Unnamed: 0,Feature,Ranking
16,Number_of_prev_cond,1
1,Diagnosed_Condition,1
2,Patient_ID,1
3,Treated_with_drugs,1
4,Patient_Age,1
5,Patient_Body_Mass_Index,1
6,Patient_Smoker,1
7,Patient_Rural_Urban,1
12,D,1
9,A,1


In [None]:
X_important_train = boruta_selector.transform(np.array(X_train))
X_important_test = boruta_selector.transform(np.array(X_test))

In [None]:
rf_important = RandomForestClassifier(n_estimators=10000, random_state=1, n_jobs=-1)


rf_important.fit(X_important_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10000,
                       n_jobs=-1, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [None]:
y_important_pred = rf_important.predict(X_important_test)
rf_imp_fscore = f1_score(y_test, y_important_pred)

In [None]:
print(rf_imp_fscore)

0.8528314682943371
