# Introduction

As part of defining the right medicine and treatment for patients with same illness, we can create a prediction model by leveraging patient data and their response to a certain medication. Our goal in this project is to create a model to find the right medication for future patients with same illness. The problem that is in question which we are trying to solve is "Can we create a machine learning algorithm where we can predict proper medication to a group of patient with same illness, based on their features such as age and gender?" 

# Contents

1. About the Data Set
2. Data Collection and Understanding
3. Data Exploration
4. Model Selecting and Set Up
5. Model Development
6. Prediction
7. Evaluation
8. Conclusion

## About the Data Set

The features that are provided within the data set are outlined as below

- Age : Age of the Patient
- Sex : Gender of the Patient
- BP  : Blood Pressure of the Patient
- Cholesterol: Cholesterol of the Patient
- Drug: Drug each patient responded to
- Na_to_K: Sodium to Potasium Levels


* Please note all patients in the dataset have the same illness.

## Data Collection and Understanding

In [1]:
# importing neccessary libraries
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv('drug200.csv')
df.head(5)

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


In [3]:
# summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
Age            200 non-null int64
Sex            200 non-null object
BP             200 non-null object
Cholesterol    200 non-null object
Na_to_K        200 non-null float64
Drug           200 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 9.5+ KB


We can see that the data set size is 1200 with 200 rows and 6 columns. The variables are correct data type; Age is integer, Sex, BP, Cholesterol and Drug is objects, Na_to_K is float.

We can also see that Sex, BP and Cholesterol are categorical variables. 

In [4]:
# looking to see if there are any missing values
missing_data=df.isnull()
missing_data.head(5)

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [5]:
for column in missing_data.columns.values.tolist():
    print(column)
    print(missing_data[column].value_counts())
    print('')

Age
False    200
Name: Age, dtype: int64

Sex
False    200
Name: Sex, dtype: int64

BP
False    200
Name: BP, dtype: int64

Cholesterol
False    200
Name: Cholesterol, dtype: int64

Na_to_K
False    200
Name: Na_to_K, dtype: int64

Drug
False    200
Name: Drug, dtype: int64



We can see that there are no missing values within our dataset. We can start analyzing our data set. 

## Data Exploration

In [6]:
df.describe()

Unnamed: 0,Age,Na_to_K
count,200.0,200.0
mean,44.315,16.084485
std,16.544315,7.223956
min,15.0,6.269
25%,31.0,10.4455
50%,45.0,13.9365
75%,58.0,19.38
max,74.0,38.247


Based on the data set, our average age of patients is 44. The youngest patient is 15 and oldest patient is 74 years old. Please keep in mind that all of the patients have the same illness. 

In [7]:
df.corr()

Unnamed: 0,Age,Na_to_K
Age,1.0,-0.063119
Na_to_K,-0.063119,1.0


There is a negative medium correlation between the age and, Sodium to Potassium ratio. The higher the age is the lower the sodium to potassium ratio is.

## Model Selecting and Set Up

Based on the feature data set, even though we do have categorical variables, we can use Decision Tree to create a prediction model. In order to do that, we need to change the categorical variables such as Sex, Blood Pressure and Cholesterol to numerical variables.

We can define our Feature Matrix is as X and y as the response vector which is the target.

In [11]:
# defining our feature matrix that will predict the target y value(drug)
X = df[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
X[0:5]

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.113999999999999],
       [28, 'F', 'NORMAL', 'HIGH', 7.797999999999999],
       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)

In [9]:
# turn the categorical variables to numeric variables
from sklearn import preprocessing

In [12]:
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1]) 


le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])


le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) 

X[0:5]


array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.113999999999999],
       [28, 0, 2, 0, 7.797999999999999],
       [61, 0, 1, 0, 18.043]], dtype=object)

In [13]:
# defining the y target value
y=df['Drug']
y[0:5]

0    drugY
1    drugC
2    drugC
3    drugX
4    drugY
Name: Drug, dtype: object

We have selected Decision Tree as predictive modeling, defined our feature matrix and target. We can start setting up the decision tree.

In [15]:
# importing neccessary libraries
from sklearn.model_selection import train_test_split

In [16]:
X_trainset, X_testset, y_trainset, y_testset=train_test_split(X, y, test_size=0.3, random_state=3)

In [17]:
X_trainset[0:5]

array([[26, 0, 0, 1, 19.160999999999998],
       [41, 0, 2, 1, 22.905],
       [28, 0, 2, 0, 19.675],
       [19, 0, 0, 0, 13.312999999999999],
       [50, 1, 2, 1, 15.79]], dtype=object)

## Model Development

In [19]:
# creating the Decision Tree Clasifier instance
from sklearn.tree import DecisionTreeClassifier

In [20]:
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree # it shows the default parameters

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [21]:
# we will fit the X, y trainset. (training the dataset with X, y trainset values)
drugTree.fit(X_trainset,y_trainset)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

## Prediction

In [22]:
# our model is ready and we can start defining predictions
predTree = drugTree.predict(X_testset)

In [23]:
print (predTree [0:5])
print (y_testset [0:5])

['drugY' 'drugX' 'drugX' 'drugX' 'drugX']
40     drugY
51     drugX
139    drugX
197    drugX
170    drugX
Name: Drug, dtype: object


## Evaluation

In [24]:
# our model is ready and we can check the accuracy of the model by importing metrics from sklearn
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))

DecisionTrees's Accuracy:  0.9833333333333333


The classification score we get is based on: a set of labels predicted for a sample must exactlty match the corresponding set of labels. 

## Conclusion

Based on our analysis, by using the Age, Sex, Blood Pressure, Cholesterol and Sodium to Potasium ratio as feature matrix, we created a machine learning model to predict proper medication for a patient. 