<a href="https://colab.research.google.com/github/andydaehn/Product_Lab_Results/blob/main/Product_Lab_Results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab Report Analysis <a name="top"></a>

## Table of Contents
* [Abstract](#abstract)
* [Exploratory Data Analysis](#exploratory_data_analysis)
    * [Univariate Analysis](#univariate)
    * [Bivariate Analysis](#bivariate)
    * [Multivariate Analysis](#multivariate)
* [Data Preprocessing](#preprocess)
* [Split and Encode the Data](#tt_split)
* [Train and Test the Data on Random Forest Classifier](#train_rfc)
    * [Results for Random Forest Classifier](#rfc_results)
* [Train and Test the Data on Decision Tree Classifier](#train_dtc)
    * [Results for Decision Tree Classifier](#dtc_results)

## Abstract <a class="anchor" id="abstract"></a>


Information has been collected about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of five medications, Drug a, Drug b, Drug c, Drug x and y.

The goal is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.

This analysis will help predict the best drug to use for a particular patient, based on medical data from 200 patients, by classifying the patient to the drug profile.

Comparing results from the Random Forest and Decision Tree Classifiers, The Decision Tree performed the best at 83.12%

This project is based off a Kaggle dataset by [Pratham Tripathi](https://www.kaggle.com/datasets/prathamtripathi/drug-classification)

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import plotly.express as px

# Ensure visualizations can be viewed by all
colorblind_seq = ['#8856a7', '#9ebcda','#de2d26']

# Load the dataset
explore_df = pd.read_csv('train.csv')

## Exploratory Data Analysis <a class="anchor" id="exploratory_data_analysis"></a>
<a href="#top">Back to top of page</a>

The first 5 rows of the dataset

In [2]:
explore_df.head()

Unnamed: 0,id,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,...,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,0,A,80.1,material_7,material_8,9,5,7,8,4,...,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,1,A,84.89,material_7,material_8,9,5,14,3,3,...,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,2,A,82.43,material_7,material_8,9,5,12,1,5,...,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,3,A,101.07,material_7,material_8,9,5,13,2,6,...,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,4,A,188.06,material_7,material_8,9,5,9,2,8,...,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


Statistical information on numerical values 

In [3]:
explore_df.describe()

Unnamed: 0,id,loading,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,measurement_3,measurement_4,measurement_5,...,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
count,26570.0,26320.0,26570.0,26570.0,26570.0,26570.0,26570.0,26189.0,26032.0,25894.0,...,25343.0,25270.0,25102.0,24969.0,24796.0,24696.0,24561.0,24460.0,24286.0,26570.0
mean,13284.5,127.826233,6.754046,7.240459,7.415883,8.232518,6.256568,17.791528,11.731988,17.127804,...,11.430725,16.117711,19.172085,11.702464,15.652904,16.048444,14.995554,16.460727,701.269059,0.212608
std,7670.242662,39.03002,1.471852,1.456493,4.11669,4.199401,3.309109,1.0012,0.996085,0.996414,...,0.999137,1.405978,1.520785,1.488838,1.155247,1.491923,1.549226,1.708935,123.304161,0.40916
min,0.0,33.16,5.0,5.0,0.0,0.0,0.0,13.968,8.008,12.073,...,7.537,9.323,12.461,5.167,10.89,9.14,9.104,9.701,196.787,0.0
25%,6642.25,99.9875,6.0,6.0,4.0,5.0,4.0,17.117,11.051,16.443,...,10.757,15.209,18.17,10.703,14.89,15.057,13.957,15.268,618.9615,0.0
50%,13284.5,122.39,6.0,8.0,7.0,8.0,6.0,17.787,11.733,17.132,...,11.43,16.127,19.2115,11.717,15.6285,16.04,14.969,16.436,701.0245,0.0
75%,19926.75,149.1525,8.0,8.0,10.0,11.0,8.0,18.469,12.41,17.805,...,12.102,17.025,20.207,12.709,16.374,17.082,16.018,17.628,784.09025,0.0
max,26569.0,385.86,9.0,9.0,29.0,29.0,24.0,21.499,16.484,21.425,...,15.412,22.479,25.64,17.663,22.713,22.303,21.626,24.094,1312.794,1.0


Basic columner information 

In [4]:
explore_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26570 entries, 0 to 26569
Data columns (total 26 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              26570 non-null  int64  
 1   product_code    26570 non-null  object 
 2   loading         26320 non-null  float64
 3   attribute_0     26570 non-null  object 
 4   attribute_1     26570 non-null  object 
 5   attribute_2     26570 non-null  int64  
 6   attribute_3     26570 non-null  int64  
 7   measurement_0   26570 non-null  int64  
 8   measurement_1   26570 non-null  int64  
 9   measurement_2   26570 non-null  int64  
 10  measurement_3   26189 non-null  float64
 11  measurement_4   26032 non-null  float64
 12  measurement_5   25894 non-null  float64
 13  measurement_6   25774 non-null  float64
 14  measurement_7   25633 non-null  float64
 15  measurement_8   25522 non-null  float64
 16  measurement_9   25343 non-null  float64
 17  measurement_10  25270 non-null 

In [5]:
explore_df['product_code'].unique()

array(['A', 'B', 'C', 'D', 'E'], dtype=object)

In [6]:
explore_df['attribute_0'].unique()

array(['material_7', 'material_5'], dtype=object)

In [21]:
#explore_df['attribute_0'].str.lstrip('attribute_')

explore_df['attribute_0'] = explore_df['attribute_0'].str.replace(r'\D', '')  #.map(lambda x: x.replace(r'\D', ''))

explore_df['attribute_0'].unique()

  This is separate from the ipykernel package so we can avoid doing imports until


array(['7', '5'], dtype=object)

In [24]:
explore_df['attribute_0'] = explore_df['attribute_0'].astype(int)
explore_df['attribute_0'].unique()

array([7, 5])

In [7]:
explore_df['attribute_1'].unique()

array(['material_8', 'material_5', 'material_6'], dtype=object)

In [25]:
explore_df['attribute_1'] = explore_df['attribute_1'].str.replace(r'\D', '').astype(int)  #.map(lambda x: x.replace(r'\D', ''))
explore_df['attribute_1'].unique()
# explore_df['attribute_0'] = explore_df['attribute_0'].astype(int)
# explore_df['attribute_0'].unique()

  """Entry point for launching an IPython kernel.


array([8, 5, 6])

In [8]:
explore_df['attribute_2'].unique()

array([9, 8, 5, 6])

In [9]:
explore_df['attribute_3'].unique()

array([5, 8, 6, 9])

In [10]:
explore_df['measurement_0'].unique()

array([ 7, 14, 12, 13,  9, 11,  4, 10,  6,  8, 21, 15, 17, 18, 19, 16,  5,
       25,  3,  1, 23, 20, 22,  2, 26, 24,  0, 29, 27])

In [11]:
explore_df['measurement_1'].unique()

array([ 8,  3,  1,  2,  4,  6,  0,  9,  5,  7, 10, 12, 11, 13, 17, 14, 16,
       15, 18, 20, 24, 22, 21, 19, 23, 27, 25, 26, 29, 28])

In [26]:
explore_df.head()

Unnamed: 0,id,product_code,loading,attribute_0,attribute_1,attribute_2,attribute_3,measurement_0,measurement_1,measurement_2,...,measurement_9,measurement_10,measurement_11,measurement_12,measurement_13,measurement_14,measurement_15,measurement_16,measurement_17,failure
0,0,A,80.1,7,8,9,5,7,8,4,...,10.672,15.859,17.594,15.193,15.029,,13.034,14.684,764.1,0
1,1,A,84.89,7,8,9,5,14,3,3,...,12.448,17.947,17.915,11.755,14.732,15.425,14.395,15.631,682.057,0
2,2,A,82.43,7,8,9,5,12,1,5,...,12.715,15.607,,13.798,16.711,18.631,14.094,17.946,663.376,0
3,3,A,101.07,7,8,9,5,13,2,6,...,12.471,16.346,18.377,10.02,15.25,15.562,16.154,17.172,826.282,0
4,4,A,188.06,7,8,9,5,9,2,8,...,10.337,17.082,19.932,12.428,16.182,12.76,13.153,16.412,579.885,0


In [27]:

# define one hot encoding
encoder = OneHotEncoder()
# transform data
onehot = encoder.fit_transform(explore_df['product_code'])
explore_df.head()

ValueError: ignored

## Univariate Analysis <a class="anchor" id="univariate"></a>
<a href="#top">Back to top of page</a>

In [None]:
# Histograms for univariate analysis
for col in explore_df:
    fig = px.histogram(x = explore_df[col],
                       title='Histogram for {}'.format(col),
                       labels={'x':col},
                       nbins=30,
                       color_discrete_sequence=colorblind_seq,
                       width=700,
                       height=500)

    fig.show()

**Inference:** Na_to_k has a positive skew. 

In [None]:
# Skewness
print("Skewness of Na_to_K: %f" % explore_df['Na_to_K'].skew())

Skewness of Na_to_K: 1.039341


**Inference**: Drug Y is widely used compared to other drugs.

## Bivariate Analysis <a class="anchor" id="bivariate"></a>
<a href="#top">Back to top of page</a>

In [None]:
# Scatter of bivariate/numerical features
numerical_df = ['Age','Na_to_K']
for col in numerical_df:
    fig = px.scatter(x = explore_df['Drug'],
                     y = explore_df[col],
                     title='Scatter for {}'.format(col),
                     labels={'x':'Drug','y':col},
                     color_discrete_sequence=colorblind_seq,
                     width=700,
                     height=500,)
    fig.show()

## Multivariate Analysis <a class="anchor" id="multivariate"></a>
<a href="#top">Back to top of page</a>

In [None]:
# Histogram of multivariate features
category_df  = ['Sex','BP','Cholesterol']
for col in category_df:
    fig = px.histogram(x = explore_df['Drug'],
                       title = 'Histogram for {}'.format(col),
                       barmode = 'group',
                       color = explore_df[col],
                       labels = {'x':'Drug','color':col},
                       nbins = 30,
                       color_discrete_sequence=colorblind_seq,
                       width = 700,
                       height = 500)
    fig.show()

## Data Preprocessing <a class="anchor" id="preprocess"></a>
<a href="#top">Back to top of page</a>

Create 7 groups out of the 'Age' column

In [None]:
# Make age into 7 groups
bin_age = [0, 19, 29, 39, 49, 59, 69, 80]
category_age = ['<20s', '20s', '30s', '40s', '50s', '60s', '>70s']
explore_df['Age_binned'] = pd.cut(explore_df['Age'], bins=bin_age, labels=category_age)
explore_df.Age_binned.unique()

['20s', '40s', '60s', '30s', '>70s', '50s', '<20s']
Categories (7, object): ['<20s' < '20s' < '30s' < '40s' < '50s' < '60s' < '>70s']

Combine BP and Cholesterol into one column

In [None]:
# Create new column that combines both BP and Cholesterol
explore_df['BP_Chol'] = explore_df['BP'].str.cat(explore_df['Cholesterol'],sep=" ")
explore_df.BP_Chol.unique()

array(['HIGH HIGH', 'LOW HIGH', 'NORMAL HIGH', 'LOW NORMAL',
       'HIGH NORMAL', 'NORMAL NORMAL'], dtype=object)

Create 4 groups out of Na_to_K

In [None]:
# Make Na_to_K into 4 groups
bin_NatoK = [0, 9, 19, 29, 50]
category_NatoK = ['<10', '10-20', '20-30', '>30']
explore_df['Na_to_K_binned'] = pd.cut(explore_df['Na_to_K'], bins=bin_NatoK, labels=category_NatoK)
explore_df.Na_to_K_binned.unique()

['20-30', '10-20', '<10', '>30']
Categories (4, object): ['<10' < '10-20' < '20-30' < '>30']

## Split and Encode the Data <a class="anchor" id="tt_split"></a>
<a href="#top">Back to top of page</a>

Split the data

In [None]:
# Split data into features and target
y_df = explore_df['Drug']
X_df = explore_df.drop(['Drug'],axis=1)

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=22)

Encode the data

In [None]:
# Encode the data
X_train, X_test = [pd.get_dummies(df) for df in [X_train, X_test]]
y_train, y_test = [pd.get_dummies(df) for df in [y_train, y_test]]

In [None]:
''' Reports will return 0.0 (warning) when 
    there is not a f1 score to calculate for a label '''
import warnings
warnings.filterwarnings('ignore')

## Train and Test the Data on Random Forest Classifier <a class="anchor" id="train_rfc"></a>
<a href="#top">Back to top of page</a>

In [None]:
# Feed pipeline into GridSearchCV
pipeline = Pipeline([('scaler' ,StandardScaler()),
                     ('rfc',RandomForestClassifier())])

param_grid = {
    'rfc__max_depth': [4, 5, 10],
    'rfc__max_features': [2, 3],
    'rfc__min_samples_leaf': [3, 4, 5],
    'rfc__n_estimators': [100, 200, 300]}

# Initialize
grid_pipeline = GridSearchCV(pipeline, param_grid, n_jobs=-1, verbose=1, cv=5, scoring='f1')
# Fit
grid_pipeline.fit(X_train,y_train)
grid_pipeline.best_params_

Fitting 5 folds for each of 54 candidates, totalling 270 fits


{'rfc__max_depth': 4,
 'rfc__max_features': 2,
 'rfc__min_samples_leaf': 3,
 'rfc__n_estimators': 100}

In [None]:
# Train the data on the Random Forest Classifier and test the accuracy
rfc = RandomForestClassifier(max_depth=4, max_features=2, min_samples_leaf=3, n_estimators=100)
rfc.fit(X_train,y_train)

# Making predictions
y_train_pred = rfc.predict(X_train)
y_test_pred = rfc.predict(X_test)

Report results for Random Forest Classifier <a class="anchor" id="rfc_results"></a>

In [None]:
# Get Scores
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Results for Random Forest Classifier')
print('The training accuracy is',round((train_accuracy*100),2),'%')
print('The test accuracy is', round((test_accuracy*100),2),'%')

Results for Random Forest Classifier
The training accuracy is 72.5 %
The test accuracy is 57.5 %


In [None]:
print(classification_report(y_test, y_test_pred, labels=np.unique(y_test_pred)))

              precision    recall  f1-score   support

           0       1.00      0.90      0.95        20
           1       0.00      0.00      0.00         7

   micro avg       1.00      0.67      0.80        27
   macro avg       0.50      0.45      0.47        27
weighted avg       0.74      0.67      0.70        27
 samples avg       0.45      0.45      0.45        27



## Train and Test the Data on Decision Tree Classifier <a class="anchor" id="train_dtc"></a>
<a href="#top">Back to top of page</a>

In [None]:
# Feed pipeline into GridSearchCV
pipeline = Pipeline([('scaler', StandardScaler()),
                     ('dtc', DecisionTreeClassifier())])

param_grid = {
    'dtc__min_samples_leaf':[5,10,15],
    'dtc__criterion':['gini', 'entropy'],
    'dtc__max_depth':[2,4,6,8,10,12]}
# Initialize
grid_pipeline = GridSearchCV(pipeline, param_grid, n_jobs=-1, verbose=1, cv=5, scoring='f1')
# Fit
grid_pipeline.fit(X_train,y_train)
grid_pipeline.best_params_

Fitting 5 folds for each of 36 candidates, totalling 180 fits


{'dtc__criterion': 'gini', 'dtc__max_depth': 2, 'dtc__min_samples_leaf': 5}

In [None]:
# Train the data on the Decision Tree Classifier and test the accuracy
dtc = DecisionTreeClassifier(criterion='gini', max_depth=2, min_samples_leaf=5)
dtc.fit(X_train,y_train)

# Making predictions
y_train_pred = dtc.predict(X_train)
y_test_pred = dtc.predict(X_test)

Report results for Decision Tree Classifier <a class="anchor" id="results_dtc"></a>

In [None]:
# Get Scores
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Results for Decision Tree Classifier')
print('The training accuracy is',round((train_accuracy*100),2),'%')
print('The test accuracy is', round((test_accuracy*100),2),'%')

Results for Decision Tree Classifier
The training accuracy is 83.12 %
The test accuracy is 85.0 %


In [None]:
print(classification_report(y_test, y_test_pred, labels=np.unique(y_test_pred)))

              precision    recall  f1-score   support

           0       0.95      1.00      0.98        20
           1       0.64      1.00      0.78         7

   micro avg       0.84      1.00      0.92        27
   macro avg       0.79      1.00      0.88        27
weighted avg       0.87      1.00      0.92        27
 samples avg       0.68      0.68      0.68        27

