<a href="https://colab.research.google.com/github/andydaehn/Product_Lab_Results/blob/main/Product_Lab_Results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab Report Analysis <a name="top"></a>

## Table of Contents
* [Abstract](#abstract)
* [Exploratory Data Analysis](#exploratory_data_analysis)
    * [Univariate Analysis](#univariate)
    * [Bivariate Analysis](#bivariate)
    * [Multivariate Analysis](#multivariate)
* [Data Preprocessing](#preprocess)
* [Split and Encode the Data](#tt_split)
* [Train and Test the Data on Random Forest Classifier](#train_rfc)
    * [Results for Random Forest Classifier](#rfc_results)
* [Train and Test the Data on Decision Tree Classifier](#train_dtc)
    * [Results for Decision Tree Classifier](#dtc_results)

## Abstract <a class="anchor" id="abstract"></a>


Information has been collected about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of five medications, Drug a, Drug b, Drug c, Drug x and y.

The goal is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.

This analysis will help predict the best drug to use for a particular patient, based on medical data from 200 patients, by classifying the patient to the drug profile.

Comparing results from the Random Forest and Decision Tree Classifiers, The Decision Tree performed the best at 83.12%

This project is based off a Kaggle dataset by [Pratham Tripathi](https://www.kaggle.com/datasets/prathamtripathi/drug-classification)

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import plotly.express as px
import plotly.graph_objects as go

# Ensure visualizations can be viewed by all
colorblind_seq = ['#8856a7', '#9ebcda','#de2d26']

# Load the dataset
train_df = pd.read_csv('train.csv')
X_test = pd.read_csv('test.csv')

In [2]:
y_train = train_df['failure']
X_train = train_df.drop(['failure'],axis=1)

In [4]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26570 entries, 0 to 26569
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              26570 non-null  int64  
 1   product_code    26570 non-null  object 
 2   loading         26320 non-null  float64
 3   attribute_0     26570 non-null  object 
 4   attribute_1     26570 non-null  object 
 5   attribute_2     26570 non-null  int64  
 6   attribute_3     26570 non-null  int64  
 7   measurement_0   26570 non-null  int64  
 8   measurement_1   26570 non-null  int64  
 9   measurement_2   26570 non-null  int64  
 10  measurement_3   26189 non-null  float64
 11  measurement_4   26032 non-null  float64
 12  measurement_5   25894 non-null  float64
 13  measurement_6   25774 non-null  float64
 14  measurement_7   25633 non-null  float64
 15  measurement_8   25522 non-null  float64
 16  measurement_9   25343 non-null  float64
 17  measurement_10  25270 non-null 

In [5]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20775 entries, 0 to 20774
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              20775 non-null  int64  
 1   product_code    20775 non-null  object 
 2   loading         20552 non-null  float64
 3   attribute_0     20775 non-null  object 
 4   attribute_1     20775 non-null  object 
 5   attribute_2     20775 non-null  int64  
 6   attribute_3     20775 non-null  int64  
 7   measurement_0   20775 non-null  int64  
 8   measurement_1   20775 non-null  int64  
 9   measurement_2   20775 non-null  int64  
 10  measurement_3   20446 non-null  float64
 11  measurement_4   20366 non-null  float64
 12  measurement_5   20267 non-null  float64
 13  measurement_6   20151 non-null  float64
 14  measurement_7   20055 non-null  float64
 15  measurement_8   19929 non-null  float64
 16  measurement_9   19871 non-null  float64
 17  measurement_10  19708 non-null 

In [6]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26570 entries, 0 to 26569
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              26570 non-null  int64  
 1   product_code    26570 non-null  object 
 2   loading         26320 non-null  float64
 3   attribute_0     26570 non-null  object 
 4   attribute_1     26570 non-null  object 
 5   attribute_2     26570 non-null  int64  
 6   attribute_3     26570 non-null  int64  
 7   measurement_0   26570 non-null  int64  
 8   measurement_1   26570 non-null  int64  
 9   measurement_2   26570 non-null  int64  
 10  measurement_3   26189 non-null  float64
 11  measurement_4   26032 non-null  float64
 12  measurement_5   25894 non-null  float64
 13  measurement_6   25774 non-null  float64
 14  measurement_7   25633 non-null  float64
 15  measurement_8   25522 non-null  float64
 16  measurement_9   25343 non-null  float64
 17  measurement_10  25270 non-null 

In [7]:
# Format to be the same as the other attributes
X_train['attribute_0'] = X_train['attribute_0'].str.replace(r'\D', '').astype(int)
X_train['attribute_1'] = X_train['attribute_1'].str.replace(r'\D', '').astype(int)

  
  This is separate from the ipykernel package so we can avoid doing imports until


In [8]:
X_test['attribute_0'] = X_test['attribute_0'].str.replace(r'\D', '').astype(int)
X_test['attribute_1'] = X_test['attribute_1'].str.replace(r'\D', '').astype(int)

  """Entry point for launching an IPython kernel.
  


In [9]:
for x in range(18):
    X_train[f'measurement_{x}'].fillna(value=X_train[f'measurement_{x}'].mean(), inplace=True)
X_train['loading'].fillna(value=X_train['loading'].mean(), inplace=True)

In [10]:
for x in range(18):
    X_test[f'measurement_{x}'].fillna(value=X_test[f'measurement_{x}'].mean(), inplace=True)
X_test['loading'].fillna(value=X_test['loading'].mean(), inplace=True)

In [11]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26570 entries, 0 to 26569
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              26570 non-null  int64  
 1   product_code    26570 non-null  object 
 2   loading         26570 non-null  float64
 3   attribute_0     26570 non-null  int64  
 4   attribute_1     26570 non-null  int64  
 5   attribute_2     26570 non-null  int64  
 6   attribute_3     26570 non-null  int64  
 7   measurement_0   26570 non-null  int64  
 8   measurement_1   26570 non-null  int64  
 9   measurement_2   26570 non-null  int64  
 10  measurement_3   26570 non-null  float64
 11  measurement_4   26570 non-null  float64
 12  measurement_5   26570 non-null  float64
 13  measurement_6   26570 non-null  float64
 14  measurement_7   26570 non-null  float64
 15  measurement_8   26570 non-null  float64
 16  measurement_9   26570 non-null  float64
 17  measurement_10  26570 non-null 

In [12]:
y_train.size

26570

In [13]:
y_train = y_train.drop([-1])
y_train.size()

KeyError: ignored

In [17]:
#y_train = y_train.to_frame()

In [24]:
y_train.unique()

AttributeError: ignored

In [10]:
X_test.info()
# fig = px.histogram(x = X_train['loading'],
#                    #title='Histogram for {}'.format(col),
#                    #labels={'x':col},
#                    nbins=30,
#                    color_discrete_sequence=colorblind_seq,
#                    width=700,
#                    height=500)

# fig.show()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20775 entries, 0 to 20774
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              20775 non-null  int64  
 1   product_code    20775 non-null  object 
 2   loading         20775 non-null  float64
 3   attribute_0     20775 non-null  int64  
 4   attribute_1     20775 non-null  int64  
 5   attribute_2     20775 non-null  int64  
 6   attribute_3     20775 non-null  int64  
 7   measurement_0   20775 non-null  int64  
 8   measurement_1   20775 non-null  int64  
 9   measurement_2   20775 non-null  int64  
 10  measurement_3   20775 non-null  float64
 11  measurement_4   20775 non-null  float64
 12  measurement_5   20775 non-null  float64
 13  measurement_6   20775 non-null  float64
 14  measurement_7   20775 non-null  float64
 15  measurement_8   20775 non-null  float64
 16  measurement_9   20775 non-null  float64
 17  measurement_10  20775 non-null 

## Exploratory Data Analysis <a class="anchor" id="exploratory_data_analysis"></a>
<a href="#top">Back to top of page</a>

The first 5 rows of the dataset

In [12]:
# df_corr = train_df.corr() # Generate correlation matrix

# fig = go.Figure()
# fig.add_trace(
#     go.Heatmap(
#         x = df_corr.columns,
#         y = df_corr.index,
#         z = np.array(df_corr)
#     )
# )

In [13]:

# # define one hot encoding
# encoder = OneHotEncoder()
# # transform data
# onehot = encoder.fit_transform(train_df['product_code'])
# train_df.head()

## Split and Encode the Data <a class="anchor" id="tt_split"></a>
<a href="#top">Back to top of page</a>

Split the data

Encode the data

In [18]:
# Encode the data
X_train, X_test = [pd.get_dummies(df) for df in [X_train, X_test]]
y_train = [pd.get_dummies(df) for df in [y_train]]

In [19]:
''' Reports will return 0.0 (warning) when 
    there is not a f1 score to calculate for a label '''
import warnings
warnings.filterwarnings('ignore')

## Train and Test the Data on Random Forest Classifier <a class="anchor" id="train_rfc"></a>
<a href="#top">Back to top of page</a>

## Train and Test the Data on Decision Tree Classifier <a class="anchor" id="train_dtc"></a>
<a href="#top">Back to top of page</a>

In [20]:
# Feed pipeline into GridSearchCV
pipeline = Pipeline([('scaler', StandardScaler()),
                     ('dtc', DecisionTreeClassifier())])

param_grid = {
    'dtc__min_samples_leaf':[5,10,15],
    'dtc__criterion':['gini', 'entropy'],
    'dtc__max_depth':[2,4,6,8,10,12]}
# Initialize
grid_pipeline = GridSearchCV(pipeline, param_grid, n_jobs=-1, verbose=1, cv=5, scoring='f1')
# Fit
grid_pipeline.fit(X_train,y_train)
grid_pipeline.best_params_

ValueError: ignored

In [11]:
# Train the data on the Decision Tree Classifier and test the accuracy
dtc = DecisionTreeClassifier(criterion='gini', max_depth=2, min_samples_leaf=5)
dtc.fit(X_train,y_train)

# Making predictions
y_train_pred = dtc.predict(X_train)
y_test_pred = dtc.predict(X_test)

ValueError: ignored

Report results for Decision Tree Classifier <a class="anchor" id="results_dtc"></a>

In [None]:
# Get Scores
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Results for Decision Tree Classifier')
print('The training accuracy is',round((train_accuracy*100),2),'%')
print('The test accuracy is', round((test_accuracy*100),2),'%')

Results for Decision Tree Classifier
The training accuracy is 83.12 %
The test accuracy is 85.0 %


In [None]:
print(classification_report(y_test, y_test_pred, labels=np.unique(y_test_pred)))

              precision    recall  f1-score   support

           0       0.95      1.00      0.98        20
           1       0.64      1.00      0.78         7

   micro avg       0.84      1.00      0.92        27
   macro avg       0.79      1.00      0.88        27
weighted avg       0.87      1.00      0.92        27
 samples avg       0.68      0.68      0.68        27

