# Model Experimentation

* This notebook contains the implementation for various deep learning model on our dataset
* First we will start with data preprocssing then we'll build and train various models
* The loss we are trying to minimise is `cross_entropy` and metric we will use to measure accuracy of predictions is `accuracy`

In [1]:
# Import required libraries
import numpy as np
import pandas  as pd

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import log_loss

from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

# Remove any warnings we are getting
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importing the modified data
train = pd.read_csv('../data/train_modified.csv')
test = pd.read_csv('../data/test_modified.csv')

In [3]:
# View the first 5 rows of the training data
train.head()

Unnamed: 0,id,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,Status,N_Days_Years
0,0,D-penicillamine,58,M,N,N,N,N,2.3,316.0,3.35,172.0,1601.0,179.8,63.0,394.0,9.7,3.0,D,2.7
1,1,Placebo,52,F,N,N,N,N,0.9,364.0,3.54,63.0,1440.0,134.85,88.0,361.0,11.0,3.0,C,7.1
2,2,Placebo,37,F,N,Y,Y,Y,3.3,299.0,3.55,131.0,1029.0,119.35,50.0,199.0,11.7,4.0,D,9.4
3,3,Placebo,50,F,N,N,N,N,0.6,256.0,3.5,58.0,1653.0,71.3,96.0,269.0,10.7,3.0,C,7.1
4,4,Placebo,45,F,N,Y,N,N,1.1,346.0,3.65,63.0,1181.0,125.55,96.0,298.0,10.6,4.0,C,2.2


In [4]:
# View the first 5 rows of the test dataset
test.head()

Unnamed: 0,id,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,N_Days_Years
0,7905,D-penicillamine,54,F,N,Y,N,N,1.2,546.0,3.37,65.0,1636.0,151.9,90.0,430.0,10.6,2.0,10.5
1,7906,D-penicillamine,41,F,N,N,N,N,1.1,660.0,4.22,94.0,1257.0,151.9,155.0,227.0,10.0,2.0,6.8
2,7907,Placebo,36,F,N,Y,N,Y,2.0,151.0,2.96,46.0,961.0,69.75,101.0,213.0,13.0,4.0,0.1
3,7908,D-penicillamine,56,F,N,N,N,N,0.6,293.0,3.85,40.0,554.0,125.55,56.0,270.0,10.6,2.0,6.4
4,7909,D-penicillamine,60,F,N,Y,N,N,1.4,277.0,2.97,121.0,1110.0,125.0,126.0,221.0,9.8,1.0,4.4


## Data Preprocessing

This section involves splitting the data into training and validation splits, converting categorical data into numerical data and scaling the data to reduce  computation

In [5]:
# Checking for missing values
train.isnull().sum()

id               0
Drug             0
Age              0
Sex              0
Ascites          0
Hepatomegaly     0
Spiders          0
Edema            0
Bilirubin        0
Cholesterol      0
Albumin          0
Copper           0
Alk_Phos         0
SGOT             0
Tryglicerides    0
Platelets        0
Prothrombin      0
Stage            0
Status           0
N_Days_Years     0
dtype: int64

In [6]:
# Check for duplicate entries
train.duplicated().sum()

0

In [7]:
# Split the data into features and target
features = train.drop(columns=['Status'])
target = train['Status']

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(features, target, test_size=0.1, random_state=42)

len(X_train), len(y_train), len(X_val), len(y_val)

(7114, 7114, 791, 791)

In [8]:
# Split the columns in the dataset into numerical and categorical cols
numeric_features = list(features.select_dtypes(exclude="object").columns)
categorical_features = list(features.select_dtypes(include="object").columns)

numeric_features, categorical_features

(['id',
  'Age',
  'Bilirubin',
  'Cholesterol',
  'Albumin',
  'Copper',
  'Alk_Phos',
  'SGOT',
  'Tryglicerides',
  'Platelets',
  'Prothrombin',
  'Stage',
  'N_Days_Years'],
 ['Drug', 'Sex', 'Ascites', 'Hepatomegaly', 'Spiders', 'Edema'])

In [9]:
# Create a transformer for numerical cols
num_scaler = StandardScaler()

numeric_transformer = Pipeline(
    steps=[
        ("num_scaler", num_scaler)
    ]
)

In [10]:
# Create a transformer for categorical cols
one_hot_encoder = OneHotEncoder(sparse_output=False)
cat_scaler = StandardScaler()

categorical_transformer = Pipeline(
    steps=[
        ("one_hot", one_hot_encoder),
        ("cat_scaler", cat_scaler)
    ]
)

In [11]:
# Combine the transformers together
preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numeric_features),
        ('categorical', categorical_transformer, categorical_features)
    ]
)

In [12]:
X_train = preprocessor.fit_transform(X_train)
X_val = preprocessor.transform(X_val)
test_arr = preprocessor.transform(test)

## Model Experimentation

In this notebook, we are going to try various machine learning methods to classify our data and select the best performing model. In this section, we are going to use cross validation to check our model is generalizable or not

The machine learning algorithms we will use are
1. `LogisticRegression(multi_class="multinomial")`
2. `DecisionTreeClassifier()`
3. `GaussianNB()`
5. `LinearSVC(multi_class="crammer_singer")`
6. `RandomForestClassifier()`

We are using `log_loss()` for measuring the error. It should be closer to zero


Let's make a function to get the cross validation score when we plug the model

In [13]:
def cross_validate(model, x_train:np.array = X_train, y_train:np.array=y_train, cv:int=5) -> np.array:
    scores_acc = cross_val_score(model, x_train, y_train, cv=cv)
    scores_ll = cross_val_score(model, x_train, y_train, cv=cv, scoring="neg_log_loss")
    print(f"scores (acc): {scores_acc}")
    print(f"mean acc= {scores_acc.mean()}")
    print(f"std of acc = {scores_acc.std()}")
    print("-" * 35)
    print(f"scores (log_loss): {scores_ll}")
    print(f"mean log loss = {scores_ll.mean()}")
    print(f"std of log loss = {scores_ll.std()}")
    print("-" * 35)

1. `LogisticRegression()`

In [14]:
log_reg = LogisticRegression(multi_class="multinomial")
cross_validate(log_reg)

scores (acc): [0.80042164 0.7962052  0.79761068 0.7962052  0.78973277]
mean acc= 0.7960350994758603
std of acc = 0.003507176975741483
-----------------------------------
scores (log_loss): [-0.51359069 -0.51440008 -0.52400864 -0.54901188 -0.52808354]
mean log loss = -0.5258189657367464
std of log loss = 0.012854871768910594
-----------------------------------


2. `DecisionTreeClassifier()`

In [15]:
tree = DecisionTreeClassifier()
cross_validate(tree)

scores (acc): [0.74279691 0.74349965 0.72101195 0.74490513 0.75457103]
mean acc= 0.7413569319784572
std of acc = 0.011025290970969631
-----------------------------------
scores (log_loss): [ -9.72646725  -9.39718581 -10.18239541  -9.14389239  -9.1503227 ]
mean log loss = -9.520052711292616
std of log loss = 0.3935982606392738
-----------------------------------


3. `GaussianNB()`

In [16]:
naive_bayes = GaussianNB()
cross_validate(naive_bayes)

scores (acc): [0.73366128 0.72663387 0.73787772 0.72312017 0.72151899]
mean acc= 0.7285624060417908
std of acc = 0.006256525800805681
-----------------------------------
scores (log_loss): [-2.46489811 -2.65405146 -2.64019825 -2.69466623 -2.43062246]
mean log loss = -2.5768873016517504
std of log loss = 0.1074890152661279
-----------------------------------


4. `LinearSVC(multi_class="crammer_singer")`

In [17]:
svc = LinearSVC(multi_class="crammer_singer")
cross_validate(svc)

scores (acc): [0.79831342 0.79479972 0.79901616 0.79479972 0.78902954]
mean acc= 0.7951917118110843
std of acc = 0.003539808839786178
-----------------------------------
scores (log_loss): [nan nan nan nan nan]
mean log loss = nan
std of log loss = nan
-----------------------------------


5. `RandomForestClassifier()`

In [18]:
rf = RandomForestClassifier()
cross_validate(rf)

scores (acc): [0.82853127 0.82782853 0.82009838 0.82150387 0.82348805]
mean acc= 0.8242900194019687
std of acc = 0.003361098578715117
-----------------------------------
scores (log_loss): [-0.48908514 -0.49855331 -0.5152573  -0.60532155 -0.56524541]
mean log loss = -0.5346925431811499
std of log loss = 0.04401304994645107
-----------------------------------


**As a conclusion of our experiment, Logistic regression is doing better**
So let's train the whole training data on it

In [19]:
log_reg = LogisticRegression(multi_class="multinomial")
log_reg.fit(X_train, y_train)

In [23]:
y_preds = log_reg.predict_proba(test_arr)

In [24]:
y_preds[:5]

array([[0.83427209, 0.01340058, 0.15232733],
       [0.87892187, 0.04416433, 0.0769138 ],
       [0.05621209, 0.02066657, 0.92312134],
       [0.91395925, 0.01192822, 0.07411253],
       [0.81180543, 0.01638079, 0.17181378]])

In [28]:
# Read in the submissions file
submissions = pd.read_csv("../data/sample_submission.csv")
submissions.head()

Unnamed: 0,id,Status_C,Status_CL,Status_D
0,7905,0.628084,0.034788,0.337128
1,7906,0.628084,0.034788,0.337128
2,7907,0.628084,0.034788,0.337128
3,7908,0.628084,0.034788,0.337128
4,7909,0.628084,0.034788,0.337128


In [30]:
submissions['Status_C'] = y_preds[:, 0]
submissions['Status_CL'] = y_preds[:, 1]
submissions['Status_D'] = y_preds[:, 2]

In [31]:
submissions.head()

Unnamed: 0,id,Status_C,Status_CL,Status_D
0,7905,0.834272,0.013401,0.152327
1,7906,0.878922,0.044164,0.076914
2,7907,0.056212,0.020667,0.923121
3,7908,0.913959,0.011928,0.074113
4,7909,0.811805,0.016381,0.171814


In [33]:
# Save the submissions file
submissions.to_csv('../data/log_reg_predictions.csv', index=False, header=True)

The above file scored 0.52556(log loss) after submission with rank 531 