# IF3170 Artificial Intelligence | Praktikum

This notebook serves as a template for the assignment. Please create a copy of this notebook to complete your work. You can add more code blocks, markdown blocks, or new sections if needed.


Group Number: xx

Group Members:
- Name (NIM)
- Name (NIM)

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt

# Import other libraries if needed

## Import Dataset

In [None]:
# Write your code here
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

# 1. Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a crucial step in the data analysis process that involves examining and visualizing data sets to uncover patterns, trends, anomalies, and insights. It is the first step before applying more advanced statistical and machine learning techniques. EDA helps you to gain a deep understanding of the data you are working with, allowing you to make informed decisions and formulate hypotheses for further analysis.

In [None]:
df_train.drop('id', axis=1, inplace=True)

In [None]:
# Write your code here
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   N_Days         15000 non-null  float64
 1   Drug           8450 non-null   object 
 2   Age            15000 non-null  float64
 3   Sex            15000 non-null  object 
 4   Ascites        8453 non-null   object 
 5   Hepatomegaly   8448 non-null   object 
 6   Spiders        8441 non-null   object 
 7   Edema          15000 non-null  object 
 8   Bilirubin      15000 non-null  float64
 9   Cholesterol    6626 non-null   float64
 10  Albumin        15000 non-null  float64
 11  Copper         8340 non-null   float64
 12  Alk_Phos       8444 non-null   float64
 13  SGOT           8441 non-null   float64
 14  Tryglicerides  6575 non-null   float64
 15  Platelets      14416 non-null  float64
 16  Prothrombin    14984 non-null  float64
 17  Stage          15000 non-null  float64
 18  Status

# 2. Split Training Set and Validation Set

Splitting the training and validation set works as an early diagnostic towards the performance of the model we train. This is done before the preprocessing steps to **avoid data leakage inbetween the sets**. If you want to use k-fold cross-validation, split the data later and do the cleaning and preprocessing separately for each split.

Note: For training, you should use the data contained in the `train.csv` given by the TA. The `test.csv` data is only used for kaggle submission.

In [None]:
target_columns = ['Status']
cat_columns = list(df_train.select_dtypes(include='object').columns) + list(['Stage'])
indices_to_remove = [i for i, val in enumerate(cat_columns) if val in target_columns]
cat_columns = np.delete(cat_columns, indices_to_remove)
num_columns = [x for x in df_train.columns if x not in cat_columns and x not in target_columns]

In [None]:
X_train = df_train.drop(target_columns, axis=1)
y_train = df_train[target_columns]

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# 3. Data Cleaning and Preprocessing

This step is the first thing to be done once a Data Scientist have grasped a general knowledge of the data. Raw data is **seldom ready for training**, therefore steps need to be taken to clean and format the data for the Machine Learning model to interpret.

By performing data cleaning and preprocessing, you ensure that your dataset is ready for model training, leading to more accurate and reliable machine learning results. These steps are essential for transforming raw data into a format that machine learning algorithms can effectively learn from and make predictions.

For each step that you will do, **please explain the reason why did you do that process. Write it in a markdown cell under the code cell you wrote.**

In [None]:
len(X_train[X_train.isna().any(axis=1)]) * 100 /len(X_train)

58.35

In [None]:
missing_column = []
for column in num_columns:
    missing_percentage = X_train[column].isna().sum() / len(X_train) * 100
    if missing_percentage > 5:
        missing_column.append(column)
    print(f"Column {column} - {missing_percentage:.2f}% missing values")

Column N_Days - 0.00% missing values
Column Age - 0.00% missing values
Column Bilirubin - 0.00% missing values
Column Cholesterol - 56.03% missing values
Column Albumin - 0.00% missing values
Column Copper - 44.47% missing values
Column Alk_Phos - 43.82% missing values
Column SGOT - 43.84% missing values
Column Tryglicerides - 56.42% missing values
Column Platelets - 3.92% missing values
Column Prothrombin - 0.09% missing values


In [None]:
for column in cat_columns:
    missing_percentage = X_train[column].isna().sum() / len(X_train) * 100
    if missing_percentage > 5:
        missing_column.append(column)
    print(f"Column {column} - {missing_percentage:.2f}% missing values")

Column Drug - 43.81% missing values
Column Sex - 0.00% missing values
Column Ascites - 43.77% missing values
Column Hepatomegaly - 43.82% missing values
Column Spiders - 43.84% missing values
Column Edema - 0.00% missing values
Column Stage - 0.00% missing values


In [None]:
print('Platelets', ":", df_train['Platelets'].skew())

Platelets : 2.340967199225039


In [None]:
len(missing_column)

9

In [None]:
class MissingValueHandler(BaseEstimator, TransformerMixin):
    def __init__(self, isNum = False):
        self.isNum = isNum
        return

    def fit(self, X, y=None):
        return self 
    
    def transform(self, X, y=None):
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        if (self.isNum):
            simple_imputer = SimpleImputer(strategy='median')
        else:
            simple_imputer = SimpleImputer(strategy='most_frequent')

        for col in X.columns:
            missing_percentage = X[col].isna().sum() / len(X)
            if missing_percentage < 0.1:
                X[[col]] = simple_imputer.fit_transform(X[[col]])
            else:
                X = X.drop(col, axis=1)

        return X

missing_value_handler = MissingValueHandler()
mo_X_train = missing_value_handler.fit_transform(X_train)
# mo_X_train_df = pd.DataFrame(mo_X_train, columns=[x for x in (list(cat_columns) + list(num_columns)) if x not in missing_column])

In [None]:
X_train['Age'].loc[X_train['Age']/365 > 70]

4026     28018.0
6104     26580.0
11129    25568.0
1732     25772.0
582      25568.0
          ...   
14737    25569.0
3073     25569.0
6235     25568.0
8433     26580.0
10583    28650.0
Name: Age, Length: 372, dtype: float64

In [None]:
from category_encoders import TargetEncoder
from sklearn.model_selection import train_test_split



In [None]:
class AgeTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, isToCat = False):
        self.isToCat = isToCat
        return

    def fit(self, X, y=None):
        return self 
    
    def transform(self, X, y=None):

        if (not self.isToCat):
            X['Age'].clip(upper = 70 * 365)

        return X

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Note: You can add or delete preprocessing components from this pipeline


In [None]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import make_classification

transformed_df = missing_value_handler.fit_transform(X_train[cat_columns])
X_train[transformed_df.columns] = transformed_df
# Train a Random Forest Classifier
X_train.columns
# rf = RandomForestClassifier(random_state=42, n_estimators=100)
# rf.fit(X_train, y_train)

Index(['N_Days', 'Drug', 'Age', 'Sex', 'Ascites', 'Hepatomegaly', 'Spiders',
       'Edema', 'Bilirubin', 'Cholesterol', 'Albumin', 'Copper', 'Alk_Phos',
       'SGOT', 'Tryglicerides', 'Platelets', 'Prothrombin', 'Stage'],
      dtype='object')

In [None]:
# # Get feature importances
# importances = rf.feature_importances_
# importance_df = pd.DataFrame({
#     'Feature': feature_names,
#     'Importance': importances
# }).sort_values(by='Importance', ascending=False)
# print(importance_df)

In [None]:
# import matplotlib.pyplot as plt

# # Plot feature importances
# plt.figure(figsize=(10, 6))
# plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
# plt.gca().invert_yaxis()
# plt.title('Feature Importances')
# plt.xlabel('Importance')
# plt.show()

# 3. Compile Preprocessing Pipeline

All of the preprocessing classes or functions defined earlier will be compiled in this step.

If you use sklearn to create preprocessing classes, you can list your preprocessing classes in the Pipeline object sequentially, and then fit and transform your data.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer

# Note: You can add or delete preprocessing components from this pipeline
cat_pipeline = Pipeline([
    ('imputer', MissingValueHandler(isNum=False)),
    ('encoder', OneHotEncoder())
])

num_pipeline = Pipeline([
    ('imputer', MissingValueHandler(isNum=True)),
    ('transformer', PowerTransformer(method='yeo-johnson'))
])

num_cat_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_columns),
    ("cat", cat_pipeline, cat_columns)
])

final_pipeline = Pipeline([
    ('ageHandler', AgeTransformer()),
    ('num_cat', num_cat_pipeline)
])

X_train_prepared = num_cat_pipeline.fit_transform(X_train, y_train)
X_test_prepared = num_cat_pipeline.fit_transform(X_val, y_val)

In [None]:
X_train[cat_columns]

Unnamed: 0,Drug,Sex,Ascites,Hepatomegaly,Spiders,Edema,Stage
9839,D-penicillamine,F,N,Y,N,N,3.0
9680,Placebo,F,N,N,N,N,2.0
7093,,F,,,,N,3.0
11293,D-penicillamine,F,N,N,N,N,3.0
820,D-penicillamine,M,N,N,N,N,3.0
...,...,...,...,...,...,...,...
5191,,F,,,,N,3.0
13418,,F,,,,N,4.0
5390,,F,,,,N,4.0
860,D-penicillamine,F,N,N,N,N,3.0


In [None]:
# # Your code should work up until this point
# train_set = pipe.fit_transform(train_set)
# val_set = pipe.transform(val_set)

or create your own here

In [None]:
# Write your code here

# 4. Modeling and Validation

Modelling is the process of building your own machine learning models to solve specific problems, or in this assignment context, predicting the probability for each class in the `Status` feature (`Status_C`, `Status_CL`, `Status_D`). Validation is the process of evaluating your trained model using the validation set or cross-validation method and providing some metrics that can help you decide what to do in the next iteration of development.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer

# Initialize the LabelBinarizer
lb = LabelBinarizer()

# Fit and transform y_val to one-hot encoding
y_val = lb.fit_transform(y_val)

## KNN

In [None]:
# Type your code here
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_prepared, y_train)

# Predict
y_pred = knn.predict_proba(X_test_prepared)

  return self._fit(X, y)


In [None]:
y_pred

array([[1.        , 0.        , 0.        ],
       [0.66666667, 0.33333333, 0.        ],
       [0.66666667, 0.        , 0.33333333],
       ...,
       [1.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        ],
       [1.        , 0.        , 0.        ]])

In [None]:
loss = log_loss(y_val, y_pred)
print(f'Log Loss: {loss}')

Log Loss: 3.2483402999805886


## Naive Bayes

In [None]:
# Type your code here
gaussiannb = GaussianNB()
gaussiannb.fit(X_train_prepared, y_train)

# Predict
y_pred = knn.predict_proba(X_test_prepared)

  y = column_or_1d(y, warn=True)


In [None]:
y_val

array([[1, 0, 0],
       [0, 0, 1],
       [1, 0, 0],
       ...,
       [1, 0, 0],
       [0, 0, 1],
       [1, 0, 0]])

In [None]:
y_pred

array([[1.        , 0.        , 0.        ],
       [0.66666667, 0.33333333, 0.        ],
       [0.66666667, 0.        , 0.33333333],
       ...,
       [1.        , 0.        , 0.        ],
       [0.        , 0.        , 1.        ],
       [1.        , 0.        , 0.        ]])

In [None]:
loss = log_loss(y_val, y_pred)
print(f'Log Loss: {loss}')

Log Loss: 3.2483402999805886


## ID3

In [None]:
# Type your code here
knn = DecisionTreeClassifier()
knn.fit(X_train_prepared, y_train)

# Predict
y_pred = knn.predict_proba(X_test_prepared)

loss = log_loss(y_val, y_pred)
print(f'Log Loss: {loss}')

Log Loss: 8.470258546442532


## SVM

## Logistic Regression

In [None]:
# Type your code here
logisticregression = LogisticRegression()
logisticregression.fit(X_train_prepared, y_train)

# Predict
y_pred = logisticregression.predict_proba(X_test_prepared)

loss = log_loss(y_val, y_pred)
print(f'Log Loss: {loss}')

Log Loss: 0.41823251905238945


  y = column_or_1d(y, warn=True)


In [None]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the y_train labels to numerical labels
y_train = label_encoder.fit_transform(y_train)

  y = column_or_1d(y, warn=True)


## Notes for improvements

- **Visualize the model evaluation result**

This will help you to understand the details more clearly about your model's performance. From the visualization, you can see clearly if your model is leaning towards a class than the others. (Hint: confusion matrix, ROC-AUC curve, etc.)

- **Explore the hyperparameters of your models**

Each models have their own hyperparameters. And each of the hyperparameter have different effects on the model behaviour. You can optimize the model performance by finding the good set of hyperparameters through a process called **hyperparameter tuning**. (Hint: Grid search, random search, bayesian optimization)

- **Cross-validation**

Cross-validation is a critical technique in machine learning and data science for evaluating and validating the performance of predictive models. It provides a more **robust** and **reliable** evaluation method compared to a hold-out (single train-test set) validation. Though, it requires more time and computing power because of how cross-validation works. (Hint: k-fold cross-validation, stratified k-fold cross-validation, etc.)

- **Ensemble methods**

Ensemble methods are powerful machine learning techniques that combine the predictions of multiple models (often referred to as base learners or weak learners) to create a stronger, more accurate predictive model. The idea behind ensemble methods is that by aggregating the opinions of multiple models, you can reduce the impact of individual model errors and improve overall prediction performance. (Hint: bagging, boosting, stacking, voting)

- **Model interpretation**

Model interpretation is the process of understanding and explaining the inner workings of a machine learning model, particularly its decision-making process. Interpretation helps data scientists, stakeholders, and end-users gain insights into why a model makes certain predictions or classifications. Model interpretation is crucial for building trust in machine learning systems, identifying biases, and extracting actionable information from models. (Hint: Feature importance, PDP, SHAP Values, etc)

- **Explore other models**

There are a lot of ML models that you can use in this usecase. Try to explore and use them to solve this problem.

## Submission
To predict the test set target feature and submit the results to the kaggle competition platform, do the following:
1. Create a new pipeline instance identical to the first in Data Preprocessing
2. With the pipeline, apply `fit_transform` to the original training set before splitting, then only apply `transform` to the test set.
3. Retrain the model on the preprocessed training set
4. Predict the test set
5. Make sure the submission contains the `id`, `Status_C`, `Status_CL`, `Status_D` column.

In [None]:
X_test_prepared = num_cat_pipeline.fit_transform(df_test)

logisticregression = LogisticRegression()
logisticregression.fit(X_train_prepared, y_train)

# Predict
y_pred = logisticregression.predict_proba(X_test_prepared)
df_submission = pd.DataFrame(y_pred, columns=['Status_C', 'Status_CL', 'Status_D'])
df_submission['id'] = df_test['id']
df_submission = df_submission[['id', 'Status_C', 'Status_CL', 'Status_D']]
df_submission

Unnamed: 0,id,Status_C,Status_CL,Status_D
0,15000,0.140185,0.016349,0.843466
1,15001,0.775622,0.005845,0.218533
2,15002,0.935002,0.003103,0.061895
3,15003,0.901499,0.019647,0.078854
4,15004,0.881654,0.008954,0.109391
...,...,...,...,...
9995,24995,0.908816,0.019327,0.071856
9996,24996,0.634035,0.015644,0.350321
9997,24997,0.124723,0.007695,0.867582
9998,24998,0.010880,0.001661,0.987458


In [None]:
# Write the DataFrame to a CSV file
df_submission.to_csv('submission.csv', index=False)

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "C:\Users\abdul\AppData\Roaming\Python\Python310\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\abdul\AppData\Local\Temp\ipykernel_46948\220050157.py", line 2, in <module>
    df_submission.to_csv('submission.csv', index=False)
  File "c:\Users\abdul\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper
    raise TypeError(msg)
  File "c:\Users\abdul\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\generic.py", line 3720, in to_csv
    str or None
  File "c:\Users\abdul\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\util\_decorators.py", line 211, in wrapper
    raise TypeError(msg)
  File "c:\Users\abdul\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\io\formats\format.py", line 1162, in to_csv
  File "c:\Users\abdul\AppData\Local\Programs\P

# 6. Error Analysis

Based on all the process you have done until the modeling and evaluation step, write an analysis to support each steps you have taken to solve this problem. Write the analysis using the markdown block. Some questions that may help you in writing the analysis:

- Does my model perform better in predicting one class than the other? If so, why is that?
- To each models I have tried, which performs the best and what could be the reason?
- Is it better for me to impute or drop the missing data? Why?
- Does feature scaling help improve my model performance?
- etc...

`Provide your analysis here`