# **Part IV: Modelling**

In this notebook, we would build several classification models and compare them

## 1. Preparations
- Load Modules and Datasets
- Preprocessing

I used to decide to integrate preprocessing part into the pipeline. But now I changed my mind. Because we need to drop those unrelevant features out of original data (maybe integrate it also in the pipeline?)

## 2. Modeling
- Model build
  - Tree Model
    - Decision Tree
  - Linear Model
    - Logistic Regression
  - Advanced Models with advanced preprocessing techniques
  - If the model is complex (maybe self-designed), we might need to write models in different files and import them
- Model comparison
  - method
    - cross validation (K-fold)
      - train/validation split
        - for biased situation
    - bootstrap
  - metrics (for classification problem)
    - accuracy
    - precision / recall
    - f1-score
    - auc
    - confusion matrix
- Model validation
  - Sanity Check
  - Sensitive check ...

## 3. Prediction
- generate results

# 1. Preparations

We start by loading necessary **Modules** and maybe google colab. Then we import the Titanic **Dataset**.

In [24]:
# import modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import sys, os
sys.path.append(os.path.abspath(".."))
from src.features import FeaturePreprocessor_v1

In [26]:
# import dataset

train = pd.read_csv("../data/raw/train.csv")
train_copy = train.copy()
test = pd.read_csv("../data/raw/test.csv")

In [None]:
# evaluation metrics for classification problem

scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'roc_auc': 'roc_auc'
}

# 2. Modeling

Considerations are that different models would require different make_feature funcs, it is not a one2one question but many2many question. I belive the most confusing part lies in the feature selection. The feature selection would be only complete after we got ``print(train_processed.columns)``, which couldn't be finished inside the make_features only. Idea is that we could write a feature_selection func inside that to integrate a complete pipeline.

## 2.1 Data Preprocessing version I

In [28]:
# make_features could produce many features but only important ones would be kept 

preprocessor_v1 = FeaturePreprocessor_v1()
train_processed = preprocessor_v1.fit_transform(train)
print(train_processed.columns)

# I was thinking, maybe after this I could go back to complete features.py (but what if same preprocessor used for )

Index(['PassengerId', 'Survived', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Cabin_missing_indicator', 'Sex_male', 'Pclass_2',
       'Pclass_3', 'Embarked_Q', 'Embarked_S', 'Embarked_nan', 'Fare_log'],
      dtype='object')
0.0


In [None]:
# Feature Selection
Features_selected_v1 = ['Age', 'Fare', 'Cabin_missing_indicator', 'Sex_male', 'Pclass_2', 'Pclass_3', 'Embarked_Q', 'Embarked_S']
X = train_processed[Features_selected_v1]
y = train_processed['Survived']

# train & validation split


# 2. Modeling

In [32]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_val)
print("Accuracy:", accuracy_score(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))


Accuracy: 0.776536312849162
[[95 15]
 [25 44]]
              precision    recall  f1-score   support

           0       0.79      0.86      0.83       110
           1       0.75      0.64      0.69        69

    accuracy                           0.78       179
   macro avg       0.77      0.75      0.76       179
weighted avg       0.77      0.78      0.77       179



# 3. Baseline Model

A baseline model is selected after 