## Introduction to this notebook

After todays lession about 'Feature Engeneering' using 'ColumnTransformer()' and 'Pipline()' I would like to achive similar accuracy results like in the 3_WP notebook by using these functions in shorter time and with more concise code.

## 1. Load data and some basic EDA

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import math

# models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.model_selection import train_test_split

# new utils
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, KBinsDiscretizer
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import set_config

# to visualize the column transformer and pipeline
set_config(display='diagram')

In [2]:
df = pd.read_csv("./data/Titanic/train.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
sns.heatmap(df.isna());

## 3. Train-Test Split

In [3]:
y = df["Survived"]
X = df.loc[:, df.columns != "Survived"]

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 85)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((712, 11), (179, 11), (712,), (179,))

In [None]:
X.head()

## 3. Define ColumnTransformers

### 3.1 Problem: How to use the simple Python Functions with the FunctionTransformer()

In [1]:
# If i is np.isnan() check class and replace with median age of the class

def impute_age_class(df, column_1, column_2):
    for i in range(len(df)):
        if np.isnan(df[column_1].iloc[i]):
            if df[column_2].iloc[i] == 1:
                df[column_1].iloc[i] = 38
            elif df[column_2].iloc[i] == 2:
                df[column_1].iloc[i] = 30
            else:
                df[column_1].iloc[i] = 25
    return df

age_pipeline = Pipeline(steps = 
                         [("impute_age_class", FunctionTransformer(impute_age_class, validate=False, kw_args={'column_1': 'Age', 'column_2': 'Pclass'})),
                          ("Scale", StandardS)
])

NameError: name 'Pipeline' is not defined

In [6]:
# Merge the columns parent/children and sibling/spouse together, create a new column "Family" and bin the values

def merge_family(df, column_1, column_2):
    df["Family"] = df[column_1] + df[column_2]
    return df

family_pipeline = Pipeline(steps = 
                         [("create_family", FunctionTransformer(merge_family, validate=False, kw_args={'column_1': 'SibSp', 'column_2': 'Parch'})),
                          ("ohe", OneHotEncoder(handle_unknown="ignore"))
])

In [7]:
# Extract the title from "Name" and create a new column 

def extract_title(df, column_1):
    df["Title"] = df[column_1].map(lambda name:name.split(',')[1].split(".")[0].strip())
    return df

title_pipeline = Pipeline(steps = 
                         [("create_family", FunctionTransformer(extract_title, validate=False, kw_args={'column_1': 'Name'})),
                          ("ohe", OneHotEncoder(handle_unknown="ignore"))
])

In [8]:
numeric_features = ["Fare"]
numeric_transformer = StandardScaler()

categorical_features = ["Sex", "Pclass"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

embarked_transformer = Pipeline(steps=
                        [("imputer", SimpleImputer(strategy="most_frequent")), 
                       ("ohe", OneHotEncoder(handle_unknown="ignore"))
])

In [9]:
# Define the preprocessor

preprocessor = [
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
        ("embarked", embarked_transformer, ["Embarked"]),
        (("age_pipline", age_pipeline, ["Age", "Pclass"])),
        (('family_pipeline', family_pipeline, ["SibSp", "Parch"])),
        (('title_pipeline', title_pipeline, ["Name"]))
    ]

In [10]:
column_transformer = ColumnTransformer(preprocessor,
                                        remainder = 'drop')

In [11]:
column_transformer

## 4. Train ML models

### 4.1 Logistic Regression

### 4.1.1 Normal Model

In [12]:
log_reg_pipeline = Pipeline(steps = 
                        [('column_transformer', column_transformer),
                         ('log_reg', LogisticRegression(max_iter = 1000, class_weight = 'balanced'))
                        ])

In [13]:
log_reg_pipeline.fit(X_train, y_train)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 25
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 38
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 30


In [14]:
print(f"""The train accuracy of log_reg_pipeline is: {round(log_reg_pipeline.score(X_train,y_train),2)}
The test accuracy of log_reg_pipeline is: {round(log_reg_pipeline.score(X_test,y_test),2)}""")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 25
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 38
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 30
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 25
A value is trying to be set on a copy of a slice from a DataFram

The train accuracy of log_reg_pipeline is: 0.89
The test accuracy of log_reg_pipeline is: 0.81


### 4.1.2 Evaluating classifiers

In [None]:
from sklearn.metrics import accuracy_score 

ypred = log_reg_pipeline.predict(X_train)
print(f"Accuracy: {round(accuracy_score(y_train, ypred),2)}")

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

print(f"""Precision = {round(precision_score(y_train,ypred),2)} 
Recall = {round(recall_score(y_train,ypred),2)}
F1 = {round(f1_score(y_train,ypred),2)}""")

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import plot_confusion_matrix

conf = confusion_matrix(y_train, ypred)
conf

In [None]:
plot_confusion_matrix(log_reg_pipeline, X_train, y_train, normalize=None)

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=conf, display_labels=log_reg_pipeline.classes_)
disp.plot()
plt.show()

### 4.2 Random Forest

In [15]:
forest_pipeline = Pipeline(steps = 
                        [('column_transformer', column_transformer),
                         ('forest', RandomForestClassifier(n_estimators = 35, max_depth = 3))
                        ])

In [16]:
forest_pipeline.fit(X_train, y_train)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 25
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 38
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 30


In [None]:
X_train.isna().sum()

In [17]:
print(f"""The train accuracy of forest_pipeline is: {round(forest_pipeline.score(X_train,y_train),2)}
The test accuracy of forest_pipeline is: {round(forest_pipeline.score(X_test,y_test),2)}""")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 25
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 38
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 30
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 25
A value is trying to be set on a copy of a slice from a DataFram

The train accuracy of forest_pipeline is: 0.8
The test accuracy of forest_pipeline is: 0.8


### 4.3 Support Vector Model

In [18]:
svc_pipeline = Pipeline(steps = 
                        [('column_transformer', column_transformer),
                         ('svc', SVC(kernel= "poly", C=1))
                        ])

In [19]:
svc_pipeline.fit(X_train, y_train)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 25
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 38
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 30


In [20]:
print(f"""The train accuracy of svc_pipeline is: {round(svc_pipeline.score(X_train,y_train),2)}
The test accuracy of svc_pipeline is: {round(svc_pipeline.score(X_test,y_test),2)}""")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 25
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 38
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 30
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[column_1].iloc[i] = 25
A value is trying to be set on a copy of a slice from a DataFram

The train accuracy of svc_pipeline is: 0.87
The test accuracy of svc_pipeline is: 0.82


In [None]:
predictions = svc_pipeline.predict(X_test)

In [None]:
predictions