## Introduction to this notebook

With a very basic 'Logistic regression' we where able to achive an accuracy of **0.68**. In this notebook we will try to increase the accuracy with the help of 'Feature Engineering' (partly by using ColumnTransformers). Afterwards we will run a 'Logistic Regression' and a 'Random Forest' model in order to check if we have increased the accuracy.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
fulldf = pd.read_csv("./data/Titanic/train.csv")

In [3]:
fulldf.shape

(891, 12)

In [4]:
fulldf.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
fulldf.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [6]:
fulldf.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## 1. Feature Engineering without "ColumnTransformers"

In [7]:
# Create a copy of the original dataset before starting the FE
# Saving the PassengerID column
df = fulldf.copy()
passengerId = df["PassengerId"]

### 1.1.1 Imputing the mean age to the 177 NaNs (depending on "Survived" and "Pclass")

In [8]:
# Calculating the mean age depending on the columns "Survival" and "Pclass"
df.groupby(["Survived", "Pclass"]).mean()["Age"].round(0)

Survived  Pclass
0         1         44.0
          2         34.0
          3         27.0
1         1         35.0
          2         26.0
          3         21.0
Name: Age, dtype: float64

In [9]:
# Imputing the mean values to the NaNs depending on columns "Survived" and "Pclass"
df['Age'].fillna(df.groupby(['Survived','Pclass'])['Age'].transform('mean').round(0), inplace=True)

### 1.1.2 Creating a new column with four bins for "Age"

In [10]:
df["Age_bins"] = pd.cut(df["Age"], bins=[0,20,40,60,81], labels = ("minor", "young adult", "adult", "elder"))

### 1.2.1 Add the number of family members together in one column

In [60]:
# Add the columns parent/children and sibling/spouse together and create a new column "Family"
df["Family"] = df["SibSp"] + df["Parch"]
df["Family"].value_counts()

KeyError: 'SibSp'

### 1.2.2 Bin the size of the family into three categories

In [12]:
df['Family_size'] = pd.cut(df["Family"], bins=[0,0.5,5,10], labels= ['No family','Small family', 'Big family'])
df['Family_size'] = np.where(df["Family_size"].isnull(),"No family", df["Family_size"])

### 1.3 Retrive the title from the "Name" column and add them to a new column

In [13]:
df["Title"] = df["Name"].map(lambda name:name.split(',')[1].split(".")[0].strip())
df["Title"].value_counts()

Mr              517
Miss            182
Mrs             125
Master           40
Dr                7
Rev               6
Mlle              2
Major             2
Col               2
the Countess      1
Capt              1
Ms                1
Sir               1
Lady              1
Mme               1
Don               1
Jonkheer          1
Name: Title, dtype: int64

### 1.4 Create a new column "Cabin Status" depending on provided information in the "Cabin" column

In [14]:
# Here we assume that passengers with a "NaN"-Status in the "Cabin" column didn't in fact have a cabin 
# but rather slept in dorms or on the deck. Hence, we create a column "Cabin_Status" based on the "Cabin column".
df["Cabin_Status"] = np.where(df["Cabin"].isna(), 0, 1)

### 1.5 Imputing the "Embarked column"

In [15]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(df[['Embarked']])        # learn the most frequent value
df["Embarked"] = imputer.transform(df[['Embarked']]);  # transform the column

In [16]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_bins,Family,Family_size,Title,Cabin_Status
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,young adult,1,Small family,Mr,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,young adult,1,Small family,Mrs,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,young adult,0,No family,Miss,0


#### Next Steps:

* OneHotEncode the following categorical features:
    * Pclass
    * Sex
    * Embarked
    * Age_bins
    * Family_size
    * Title
    * Cabin_status

* Scale the following features:
    * Age
    * Fare

* Retrive the length og the "Name" column

* Drop the following categories:
    * SibSp
    * Parch
    * Ticket
    * Cabin

## 2. Feature Engineering with Column Transformers

In [17]:
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [18]:
# Devide the columns in numeric and categorical features, define the column transformers
# "Age", "Fare" and "Family": Impute and scale sequentially
# "Embark", "Sex", "Pclass" and "Title": One hot encode the information
# "Name": Calculate the length

numeric_features = ["Age", "Fare"]
numeric_transformer = StandardScaler()

categorical_features = ["Embarked", "Age_bins", "Family_size", "Cabin_Status", "Sex", "Pclass", "Title"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

In [19]:
def name_length(df):
    length = df[df.columns[0]].str.len()
    return length.values.reshape(-1, 1)

In [20]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
        ('name', FunctionTransformer(name_length), ['Name'])
    ]
)

In [21]:
# Define the Pipeline wit the Logistic Regression Model

logreg = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression(max_iter=1000))]
)

## 3. Logistic Regression

In [22]:
df = df.drop(["PassengerId", "Ticket", "Cabin", "SibSp", "Parch"], axis=1)

In [23]:
X = df.drop("Survived", axis=1)
y = df["Survived"]

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [25]:
logreg.fit(X_train, y_train);

In [59]:
logreg.score(X_test, y_test)
print(f"Logistic Regression Results\nTest score: {round(logreg.score(X_test, y_test), 2)}\nTrain score: {round(logreg.score(X_train, y_train), 2)}")

Logistic Regression Results
Test score: 0.82
Train score: 0.83


## 4. Random Forest

In [47]:
# Define the Pipeline wit the Logistic Random Forest Classifier

rf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier(n_estimators=40, max_depth=3))]
)

In [48]:
rf.fit(X_train,y_train);

In [53]:
y_test_pred = rf.predict(X_test)

In [None]:
y_train_pred = rf.predict(X_train)

In [57]:
print(f"Random Forest Results\nTest score: {round(accuracy_score(y_test_pred,y_test), 2)}"), print(f"Train score: {round(accuracy_score(y_train_pred,y_train), 2)}")

Random Forest Results
Test score: 0.79
Train score: 0.8


(None, None)

### Result: Due to 'Feature Engineering' we were able to increse the accuracy from 0.68 to 0.82!