<p align="center">
  <img src="data/titanic.jpg" width="700" height="300">
</p>

# Titanic Survival Prediction

# 1. Introduction

* Brief overview of the Titanic survival competition at Kaggle.  
* Purpose of studying this dataset and objective of this notebook, audiences
* Methodology followed.  
* Key findings and observations.  
* Challenges, areas for potential improvements.

**Dataset**  
[Kaggle Competition and dataset link](https://www.kaggle.com/c/titanic)

**Contact**  
Reachme for more or discussion from [Linkedin](https://www.linkedin.com/in/fatih-calik-469961237/), [Github](https://github.com/fatih-ml) or [Kaggle](https://www.kaggle.com/fatihkgg)

# 2. Initial Data Exploration and Preprocessing

## Importing the dependencies

In [17]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import data.fatih_eda as fc
import warnings
warnings.filterwarnings('ignore')

np.set_printoptions(suppress=True)

In [18]:
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## First Observations

In [19]:
df.shape

(891, 12)

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [21]:
categorical_features = df.select_dtypes(include=['object', 'category']).columns
numerical_features = df.select_dtypes(include='number').columns

In [22]:
df[categorical_features].describe().T

Unnamed: 0,count,unique,top,freq
Name,891,891,"Braund, Mr. Owen Harris",1
Sex,891,2,male,577
Ticket,891,681,347082,7
Cabin,204,147,B96 B98,4
Embarked,889,3,S,644


In [23]:
df[numerical_features].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,891.0,446.0,257.353842,1.0,223.5,446.0,668.5,891.0
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Age,714.0,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292


__Before diving into details with graphs and exploratory data analysis there is a must that:__  
* Surely we need to drop PassengerID  

__Additonally:__  
* name, ticket, cabin features need preprocessing or dropping (we can understand this, from summary statistics)


In [24]:
df.drop(columns=['PassengerId'], inplace=True)

## Missing Values

In [25]:
df.isnull().sum()

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [26]:
# convert age from str to int
df['Age'] = df['Age'][df['Age'].isnull()==False].astype(int)

# fillna for age with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)

# fillna for Embarked with most frequent
df['Embarked'].fillna('S', inplace=True)

# 3. Exploratory Data Analysis (EDA) and Feature Engineering

## Categorical Variables

**Names Transformer to Titles**

In [27]:
print(df['Name'].nunique(), ' Unique values\n')
df['Name'].sample(5)

891  Unique values



613                Horgan, Mr. John
108                 Rekic, Mr. Tido
382              Tikkanen, Mr. Juho
606               Karaic, Mr. Milan
172    Johnson, Miss. Eleanor Ileen
Name: Name, dtype: object

**the structure of name column is: all of them have some sort of TITLES, therefore we can summarize them in a few only titles**

In [28]:
# create title column
df['Title'] = pd.Series()

# we observed these titles with string manipulation methods
titles = ['Mr.', 'Mrs.', 'Miss.',  'Ms.', 'Master.', 'Mlle.', 'Mme.']
titles_special = ['Dr.', 'Sir', 'Col.', 'Capt.', 'Don.', 'Major.', 'Jonkheer.', 'Rev.', 'Countess.', 'Lady.']

# fill titles column
for i, name in enumerate(df.Name):    
    for title_s in titles_special:
        if title_s in name:
            df['Title'].iloc[i] = title_s
    for title in titles:
        if title in name:
            df['Title'].iloc[i] = title
            
# regrouping the titles
for i, title in enumerate(df['Title']):
    if title in titles_special:
        df['Title'].iloc[i] = 'Special_title'
    elif title in ['Ms.', 'Miss.', 'Mlle.', 'Mme.']:
        df['Title'].iloc[i] = 'Miss'

# drop name column
df.drop(columns= ['Name'], inplace=True)
df['Title'].value_counts()

Mr.              517
Miss             186
Mrs.             125
Master.           40
Special_title     23
Name: Title, dtype: int64

**Ticket transformer to Decks**

Lets underestand the structure of ticket column

In [59]:
print(df['Ticket'].nunique(), ' Unique values\n')

681  Unique values



In [60]:
df['Ticket'].sample(5)

439    C.A. 18723
100        349245
831         29106
768        371110
122        237736
Name: Ticket, dtype: object

In [64]:
ticket_no_only = {}
deck_space_ticket = {}
other = {}

for i, ticket in enumerate(df['Ticket']):
    if ticket.isdigit():
        ticket_no_only[i] = int(ticket)
    elif (' ' in ticket) and ticket[0].isalpha():
        deck_space_ticket[i] = ticket
    else:
        other[i] = ticket

**I am planning to create two columns**:  
1. Decks (if not possible assign No_deck)
2. Ticket number (0 if not possible)

In [69]:
df['Ticket_no'] = pd.Series()
df['Deck'] = pd.Series()

for i, ticket in enumerate(df['Ticket']):
    if i in ticket_no_only.keys():
        print(ticket)

113803
373450
330877
17463
349909
347742
237736
113783
347082
350406
248706
382652
244373
345763
2649
239865
248698
330923
113788
349909
347077
2631
19950
330959
349216
335677
113789
2677
345764
2651
7546
11668
349253
330958
370371
14311
2662
349237
3101295
2926
113509
19947
2697
2669
113572
36973
347088
2661
3101281
315151
2680
1601
348123
349208
374746
248738
364516
345767
345779
330932
113059
3101278
19950
343275
343276
347466
364500
374910
231919
244367
349245
349215
35281
7540
3101276
349207
343120
312991
349249
371110
110465
2665
324669
4136
2627
370369
11668
347082
237736
27267
35281
2651
370372
2668
347061
349241
228414
11752
113803
7534
2678
347081
365222
231945
350043
230080
244310
113776
35851
315037
371362
347068
315093
3101295
363291
113505
347088
1601
111240
382652
347742
17764
350404
4133
250653
347077
230136
315153
113767
370365
111428
364849
349247
234604
28424
350046
230080
368703
4579
370370
248747
345770
3101264
2628
347054
3101278
2699
367231
112277
250646
367229
3

In [55]:
counter_numeric_ticker = 0
counter_start_alpha = 0

for ticket in df['Ticket']:
    if ticket.isdigit():
        counter_numeric_ticker +=1
    elif ticket[0].isalpha() and (' ' in ticket):
        counter_start_alpha +=1
print(len(df))
print(counter_numeric_ticker, counter_start_alpha)

891
661 226


In [58]:
for ticket in df['Ticket']:
    if (' ' not in ticket) and (not ticket.isdigit()):
        print(ticket)

LINE
LINE
LINE
LINE


## Numerical Variables

# 4. Data Encoding and Preparation

In [85]:
categorical_features = df.select_dtypes(include=['object', 'category']).columns
dummies_df = pd.get_dummies(df, categorical_features, drop_first=True)

In [86]:
X = dummies_df.drop(columns=['Survived'])
y = df['Survived']

print(X.shape)
print(y.shape)

(891, 12)
(891,)


# 5. Model Building

In [87]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

In [88]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [89]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train = X_train_scaled
X_test = X_test_scaled

In [90]:
# Initialize the models
logistic_regression = LogisticRegression()
random_forest = RandomForestClassifier()
xgboost = XGBClassifier()
gboost = GradientBoostingClassifier()

# Train the models
logistic_regression.fit(X_train, y_train)
random_forest.fit(X_train, y_train)
xgboost.fit(X_train, y_train)
gboost.fit(X_train, y_train)

In [91]:
# Make predictions
lr_predictions = logistic_regression.predict(X_test)
rf_predictions = random_forest.predict(X_test)
xgb_predictions = xgboost.predict(X_test)
gb_predictions = gboost.predict(X_test)

# Evaluate models
lr_accuracy = accuracy_score(y_test, lr_predictions)
rf_accuracy = accuracy_score(y_test, rf_predictions)
xgb_accuracy = accuracy_score(y_test, xgb_predictions)
gb_accuracy = accuracy_score(y_test, gb_predictions)

print("Logistic Regression Accuracy:", lr_accuracy)
print("Random Forest Accuracy:", rf_accuracy)
print("XGBoost Accuracy:", xgb_accuracy)
print("GBoost Accuracy:", gb_accuracy)

Logistic Regression Accuracy: 0.8100558659217877
Random Forest Accuracy: 0.8435754189944135
XGBoost Accuracy: 0.8379888268156425
GBoost Accuracy: 0.8324022346368715


# Comparing The Results for each meaningfull Iteration - Feature Engineering

In [92]:
# first Iteration - with default model parametres
# no feature engineering
# 'PassengerId', ['Name', 'Ticket', 'Cabin']' dropped
# basic imputation performed
# X = (891, 8)

![image.png](attachment:ab91b994-7c0f-4475-99ef-fe1bab4a00cb.png)

In [94]:
# Second Iteration - Names are transformed into titles
# X = (891, 12)

![image.png](attachment:6225e070-7a57-436b-954e-aa645e609c43.png)

# HyperParamtre Tuning

# 6. Final Model and Deployment