## How can we predict survival on the Titanic?

Predict survival on the Titanic using passenger features like age, fare, class, and more. This project includes exploratory data analysis (EDA), feature engineering, and testing multiple classification models.

### Guiding Questions
- Which features are correlated with survival?
- Can we accurately predict survival with a simple model?
- Which model performs best on this dataset?

1. Import Statements

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression, Perceptron
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score # did i use this?

import warnings

2. Load and Preview Data

In [None]:
train_df = pd.read_csv('titanic/train.csv')
test_df = pd.read_csv('titanic/test.csv')
print(f"Passengers in train set: {train_df.shape[0]}\nPassengers in test set: {test_df.shape[0]}")

We have 891 rows in the training set and 418 in the test set — a roughly 7:3 ratio.

Will explore the types in each columns, better understand/see examples of embarked and cabin values to see how to engineer featuers, look for any null values

In [None]:
print(train_df.dtypes)

### Dataset description

| Column | dtype | Description |
|----------|----------|----------|
| PassengerId  | int64  | unique passenger  |
| Survived  | int64  | binary value of survival outcome (0, 1)  |
| Pclass  | int64  | class (1, 2, 3)  |
| Name  | object  | string value for name --> to quantify in feature engineering, we deduced title from this as well as name length  |
| Sex  | object  | sex ("male", "female")  |
| Age  | float64  | passenger's age at time of ?  |
| SibSp  | int64  | sibling/spouse #?  |
| Parch  | int64  | parents/children #?  |
| Ticket  | object?  | ?  |
| Fare  | float64  | ticket cost LOGGED ITTTTTTT |
| Cabin  | object  | cabin identifer of format letter + number (e.g. C85, C123, B42, C148) - from research, letter corresponds to deck which we engineered our own feature from this  |
| Embarked  | object  | port passenger boarded from? ('S', 'C', 'Q' which correspond to Southampton, Cherbourg, Queenstown)  |

In [None]:
train_df["Embarked"].str[0].unique()
train_df["Cabin"].str[0].unique()

In [None]:
train_df.head()

In [None]:
train_df.info()

In [None]:
train_df.isna().sum()

In [None]:
test_df.head()

In [None]:
test_df.info()

In [None]:
test_df.isna().sum()

We see that the Age, Fare, Cabin, Embarked????? have null values to be dealt with/cleaned in the train and test datset

Next, let's look at statistic summaries of datasets

In [None]:
train_df.describe()

In [None]:
test_df.describe()

We see train and test df have similar summary statistics. from first glance, all columns seem to have reasonable means and min/max (pclass 1-3 all valid, age in right range, # sibs or parch as well as fare all seem reasonable with no outright impossible valus/outliers - though fare seems to be very right skewed)

we will look into each variable distribution more to see if any single varaible distribution can be preprocessed or cleaned

### Data Cleaning & Feature Engineering

Examples:
- Visualize survival by class, sex, age, family, fare
- Show correlations (heatmap, groupby stats)
- Fill missing Age and Fare with median.
- Create binary columns from Sex, Embarked, Deck.
- Engineer features like Fare_log, Has_Cabin (derived from cabin), or grouped titles (ended up not using grouped titles)

In [None]:
#sns.countplot for each categorical attributes cumulative
#AND sns.countplot for each categorical attributes by died/survived
    #survived, sex, embarked

fig, axes = plt.subplots(figsize=(10, 3), nrows=1, ncols=3)
sns.countplot(x="Survived", data=train_df, ax=axes[0])
sns.countplot(x="Sex", data=train_df, ax=axes[1])
sns.countplot(x="Embarked", data=train_df, ax=axes[2])
fig.tight_layout()

We see that majority perished in titanic, there were more male aboard, and most embarked from 'S' in the train set

In [None]:
#sns.distplot for numerical attirbutes
#pclass, age, sibsp, parch, fare

fig1, axes1 = plt.subplots(figsize=(8, 3), nrows=1, ncols=3)
fig2, axes2 = plt.subplots(figsize=(8, 3), nrows=1, ncols=2)
sns.countplot(x="Pclass", data=train_df, ax=axes1[0])
sns.histplot(x="Age", data=train_df, ax=axes1[1])
sns.countplot(x="SibSp", data=train_df, ax=axes1[2])
sns.countplot(x="Parch", data=train_df, ax=axes2[0])
sns.histplot(x="Fare", data=train_df, ax=axes2[1])
fig1.tight_layout()
fig2.tight_layout()

majority of passengers in 3rd class, then 1st, then 2nd. age seems normally distributed (???). sib sp and parch seem similar right skewed most had 0. fare seems very right skewed so may need to be normalized--used log 

In [None]:
train_df['Fare_log'] = np.log1p(train_df['Fare'])  # log1p avoids log(0) errors
sns.histplot(x="Fare_log", data=train_df)


for managing cabin, since it's a string we'll just set to "Unknown" for fillna. it seems that one thing we can feature engineeri extract the first letter which is the "deck" that a passenger stayed on so
- deck fillna with "Unknown"
- we will create a "deck" which is the first letter of the "cabin."

In [None]:
#fill deck nan with "U"
train_df['Cabin'].fillna('Unknown', inplace=True)
train_df["Deck"] = train_df["Cabin"].str[0] # U = unknown
train_df.head()

feature engineering for NAME -- ended up not using this bc not helpful
- the "name" is of format "[surname], [title], [first and middle etc name/nicknames]" which we can deduce "surname length", "title" and "first name length" from which i imagine are more meaningful features than name - so i will deduce "surname" and "title" and then drop calculate teh lenght of the remaining first name

In [None]:
train_df["Surname Length"] = train_df["Name"].str.split(',').str[0].str.len()
train_df["Title"] = train_df["Name"].str.split(',').str[1].str.split(' ').str[1]
train_df["First Name Length"] = train_df["Name"].str.split(',').str[1].str.split(". ", regex=False).str[1].str.len()
train_df.head()

In [None]:
train_df["Title"].value_counts()

In [None]:
train_df[train_df["Title"] == "the"]

categorizing the titles into meaningful categories that takes into account marital status which could equate to class as well as how rare the titels are (e.g. "the" is for "the Countess")

In [None]:
title_mapping = {
    "Mr.": "Mr",
    "Mrs.": "Mrs",
    "Miss.": "Miss",
    "Ms.": "Miss",       # Unmarried woman (modern)
    "Mlle.": "Miss",     # French for Miss
    "Mme.": "Mrs",       # French for Mrs
    "Master.": "Master", # Usually boys under 12
    "Dr.": "Rare",       # Ambiguous — could be male or female
    "Rev.": "Rare",
    "Major.": "Rare",
    "Col.": "Rare",
    "Capt.": "Rare",
    "Sir.": "Rare",
    "Lady.": "Rare",
    "Don.": "Rare",
    "Jonkheer.": "Rare",
    "the": "Rare"
}

In [None]:
train_df["Title"] = train_df["Title"].map(title_mapping)
train_df.head()

make sex into binary feature

In [None]:
train_df['Sex'] = train_df['Sex'].map({'male':0, 'female':1})

ok now help me apply get_dummy prior to correlation heatmap to see if embarked, deck, or title have meaning

and then drop name and ticket and cabin columns

In [None]:
sns.countplot(x='Deck', hue='Survived', data=train_df)
plt.ylim(top=50)

--> decision to bucket! (movea fter correlation? or something)

then add dummy data for categories like embarked

In [None]:
train_df = pd.get_dummies(train_df, columns=["Deck", "Embarked", "Title"])
train_df.drop(axis=1, columns=['Name', 'Ticket', 'Cabin'], inplace=True)
print(train_df.columns)
train_df.head(1)

In [None]:
# to answer 'Which features are correlated with survival?' correlation plot

corr = train_df.corr()
plt.subplots(figsize=(15,10))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")

DROPPING

**Embarked_Q** → near 0 correlation

**Deck_A, Deck_G, Deck_T** → too rare or no predictive value so dropping

**Title_*** → only Mr, Miss, Mrs were correlated which is redundant with Sex

**Surname Length** 0 correlation

**Deck_U**

ADDING IN/TWEAKING

**Decks B-E** were high survival decks --> add a bucket feature

In [None]:
train_df['High Survival Deck'] = (train_df['Deck_B']) | (train_df['Deck_C']) | (train_df['Deck_D']) | (train_df['Deck_E'])
train_df.drop(axis=1, columns=['Deck_A', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_F', 'Deck_G', 'Deck_T', 'Embarked_Q', 'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Rare', 'Surname Length'], inplace=True)
train_df

In [None]:
# we try again 

corr = train_df.corr()
plt.subplots(figsize=(10,5))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")

given our EDA, age is not super correlated variable so prob fine to set to mean of train_df (can also try ommitting, and maybe setting to mean)

In [None]:
# fill age na values
train_df['Age'].fillna(train_df['Age'].mean(), inplace=True) # here I tried both median and mean - mean fared better
train_df.isna().sum()

train is all set! now let's fun it to get validation accuracy acros smodels to find best one

### Modeling

- 5 models: Logistic Regression, Decision Tree, Random Forest, KNN, Naive Bayes
- Optionally: SVM, Gradient Boosting, or Perceptron
- Compare accuracy, precision, recall, AUC
- Pick a best model + explain why

In [None]:
# general code to get CV accuracy of model

def evaluate_model(model, X, y, cv=5, scoring='accuracy'):
    """Evaluate a model using cross-validation."""
    scores = cross_val_score(model, X, y, cv=cv, scoring=scoring)
    print(f"{model} Mean: {np.mean(scores)})")
    return scores

In [None]:
X_train = train_df[['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Fare_log', 'First Name Length', 'Embarked_C', 'Embarked_S', 'High Survival Deck']]
y_train = train_df['Survived']

In [None]:
# Define models
logreg = LogisticRegression(max_iter=1000)
rf = RandomForestClassifier()
dectree = DecisionTreeClassifier()
knn = KNeighborsClassifier()
nb = GaussianNB()
svm = SVC()
gradboost = GradientBoostingClassifier()
percep = Perceptron()

# Evaluate
evaluate_model(logreg, X_train, y_train)
evaluate_model(rf, X_train, y_train)
evaluate_model(dectree, X_train, y_train)
evaluate_model(knn, X_train, y_train)
evaluate_model(nb, X_train, y_train)
evaluate_model(svm, X_train, y_train)
evaluate_model(gradboost, X_train, y_train)
evaluate_model(percep, X_train, y_train)

# suppress warnings
warnings.filterwarnings("ignore")


random forest is da best so we go w it

need to get test_df ship shape for running models!

In [None]:
# log reg code
y_train = train_df['Survived']
X_test = test_df

In [None]:
X_test['Fare_log'] = np.log1p(X_test['Fare'])  # log1p avoids log(0) errors
X_test['Fare_log'].fillna(X_test['Fare_log'].mean(), inplace=True) # here I tried both median and mean - mean fared better
X_test['Cabin'].fillna('Unknown', inplace=True)
X_test["Deck"] = X_test["Cabin"].str[0] # U = unknown
X_test["Surname Length"] = X_test["Name"].str.split(',').str[0].str.len()
X_test["First Name Length"] = X_test["Name"].str.split(',').str[1].str.split(". ", regex=False).str[1].str.len()
X_test['Sex'] = X_test['Sex'].map({'male':0, 'female':1})
X_test = pd.get_dummies(X_test, columns=["Embarked"])
X_test['High Survival Deck'] = (X_test['Deck'].isin(['B', 'C', 'D', 'E'])).astype(int)
X_test = X_test[['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Fare_log', 'First Name Length', 'Embarked_C', 'Embarked_S', 'High Survival Deck']]

In [None]:
# !pip install scikit-learn --> add this to enviro
# log reg!

model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred


In [None]:
submission = pd.read_csv('titanic/gender_submission.csv')
submission['Survived'] = y_pred
submission # answer to 'Can we accurately predict survival with a simple model?' ys got to 77.x%!

In [None]:
submission.to_csv('submissions/final_test.csv', index=None)

### 1) Logistic Regression model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

### 2) Decision Tree model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

### 3) Random Forest

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

### 4) KNN model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

### 5) Naive Bayes model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

### 6) SVM model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

### 7) Gradient Boosting model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

### 8) Perceptron model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

### Final Evaluation

- Confusion matrix, F1, ROC curve
- Feature importance chart

In [None]:
# final eval

In [None]:
# to answer 'Which model performs best on this dataset?'ArithmeticError

[write up too]

### Wrap-up
- Final thoughts, takeaways
- What you’d do next with more time/data
- References or inspiration sources

[writeup]

# References
- Titanic - Machine Learning from Disaster [https://www.kaggle.com/competitions/titanic/data]
- Titanic (wikipedia) [https://en.wikipedia.org/wiki/Titanic]
- cleaning data [https://www.youtube.com/watch?v=cWf08xuSqdU&ab_channel=DataGeekismyname]