## How can we predict survival on the Titanic?

to do list for self
- 1 apply fcc principles
- 2 apply [https://www.linkedin.com/pulse/what-i-learned-analyzing-famous-titanic-dateset-murilo-gustineli/]
- 3 apply [https://python.plainenglish.io/revitalizing-cyclistic-bike-share-program-an-in-depth-data-exploration-556b52512bf8] - diff dataset but still
- 4 apply others? [https://www.kaggle.com/code/startupsci/titanic-data-science-solutions]

### Guiding Questions
- Which features are correlated with survival?
- Can we accurately predict survival with a simple model?
- Which model performs best on this dataset?


In [None]:
# import statements
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# import train and test datasets and find length, info(), and describe() summary statistics
train_df = pd.read_csv('titanic/train.csv')
test_df = pd.read_csv('titanic/test.csv')
print(f"Passengers in train set: {train_df.shape[0]}\nPassengers in test set: {test_df.shape[0]}")

In [None]:
train_df["Cabin"].str[0].unique()

891:418 which roughly equals a 7:3 ratio of train:test rows (before dropping any training rows as needed???)

In [None]:
print(train_df.dtypes)

### Dataset description

| Column | dtype | Description |
|----------|----------|----------|
| PassengerId  | int64  | unique passenger  |
| Survived  | int64  | binary value of survival outcome (0, 1)  |
| Pclass  | int64  | class (1, 2, 3)  |
| Name  | object  | string value for name --> to quantify in feature engineering, we deduced title from this as well as name length  |
| Sex  | object  | sex ("male", "female")  |
| Age  | float64  | passenger's age at time of ?  |
| SibSp  | int64  | sibling/spouse #?  |
| Parch  | int64  | parents/children #?  |
| Ticket  | object?  | ?  |
| Fare  | float64  | ticket cost  |
| Cabin  | object  | cabin identifer of format letter + number (e.g. C85, C123, B42, C148) - from research, letter corresponds to deck which we engineered our own feature from this  |
| Embarked  | object  | port passenger boarded from? ('S', 'C', 'Q' which correspond to Southampton, Cherbourg, Queenstown)  |

NOTE - variations to help algorithm?
- deal with fare, cabin, age dropped values differently!

In [None]:
train_df.head()

In [None]:
train_df.tail()

In [None]:
train_df.info()

In [None]:
train_df.isna().sum()

We see that the Age, Cabin, and Embarked have null values to be dealt with/cleaned in the train datset

In [None]:
test_df.head()

In [None]:
test_df.info()

In [None]:
test_df.isna().sum()

We see that the Age, Fare, Cabin have null values to be dealt with/cleaned in the train datset

Next, let's look at statistic summaries of datasets

In [None]:
train_df.describe()

In [None]:
test_df.describe()

We see train and test df have similar summary statistics. from first glance, all columns seem to have reasonable means and min/max (pclass 1-3 all valid, age in right range, # sibs or parch as well as fare all seem reasonable with no outright impossible valus/outliers - though fare seems to be very right skewed)

we will look into each variable distribution more to see if any single varaible distribution can be preprocessed or cleaned

In [None]:
# sanity checks
print(f"Minimum age: {train_df['Age'].min()}. Maximum age: {train_df['Age'].max()}.")
print(f"Duplicate PassengerIds? {train_df.duplicated('PassengerId').sum()}")

First let's figure out to deal with missing AGE and CABIN columns which there are a lot of in train and test. but we'll only look at train set??

## Visualizing each variable (pre cleaning/n/a values)

In [None]:
#sns.countplot for each categorical attributes cumulative
#AND sns.countplot for each categorical attributes by died/survived
    #survived, sex, embarked

fig, axes = plt.subplots(figsize=(10, 3), nrows=1, ncols=3)
sns.countplot(x="Survived", data=train_df, ax=axes[0])
sns.countplot(x="Sex", data=train_df, ax=axes[1])
sns.countplot(x="Embarked", data=train_df, ax=axes[2])
fig.tight_layout()

We see that majority perished in titanic, there were more male aboard, and most embarked from 'S' in the train set

In [None]:
#sns.distplot for numerical attirbutes
#pclass, age, sibsp, parch, fare

fig1, axes1 = plt.subplots(figsize=(8, 3), nrows=1, ncols=3)
fig2, axes2 = plt.subplots(figsize=(8, 3), nrows=1, ncols=2)
sns.countplot(x="Pclass", data=train_df, ax=axes1[0])
sns.histplot(x="Age", data=train_df, ax=axes1[1])
sns.countplot(x="SibSp", data=train_df, ax=axes1[2])
sns.countplot(x="Parch", data=train_df, ax=axes2[0])
sns.histplot(x="Fare", data=train_df, ax=axes2[1])
fig1.tight_layout()
fig2.tight_layout()

majority of passengers in 3rd class, then 1st, then 2nd. age seems normally distributed (???). sib sp and parch seem similar right skewed most had 0. fare seems very right skewed so may need to be normalized(???) - example used log but other methods?

for managing cabin, since it's a string we'll just set to "Unknown" for fillna. it seems that one thing we can feature engineeri extract the first letter which is the "deck" that a passenger stayed on so
- deck fillna with "Unknown"
- we will create a "deck" which is the first letter of the "cabin."

In [None]:
train_df["Deck"] = train_df["Cabin"].str[0] # U = unknown
train_df.head()

feature engineering for NAME
- the "name" is of format "[surname], [title], [first and middle etc name/nicknames]" which we can deduce "surname length", "title" and "first name length" from which i imagine are more meaningful features than name - so i will deduce "surname" and "title" and then drop calculate teh lenght of the remaining first name

TICKET
- will also drop because seems unhelpful/unstandardized

In [None]:
train_df["Surname Length"] = train_df["Name"].str.split(',').str[0].str.len()
train_df["Title"] = train_df["Name"].str.split(',').str[1].str.split(' ').str[1]
train_df["First Name Length"] = train_df["Name"].str.split(',').str[1].str.split(". ", regex=False).str[1].str.len()
# train_df.drop(axis=1, columns=['Name', 'Cabin', 'Ticket'], inplace=True)
train_df.head()

In [None]:
train_df["Title"].value_counts()

In [None]:
train_df[train_df["Title"] == "the"]

categorizing the titles into meaningful categories that takes into account marital status which could equate to class as well as how rare the titels are (e.g. "the" is for "the Countess")

In [None]:
title_mapping = {
    "Mr.": "Mr",
    "Mrs.": "Mrs",
    "Miss.": "Miss",
    "Ms.": "Miss",       # Unmarried woman (modern)
    "Mlle.": "Miss",     # French for Miss
    "Mme.": "Mrs",       # French for Mrs
    "Master.": "Master", # Usually boys under 12
    "Dr.": "Rare",       # Ambiguous — could be male or female
    "Rev.": "Rare",
    "Major.": "Rare",
    "Col.": "Rare",
    "Capt.": "Rare",
    "Sir.": "Rare",
    "Lady.": "Rare",
    "Don.": "Rare",
    "Jonkheer.": "Rare",
    "the": "Rare"
}

In [None]:
train_df["Title"] = train_df["Title"].map(title_mapping)
train_df.head()

fill deck nan with "U"

In [None]:
train_df['Deck'].fillna('U', inplace=True)

make sex into binary feature

In [None]:
train_df['Sex'] = train_df['Sex'].map({'male':0, 'female':1})

ok now help me apply get_dummy prior to correlation heatmap to see if embarked, deck, or title have meaning

and then drop name and ticket and cabin columns

In [None]:
train_df.head(1)

In [None]:
# look at deck survival correlation

sns.countplot(x='Deck', hue='Survived', data=train_df)
plt.ylim(top=50)


then add dummy data for categories like embarked

In [None]:
train_df = pd.get_dummies(train_df, columns=["Deck", "Embarked", "Title"])
print(train_df.columns)
train_df.drop(axis=1, columns=['Name', 'Ticket', 'Cabin'], inplace=True)
train_df.head(1)

In [None]:
train_df

In [None]:
# to answer 'Which features are correlated with survival?' correlation plot

corr = train_df.corr()
plt.subplots(figsize=(15,10))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")

DROPPING

**Embarked_Q** → near 0 correlation

**Deck_A, Deck_G, Deck_T** → too rare or no predictive value so dropping

**Deck_U** → yes it’s correlated, but might reflect missingness bias (people without assigned decks were more likely to die); consider turning it into a binary column like Have_Deck

**Title_*** → only Mr, Miss, Mrs were correlated which is redundant with Sex

ADDING IN/TWEAKING

**Decks B-E** were high survival decks --> add a bucket feature

In [None]:
train_df['High Survival Deck'] = (train_df['Deck_B']) | (train_df['Deck_C']) | (train_df['Deck_D']) | (train_df['Deck_E'])
train_df.drop(axis=1, columns=['Deck_A', 'Deck_B', 'Deck_C', 'Deck_D', 'Deck_E', 'Deck_F', 'Deck_G', 'Deck_T', 'Deck_U' 'Embarked_Q', 'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Rare'], inplace=True)
train_df

In [None]:
# we try again

corr = train_df.corr()
plt.subplots(figsize=(15,10))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm")

In [None]:
# deck plot

sns.barplot(x='Deck', y='Survived', data=train_df)


In [None]:
#pivot tables

### EDA

- Visualize survival by class, sex, age, family, fare
- Show correlations (heatmap, groupby stats)
- Write observations inline

### Data Cleaning

- describe data
- Of the estimated 2,224 passengers and crew aboard, approximately 1,500 died (estimates vary) [https://en.wikipedia.org/wiki/Titanic]
- 891 entries in the training set
- 418 in the test set
- 1309 total meaning a rougly 7:3 split for train:test

x

- Inspect nulls
- Drop/recode columns
- Feature engineering (like 'is_alone', deck extraction, etc.)

In [None]:
# clean as needed
# maybe remove outliers?


train_df.info()


In [None]:
# fill age na values
train_df['Age'].fillna(train_df['Age'].median(), inplace=True)

# drop cabin column and name
train_df.drop(axis=1, columns='Cabin', inplace=True)

# drop embarked na rows
train_df.dropna(subset=['Embarked'], inplace=True)

In [None]:
train_df.info()


optional stuff to refine it later!

# 11
    pressure_mask = df["ap_lo"] <= df["ap_hi"]
    
    short_mask = df["height"] >= df["height"].quantile(0.025) 
    
    tall_mask = df["height"] <= df["height"].quantile(0.975)
    
    low_weight_mask = df["weight"] >= df["weight"].quantile(0.025)
    
    high_weight_mask = df["weight"] <= df["weight"].quantile(0.975)

    df_heat = df[pressure_mask & short_mask & tall_mask & low_weight_mask & high_weight_mask]



# Clean data
df = df[(df['views'] >= df['views'].quantile(0.025)) & (df['views'] <= df['views'].quantile(0.975))] # 1304 -> 1176


In [None]:
# how to get column types????
train_df.dtypes



In [None]:
train_df.head()

### Modeling

- 5 models: Logistic Regression, Decision Tree, Random Forest, KNN, Naive Bayes
- Optionally: SVM, Gradient Boosting, or Perceptron
- Compare accuracy, precision, recall, AUC
- Pick a best model + explain why

### 1) Logistic Regression model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

In [None]:
# log reg code
X_train = train_df.copy()
X_train.drop(columns=['Name', 'Ticket', 'Survived'], inplace=True)
y_train = train_df['Survived']
X_test = test_df.copy()
X_test.drop(axis=1, columns='Cabin', inplace=True)
X_test['Age'].fillna(X_test['Age'].median(), inplace=True)
X_test['Fare'].fillna(X_test['Fare'].median(), inplace=True)
X_test.drop(columns=['Name', 'Ticket'], inplace=True)

In [None]:
X_train.info()

In [None]:
X_test.info()

In [None]:
# make all numerical

X_train['Sex'] = X_train['Sex'].astype(str).map({"male": 0, "female": 1})
X_test['Sex'] = X_test['Sex'].astype(str).map({"male": 0, "female": 1})
X_train['Embarked'] = X_train['Embarked'].astype(str).map({'Q': 0, 'S': 1, 'C': 2})
X_test['Embarked'] = X_test['Embarked'].astype(str).map({'Q': 0, 'S': 1, 'C': 2})

In [None]:
X_train.info()

In [None]:
# !pip install scikit-learn
# log reg!

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred


In [None]:
submission = pd.read_csv('titanic/gender_submission.csv')
submission['Survived'] = y_pred
submission # 75% --> answer to 'Can we accurately predict survival with a simple model?'

In [None]:
submission.to_csv('logreg_submission.csv', index=None)

### 2) Decision Tree model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

In [None]:
# decision tree code

### 3) Random Forest

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

In [None]:
# random forest code

### 4) KNN model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

In [None]:
# KNN code

### 5) Naive Bayes model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

In [None]:
# Naive Bayes code

### 6) SVM model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

In [None]:
# SVM code

### 7) Gradient Boosting model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

In [None]:
# Gradient Boosting code

### 8) Perceptron model

- description high level w figs (read articles / watch vids)
- pros and cons in general and for this dataset

[writeup]

In [None]:
# Perceptron code

### Final Evaluation

- Confusion matrix, F1, ROC curve
- Feature importance chart

In [None]:
# final eval

In [None]:
# to answer 'Which model performs best on this dataset?'ArithmeticError

best_model!

[write up too]

### Wrap-up
- Final thoughts, takeaways
- What you’d do next with more time/data
- References or inspiration sources

[writeup]

# References
- Titanic - Machine Learning from Disaster [https://www.kaggle.com/competitions/titanic/data]
- Titanic (wikipedia) [https://en.wikipedia.org/wiki/Titanic]
- cleaning data [https://www.youtube.com/watch?v=cWf08xuSqdU&ab_channel=DataGeekismyname]