# 1 - Understanding & Planning

**Author:** M. Görkem Ulutürk

**Date:** December, 2025

## The Problem

Regardless of the workflow a data scientist chooses for a project, it should
start with a question. Of course, there'll be many questions asked, and
answered along the way; however, the first question should spark the
curiosity within and guide us through the project: the ultimate question
we're trying to answer with this project. Thus, we state our question:

> Did people survive the Titanic incident out of pure luck, or were social
constructs resulted in certain groups having more chances at survival?

Remember, one of the reasons why so many people died in this incident was
because of the lack of lifeboats. Therefore, people on board had to make
certain choices; some sacrifices had to be made, and some people were saved.
But we wonder whether a person's traits, such as age, wealth, sex, etc.
played a role in their survival, and if so, which groups were more likely to
survive.

## The Data

With this project, we've been handed out three datasets:

1. `train.csv`: Contains the passenger information for training the ML model
2. `test.csv`: Used for evaluating the model performance for submission
3. `gender_submission.csv`: Example submission data

### Data Dictionary <a name="data-dictionary"></a>

`train.csv` contains

- **891 rows**
- **11 columns**

Variable | Dtype | Definition | Key
---------|-------|------------|-----
`Survived` | `int64` | Survival | 0 = No, 1 = Yes
`Pclass` | `int64` | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd
`Name` | `object` | Name | 
`Sex` | `object` | Sex |
`Age` | `float64` | Age in years |
`SibSp` | `int64` | The number of siblings / spouses aboard the Titanic |
`Parch` | `int64` | The number of parents / children aboard the Titanic |
`Ticket` | `object` | Ticket number |
`Fare` | `float64` | Passenger fare |
`Cabin` | `object` | Cabin number |
`Embarked` | `object` | Port of Embarkation | C = Cherbourg, </br>Q = Queenstown, </br>S = Southampton

**Notes:**

- `Pclass`: A proxy for socio-economic status (SES), 1st = Upper,
2nd = Middle, 3rd = Lower
- `Age`: Age is fractional if less than 1. If the age is estimated, is it in
the form of xx.5
- `SibSp`: The dataset defines family relations in this way...
    - Sibling = brother, sister, stepbrother, stepsister
    - Spouse = husband, wife (mistresses and fiancés were ignored)
- `Parch`: The dataset defines family relations in this way...
    - Parent = mother, father
    - Child = daughter, son, stepdaughter, stepson
    - Some children travelled only with a nanny, therefore parch=0 for them.

## Inspecting the Data

### Importing the Data

Let us start by importing the required packages.

In [1]:
import pandas as pd

Let's now import the data.

In [2]:
df = pd.read_csv("../data/raw/train.csv", encoding="utf-8")
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### Initial Data Wrangling

We have the column `PassengerId`. This column is used in the `test.csv` data
for submission purposes. We will not need this column for model training.

Now, let's check for missing values and data types.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Only columns to contain missing values are `Age`, `Cabin`, and `Embarked`. In
the data wrangling section of the project, we'll deal with these missing
values. For now, having an educated view on the data is beneficiary for our
planning purposes.

Let's also check for duplicates.

In [4]:
df.duplicated(keep="first").sum()

np.int64(0)

We don't have any duplicates we need to deal with.

**Takeaways**

- We'll drop the column `PassengerId` as it's not needed for training
- We've found no duplicates in the data
- The data contains some missing values, especially in the `Cabin` column.

### Initial EDA

To make reasonable plans, we'll briefly inspect the data.

In [5]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


We can deduce that

- Majority did not survive (`Survived` is a binary variable with mean 0.38)
- Upper class (`Pclass`) was the minority
- Majority were younger than midle-aged (75th percentile is 38)
- More than half the people had no siblings/spouses
- Majority had no children

Let's also take a look at the `Sex` variable.

In [6]:
df['Sex'].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

In [7]:
df.groupby('Sex')['Survived'].agg('sum')

Sex
female    233
male      109
Name: Survived, dtype: int64

We see that although the majority of the passengers were males, females were
the majority among survivors. We'll also take other variables like `Age` into
consideration during the EDA, but for now, this information, together with
our intuition is enough to suspect that `Sex` is probably correlated with
the target variable.

### Performance Targets

Lastly, we need to determine a performance benchmark. Let's create a baseline
prediction: since the majority of females survived (233 survivors out of
314 total), a model that predicts the passenger to survive if the passenger
is female will be accurate most of the time. Let's check.

Let's say the model predicts survival if the passenger is female and did not
survive if the passenger is male. In this case,

In [8]:
from sklearn.metrics import accuracy_score

def predict(X: pd.DataFrame) -> pd.Series:
    return X['Sex'] == 'female'

print(accuracy_score(df['Survived'], predict(df)))

0.7867564534231201


A model that predicts all females as survived and all males as did not
survive has an accuracy of 79%. Let's also see what a baseline decision tree
classifier is able to achieve.

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (recall_score, precision_score, f1_score)

df = pd.read_csv("../data/raw/train.csv", encoding="utf-8")

df.drop(['Cabin', 'Embarked', 'Name', 'PassengerId', 'Ticket'],
        axis=1, inplace=True)

# Filling missing values; only `Age` column contains NaNs after dropping the
# columns above
df.fillna(value=df['Age'].mean(), axis=0, inplace=True)
df.reset_index(inplace=True, drop=True)

df = pd.get_dummies(data=df)

y = df['Survived']
X = df.drop('Survived', axis=1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.20, random_state = 40
)

dt = DecisionTreeClassifier(random_state = 40)
dt.fit(X_train, y_train)
pred = dt.predict(X_test)

print(f"Accuracy score: {accuracy_score(y_test, pred)}")
print(f"Recall score: {recall_score(y_test, pred)}")
print(f"Precision score: {precision_score(y_test, pred)}")
print(f"F1 score: {f1_score(y_test, pred)}")

Accuracy score: 0.7988826815642458
Recall score: 0.7368421052631579
Precision score: 0.7777777777777778
F1 score: 0.7567567567567568


A baseline model without any feature engineering, by dropping the columns
`Cabin`, `Embarked`, `Name`, and `Ticket`, and by filling in missing `Age`
values with the mean can achieve an accuracy score of 79.9%, and an F1-score
of 75.7%. Thus, we'll set our target as around 85% accuracy and F1-score for
this project, performing better than an all-female survivor model and a
baseline decision tree model.

Let's also `pickle` this model for future reference.

In [10]:
import pickle

with open("../models/base_dt.pkl", "wb") as f:
    pickle.dump(dt, f)

# test
with open("../models/base_dt.pkl", "rb") as f:
    base_dt = pickle.load(f)

print(accuracy_score(y_test, base_dt.predict(X_test)))

0.7988826815642458


## Next Steps

Recall that the problem we're trying to solve is to be able to infer whether
a passenger survived on the basis of their features present in the dataset.
These features are stated in the [data dictionary](#data-dictionary).
The target variable is the binary variable `Survived`.

We can make some initial, educated plans. First of all, for the data
wrangling part, we'll need to deal with the missing values. We've discovered
that `Age`, `Cabin`, and `Embarked` variables contain missing values. Based
on the data dictionary and our purpose, we may choose to

- Impute missing `Age` values because, as per intuition, this column is
probably correlated with the target variable `Survived`
- Discuss the relevance of `Cabin` column and potentially drop it
- Discuss the relevance of `Embarked` column and potentially drop it

Other features that are probably correlated with the target variable are

- `Pclass`
- `Sex`
- `Fare`

Of course, we'll conduct a detailed analysis for each feature on whether
they're correlated or not with the target variable. We'll also conduct
analysis to reveal relationships between feature variables. We'll discuss
this topic more in the EDA phase.

Additionally, we may choose to drop columns such as `Name` or `Sibsp`
after EDA, if we fail to find any relation to the target variable.
Intuitively, we'll at least drop the `Name` column after feature extraction.
The `Name` itself shouldn't be significant except for determining sex,
family members, or title (Mr., Mrs., etc.).

Lastly, the model choice will be clearer after data wrangling and EDA;
however, an educated guess would be that a tree-based classifier is probably
the best fit for the case. We'll uncover more in the upcoming sections. We've
set a performance target of around 85% in both accuracy and F1 scores for
the model.