# Session 1: Data Preprocessing with the Titanic Dataset

In this notebook, we will:
- Load and explore the Titanic dataset  
- Identify and handle missing values  
- Encode categorical variables  
- Scale numerical features  
- Split the data into training and testing sets  

Dataset source: `seaborn.load_dataset("titanic")`

### Titanic Dataset: Variable Description

The Titanic dataset contains information about passengers on the Titanic and whether they survived the disaster.  
Below is a short description of each column:

| Column | Type | Description |
|---------|------|-------------|
| **survived** | int (0 = No, 1 = Yes) | Whether the passenger survived. This will be our target variable in later sessions. |
| **pclass** | int (1, 2, 3) | Passenger class — a proxy for socioeconomic status (1st = upper, 2nd = middle, 3rd = lower). |
| **sex** | category | Gender of the passenger (`male`, `female`). |
| **age** | float | Age of the passenger in years. May contain missing values. |
| **sibsp** | int | Number of siblings or spouses aboard the Titanic. |
| **parch** | int | Number of parents or children aboard the Titanic. |
| **fare** | float | Ticket fare paid by the passenger (in British pounds). |
| **embarked** | category | Port of embarkation (`C` = Cherbourg, `Q` = Queenstown, `S` = Southampton). May contain missing values. |
| **class** | category | Duplicate of `pclass`, but stored as a categorical variable (`First`, `Second`, `Third`). |
| **who** | category | Simplified description of the passenger (`man`, `woman`, `child`). |
| **adult_male** | bool | Whether the passenger is an adult male (True/False). |
| **deck** | category | Deck level where the passenger’s cabin was located. Many values are missing. |
| **embark_town** | category | Full name of the embarkation town (`Cherbourg`, `Queenstown`, `Southampton`). Similar to `embarked`. |
| **alive** | category | Text version of survival status (`yes` / `no`). |
| **alone** | bool | Whether the passenger traveled alone. |


**Note:**  
For our preprocessing exercises, we will mainly use the following variables:
- `age`, `fare`, `sibsp`, `parch` → numerical features  
- `sex`, `class`, `embarked` → categorical features  
- `survived` → target variable for later modeling


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer

# Load Titanic dataset
df = sns.load_dataset("titanic")

# Display first rows
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Exploring the Titanic Dataset

Let's start by inspecting the structure of our data:
- What are the column names?
- What types of variables do we have?
- Are there any missing values?


### Exercise 1

1. Check how many unique values each column has.  
2. Identify which columns are categorical and which are numerical.  
3. Plot one or two distributions (e.g., `age`, `fare`) using `sns.histplot`.

*Hint:* use `df.nunique()` and `sns.histplot(df['age'])`.

**1. Check how many unique values each column has**

**2. Identify which columns are categorical and which are numerical.**

**3. Plot one or two distributions (e.g., `age`, `fare`) using `sns.histplot`.**

## Handling Missing Values

Let's look closer at missing data:
- The `age` column has some missing entries.
- The `embarked` column also contains missing values.

We can:
1. Drop rows or columns with too many NaNs.
2. Fill (impute) missing values with mean, median, or most frequent values.


### Exercise 2

Try to:
1. Inspect the `deck` and `embark_town` columns.  
2. Decide how you would handle their missing values.  
3. Apply your chosen strategy (drop, fill, and how to fill).

*Discuss your reasoning:* When is it better to drop vs impute?


**1. Inspect the `deck` and `embark_town` columns.**

**2. Decide how you would handle their missing values.**  

**3. Apply your chosen strategy (drop, fill, and how to fill).**

**Bonus exercise: try imputation using interpolate for numerical variables**

### Encoding Categorical Variables

Machine learning models require numerical input.
We need to **encode** categorical variables such as `sex`, `class`, and `embarked`.

Common approaches:
- **Label encoding:** assign integer IDs (good for ordinal categories)
- **One-Hot encoding:** create binary columns for each category (good for nominal categories)


### Exercise 3 – Encoding Categorical Variables

So far, we’ve seen how to use **One-Hot Encoding** to convert categorical features into binary variables.  
Now let’s go deeper and compare it with **Label Encoding**.

1. Identify which categorical columns are **ordinal** (with a meaningful order) and which are **nominal** (no inherent order).   
2. Apply **Label Encoding** to the ordinal feature(s).  
3. Apply **One-Hot Encoding** to the nominal feature(s).  
4. Combine all encoded columns into a single processed dataset with the numerical variables.  
5. Compare the two encoding methods — what are the pros and cons of each?

*Hint:*  
You can use `sklearn.preprocessing.LabelEncoder` for label encoding and `OneHotEncoder` for one-hot encoding.


**1. Identify which categorical columns are **ordinal** (with a meaningful order) and which are **nominal** (no inherent order).**

**2. Apply **Label Encoding** to the ordinal feature(s).**

**3. Apply **One-Hot Encoding** to the nominal feature(s).**

**4. Combine all encoded columns into a single processed dataset with the numerical variables.**

**5. Compare the two encoding methods — what are the pros and cons of each?**

## Feature Scaling

Many ML algorithms (e.g., k-NN, SVM, gradient descent-based methods) are sensitive to feature scale.  
Common techniques:
- **Standardization (Z-score):** subtract mean, divide by std.
- **Min–Max scaling:** rescale to [0, 1].


### Exercise 4 – Compare Feature Scaling Methods

So far, we’ve applied **StandardScaler**, which rescales features to have a mean of 0 and standard deviation of 1.

1. Apply **MinMaxScaler** to the same dataset.  
2. Compare the results between StandardScaler and MinMaxScaler.  
3. Visualize the effect of both scaling methods on a feature (e.g., `fare`).  
4. Discuss: when might one method be preferred over the other?

*Hints:*
- Use `from sklearn.preprocessing import MinMaxScaler`.
- The Min–Max formula rescales features to a fixed range, typically [0, 1].


**1. Apply **MinMaxScaler** to the same dataset.**

**2. Compare the results between StandardScaler and MinMaxScaler.**

**3. Visualize the effect of both scaling methods on a feature (e.g., `fare`).**

**4. Discuss: when might one method be preferred over the other?**

## Train/Test Split

We now have a clean, encoded, and scaled dataset.
Let’s split it into **training** and **testing** subsets to prepare for modeling.

- `train_test_split` ensures models can generalize.  
- Typical ratio: 80 % training / 20 % testing.


### Exercise 5

1. Repeat the split using `test_size=0.3` and then `0.5`.  
2. Observe how the size of the training set changes.  
3. What happens if we change the random_state parameter?
4. Discuss: what are the trade-offs between a larger training or test set?


**1. Repeat the split using `test_size=0.3` and then `0.5`.**

**2. Observe how the size of the training set changes.**

**3. What happens if we change the random_state parameter?**

**4. Discuss: what are the trade-offs between a larger training or test set?**

## Advanced Exercises

### Exercise 6

Visualize missing data patterns to understand missing data visually.

Use sns.heatmap() to visualize where missing values occur.

Identify if missingness appears random or systematic (e.g., people in certain classes missing ages).

### Exercise 7

Explore outliers before and after scaling to understand how scaling interacts with outliers.

Identify potential outliers with boxplots and compare before/after scaling (StandardScaler, MinMaxScaler).

### Exercise 8

Explore correlation between variables.

Compute correlation matrix of numeric variables. Plot it with sns.heatmap().

Which features are most correlated? Could that affect model performance?

### Exercise 9

Build a preprocessing pipeline that helps you consolidate all preprocessing steps and allows you to reuse it for new datasets.

### Exercise 10

Apply all the steps (or use the pipeline developed by yourself in *Exercise 9*) in order to preprocess a new dataset.

Proposed dataset: `sns.load_dataset("penguins")`

### Bonus: Practice on a custom dataset

Apply the learned skills to your own personal research data.

1. Bring a small CSV dataset from your own project.
2. Load it using pd.read_csv() and repeat:
    - Missing value handling
    - Encoding categorical variables
    - Scaling
    - Train/test split