# Lab 9: Pre-Processing with Scikit-Learn and Pandas

The objective of this notebook is to learn about **pre-processing** with the **Scikit-Learn** and **Pandas** libraries. Then, train a simple binary classifier on the pre-processed dataset.

In this lab, we will train a binary classification model that predicts which **passengers survived** the **Titanic shipwreck** <a href="https://www.kaggle.com/c/titanic" >link</a>.

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this notebook, you are asked to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). For now we just focus on the preprocessing part, but you can come back here later after the laboratory on classification

You can find a detailed **tutorial** <a href="https://datasciencewithchris.com/kaggle-titanic-data-cleaning-and-preprocessing/" >here</a>.

## Outline

- [1. Load Dataset](#1)
- [2.  Data pre-processing](#2)
- [3. Model training](#3) (Following the workshop on classification)


First, run the following cell to import some useful libraries to complete this Lab. If not already done, you must install them in your virtual environment

In [None]:
import pandas as pd

pd.options.display.max_columns= 50
pd.options.display.max_rows= None

import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml

from sklearn.impute import SimpleImputer

<a id='1'></a>
## 1. Load dataset

Firstly, you will load the **Titanic** dataset used in this lab into a DataFrame `df`. 

**Scikit-Learn** comes with built-in datasets for the **Titanic dataset**. The next cell loads the titanic dataset from Scikit-Learn and stores it in a Pandas DataFrame.

In [None]:
df, y = fetch_openml('titanic', version=1, as_frame=True, parser='auto', return_X_y=True)
df["survived"] = y

In [None]:
df.to_csv("lab6_data.csv")

Run the next cell to look at the first 5 rows of the dataset.

In [None]:
df.head()

In [None]:
print("Number of samples:", len(df))

In [None]:
df.columns

The dataset is composed of 1309 samples. Each row contains information on each passenger. Specifically, the dataset contains the following attributes:

- **pclass**: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- **name**: Passenger name
- **sex**: Passenger sex
- **age**: Passenger age
- **sibsp**: Number of Siblings/Spouses Aboard
- **parch**: Number of Parents/Children Aboard
- **ticket**: Ticket Number
- **fare**: Passenger Fare
- **cabin**: Cabin
- **embarked**: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- **boat**: Lifeboat (if survived)
- **body**: Body number (if did not survive and body was recovered). It could be another target.
- **home.dest**: Destination
- **survival** (target): Survival (0 = No; 1 = Yes)

Note that **boat** and **body** must be removed from input features because provide information about the target variable (i.e., they have values only if target is survived).

In [None]:
df.info()

In [None]:
df.describe()

<a id='2'></a>
## 2. Data pre-processing

Firstly, you will perform the pre-processing of the dataset.


### 2.1 Train and Test splitting with Stratification

In [None]:
df["survived"].value_counts()

The dataset is a slightly **imbalance**.

In [None]:
df.head()

#### Exercise 2.6.1

Extract the input features in `X` and the target values in `y`.

In [None]:
#### START CODE HERE (~2 lines) ####

X = None
y = None

#### END CODE HERE ####

In [None]:
X.head()

#### Exercise 2.1.2

Split the dataset into **train** and **test**. In this case, the dataset is **imbalance**. Therefore, it is recommended to split using stratification (i.e., the class label distribution will be preserved during the splitting).

Split with 80% for training and 20% for validation. Shuffle the dataset before splitting.

In [None]:
#### START CODE HERE (~1 line) ####

X_train, X_test, y_train, y_test = train_test_split(None, None, test_size=None, shuffle=None, random_state=42, stratify=None)

#### END CODE HERE ####

In [None]:
print(f"Number of training examples: {len(X_train)}")
print(f"Number of testing examples: {len(X_test)}")

### 2.2 Handling missing values

#### Exercise 2.2.1
Count the number of **null values** in training and test, and store them in the variables `nan_count_train` and `nan_count_test`.

In [None]:
#### START CODE HERE (~2 lines) ####

nan_count_train = None
nan_count_test = None

#### END CODE HERE ####

In [None]:
print("Train")
print(nan_count_train)

In [None]:
print("Test")
print(nan_count_test)

Sometimes, the **missing values** are not in the *nan* format.

The next cell prints the format of *nan* values.

In [None]:
print('Data types of missing values')
for col in X_train.columns[X_train.isnull().any()]:
    print(col, X_train[col][X_train[col].isnull()].values[0])

In this case, all *nan* values are in the *nan* format.

#### Exercise 2.2.2

Fill **null values** in the column `age` with the **mean** of the column `age` in the training and test set. Please compute the mean only on the training!

In [None]:
print(f'Number of null values in Train before pre-processing: {X_train.age.isnull().sum()}/{len(X_train)}')
print(f'Number of null values in Test before pre-processing: {X_test.age.isnull().sum()}/{len(X_test)}')

#### START CODE HERE (~2 lines) ####

None
None

#### END CODE HERE ####

print(f'Number of null values in Train after pre-processing: {X_train.age.isnull().sum()}/{len(X_train)}')
print(f'Number of null values in Test after pre-processing: {X_test.age.isnull().sum()}/{len(X_test)}')

#### Exercise 2.2.3

Fill **null values** in the column `fare` with the **median** of the column `fare` in the training and test set. Please compute the median only on the training!

In [None]:
print(f'Number of null values in Train before pre-processing: {X_train.fare.isnull().sum()}/{len(X_train)}')
print(f'Number of null values in Test before pre-processing: {X_test.fare.isnull().sum()}/{len(X_test)}')

#### START CODE HERE (~2 lines) ####

None
None

#### END CODE HERE ####

print(f'Number of null values in Train after pre-processing: {X_train.fare.isnull().sum()}/{len(X_train)}')
print(f'Number of null values in Test after pre-processing: {X_test.fare.isnull().sum()}/{len(X_test)}')

#### Exercise 2.2.4

Fill **null values** in the column `embarked` with the **most frequent value** of the column `embarked`. Please compute the most frequent only on the training!

In [None]:
print(f'Number of null values in Train before pre-processing: {X_train.embarked.isnull().sum()}/{len(X_train)}')
print(f'Number of null values in Test before pre-processing: {X_test.embarked.isnull().sum()}/{len(X_test)}')

#### START CODE HERE (~3 lines) ####

imp = SimpleImputer(missing_values=np.nan, strategy=None)
X_train['embarked'] = None
X_test['embarked'] = None

#### END CODE HERE ####

print(f'Number of null values in Train after pre-processing: {X_train.embarked.isnull().sum()}/{len(X_train)}')
print(f'Number of null values in Test after pre-processing: {X_test.embarked.isnull().sum()}/{len(X_test)}')

### 2.3 Features selection

#### Exercise 2.3.1
Remove columns *cabin*, *body*, *boat*, and *home.dest* from the train and test sets because they contain info about the target variable (i.e., the model could "cheat" predicting the target label based on the info in these attributes).

In [None]:
#### START CODE HERE (~2 lines) ####

X_train = None
X_test = None

#### END CODE HERE ####

X_train.head()

#### Exercise 2.3.2

Remove other columns that you think are useless features in predicting which people were more likely to survive.

In [None]:
#### START CODE HERE (~2 lines) ####

X_train = None
X_test = None

#### END CODE HERE ####

X_train.head()

The next cell plots the **correlation heatmap** using `Seaborn` and `df.corr()`. You will probably need to install `Seaborn` using the command `pip install seaborn`, and then restart your kernel.

In [None]:
import seaborn as sns
%config InlineBackend.figure_format = 'svg'

df_show = pd.concat([X_train[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']], y_train], axis = 1)

g = sns.heatmap(df_show.corr(),
                annot=True, 
                cmap = "coolwarm")

### 2.4 Features engineering (optional)

#### Exercise 2.4.1

If you want, you can create new columns here from the ones available.

In [None]:
#### START CODE HERE ####

#### END CODE HERE ####

X_train.head()

### 2.5 Discretization

The next cell performs the **discretization** of the age column with **fixed-intervals**. 
You can learn more about **discretization** <a href="https://trainindata.medium.com/variable-discretization-in-machine-learning-7b09009915c2" >here</a>.

In [None]:
age_category = ['Child (0-14]', 'Young (14-24]', 'Adults (24-50]', 'Senior (50+]']
 
X_train['age_disc']=pd.cut(x=X_train['age'], bins=[0,14,24,50,100],labels=age_category)
X_train = X_train.drop(columns=['age']) # Remove the old age column

X_test['age_disc']=pd.cut(x=X_test['age'], bins=[0,14,24,50,100],labels=age_category)
X_test = X_test.drop(columns=['age']) # Remove the old age column

In [None]:
X_train.head()

In [None]:
X_test.head()

### 2.7 One-hot encoding

The following cells perform the **one-hot encoding** of the categorical features using the `OneHotEncoder` of the **Scikit-Learn** library. You can also use a similar approach using the `get_dummies` function of **Pandas**.

You can learn the differences between `OneHotEncoder` and `get_dummies` <a href="https://pythonsimplified.com/difference-between-onehotencoder-and-get_dummies/" >here</a>.

When building the `OneHotEncoder` object, the `handle_unknown` parameter is set to `'ignore'`. 


In [None]:
X_train.head()

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown='ignore') 

In [None]:
categorical_columns = ['sex', 'embarked']

In [None]:
ohe.fit(X_train[categorical_columns]) # Fit on training data

temp_df = pd.DataFrame(data=ohe.transform(X_train[categorical_columns]).toarray(), 
                       columns=ohe.get_feature_names_out()) # Create a new DataFrame with only the one-hot encoded cols

X_train.drop(columns=categorical_columns, axis=1, inplace=True) # Remove the old categorical columns from the original data
X_train = pd.concat([X_train.reset_index(drop=True), temp_df], axis=1)

X_train.head()

In [None]:
temp_df = pd.DataFrame(data=ohe.transform(X_test[categorical_columns]).toarray(), 
                       columns=ohe.get_feature_names_out()) # Not fit on test data!

X_test.drop(columns=categorical_columns, axis=1, inplace=True)
X_test = pd.concat([X_test.reset_index(drop=True), temp_df], axis=1)

X_test.head()

### 2.7 Ordinal Encoding

When the categorical feature are ordinal we can use ordinal Encoding. Since the order among the categories is important, encoding should reflect the sequence.

In [None]:
age_category

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder(categories=[age_category]) # Should be a list becuause you can specify the categories for multiple columns


ord_enc.fit(X_train.loc[:, ["age_disc"]]) # Fit on training data



In [None]:
X_train["age_disc_enc"] = ord_enc.transform(X_train.loc[:, ["age_disc"]])

X_train.head()

In [None]:
X_train.drop(columns=["age_disc"], axis=1, inplace=True)

X_train.head()

In [None]:
X_test["age_disc_enc"] = ord_enc.transform(X_test.loc[:, ["age_disc"]])
X_test.drop(columns=["age_disc"], axis=1, inplace=True)

X_test.head()

### 2.8 Normalization/Standardization

#### Exercise 2.8.1 

Perform **Min-Max** normalization of the *numerical features*. Remember to **fit** on the training and not on the test. Note that `age_disc_enc` in this case is categorical but can be normalized too.

In [None]:
from sklearn.preprocessing import MinMaxScaler

numerical_features = ["pclass", "sibsp", "parch", "fare", "age_disc_enc"]

#### START CODE HERE (~4 lines) ####

minmax_s = None

minmax_s.fit(None) 

X_train[numerical_features] = None
X_test[numerical_features] = None

#### END CODE HERE ####

In [None]:
X_train.head()

In [None]:
X_test.head()

### 2.9 Features Reduction

Now we fit PCA on the standardized training data and compute the cumulative explained variance. This tells us how much variance is captured as we increase the number of components.

👉 Plot the cumulative variance curve and add a red line at 90% to guide component selection.

In [None]:

from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

#start code


#end code

plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by PCA Components')
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()



Based on the plot above, select the minimum number of components needed to reach a certain threshold (e.g. 90% of the total variance).

👉 Compute the number of components programmatically.

In [None]:
threshold = 0.9

#start code

#end code

print(f"Number of components to reach {threshold*100:.0f}% variance: {n_components}")


Now we re-fit PCA using only the selected number of components, and project both training and test data into this reduced space.

👉 Use .fit_transform() on training, and .transform() on test (no refit!).



In [None]:
#start code



#end code

The loadings show how much each original feature contributes to each principal component.

👉 Compute the loadings and visualize the contributions of each feature to PC₁ and PC₂ using a horizontal bar chart.

In [None]:
#start code
# as index use: features = [f"feature_{i}" for i in range(X_train.shape[1])]


#end code

plt.xlabel('Loadings')
plt.ylabel('Feature')
plt.title(f'Contribution of features to PC$_1$ and PC$_2$')
plt.tight_layout()
plt.show()


<a id='3'></a>
## 3. Model Training and Evaluation

Now, you can **train** and **evaluate** a **binary classification** model on the pre-processed dataset. 

### 3.1 Training

### 3.2 Evaluation