# **Lab 0 - Explainable and Trustworthy AI**


---



**Teaching Assistant**: *Salvatore Greco*

## **Lab 0:** Machine Learning pipeline with Pandas and Scikit-learn

In this lab, you will learn about **pre-processing** and **model training** in Machine Learning (ML) with [Pandas](https://pandas.pydata.org/) and [Scikit-Learn](https://scikit-learn.org/stable/) libraries.


[Pandas](https://pandas.pydata.org/) is a Python library useful for handling and analyzing data structures, particularly bidimensional tables and time series (i.e., data associated with time). It provides useful data structures (e.g., Series and DataFrames) to manage data effectively. The library provides tools for managing the data selection, transforming data with grouping and pivoting operations, managing missing data in the dataset, and performing statistics and charts on data. Pandas is based on [Numpy](https://numpy.org/) arrays.

[Scikit-Learn](https://scikit-learn.org/stable/) is a Python library that implements many machine learning algorithms, and it is built on [Numpy](https://numpy.org/), [SciPy](https://scipy.org/) and [Matplotlib](https://matplotlib.org/). In Scikit-learn both *unsupervised* (e.g., K-Means, DBScan clustering algorithms), and *supervised* algorithms for *regression* and *classification* tasks are available. Scikit-Learn also provides  useful functions for data pre-processing, feature extraction, feature selection, and dimensionality reduction.

A typical **machine learning pipeline** involves the following steps:
  1. **Data Collection**: Gather your data. - *(uncovered)*
  2. **Data Exploration**: Perform exploratory data analysis to understand patterns, distributions, and correlations in the data. - *(uncovered)*
  3. **Data Splitting**: Split the dataset into training, validation (optional), and test sets.
  4. **Data Cleaning**: Handle missing values, remove duplicates, and correct errors.
  5. **Feature Selection**: Select relevant features and remove redundant ones.
  6. **Data Transformation**: Normalization, standardization, and encoding.
  7. **Feature Engineering**: Create new features or modify existing ones (e.g., discretization).
  8. **Data Augmentation**: Augment the training set to increase its size and variability (if possible). Apply techniques like oversampling, undersampling, or [SMOTE](https://medium.com/@corymaklin/synthetic-minority-over-sampling-technique-smote-7d419696b88c) to handle imbalanced data. - *(uncovered)*
  9. **Model Selection and Training**: Choose and train the model using the pre-processed training set.
  10. **Hyperparameters Tuning**: Explore various hyperparameter configurations to improve upon the baseline model's performance.  Evaluate each set of hyperparameters using a validation set or cross-validation to assess the model's performance. - *(uncovered)*
  11. **Model Evaluation**: Evaluate the model's performance on the preprocessed test set using appropriate metrics.

You can also create pre-processing pipelines that automate all the pre-processing steps.

The previous steps are just a general list. However, they depend on the model you want to train. For example, tree-based algorithms such as decision trees and random forests can handle categorical data naturally. This, they do not require the encoding of categorical features and normalization/standardization.

 Note that, it is reccomended to split the dataset early in the process and using *only* the training set for deriving any data-specific insights or transformations are fundamental practices to prevent data leakage and ensure the model's generalizability to new data. This approach maintains the test set as an unbiased assessment of the model's performance.

---

## **Exercise 1: Titanic Survival Prediction**

In this exercise, you will train a binary classification model that predicts which **passengers survived** the **Titanic shipwreck** <a href="https://www.kaggle.com/c/titanic" >link</a>.

The sinking of the Titanic is one of the most famous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While some element of luck was involved in surviving, it seems some groups of people were more likely to survive than others.

In this exercise, you are asked to build a predictive model that answers the question: *“What sorts of people were more likely to survive?”* using passenger data (i.e., name, age, gender, socio-economic class, etc).

You can find two detailed **tutorials** in the following links: [tutorial1](https://datasciencewithchris.com/kaggle-titanic-data-cleaning-and-preprocessing/) and [tutorial2](https://medium.com/@melodyyip_/titanic-survival-prediction-using-machine-learning-89a779656113).

Run the next cell to import the required libraries for this exercise.

In [None]:
# Import the required libraries for this exercise

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn import tree

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

### 1.1 Load dataset

Firstly, you will load the **Titanic** dataset used in this lab into a DataFrame `df`.

**Scikit-Learn** comes with a built-in dataset for the **Titanic survival prediction** task. The next cell loads the titanic dataset from Scikit-Learn and stores it in a Pandas [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

In [None]:
# Load input features and target variable
df, y = fetch_openml('titanic', version=1, as_frame=True, parser='auto', return_X_y=True)

# The "survived" column contains the target variable
df["survived"] = y

# Print the number of samples in the dataset
print(f"Number of samples in the dataset: {len(df)}")

Pandas [DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) have useful methods and attributes  to manipulate and analyze data efficiently.

Some methods and attributes are useful for getting a quick overview of your data. Some examples include:
- `df.head()`: This method returns the first *n* rows of the DataFrame, where *n* is a parameter that you can specify. If you do not specify n, it defaults to 5. This is particularly useful for quickly inspecting the beginning of your dataset to understand its structure and the type of data it contains.
- `df.info()`: This method provides a concise summary of the DataFrame, including the number of non-null entries in each column, the data type of each column, the memory usage, the number of columns, and the range index. It can be useful for getting a quick overview of the DataFrame's structure and to identify columns with missing values.
- `df.columns()`: This attribute can be used to view or modify the column names. For example, you can use df.columns.tolist() to get a list of all column names.
- `df.describe()`: This method generates descriptive statistics that summarize the central tendency, dispersion, and shape of the dataset’s distribution, excluding *NaN* values. It works on numeric and object data types, providing information such as mean, standard deviation, minimum, maximum, and quartile values for numeric data, and count, unique, top, and frequency for object data (e.g., strings or timestamps).



In [None]:
df.head()

In [None]:
print(f"Dataset columns: {df.columns.tolist()}")

In [None]:
df.info()

In [None]:
df.describe()

The `"survived"` column contains the target variable (i.e., the variable that you want to predict).

Some datasets contain a **balanced** number of samples for each label. Thus, each category of data is equally represented. However, many real-world datasets are **imbalanced**, meaning they have a disproportionate number of samples in one or more classes than others.

Highly **imbalanced** datasets can cause the model to become biased towards the more frequently represented class(es), thereby reducing the model's ability to generalize well across all categories. In such cases, the model trained may perform well on the majority class(es) but poorly on the minority class(es), because it has not had enough data to learn from for the underrepresented categories. Imbalance can significantly affect the performance and fairness of predictive models, leading to misleadingly high accuracy scores that do not accurately reflect the model's ability to predict less frequent classes.

Run the next cell to count the number of samples for each class label. This is useful to verify if the dataset is **balanced** or **imbalanced**.

In [None]:
df["survived"].value_counts()

In this case, the dataset is slightly **imbalanced**. The *non-survived* class (0) is more frequent than the *survived* class (1).

The next cell counts the number of duplicate rows.

In [None]:
# check for duplicate rows
duplicate_rows = df.duplicated(keep=False).sum()
print(f"Number of duplicate rows: {duplicate_rows}")

There are no duplicate rows in this dataset. However, in Pandas, you can remove duplicate rows using `df.drop_duplicates()`. You can also remove duplicates based on a specified column `df.drop_duplicates(subset='column_name')`.

### 1.2 Train and Test splitting with Stratification

The first step involves splitting your dataset into distinct subsets to ensure that your model can generalize well to unseen data. This step is crucial for evaluating the performance of your model in an unbiased manner.

Datasets are usually split into the following subset:
- **Training Set**: Subset of your data used to train your model. It is the largest portion from which your model learns the underlying patterns to perform accurate predictions.

- **Validation Set**: (Optional but highly recommended) Subset used to fine-tune the model's hyperparameters and evaluate which models, configurations, or hyperparameters is the best performance. It acts as a proxy for the test set during the development phase.

- **Test Set**: Subset used to evaluate the final model's performance after it has been trained and validated. It provides an assessment of how well your model has learned to generalize from the training data to new, unseen data.

In this lab, we will only use training and test set for semplicity, and due to the low number of samples in the dataset.


**Exercise:** Split the dataset into **train** and **test** datasets. In this case, the dataset is **imbalanced**. Therefore, it is recommended to split using stratification (i.e., the class label distribution will be preserved during the splitting).

**Split** with 80% of samples for training and 20% of samples for validation. **Shuffle** the dataset before splitting, and perform the **stratification** by label. Replace `None` with your code.


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>
<ul>
    <li> To split a dataset you can use the `train_test_split` function of Scikit-learn (<a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html" >link</a>).
</ul>
</p>

In [None]:
#### START CODE HERE (~ 1 line) ####

df_train, df_test = None

#### END CODE HERE ####

In [None]:
print(f"Number of samples in the training set {len(df_train)}")
print(df_train["survived"].value_counts())

In [None]:
print(f"Number of samples in the test set {len(df_test)}")
print(df_test["survived"].value_counts())

### 1.3 Handling missing values

Machine learning algorithms require that all the input values are in a **numerical** formal. However, real-world datasets are often "dirty". For instance, they can contain missing values for some columns and records. Before training your ML models, you should handle missing values.

You should first check if **null** values are present in your dataset. Pandas Dataframes have many useful methods to check for null values in your dataset.
- `df.isnull()` or `df.isna()`: They return a DataFrame with the same shape as the input DataFrame, but containing boolean values (True or False) indicating the presence of null values.
- `df.notnull()` or `df.notna()`: The opposite of isnull() and isna().


**Exercise**: Count the number of **null values** in training and test, and store them in the variables `nan_count_train` and `nan_count_test`.
Replace `None` with your code.

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>

<ul>
    <li> Remember that boolean values can also be interpreted as 0 (False) and 1 (True). Thus, you can exploit the `df.sum()` method to count the number of ones.
</ul>
</p>

In [None]:
#### START CODE HERE (~ 2 line) ####

nan_count_train = None
nan_count_test = None

#### END CODE HERE ####

In [None]:
print("Train")
print(nan_count_train)

In [None]:
print("Test")
print(nan_count_test)

In several columns of the dataset, missing values are present, specified with `NaN` (i.e., not a number).

There are several strategies for handling missing data, some examples include:
1. **Deletion**: Discard entire rows/columns containing missing
values.
2.  **Imputation**: Replace missing values with some imputed values (e.g., mean, median, constant, etc.).

3.  **Inference**: Use other data points to train a model that can predict the missing values.

> #### 1. Discarding missing values
  * You can **remove** rows or columns containing missing values using the `df.dropna(axis=)` method of Pandas DataFrames. If you specify `axis=0`, it will remove *rows* containing missing values. In contrast, if you specify `axis=1`, it will remove the *columns* containing missing values.
  * You can also remove rows containing missing values in a specific column specifying the `subset` parameter (e.g., `df.dropna(subset = ["column_name"])`). In this case,  all rows containing a missing value in the `column_name` column are removed.
  * Note that, `df.dropna()` returns a new DataFrame. Therefore, you should re-assign to `df` the new DataFrame (e.g., `df = df.dropna()`) or set the `inplace` parameter to `True` (e.g., `df.dropna(inplace=True)`).

> #### 2. Imputing missing values
  * You can impute values on missing data with Pandas with the `df.fillna()` method and specify the new value that will replace the `NaN` values. The `df.fillna()` method returns a new DataFrame by replacing the null values with the specified value. For instance, you can replace `NaN` values with the column mean with `df.fillna(df.mean())`.
  * You can also use Scikit-Learn to impute values on missing data with `sklearn.impute.SimpleImputer`. The [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) can replace missing values using a descriptive statistic (e.g., mean, median, or most frequent) along each column, or using a constant value.
    - `"mean"`: replace missing values using the *mean* along each column (only for numeric data).
    - `"median"`: replace missing values using the *median* along each column (only for numeric data).
    - `"most_frequent"`: replace missing using the *most frequent value* along each column (for both strings and numeric data).
  * Below is reported an example of usage:

```python
from sklearn.impute import SimpleImputer

# Instantiate a SimpleImputer object specifying the descriptive statistic
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean’)

# Compute the mean fitting on training data (important! do not fit on test data)
imp_mean.fit(X_train.values)  

# Replace missing values in the training set
X_train = imp_mean.transform(X_train.values)
# replace missing values in the test set
X_test = imp_mean.transform(X_ test.values)

```

> #### 3. Predicting missing values
Using models to predict the missing values is uncovered in this lab. However, the idea is to simply train a machine learning model (e.g., linear regression) to predict missing values. If you are interested, you can read more about it [here](https://medium.com/machine-learning-mastery/5-ways-to-handle-missing-values-in-python-4fe6a625e251).

**Exercise**: Fill **null values** in the column `age` with the **mean** of the column `age` in the training and test set.

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>

<ul>
    <li> Remember that all the statistics must be computed on the training set only. Therefore, the null values in the test set must be replaced by the mean of the training.
</ul>
</p>

In [None]:
print(f'Number of null values in Train before pre-processing: {df_train.age.isnull().sum()}/{len(df_train)}')
print(f'Number of null values in Test before pre-processing: {df_test.age.isnull().sum()}/{len(df_test)}')

#### START CODE HERE (~ 2 line) ####


#### END CODE HERE ####

print(f'Number of null values in Train after pre-processing: {df_train.age.isnull().sum()}/{len(df_train)}')
print(f'Number of null values in Test after pre-processing: {df_test.age.isnull().sum()}/{len(df_test)}')

**Exercise**: Fill **null values** in the column `fare` with the **median** of the column `fare` in the training and test set.


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>

<ul>
    <li> Remember that all the statistics must be computed on the training set only. Therefore, the null values in the test set must be replaced by the median of the training.
</ul>
</p>

In [None]:
print(f'Number of null values in Train before pre-processing: {df_train.fare.isnull().sum()}/{len(df_train)}')
print(f'Number of null values in Test before pre-processing: {df_test.fare.isnull().sum()}/{len(df_test)}')

#### START CODE HERE (~ 2 line) ####


#### END CODE HERE ####

print(f'Number of null values in Train after pre-processing: {df_train.fare.isnull().sum()}/{len(df_train)}')
print(f'Number of null values in Test after pre-processing: {df_test.fare.isnull().sum()}/{len(df_test)}')

**Exercise**: Fill **null values** in the column `embarked` with the **most frequent value** of the column `embarked` in the training and test set.


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>

<ul>
    <li> Remember that all the statistics must be computed on the training set only. Therefore, the null values in the test set must be replaced by the most frequent value of the training.
</ul>
</p>

In [None]:
print(f'Number of null values in Train before pre-processing: {df_train.embarked.isnull().sum()}/{len(df_train)}')
print(f'Number of null values in Test before pre-processing: {df_test.embarked.isnull().sum()}/{len(df_test)}')

#### START CODE HERE (~ 3 line) ####


#### END CODE HERE ####

print(f'Number of null values in Train after pre-processing: {df_train.embarked.isnull().sum()}/{len(df_train)}')
print(f'Number of null values in Test after pre-processing: {df_test.embarked.isnull().sum()}/{len(df_test)}')

### 1.4 Feature selection
Feature selection is a critical step in the machine learning pipeline, as it involves choosing the most relevant features (or variables) that contribute to the predictive power of a model. The goal of feature selection is not only to improve the model's performance but also to reduce the computational complexity and enhance the interpretability of the model. The following are the main advantanges produced by an effective feature selection:
- **Improves Model Performance**: By removing irrelevant or redundant features, it can increase the accuracy of the model and reduce the risk of overfitting.
- **Reduces Training Time**: It can reduce training time by reducing the complexity of the inputs, which is particularly beneficial when dealing with large datasets.
- **Increases Model Interpretability**: Models with fewer features are easier to understand and explain, making the results more accessible to non-experts.

Identifiers, unique codes, etc., are usually useless features that must be removed.

You can learn more about advanced feature selection techniques [here](https://nathanrosidi.medium.com/feature-selection-techniques-in-machine-learning-82c2123bd548).

In this exercise, you will just remove features based on the domain knowledge. Specifically, you will remove features that are useless or contain explicit information related to target variable (i.e., the model by using that feature has the information of the actual label). However, data visualization and exploratory data analysis can help in identifying relationships between features and the target variable, as well as spotting redundant features. In this lab, you will also optionally exploit a correlation matrix to remove redundant features.


**Exercise**: Remove columns `cabin`, `body`, `boat`, and `home.dest` from the train and test sets because they contain info about the target variable (i.e., the model could "cheat" predicting the target label based on the info in these attributes).


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>

<ul>
    <li> You can find useful the 'df.drop()' method.
</ul>
</p>

In [None]:
#### START CODE HERE (~ 2 line) ####


#### END CODE HERE ####

df_train.head()


**Exercise**: Remove other columns that you think are useless features in predicting which people were more likely to survive.

In [None]:
#### START CODE HERE (~ 2 line) ####


#### END CODE HERE ####

df_train.head()

The next cell plots a **correlation** matrix of the input features with respect to the target variable. The **correlation matrix** is a powerful tool in the data pre-processing phase, especially when you're trying to understand the relationships between your input features and the target variable. Specifically, the **correlation matrix** can be used to:
- **Identify Relationships**: It helps in identifying the linear relationship between the input features and the target variable. A high positive or negative correlation indicates a strong relationship, whereas a correlation close to zero suggests no linear relationship.
- **Feature Selection**: By analyzing the correlation matrix, you can identify and eliminate features that are highly correlated with each other but not with the target variable. This is because highly correlated features contribute redundant information, which can lead to overfitting.
- **Insights for Feature Engineering**: Understanding the relationships between features can also provide insights for feature engineering, such as creating new features that are combinations of existing ones.

In [None]:
df_corr= pd.concat([df_train[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare']], df_train['survived']], axis = 1)

g = sns.heatmap(df_corr.corr(),
                annot=True,
                cmap = "coolwarm")

**Exercise (optional)**: Remove or combine highly correlated features based on the correlation matrix.

In [None]:
#### START CODE HERE (~ 1 line) ####

#### END CODE HERE ####

df_train.head()

### 1.5 Feature engineering

Another crucial pre-processing step in the machine learning pipeline is **feature engineering**, which involves creating new features or modifying existing ones to improve the performance of a machine learning model. Spefically, it can be useful to:
- **Improve model accuracy**: Effective modified features can capture essential information, making it easier for models to learn.
- **Improve model's generalizability**: By capturing the underlying patterns in the data more effectively, feature engineering can help models perform better on unseen data.
- **Reduce the need for complex models**: Simpler models with the right features can outperform complex models with a raw set of features.

#### Discretization
Discretization is a pre-processing step of machine learning that involves transforming continuous features into discrete or categorical ones. This process can be particularly useful for certain models that work better with categorical data, or when looking to simplify the patterns in the data, making them more interpretable for analysis. Discretization can also be beneficial for handling outliers and can improve the performance of some models by creating bins or categories that group continuous data points. The main advantages of using discretization can be summarized in the following:
- **Improves model interpretability**: By categorizing continuous features, discretization can potentially make the model's decisions easier to understand.
- **Handles outliers**: Outliers can have less impact when the data is divided into bins, as they will fall into the upper or lower bins along with other extreme values.
- **Reduces Complexity**: Discretization can act as a form of dimensionality reduction, simplifying the model by reducing the number of unique input values.

You can learn more about **discretization** <a href="https://trainindata.medium.com/variable-discretization-in-machine-learning-7b09009915c2" >here</a>.

**Exercise**: Discretize the `age` column in the training and test sets into the following categories: `['Child (0-14]', 'Young (14-24]', 'Adults (24-50]', 'Senior (50+]']`. The new discretized age column must by named `age_disc`. The discretized age categories are provided in the `age_category` list. Once performed the discretization, remove the old `age` column from the trining and test set.

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>

<ul>
    <li> To segment data into bins you can use the `pd.cut` function. Specify the correct bins values and the labels. You should cut on the `age` column. </li>
    <li>To remove a column you can use `df.drop()`</li>
</ul>
</p>

In [None]:
age_category = ['Child (0-14]', 'Young (14-24]', 'Adults (24-50]', 'Senior (50+]']

#### START CODE HERE (~ 4 line) ####


#### END CODE HERE ####

In [None]:
df_train.head()

### 1.8 Feature encoding

Machine learning algorithms operate on numerical data, making it essential to convert any categorical input features into a numerical format before training your model. This process, known as **feature encoding**. Proper encoding of input features ensures that the algorithm can interpret the data correctly, leading to more accurate models. Note that this step depends on your algorithm. For instance, Decision trees and their ensembles (e.g., Random Forests) can handle categorical data naturally (depending on the implementation), but many models (such as linear regression, logistic regression, and neural networks) require numerical input.

Common encoding techniques include:
- **One-Hot Encoding**: For each unique category value, a new binary column is created. A category value is represented by a 1 in its corresponding column and 0s in all others. This method avoids implying an ordinal relationship but increases the feature space. For instance, a variable color containing three possible values, e.g., `red`, `green`, and `blue`, will create three additional columns: `color_red`, `color_green`, and `color_blue`. To represent the color green, you will represent the input with the following vector `[0, 1, 0]`. The main drawback of one-hot encoding is that it can significantly increase the dataset's sparsity (i.e., the number of zeros). Another possible drawback is that, if data has a natural order to the categories (e.g., low, medium, high) one-hot encoding can lose this information (use ordinal encoding in this case).
- **Label Encoding**: Each unique category value is assigned an integer value. This method is straightforward but implies an ordinal relationship between categories, which may not always be appropriate.
- **Ordinal Encoding**: Similar to label encoding, but the integer assignments are made based on the order specified by the user, making it suitable for ordinal data.

You can learn more about all the encoding techniques [here](https://medium.com/anolytics/all-you-need-to-know-about-encoding-techniques-b3a0af68338b).

#### One-hot encoding

Scikit-Learn make you easy to perform the one-hot encoding with the [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

You can also use a similar approach using Pandas which provide a DataFrame's method `df.get_dummies()` to perform the one-hot encoding ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)).

The two approach are similar. The main difference is that the `get_dummies` method does not store the information about train data categories. Hence it may result in inconsistencies with train and test data features. You can learn the differences between `OneHotEncoder` and `get_dummies` [here](https://pythonsimplified.com/difference-between-onehotencoder-and-get_dummies/).

The next cell performs the one-hot encoding on the **training set** for the `sex` and `embarked` columns using the `OneHotEncoder`. Then, it removes the old columns. The new encoded training set is stored in a new DataFrame `df_train_encoded`.

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown='ignore')

categorical_columns = ['sex', 'embarked']

# Fit the one-hot encoder on training data
ohe.fit(df_train[categorical_columns])

# Create a new DataFrame with only the one-hot encoded columns
temp_df_train = pd.DataFrame(data=ohe.transform(df_train[categorical_columns]).toarray(),
                             columns=ohe.get_feature_names_out())

# Create a copy of the DataFrame
df_train_encoded = df_train.copy()

# Remove the old categorical columns from the original data
df_train_encoded.drop(columns=categorical_columns, axis=1, inplace=True)
df_train_encoded = pd.concat([df_train_encoded.reset_index(drop=True), temp_df_train], axis=1)

Now, look the differences between the original raw trainig and the encoded training DataFrames.

In [None]:
df_train.head()

In [None]:
df_train_encoded.head()

You can see that a new column is created for each distinct category of the encoded columns `sex` and `embarked`.

**Exercise**: Perform the same one-hot encoding on the test set. Create a new DataFrame `df_test_encoded`.

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>

<ul>
    <li> Remember to do not fit another OneHotEncoder. Instead, use the one fitted on the training set.
</ul>
</p>

In [None]:
#### START CODE HERE (~ 4 lines) ####


#### END CODE HERE ####

In [None]:
df_test.head()

In [None]:
df_test_encoded.head()

#### Ordinal encoding

With Scikit-Learn you can perform the ordinal encoding with the [OrdinalEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html).

You previously discretized the `age` column into bins, creating a new column `age_disc`. This column must be encoded as well. However, in this case, the categories have an explicit order, Therefore, the ordinal encoding is more suitable.

The next cells perform the ordinal encoding of the `age_disc` column on the training set by fitting the OrdinalEncoder on the training data, transform the training dataset column, and delete the old columns.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Instantiate the OrdinalEncoder specifying the list of the categories
ord_enc = OrdinalEncoder(categories=[age_category]) # Should be a list becuause you can specify the categories for multiple columns


# Fit the OrdinalEncoder on training data
ord_enc.fit(df_train_encoded.loc[:, ["age_disc"]])

ord_enc

In [None]:
# Transform the training data column 'age_disc' into the encoded version 'age_disc_enc'
df_train_encoded["age_disc_enc"] = ord_enc.transform(df_train_encoded.loc[:, ["age_disc"]])

In [None]:
df_train_encoded.head()

You can see that the new column `age_disc_enc` is represented with an incremental number. Therefore, the order is preserved.

In [None]:
# Delete the old 'age_disc' column
df_train_encoded.drop(columns=["age_disc"], axis=1, inplace=True)

df_train_encoded.head()

**Exercise**: Perform the same ordinal encoding on the test set, and remove the old `age_disc` column.

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>

<ul>
    <li> Remember to do not fit another OrdinalEncoder. Instead, use the one fitted on the training set.
</ul>
</p>

In [None]:
#### START CODE HERE (~ 2 lines) ####


#### END CODE HERE ####

df_test_encoded.head()

### 1.7 Normalization and Standardization


Some machine learning algorithms require input features to be **normalized** or **standardized** to work correctly, as this can significantly impact the performance of the model, especially in algorithms that rely on distance computation or gradient descent optimization.

**Normalization** and **standardization** are two fundamental pre-processing steps that help to bring the data onto a common scale, making it easier to process by an algorithm. While both methods scale the data, their methods and purposes differ. The choice between normalization and standardization depends on the specific requirements of the algorithm and the nature of the data.





A **normalization** technique is **Min-Max** normalization. It is a simple tehcnique that rescales the range of features into `[0, 1]`. This is particularly useful when the parameters have to be bounded within a fixed range. It's also useful in algorithms that compute distances between data points and need normalization to ensure that each feature contributes equally to the result.

The formula for **Min-Max normalization** is:

$$ x\_norm = \frac{(x - x_{min})}{(x_{max} - x_{min})} $$

Where:
- $x$ is the original value.
- $x_{min}$ and $x_max$ are the minimum and the maximum of the feature.
- $x_{norm}$ is the normilized value.

Scikit-Learn provides you an useful class to perform the **Min-Max** normalization, namely [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

<br><br>

A widely used **standardization** technique is the **Z-score** normalization. This method involves rescaling the features so they have the properties of a standard normal distribution with zero mean $\mu=0$ and standard deviation one $\sigma = 1$. Standardization is crucial in cases where the data follows a Gaussian distribution and when the algorithm assumes data to be centered around zero.

The formula for **Z-score normalization** is:
$$ x_{standardized} = \frac{(x - \mu)}{\sigma}$$

Where:
- $x$ is the original value.
- $\mu$ is the mean of the feature values.
- $\sigma$ is the standard deviation of the feature values.
- $ x_{standardized}$ is the standardized value.

Unlike Min-Max normalization, standardization does not bind values to a specific range, which makes it useful for features with outliers or many variances. Algorithms like Support Vector Machines, Linear Regression, and Logistic Regression benefit significantly from standardization because it enhances their convergence in optimization algorithms.

Scikit-Learn provides you an useful class to perform the **Z-score** normalization, namely [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

**Exercise**: Perform **Min-Max** normalization of the *numerical features* specified in the `numerical_features` variable for both training and test sets. Remember to **fit** on the training and not on the test. Note that `age_disc_enc` in this case is categorical but can be normalized too.

<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<p>

<ul>
    <li> Use the MinMaxScaler provided by Scikit-Learn.
    <li> You must create the MinMaxScaler object, fit on the training encoded DataFrame, and transform the training and test encoded DataFrames.
    <li> You must use indexing to correctly access only the numerical features of your DataFrames.
</ul>
</p>

In [None]:
from sklearn.preprocessing import MinMaxScaler

numerical_features = ["pclass", "sibsp", "parch", "fare", "age_disc_enc"]

#### START CODE HERE (~ 4 lines) ####


#### END CODE HERE ####

In [None]:
df_train_encoded.head()

You can see that the numerical features are now rescaled into `[0, 1]`.

In [None]:
df_test_encoded.head()

### 2.1 Models training and evaluation

Scikit-Learn offers a wide range of pre-implemented classification algorithms. You can explore the available Scikit-Learn classification algorithms [here](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning).

Training a Scikit-Learn model typically involves the following steps:

- **Instantiate the model**: Select the model and create the model object by settings its parameters.
- **Training the Model**: Fit your model to the training data using the `.fit()` method.
- **Evaluating the Model**: Assess the model's performance on the testing set using metrics such as accuracy, precision, recall, and the confusion matrix. Once the model is trained, you can use the `.predict()` method.
- **Parameter Tuning**: Optionally, use cross-validation and grid search to find the best model parameters.

You can learn more about Scikit-Learn evaluation metrics [here](https://scikit-learn.org/stable/modules/model_evaluation.html).

Scikit-Learn also provides useful functions for [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html).

The next cells train and evaluate a **LogisticRegression** model.

In [None]:
# Extract target variable and input features for the training data
y_train = df_train_encoded['survived']               # Target variable trainig set
X_train = df_train_encoded.drop('survived', axis=1)  # Features training set


# Extract target variable and input features for  the testing data
y_test = df_test_encoded['survived']                 # Target variable test set
X_test = df_test_encoded.drop('survived', axis=1)    # Features test set


In [None]:
from sklearn.linear_model import LogisticRegression

# Initialize the model
lr_model = LogisticRegression(max_iter=1000)  # Increasing max_iter if convergence warning occurs

# Train the model
lr_model.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# Make predictions
y_test_pred_lr = lr_model.predict(X_test)

# Evaluate the model
lr_acc = accuracy_score(y_test, y_test_pred_lr)
lr_f1 = f1_score(y_test, y_test_pred_lr, average='macro')

# Print accuracy and F1 Score
print(f"Accuracy: {lr_acc:.2f}")
print(f"F1: {lr_f1:.2f}")

Remember that, when the dataset is **imbalanced**, `F1 score` and `recall` are more useful metrics than `accuracy`.

Scikit-Learn provides you a useful function to compute several evaluation metrics, namely [classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

labels = ["Not-Survived", "Survived"]

classification_report_lr = classification_report(y_test, y_test_pred_lr, target_names=labels)
print(classification_report_lr)

The next cell plots the confusion matrix. The confusion matrix is a useful tool for evaluating the performance of classification models.
It provides a visual summary of how well the model predicts across different classes, allowing you to see not just the overall accuracy but also more specific details about where the model is making errors.

However, in this case the classification task is binary, so the confusion matrix is not very indicative. However, code is given to show how it can be fastly implemented using Scikit-Learn.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

cmd = ConfusionMatrixDisplay.from_predictions(y_test, y_test_pred_lr, cmap=plt.cm.Blues)
ax = cmd.ax_
ax.set_title('Confusion Matrix')
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
plt.show()

**Exercise**: Train a [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and evaluate its performance. Compute the classification report and store it in a variable `classification_report_rf`.

In [None]:
from sklearn.ensemble import RandomForestClassifier

#### START CODE HERE (~ 4 lines) ####


#### END CODE HERE (~ 4 lines) ####

print(classification_report_rf)

The next cells train and evaluate a SupportVectorMachine and a simple Neural Network models.

In [None]:
from sklearn.svm import SVC

svm_model = SVC(gamma=1.5, kernel="rbf", probability=True)

svm_model.fit(X_train, y_train)

y_test_pred_svm = svm_model.predict(X_test)

classification_report_svm = classification_report(y_test, y_test_pred_svm, target_names=labels)
print(classification_report_svm)

In [None]:
from sklearn.neural_network import MLPClassifier

mlp_model = MLPClassifier(hidden_layer_sizes=(256, 32), max_iter=500).fit(X_train, y_train)

y_test_pred_mlp = mlp_model.predict(X_test)

classification_report_mlp = classification_report(y_test, y_test_pred_mlp, target_names=labels)
print(classification_report_mlp)

---

## **Exercise 2: Diabetes prediction**

In this exercise, you will train machine learning models to predict diabetes in patients based on their medical history and demographic information, using the **Diabetes prediction dataset**.

The **Diabetes prediction dataset** is a collection of medical and demographic data records from patients, and their diabetes status (positive or negative).


This is an example of real-world medical application. Indeed, this model can be useful for healthcare professionals in identifying patients who may be at risk of developing diabetes and in developing personalized treatment plans.


Each record includes several features, such as:
- **age**
- **gender**
- **body mass index** (BMI)
- **hypertension**
- **heart disease**
- **smoking history**
- **HbA1c level**
- **blood glucose level**  

In [None]:
# Import the required libraries for this exercise

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn import tree

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# If your dataset is stored on Google Drive, mount the drive before reading it
# from google.colab import drive
# drive.mount('/content/drive')

Before running the next cell, upload the dataset on colab.

In [None]:
!unzip diabetes-dataset.zip

In [None]:
df = pd.read_csv('/content/diabetes_prediction_dataset.csv')

In [None]:
df.head()

In [None]:
# Check if the dataset is balanced
df.diabetes.value_counts()

**Exercise**: Now you will implement the **pre-processing pipeline**, and **train** and **evaluate** a **binary classifier** on the target variable.

The following steps are recommended to complete the task. However, it is up to you to make specific choices about the pre-processing to be performed.

Steps:

1. Perform the pre-processing:
  * Split into **train** and **test** sets (80% train and 20% test).
  * **Remove** useless or redundant features.
  * **Combine features** to create new features.
  * Handling **missing values**.
  * Perform **discretization** of features if necessary.
  * Encode **categorical features**.
  * Perform **normalization** or **standardization** of input features.
  * **Encode the target** variable if necessary.



2. Train one or more **binary classifiers** to predict the diabetes status of patiens. Use appropriate evaluation metrics to identify the best performing model.

***Hints*** :

>
* When performing the pre-processing steps, compute the statistics on training and transform the test data accordingly.
* All the categorical features must be properly encoded.
* The dataset is highly imbalanced. F1 score and recall are more appropriate metrics for this task.

---

This time the exercise is **open-ended**, so it is up to you to write all the code to carry out these steps.