
# Data Preprocessing

A key step in any machine learning project is data preprocessing, which prepares raw data for model training and evaluation. Proper preprocessing can significantly improve model performance and robustness, while poor preprocessing can lead to misleading results or even model failure.

A few common preprocessing tasks include:

1.  ****Handling missing values****: Missing data can skew results or cause algorithms to fail. Common strategies of handling missing values include:
    -   Removing rows/columns with missing values
    -   Imputing missing values using mean, median, or mode
    -   Using advanced techniques like KNN imputation
2.  ****Feature scaling****: Many algorithms are sensitive to the scale of input features. Common methods include:
    -   Standardization (z-score normalization): Rescales features to have mean 0 and standard deviation 1
    -   Min-Max scaling: Rescales features to a fixed range, typically [0, 1]
3.  ****Encoding categorical variables****: Machine learning algorithms typically require numerical input. Common techniques include:
    -   One-hot encoding: Converts categorical variables into binary columns
    -   Label encoding: Assigns a unique integer to each category
4.  ****Removing duplicates****: Duplicate rows can bias model training and evaluation.
5.  ****Feature engineering****: Creating new features from existing ones can improve model performance. Examples include:
    -   Polynomial features
    -   Interaction terms
6.  ****Pipeline creation****: Using pipelines to chain preprocessing steps and model fitting can simplify the workflow and ensure consistent preprocessing across training and testing data.

## Practical Demonstration

We will demonstrate data preprocessing using the `pandas` library for data manipulation and `scikit-learn` for machine learning tasks. The dataset we will use is the Titanic dataset, which contains information about passengers on the Titanic and whether they survived.

### Loading the data

-   Loading the Titanic Dataset

In [None]:
from sklearn.datasets import fetch_openml

# Load the Titanic dataset from scikit-learn
data = fetch_openml('titanic', version=1, as_frame=True)
print(data['DESCR'])

-   Convert the dataset to a pandas DataFrame and display the first few rows

In [None]:
df = data.frame
print(df.info())
df.head()

### Exploratory Data Analysis (EDA)

We'll start by doing a bit of EDA to understand the dataset better. We begin by visualising the distribution of the target variable `survived`:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='survived', data=df)
plt.title('Survival Distribution')
plt.xlabel('Survived')
plt.ylabel('Count')
plt.show()

As a first guess, we might hypothesize that the survival rate is higher might depend on the socio-economic status, which we can infer from the `fare` column. Let's visualize the distribution of fare for survivors and non-survivors:

In [None]:
sns.boxplot(x='survived', y='fare', data=df, log_scale=True)
plt.title('Fare Distribution by Survival')
plt.xlabel('Survived')
plt.ylabel('Fare')
plt.show()

This seems to be borne out by the data, as we can see that survivors tend to have higher fares than non-survivors. Another indication of socio-economic status is the passenger class, which we can visualize as follows:

In [None]:
sns.countplot(x='pclass', hue='survived', data=df)
plt.title('Passenger Class vs Survival')
plt.xlabel('Passenger Class')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right')
plt.show()

Another hypothesis we might have is that the survival rate might depend on the gender of the passengers, e.g. "Women and children first". Let's visualize the distribution of survival by gender:

In [None]:
sns.countplot(x='sex', hue='survived', data=df)
plt.title('Gender vs Survival')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.legend(title='Survived', loc='upper right')
plt.show()

We will only keep the features mentioned in the hypotheses above, which are `fare`, and `sex`. We will also keep the target variable `survived` for our model.

In [None]:
df = df[['fare', 'sex', 'survived']]
X = df.drop(columns=data.target.name)
y = df[data.target.name]

### Train-Test Split

Perform a train-test split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Data Preprocessing Steps

#### Handling Missing Values

Missing values are common in real-world datasets. In the Titanic dataset, the `fare` column has some missing values. We will handle these missing values by imputing them with the mean fare.

First, let's check for missing values in the training and test sets:

In [None]:
print(X_train.isnull().sum())
print(X_test.isnull().sum())
df.info()

We can see that the `fare` column has some (one) missing values. We will fill these missing values with the mean fare using the `SimpleImputer` class from `scikit-learn`.

In [None]:
# Fill missing values in the 'fare' column with the mean
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train[['fare']] = imputer.fit_transform(X_train[['fare']])
X_test[['fare']] = imputer.transform(X_test[['fare']])

#### Feature Scaling

Feature scaling is important for algorithms that are sensitive to the scale of input features, such as logistic regression, k-nearest neighbors, and support vector machines. We will scale the `fare` feature using standardization (z-score normalization), which rescales the feature to have mean 0 and standard deviation 1.

In [None]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler instance
scaler = StandardScaler()

# Fit and transform the training data
X_train[['fare']] = scaler.fit_transform(X_train[['fare']])

# Transform the test data
X_test[['fare']] = scaler.transform(X_test[['fare']])

#### Encoding Categorical Variables

Categorical variables need to be converted into numerical format for machine learning algorithms. Depending on the type of categorical variable, i.e. whether it is ordinal or nominal, we can use different encoding techniques. For a nominal categorical variable, for which the categories do not have a natural order, we can use **one-hot encoding**. For an ordinal categorical variable, where the categories have a natural order, we can use **label encoding**.

In the reduced Titanic dataset, the only categorical variable is `sex`, which is a nominal variable with two categories, `male` and `female`. We will use one-hot encoding to convert this categorical variable into numerical format. One-hot encoding creates binary columns for each category, indicating the presence or absence of that category in each row.

In [None]:
X_train.head()

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Create a OneHotEncoder instance
one_hot_encoder = OneHotEncoder()

# Encode the 'sex' column
X_train_encoded = one_hot_encoder.fit_transform(X_train[['sex']]).toarray()
X_test_encoded = one_hot_encoder.transform(X_test[['sex']]).toarray()
# Create a DataFrame with the encoded columns
import pandas as pd
X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=one_hot_encoder.get_feature_names_out(['sex']))
X_test_encoded_df = pd.DataFrame(X_test_encoded, columns=one_hot_encoder.get_feature_names_out(['sex']))
# Concatenate the encoded columns with the original DataFrame
X_train = pd.concat([X_train.reset_index(drop=True), X_train_encoded_df], axis=1)
X_test = pd.concat([X_test.reset_index(drop=True), X_test_encoded_df], axis=1)

# Drop the original 'sex' column
X_train = X_train.drop(columns=['sex'])
X_test = X_test.drop(columns=['sex'])

We have now preprocessed the dataset, and it is ready for training a machine learning model. Here's how the preprocessed training data looks like:

In [None]:
X_train.head()

We've only kept two features: `fare` (numerical) and `sex` (categorical). The target variable is `survived`, which we want to predict.

### Training a Machine Learning Model

Now that we have preprocessed the data, we can train a machine learning model. We are dealing with a binary classification problem (predicting survival), so we will use logistic regression as our model. Logistic regression is a simple yet effective algorithm for binary classification tasks, and we will discuss it in more detail in a later session.

In [None]:
from sklearn.linear_model import LogisticRegression

# Create a logistic regression model
model = LogisticRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

### Evaluating the Model

After training the model, we need to evaluate its performance on the test data. We will use the accuracy metric, which measures the proportion of correct predictions made by the model.

In [None]:
# Evaluate the model on the test data
accuracy = model.score(X_test, y_test)
print(f'Model accuracy: {accuracy:.2f}')

In [None]:
# Make predictions on the test data
y_pred = model.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['did not survive', 'survived']))

In [None]:
# Plot the normalized confusion matrix
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=(8, 6))
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues',
            xticklabels=['Not Survived', 'Survived'],
            yticklabels=['Not Survived', 'Survived'])
plt.title('Normalized Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## Hands-on Exercises

In this hands-on exercise, you will apply the data preprocessing to the Iris dataset, with a twist: instead of considering the species as the target variable, you will consider the petal length as the target variable. The ultimate goal is to predict the petal length based on the other features (sepal length, sepal width, petal width, and species).

-   Load the Iris dataset
-   Extract the features and target variable
    -   Features: sepal length, sepal width, petal width, species
    -   Target variable: petal length
-   Perform a train-test split
-   Preprocess the data:
    -   Handle missing values (if any)
    -   Scale the numerical features (sepal length, sepal width, petal width)
    -   Encode the categorical variable (species) using one-hot encoding
    -   Remove any unnecessary columns
-   Train a linear regression model to predict the petal length using the preprocessed data