# Machine Learning 2023-2024 - UMONS

# Introduction to regression and classification with Scikit-Learn

This notebook is an introduction to the library [scikit-learn](https://scikit-learn.org/stable/), which provides numerous tools to easily perform machine learning tasks. 

In this lab, we will experiment with two of the most frequently encountered tasks in machine learning: 
  - **Regression**, for a continous outcome.
  - **Classification**, for a discrete outcome.

In order for you to first have a good feeling of the general pipeline of a machine learning task, we will perform:
- Data splitting
- Linear regression
- One-hot encoding
- $K$-nearest neighbors classification
- Basic metrics for regression and classification
- Confusion matrices

We will start with an example of linear regression.

**Import the necessary libraries**

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OneHotEncoder

np.set_printoptions(precision=2)

**Load the 'Pokemon.csv' dataset as a Pandas Dataframe**

In [2]:
df = pd.read_csv('data/Pokemon.csv')
df.head()

**Change the Type 1 and Type 2 variables to categorical**

In [3]:
print(df.dtypes)
df = df.astype({'Type 1': 'category', 'Type 2': 'category', 'Generation': 'category', 'Legendary': 'category'})
df.dtypes

**Create a variable `X` containing the predictor 'Attack' and a variable `y` containing the target variable 'HP'.**

In [4]:
X = df[['Attack']]
y = df['HP']
sns.scatterplot(x='Attack', y='HP', data=df)

**Split the dataset into training and test sets following an 80%/20% partition.**

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, test_size=0.2, shuffle=True, random_state=0
)

(X_train.shape, y_train.shape), (X_test.shape, y_test.shape)

**Build a linear regression model and fit it to the training data.**

In [6]:
model = LinearRegression(fit_intercept=True)
model.fit(X_train, y_train)

**Compute the mean squared error (MSE) on both the training and test sets.**

In [7]:
# Make predictions for both the training and the test sets.
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Compute the coefficient of determination and the mean square error on both sets.
MSE_train = mean_squared_error(y_train, y_pred_train)
MSE_test = mean_squared_error(y_test, y_pred_test)

print(f'MSE on training set: {MSE_train:.2f}')
print(f'MSE on test set: {MSE_test:.2f}')

**Plot the regression line**

In [8]:
# Generate predictions out of the fitted model.
x_plot = pd.DataFrame(np.linspace(0, 200, 100), columns=['Attack'])
y_plot = model.predict(x_plot)

# Plot the regression line.
fig, ax = plt.subplots()
ax = sns.scatterplot(x='Attack', y='HP', data=df)
ax.plot(x_plot, y_plot, label='Regression line', color='red')
ax.legend();

## Regression task with a Linear Regression model. 

**1) Your turn ! Create a variable `X` containing the predictors 'Attack' and 'Defense' and a variable `y` containing the target variable 'HP'.**

**2) Split the dataset into a training and a test set. Follow an 80%/20% split partition, and make sure the dataset is shuffled.** 

**3) Fit a linear regression model to the training data.** 

**4) What is the expression of the fitted model ? You need to access the model's parameters using `.coef_` and `.intercept_` to answer this question.**

**5) Using the fitted model, predict the values of the target variable 'HP' in the training and test sets.**

**6) For both the training and test sets, evaluate the model's predictions using the Mean Squared Error (MSE). What do you observe ?**

**7) Can you implement the mean squared error calculation on the test set and verify if the results match those obtained using scikit-learn?**

**8) Consider the variable 'Generation' as additional predictor. We will treat it as a categorical variable and encode it using one-hot encoding.**

We will not feed 'Generation' as is to the model. 
Instead, we'll use the `OneHotEncoder` class to preprocess it, which will create a new binary variable 
(also called 'dummy variable') for each of the $K$ categories of 'Generation'.

Here, 'Generation' possesses $K=6$ categories, so the one-hot-encoding will create 6 binary variables. 
For each dummy variable, a '1' means that the observation belongs to that category, while a '0' 
means it does not. Note that, as each observation belongs to a single category, only 1 of the 6 
dummy variables will take on the value '1', while the rest will be '0's.

**We can retrieve the categories of the variable 'Generation' using the method `Series.cat.categories`.**

In [17]:
df['Generation'].cat.categories

**8.1) Create a variable `X` containing the predictors 'Attack', 'Defense', and 'Generation', and a variable `y` containing the target variable 'HP'.**

**8.2) Split the dataset intro training and test sets following a 80/20 partition.**

**We create a one-hot encoding for the variable 'Generation' using the `OneHotEncoder` class. Unknown categories are ignored using `handle_unknown='ignore'` (in the sense that all dummy columns are set to 0).**

In [20]:
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Function to remove the column to be encoded and add the encoded columns with correct feature names
def encode_df(df, ohe):
    encoding = ohe.transform(df[['Generation']])
    return pd.concat([
        df.drop(columns='Generation').reset_index(drop=True),
        pd.DataFrame(encoding, columns=ohe.get_feature_names_out())
    ], axis=1)

**8.3) Fit the one-hot-encoder exclusively to the training set for the variable 'Generation'. Then, remove the column to be encoded and add the encoded columns with correct feature names. You can use the function `encode_df` defined above.**

Any pre-processing step must be fitted to the training data only, as it would otherwise result in data leakage (i.e. the model having access to information contained in the test set during training). Once it is fitted to the training set, it can be then applied to the test set. 

This also includes the cases where you replace missing values with a column statistic (i.e. mean, media, max, min, etc...).

**8.4) Fit the Linear Regression model to the training set, and get the model's coefficients. How does the model write now ?**

**8.5) Predict the MSE on the training and test sets.**

## Classification Task with a KNN classifier.

**9) Using the function `sns.histplot`, plot three histograms showing the distribution of the 'HP', 'Attack' and 'Defense' variables, using the `hue` parameter to distinguish if Pokemons are legendary. What do you observe?**

**10) Use the `KNeighborsClassifier` class of scikit-learn with 5 neighbors to predict whether a Pokemon is legendary or not, using the variables 'Attack', 'Defense', and 'HP' as features. To this end, apply the following steps:**
- **Select the features and the target variable.**
- **Split your dataset into a training and test set following a 80%/20% partition.**
- **Fit the model to the training set, and predict the variable 'Legendary' on the training and test sets.**

**11) Compute the accuracy score of the model's predictions on the training and test sets using the function `accuracy_score`.** 

**12) Can you implement the accuracy calculation on the test set and verify if the results match those obtained using scikit-learn?**

**13) Look at the distribution of the variable 'Legendary' in the test dataset using `sns.countplot`. What do you observe ?**

**14) Get the confusion matrix of the predictions on the test set using the `confusion_matrix` function. What do you observe and how do you link your observations to the accuracy of the model?**