<img src="images\Logo_UCLL_ENG_RGB.png" style="background-color:white;" />

# Data Analytics & Machine learning

Lecturers: Aimée Lynn Backiel, Chidi Nweke, Daan Nijs

Academic year 2023-2024

## Lab 7: Machine learning, part 3

### Lecture outline

1. Recap of previous weeks
2. Automating machine learning pipelines with sci-kit learn
3. Model evaluation 

The focus of last lab was on using sci-kit learn's tools to made models. On top of that we saw different important performance metrics for regression and how to use them in sci-kit learn. Our overarching goal remains the same: we want to try out various approaches and select the best one afterwards. This lecture we will continue that.

### Recap of last lecture(s)

#### Lab 1

1. We ensured we had a valid Python installation.
2. We learnt what a virtual environment is:
   * Isolated Python executable and packages.
   * We created a virtual environment.
3. Absolute path vs relative path recap.
4. Recap of data structures in Python

#### Lab 2
1. Installed Pandas
2. Learnt how to read data
3. Learnt how to calculate mean, mode, median etc.
4. Basic exploration of the 4 variables

#### Lab 3
1. Wrapped up computing summary statistics (mean, median, mode, ...)
2. Learnt how to deal with outliers 
3. Focused on exploration of dat

#### Lab 4
1. Univariate data visualization using Matplotlib
   1. Figures and axes
   2. Histograms
   3. Box plots
   4. Bar charts
2. Multivariate data visualization using Seaborn
   1. Scatter plots
   2. Small multiples
   3. Color coding

#### Lab 5
1. Intro to machine learning using scikit-learn
   1. Preprocessing
      1. One Hot encoding
      2. Scaling
      3. Outliers
   2. Regression

#### Lab 6
1. Preprocessing with scikit-learn
   1. ColumnTransformer: Apply a transformation to specific columns.
   2. Pipeline: Do several transformations after each other
2. Evaluation:
   1. Why the mean of the error is a bad idea
   2. Mean absolute error
   3. Mean squared error

### The case

Ada Turing Travelogue, or as everyone calls her, Ada just started working part time at her parents travel agency. She has a keen understanding and interest of everything related to applied computer science ranging from server & system management to full stack software development. Through database foundations she already understands how to query data and programming 1 and 2 covered the essentials about the Python programming language. Recently she has just decided to start learning about data analytics & machine learning as well.

She uses her skills to connect to the travel agency's database where she finds many, normalized, tables. Ada recalls what she learnt in database foundations and performs all the correct joins. Afterwards she saves the data in the `data/` folder.


She finds the following dataset:

| Column Name          | Description                                                                                       |
| -------------------- | ------------------------------------------------------------------------------------------------- |
| SalesID              | Unique identifier for each sale.                                                                  |
| Age                  | Age of the traveler.                                                                              |
| Country              | Country of origin of the traveler.                                                                |
| Membership_Status    | Membership level of the traveler in the booking system; could be 'standard', 'silver', or 'gold'. |
| Previous_Purchases   | Number of previous bookings made by the traveler.                                                 |
| Destination          | Travel destination chosen by the traveler.                                                        |
| Stay_length          | Duration of stay at the destination.                                                              |
| Guests               | Number of guests traveling (including the primary traveler).                                             |
| Travel_month         | Month in which the travel is scheduled.                                                           |
| Months_before_travel | Number of months prior to travel that the booking was made.                                       |
| Earlybird_discount   | Boolean flag indicating whether the traveler received an early bird discount.                     |
| Package_Type         | Type of travel package chosen by the traveler.                                                    |
| Cost                 | Calculated cost of the travel package.                                                            |
| Margin | The cost (for the traveler) - what the travel agency pays. |
 | Additional_Services_Cost| The amount of additional services (towels, car rentals, room service, ...) that was bought during the trip. |


#### Our challenge

Before getting into harder use cases we will start off by predicting the cost of a given stay. Right now Ada's parents do this manually automating this task would already be a big help to their business.

<center>
<img src="https://www.datascience-pm.com/wp-content/uploads/2021/02/CRISP-DM.png" style="background-color:white;width:50%">
</center>

### Machine learning with sci-kit learn


<center>
<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" style="background-color:white;width:50%">
</center>

##### ❓ What have we done so far of the image below? What stages have we completed?


We have done train test splitting and we have built a few models on the training set. 

We will continue from last lecture. We covered two cornerstones of sklearn namely:

1. `make_column_transformer`, this allows you to specify preprocessing you want to apply for different columns. E.g., scaling for numeric columns and one hot encoding for categorical.
2. `make_pipeline`, this allows you to compose several steps. So our first step could be preprocessing and the second our ML model. A pipeline is an end-to-end object that lets you go from raw data to a model.

##### model evaluation using sci-kit learn (Summary of last lab's code)

In [1]:
import pandas as pd # by convention
pd.options.display.float_format = '{:.2f}'.format
from sklearn.model_selection import train_test_split
import plotly.express as px
import numpy  as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

ModuleNotFoundError: No module named 'pandas'

In [None]:
travel_dataset = pd.read_csv("data/lab_7_dataset.csv")
X = travel_dataset.drop(columns="cost") 
y = travel_dataset["cost"] 
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [None]:
cat_columns = ["package_Type", "destination", "country"]
numeric_columns = ["guests", "age", "stay_length"]

preprocessing = make_column_transformer(
    (StandardScaler(), numeric_columns),
    (OneHotEncoder(sparse_output=False), cat_columns),
    remainder="drop"
)

In [None]:
lin_reg_pipe = make_pipeline(preprocessing, LinearRegression())

 ##### Answer to previous lab's question (New material as from here)
 
 ❓ Use the pipeline approach discussed above and the mean_absolute_error and mean_squared_error for the following models: RandomForestRegressor, HistGradientBoostingRegressor, DecisionTreeRegressor and LinearRegression.



This is the answer to last session's final question. It also serves as a summary of all the new approaches covered there:

1. `model_name_pair` contains a list of tuples with different models inside. We loop over them and try them one-by-one.
2. `make_pipeline` takes the preprocessing (scaling numeric columns and one hot encoding categorical columns) and immediately places the model behind it. both `.fit` and `.predict` will now both apply the preprocessing and the model in one go.
3. Predictions are make on the training and the test set.
4. The results are evaluated by means of the MAE and MSE.

💡 for those interested: `n_jobs = 7` instructs the model to use 7 CPU cores while training. This is a considerable speed-up for random forest.

In [None]:
model_name_pair = [("random_forest", RandomForestRegressor(n_jobs=7)), ("gradient boosting", HistGradientBoostingRegressor()), ("decision tree", DecisionTreeRegressor()), ("linear regression", LinearRegression()) ]
results = []
for pair in model_name_pair:
    name, model = pair
    print(f"STARTING {name}")
    pipe = make_pipeline(preprocessing, model)
    pipe.fit(X_train, y_train)
    predictions_train = pipe.predict(X_train)
    predictions_test = pipe.predict(X_test)
    print("-"*20)
    print(f"The MAE on the training set is {mean_absolute_error(y_train, predictions_train)} and on the test set it is {mean_absolute_error(y_test, predictions_test)}")
    print(f"The mse on the training set is {mean_squared_error(y_train, predictions_train, squared=False)} and on the test set it is {mean_squared_error(y_test, predictions_test, squared=False)}\n")

STARTING random_forest


--------------------
The MAE on the training set is 91.4326020995687 and on the test set it is 235.0808060883703
The mse on the training set is 124.22135738060179 and on the test set it is 314.0951110477195

STARTING gradient boosting
--------------------
The MAE on the training set is 215.27566294564866 and on the test set it is 221.9883392863986
The mse on the training set is 270.5222135184237 and on the test set it is 282.95933184951315

STARTING decision tree
--------------------
The MAE on the training set is 12.197864719583333 and on the test set it is 277.8483461916666
The mse on the training set is 51.045890849136754 and on the test set it is 408.31580508143423

STARTING linear regression
--------------------
The MAE on the training set is 307.0123379074999 and on the test set it is 302.02662900999997
The mse on the training set is 415.36793401262753 and on the test set it is 406.0298174069804



##### ❓ Comment on the behavior of the models. Are they overfitting? 

It seems like the decision tree and random forest are both overfitting. There is a large gap between the performance on the test set and the training set.
Linear regression is underfitting, the performance on test and train is similarly bad. 
Gradient boosting is not underfitting nor overfitting. The difference between test and train is minimal, on top of that, its MAE and MSE are better than all the alternatives we have tried.

#### Finding the best model: feature engineering

Recall in the previous session we made a number of observations:

* If you're visiting the same country as you're from the destination seemed to be cheaper
* If you're traveling in approximately the same continent it's also cheaper
* There might be the case between age and month.
* Maybe we should look at age in groups.

In [None]:
destination_to_country = {
        "New York": "USA",
        "Rome": "Italy",
        "Paris": "France",
        "Tokyo": "Japan",
        "Cairo": "Egypt",
        "Sydney": "Australia",
        "Rio": "Brazil",
        "Cape Town": "South Africa",
    }
country_to_continent = {
        "USA": "America",
        "UK": "EMEA",
        "France": "EMEA",
        "Canada": "America",
        "Australia": "Asia",
        "Germany": "EMEA",
        "Spain": "EMEA",
        "Italy": "EMEA",
    }
destination_to_continent = {
        "New York": "America",
        "Rome": "EMEA",
        "Paris": "EMEA",
        "Tokyo": "Asia",
        "Cairo": "EMEA",
        "Sydney": "Asia",
        "Rio": "America",
        "Cape Town": "Africa",
    }

##### ❓ Use Pandas to make a variables to indicate if they traveled to the same country and then the same continent. 

##### HINT1: Look at [the map method for Pandas series](https://pandas.pydata.org/docs/reference/api/pandas.Series.map.html). 

##### HINT2: Remember, a series is simply a column so , `df[column]` gives you a series.

In [None]:
X_train["country"] == X_train["destination"].map(destination_to_country)

39087    False
30893    False
45278    False
16398    False
13653    False
         ...  
11284     True
44732    False
38158    False
860      False
15795    False
Length: 40000, dtype: bool

In [None]:
X_train["country"].map(country_to_continent) == X_train["destination"].map(destination_to_continent)

39087    False
30893    False
45278    False
16398    False
13653     True
         ...  
11284     True
44732    False
38158     True
860      False
15795    False
Length: 40000, dtype: bool

When trying to add these variables there is some friction. They don't fit into our sci-kit learn `Pipeline` workflow nicely. We could add them to our entire dataset before splitting. Adding variables to the entire dataset is not risk-free. Doing so may lead to the methodological error we spoke about previously **data leakage**. In the scope of this course it's fine to use this approach to *add* variables. We'll briefly show you the more principled way, but you don't need to know this for the exam. It involves creating a custom `Transformer` which we can then compose in our pipeline as usual.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from dataclasses import dataclass

@dataclass
class CountryMapping(TransformerMixin, BaseEstimator):
    country_to_continent: dict[str, str]
    destination_to_country: dict[str, str]
    destination_to_continent: dict[str, str]

    def fit(self, X, y=None):
        return self
    
    def transform(self, X: pd.DataFrame):
        _X = X.copy() # so we don't change our input data frame
        _X['country_match'] = _X['country'] == _X['destination'].map(self.destination_to_country)
        _X['continent_match'] = _X['country'].map(self.country_to_continent) == _X['destination'].map(self.destination_to_continent)
        return _X

In [None]:
country_map = CountryMapping(country_to_continent, destination_to_country, destination_to_continent)

In [None]:
country_map.fit_transform(X_train)

Unnamed: 0,sales_id,age,country,membership_status,previous_purchases,destination,stay_length,guests,travel_month,months_before_travel,earlybird_discount,package_Type,rating,margin,additional_services_cost,country_match,continent_match
39087,39088,24,Germany,silver,2,Tokyo,4,2,12,5,True,Adventure,5,1404.79,103.43,False,False
30893,30894,34,USA,standard,2,Paris,4,2,8,1,False,Cultural,5,1135.69,345.63,False,False
45278,45279,32,Italy,standard,2,Rio,4,3,2,4,True,Relaxation,5,199.89,86.47,False,False
16398,16399,53,Germany,silver,6,Rio,2,2,1,1,False,Adventure,5,64.28,-55.23,False,False
13653,13654,36,Germany,silver,3,Cairo,2,1,2,4,True,Relaxation,5,352.40,-75.80,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11284,11285,30,Italy,gold,6,Rome,2,1,6,3,False,Cultural,7,94.40,21.36,True,True
44732,44733,29,Germany,standard,1,Cape Town,2,2,2,3,False,Adventure,7,-157.90,109.68,False,False
38158,38159,33,Germany,standard,2,Cairo,6,2,2,3,False,Adventure,6,665.88,166.64,False,True
860,861,43,Germany,standard,4,Cape Town,1,2,2,2,False,Cultural,5,-271.69,136.77,False,False


We can later add this as a new preprocessing step to our column transformer.

💡 We briefly make a transformer called `do_nothing_to`. The semantics aren't important as this is something you will rarely need in practice and definitely not on the exam. All it does is make an anonymous function (a function without a name) that takes an input and returns that as output so it effectively does nothing. We make it a transformer by giving it to `FunctionTransformer`.

💡 The reason why this is necessary is that our columns transformer is configured to drop all unused columns. We need to "use" it for a column to stay in our dataset.

In [None]:
from sklearn.pipeline import FunctionTransformer


do_nothing_to = FunctionTransformer(lambda x: x)

In [None]:
preprocessing_country = make_column_transformer(
    (StandardScaler(), numeric_columns),
    (OneHotEncoder(sparse_output=False), cat_columns),
    (do_nothing_to,  ["country_match", "continent_match"]),
    remainder="drop"
)

##### Interaction terms

<img src="https://www.jmp.com/en_se/statistics-knowledge-portal/what-is-multiple-regression/mlr-with-interactions/_jcr_content/par/styledcontainer_2069/par/lightbox_3be9/lightboxImage.img.png/1548351208495.png">

Let's consider an alternative approach to the problem above using interaction terms. Interaction terms are created by multiplying two variables together, effectively creating a new variable. This is a mathematical way of capturing the 'AND' condition in our data. For example, after one-hot encoding categorical variables like 'Country' and 'Destination', we can generate interaction terms to explore the combined effect of these two features.

Suppose we have 'New York', 'Rome', 'Tokyo', and 'Cairo' as categories for 'City' and 'USA', 'Italy', 'Japan', and 'Egypt' for 'Country'. If we create interaction terms for 'City' and 'Country', we end up with additional columns such as 'New York x USA', 'Rome x Italy', and so on. Each of these new columns will have a value of 1 only if both contributing variables (e.g., 'City' is 'New York' AND 'Country' is 'USA') are 1; otherwise, the value will be 0. This new variable thus answers the question: "Is the traveler from X city AND going to Y country?"

By including interaction terms, we allow our model to consider the combined influence of two variables, which can be particularly insightful when the effect of one variable on the outcome depends on the level of another variable.

In scikit-learn, interaction effects between variables can be encoded using the `PolynomialFeatures` transformer. Interaction effects are valuable in a model when the relationship between two features can affect the outcome in a way that is not simply additive.

For example, consider two binary features, A and B. Individually, they might have a certain effect on the target variable Y. However, when both A and B occur together (i.e., A=1 and B=1), their combined effect on Y could be different from the sum of their individual effects. This is where interaction terms come into play.

The PolynomialFeatures transformer can not only generate polynomial features, which are features raised to a power (like $x^{2}$ or $x^{3}$), but also interaction features, which are products of features (like $x_{1} * x_{2}$).

 Note: polynomial is called "veelterm" in Dutch.

We will start off by generating interactions between all our numeric columns.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(interaction_only=True)

numeric_preprocessing = make_pipeline(StandardScaler(), poly)

preprocessing_interactions = make_column_transformer(
    (numeric_preprocessing, numeric_columns),
    (do_nothing_to,  ["country_match", "continent_match"]),
    (OneHotEncoder(sparse_output=False), cat_columns),
    remainder="drop"
)


##### Binning

<img src="https://miro.medium.com/max/2000/1*LGTAObYYj2-fdBMFLz30rw.jpeg" style="width:50%">

Binning is a technique that involves segmenting a continuous variable into several intervals, or 'bins'. Similar to the way we categorize data when creating histograms, binning transforms a continuous variable into an ordered categorical variable. One hot encoding these bins allows us to introduce non-linear effects into our linear models, which ordinarily would interpret the data as having a constant slope.

Take temperature as an example: people generally enjoy mild increases in weather warmth, but there's a threshold beyond which higher temperatures become unpleasant. Binning would let us model this non-linear relationship. Instead of treating temperature as a single continuous predictor with a constant effect, we could divide temperatures into ranges (e.g., 0-10°C, 10-20°C, 20-30°C, etc.) and treat each range as a separate category. By one hot encoding these categories, we enable our model to capture the varying effects of different temperature ranges on people's comfort levels. This approach can reveal more complex patterns in how the predictor variable (in this case, temperature) influences the outcome variable (such as people's reported happiness).

In [None]:
from sklearn.preprocessing import KBinsDiscretizer


preprocessing_bins = make_column_transformer(
    (StandardScaler(), numeric_columns),
    (KBinsDiscretizer(),  ["age"]),
    (do_nothing_to,  ["country_match", "continent_match"]),
    (OneHotEncoder(), cat_columns),
    remainder="drop"
)

preprocessing_bins_interactions = make_column_transformer(
    (numeric_preprocessing, numeric_columns),
    (KBinsDiscretizer(), ["age"]),
    (do_nothing_to,  ["country_match", "continent_match"]),
    (OneHotEncoder(sparse_output=False), cat_columns),
    remainder="drop"
)

#### Finding the best model: cross validation

As we explore various models and preprocessing techniques, we encounter a dilemma: how do we identify the best model without biasing our selection? 

##### ❓ What is the weakness of the train-test split approach? Think about this before continuing.

Using just one test set might lead us to choose a model that excels on that particular subset of data by sheer luck. What if a different shuffle of the data leads to a different 'best' model?

##### ❓ Think about a solution for this before we continue.

To resolve this, let's consider a more robust method. Imagine if we could test each model not on one but multiple randomized slices of our data. This is where cross-validation comes into play.


Here's how it works: We divide our training set into smaller sections, say 20% chunks. We train our model on 80% of the data, then validate it on the remaining 20%. We repeat this process five times, each time with a different 20% held out for validation. This technique, known as k-fold cross-validation (with k being the number of chunks or 'folds' we create), allows each model a fair shot at proving itself across the entirety of our data.

By averaging the performance across these folds, we obtain a more reliable measure of a model's quality. This thorough approach increases our confidence that we're selecting the best model, not by chance, but by consistent performance.

<center><img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" style="background-color:white"></center>

##### ❓ Sanity check: if we do K-fold cross validation, how many models have we trained. Answer for 2-fold, 3-fold, 5-fold and K-fold.

2, 3, 5 and K. You make K splits so you train K models.

##### ❓ What is the downside of K-fold cross validation?

Since you're training a lot of models it could take a prohibitively long time to complete.

As usual, sci-kit learn makes it easy to do cross validation. 

We supply `cross_val_score` with 3 mandatory parameters:

* The machine learning model
* The `X_train`
* `X_test`
  
Additionally you can use `cv` to specify how many folds and you can pick a `score` parameter, which is the result that will be reported to you as the performance on each fold.

In [None]:
from sklearn.model_selection import cross_val_score

lin_reg_cv_results = cross_val_score(lin_reg_pipe, X_train, y_train, cv=5, scoring= "neg_root_mean_squared_error")
lin_reg_cv_results

array([-407.54493892, -430.31872512, -415.4150392 , -416.00540587,
       -408.91429495])

In [None]:
np.mean(lin_reg_cv_results)

-415.639680810938

In [None]:
decision_tree_pipe = make_pipeline(numeric_preprocessing, DecisionTreeRegressor())

In [None]:
decision_tree_cv_results = cross_val_score(decision_tree_pipe, X_train, y_train, cv=5, scoring= "neg_root_mean_squared_error")
decision_tree_cv_results

array([-411.71180971, -423.54799952, -413.5144439 , -407.63445237,
       -412.44588413])

In [None]:
np.mean(decision_tree_cv_results)

-413.7709179244554

##### ❓ Interpret these values. Which model performs better?

Linear regression performs better than the decision tree.

##### ❓ Contrast the performance seen in the beginning of this notebook of these two models. Think about overfitting and so on. (Key Insight!)

When we evaluated the decision tree on the training set it performed really well. Once we added the test set into the mix it became clear the model was overfitting. By using cross validation we are able to see that linear regression is the better model **without** having to use our test set.

##### ❓ Your turn: experiment with different models and setups. Try out the new pre processing pipelines with a variety of machine learning models and cross validation. Feel free to reuse some of the evaluation code of the beginning of the lecture. 

In [None]:
preprocessing_country 
preprocessing_interactions
preprocessing_bins
preprocessing_bins_interactions;

# Your pipeline should look like this: pipe = make_pipeline(country_map, prep, model)
# Deviations from this will likely cause an error.

In [None]:
model_name_pair = [("random_forest", RandomForestRegressor(n_jobs=7)), ("gradient boosting", HistGradientBoostingRegressor()), ("decision tree", DecisionTreeRegressor()), ("linear regression", LinearRegression()) ]
preprocessing = [("same country and continent", preprocessing_country), ("interactions", preprocessing_interactions), ("bins", preprocessing_bins), ("bins and interactions", preprocessing_bins_interactions) ]
results = []
for pair in model_name_pair:
    name, model = pair
    for prep_name, prep in preprocessing:
        print(f"STARTING {name} with preprocessing = {prep_name}")
        pipe = make_pipeline(country_map, prep, model)
        pipe.fit(X_train, y_train)
        predictions_train = pipe.predict(X_train)
        predictions_test = pipe.predict(X_test)
        print("-"*20)
        print(f"The MAE on the training set is {mean_absolute_error(y_train, predictions_train)} and on the test set it is {mean_absolute_error(y_test, predictions_test)}")
        print(f"The mse on the training set is {mean_squared_error(y_train, predictions_train, squared=False)} and on the test set it is {mean_squared_error(y_test, predictions_test, squared=False)}\n")

STARTING random_forest with preprocessing = same country and continent
--------------------
The MAE on the training set is 90.08049818885435 and on the test set it is 231.75422228752436
The mse on the training set is 122.5409192023597 and on the test set it is 308.03027639237615

STARTING random_forest with preprocessing = interactions
--------------------
The MAE on the training set is 90.2064913094094 and on the test set it is 231.37146928853224
The mse on the training set is 122.51715614838321 and on the test set it is 307.448897817142

STARTING random_forest with preprocessing = bins
--------------------
The MAE on the training set is 90.2778500380723 and on the test set it is 231.16272497215812
The mse on the training set is 122.50225850425151 and on the test set it is 307.0998906937639

STARTING random_forest with preprocessing = bins and interactions
--------------------
The MAE on the training set is 89.9867915547324 and on the test set it is 231.37971949728416
The mse on the t

##### ❓ Is this an improvement over before?

For gradient boosting it's not a noticeable improvement. For linear regression on the other hand it did improve the MSE by approximately 20. This makes sense because linear regression is a very simple model that considers the effect on each of the variables on the target separately. By adding non-linearities such as interactions, binning, and so on. the model is able to improve.

Gradient boosting on the other hand is a decision tree algorithm. Specifically, they are an ensemble of decision trees where the next tree is trained to reduce the error of all the previous trees. Trees are capable of handling complex non-linearities by themselves. They perform binning and interactions by default.

#### Finding the best model: can we do more?

We have already tried different methods like creating new features and grouping data. These methods have sometimes made our models better. But, there's more we can do.

Machine learning is a process that goes in circles, as shown in the CRISP-DM image. Once we build models, there are ways to look more closely at how they work.

1. Comparing Predicted and Actual Results
   
The first step we did was to compare what our model predicted with the real results. This helps us see where the model does well and where it needs more work. It also shows anomalies, for instance certain models were predicting negative costs which should not be possible.

2. Looking at Residuals
   
Here, we look at the errors - the difference between the real values (y_true) and what the model predicted (y_pred). We do this for each input in our model. The main idea is to look for patterns in these errors. For example, if we make charts showing errors for each country and notice that errors are bigger for certain countries, it tells us that we might need to improve our features. This is like doing analysis on the results of our model.

##### Parameters and hyperparameters (only theory)

Another thing we can do is called **hyperparameter tuning**. Hyperparameters are simply the "settings" of the machine learning model. Parameters are what it uses to make predictions. So for linear regression the parameters are the coefficients and for decision trees these are the splits it is making. 

Most machine learning models have hyperparameters (settings) as well. Typically they are used to decrease the complexity of the model. For a decision tree a hyperparameter is for instance the amount of splits it is allowed to make. For Random Forest and Gradient boosting the amount of trees it uses are a hyperparameter. For linear regression the regularization constant (punishment if the coefficients get large) is also a hyperparameter.**Hyperparameters help us combat overfitting.**


Typically complex models have many different hyperparameters you can tune. We encourage you to look at the number of hyperparameters for the models we have used so far to get an idea:

* [Linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
* [Gradient boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html)
* [Random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
* [Decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)


Some models are very sensitive to their choice of hyperparameters. With their defaults they typically overfit, this is the case for random forest and decision trees (see above).

To identify the best hyperparameters, techniques like grid search and random search are used. Grid search exhaustively explores all possible combinations of hyperparameters, offering a thorough but computationally intensive approach. Random search, on the other hand, samples a predetermined number of combinations randomly, providing a quicker but slightly less focused alternative.

<center>
<img src="https://www.researchgate.net/publication/341691661/figure/fig2/AS:896464364507139@1590745168758/Comparison-between-a-grid-search-and-b-random-search-for-hyper-parameter-tuning-The.png"></center>


##### ❓ hyperparameters can be sensitive to specific parts of the data. We don't want hyperparameters that do well on one specific part of the data. What can we do to make sure we select hyperparameters that work well for the entire dataset?

For each hyperparameter combination we do K-fold cross validation. This means if we have 2 parameters with each 10 outcomes we have 100 combinations which each need K-1 models, assuming you are doing grid search. This is a very expensive procedure.

**The key question is if manually doing grid search or random search is worth it.** The answer, as usual, is it depends.

Counter-arguments: 

1. Linear regression in sci-kit learn automatically performs hyperparameter tuning:

* [RidgeCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html)
* [ElasticNetCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html)
   
Their names come from the specific regularization they perform, covering that is out of scope for this course.

The CV stands for "cross validation". These models perform hyperparameter tuning with cross validation by default. For linear regression this is fine as this is a very inexpensive model to train.

2. It is very time consuming.
3. The difference can be negligible. You also don't know ahead of time if it will matter.
4. It requires you to somewhat know the internal details of the models.
5. Gradient boosting is extremely powerful with the default settings, it may not even require tuning.


Arguments for:

1. In some cases a 1 % difference absolutely makes a difference.

This is related to point 3. Even if the difference is negligible at a certain scale a 1 % increase matters. For instance, if you have a ML model to predict how much food will be thrown out for a small grocery store a difference of 1 % does not really matter. If this is on the scale of the entire franchise then it is a worthwhile investment.

### Summary

Revisiting the complex image we began with, after our lab today, the process should be clearer:

<center>
<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" style="background-color:white;width:50%">
</center>


1. You start by dividing your dataset into two parts: training and test data.
2. The training data is then used for cross-validation. This technique is a reliable way to assess model performance and can be paired with hyperparameter tuning, although that's not mandatory.
3. Choose the model or models that perform best in the cross-validation phase.
4. These top models are then trained with the entire set of training data.
5. finally, we test these models on the test data. The results from this step give us our final evaluation metrics.


Remember, this method, aside from the optional step of hyperparameter tuning, is crucial for your exam. Evaluating and training on the same data is a practice we want to avoid because it can lead to overfitting, which is why we test our models on unseen data.