---
title: "Introduction to Predictive Models"
subtitle: "IN2004B: Generation of Value with Data Analytics"
author: 
  - name: Alan R. Vazquez
    affiliations:
      - name: Department of Industrial Engineering
format: 
  revealjs:
    chalkboard: false
    multiplex: false
    footer: "Tecnologico de Monterrey"
    logo: IN2004B_logo.png
    css: style.css
    slide-number: True
    html-math-method: mathjax
editor: visual
jupyter: python3
---


## Load the libraries

</br>

Before we start, let's import the data science libraries into Python.


In [None]:
#| echo: true
#| output: false

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

Here, we use specific functions from the **pandas**, **matplotlib**, **seaborn** and **sklearn** libraries in Python.

## Main data science problems

</br>

[**Regression Problems**]{style="color:green;"}. The response is numerical. For example, a person's income, the value of a house, or a patient's blood pressure.

[**Classification Problems**]{style="color:blue;"}. The response is categorical and involves *K* different categories. For example, the brand of a product purchased (A, B, C) or whether a person defaults on a debt (yes or no).

The predictors ($\boldsymbol{X}$) can be *numerical* or *categorical*.

## Main data science problems

</br>

[**Regression Problems**]{style="color:green;"}. The response is numerical. For example, a person's income, the value of a house, or a patient's blood pressure.

[**Classification Problems**. The response is categorical and involves *K* different categories. For example, the brand of a product purchased (A, B, C) or whether a person defaults on a debt (yes or no).]{style="color:gray;"}

The predictors ($\boldsymbol{X}$) can be *numerical* or *categorical*.

## Regression problem

</br>

**Goal**: Find the best function $f(\boldsymbol{X})$ of the predictors $\boldsymbol{X} = (X_1, \ldots, X_p)$ that predicts the response $Y$.

In mathematical terms, we want to establish the following relationship:

$$Y = f(\boldsymbol{X}) + \epsilon$$

-   Where $\epsilon$ is a natural (random) error.

## How to find the shape of $f(\boldsymbol{X})$?

</br>

Using training data. ![](images/TrainVal1.png){fig-align="center"}

## How to find the shape of $f(\boldsymbol{X})$?

</br>

Using training data.

![](images/TrainVal2.png){fig-align="center"}

## How to evaluate the quality of the candidate function $\hat{f}(\boldsymbol{X})$?

:::::: center
::::: columns
::: {.column width="40%"}
Using validation data.
:::

::: {.column width="60%"}
![](images/TrainVal3.png){fig-align="center"}
:::
:::::
::::::

## How to evaluate the quality of the candidate function $\hat{f}(\boldsymbol{X})$?

:::::: center
::::: columns
::: {.column width="40%"}
Using validation data.
:::

::: {.column width="60%"}
![](images/TrainVal4.png){fig-align="center"}
:::
:::::
::::::

## Moreover...

</br>

:::::: center
::::: columns
::: {.column width="40%"}
We can use [***test data***]{style="color:darkgreen;"} for a final evaluation of the model.

Test data is data obtained from the process that generated the training data.

Test data is independent of the training data.
:::

::: {.column width="60%"}
</br>

![](images/TrainVal5.png){fig-align="center"}
:::
:::::
::::::

## Linear regression model

A common candidate function for predicting a response is the linear regression model. It has the mathematical form:

$$\hat{Y}_i = \hat{f}(X_i) = \hat{\beta}_0 + \hat{\beta}_1 X_i.$$

-   Where $i = 1, \ldots, n_t$ is the index of the $n_t$ training data.

-   $\hat{Y}_i$ is the prediction of the actual value of the response $Y_i$ associated with a predictor value equal to $X_i$.

-   $\hat{\beta}_0$ and $\hat{\beta}_1$ are called the [*estimated coefficients*]{style="color:#4682B4;"} of the model.

## 

The values of $\hat{\beta}_0$ and $\hat{\beta}_1$ are obtained using the training dataset and the **least squares method**.

This method finds the values of $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize the error made by the model $\hat{f}(X_i)$ when trying to predict the responses ($Y_i$) of the training dataset.

</br>

. . .

Technically, the least squares method finds the $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimize the following expression

::: {style="font-size: 75%;"}
$$(Y_1 - (\hat{\beta}_0 + \hat{\beta}_1 X_1 ))^2 + (Y_2 - (\hat{\beta}_0 + \hat{\beta}_1 X_2 ))^2 + \cdots + (Y_{n_t} - (\hat{\beta}_0 + \hat{\beta}_1 X_{n_t} ))^2 $$
:::

For the $n_t$ observations in the [training data]{style="color:blue;"}!

## The idea in two dimensions

![](images/Modulo%203%20-%20Modelos%20predictivos%20y%20series%20de%20tiempo%20copy.012.jpeg){fig-align="center"}

## Example

</br>

We used the dataset called "Advertising.xlsx" in Canvas.

-   TV: Money spent on TV ads for a product (\$).
-   Sales: Sales generated from the product (\$).
-   200 markets


In [None]:
#| echo: true
#| output: true

# Load the data into Python
Ads_data = pd.read_excel('Advertising.xlsx')

## 

</br>


In [None]:
#| echo: true
#| output: true

Ads_data.head()

## 

</br></br>

Now, let's choose our predictor and response.


In [None]:
#| echo: true
#| output: true

# Chose the predictor.
X_full = Ads_data.filter(['TV'])

# Set the response.
Y_full = Ads_data.filter(['Sales'])

## Create training and validation data

</br>

To evaluate a model's performance on unobserved data, we split the current dataset into a training dataset and a validation dataset. To do this, we use `train_test_split()`.


In [None]:
#| echo: true
#| output: true

X_train, X_valid, Y_train, Y_valid = train_test_split(X_full, Y_full, 
                                                      test_size = 0.25,
                                                      random_state = 301655)

We use 75% of the data for training and the rest for validation.

## Fit a linear regression model in Python

</br>

In Python, we use the `LinearRegression()` and `.fit()` functions from the **scikit-learn** to fit a linear regression model.


In [None]:
#| echo: true
#| output: false

# 1. Create the linear regression model
LRmodel = LinearRegression()

# 2. Fit the model.
LRmodel.fit(X_train, Y_train)

## 

The following commands show the estimated coefficients of the model.


In [None]:
#| echo: true
#| output: true

print("Coefficients:", LRmodel.coef_)

We can also show the estimated intercept.


In [None]:
#| echo: true
#| output: true

print("Intercept:", LRmodel.intercept_)

</br>

. . .

The estimated model thus is

$$\hat{Y}_i = 6.69 + 0.051 X_i.$$

## Model assumptions

</br>

To use the regression model, the model errors $e_i = Y_i - \hat{Y}_i$ obtained on the [training data]{style="color:blue;"} must meet three conditions:

1.  On average, they must be equal to 0.
2.  They must have the same dispersion or variability.
3.  They must be independent of each other.

These assumptions are evaluated using a *graphical analysis of residuals* (model errors).

## In Python

For a residual analysis, we need to compute the residuals and predictions using the training dataset. Additionally, we place these objects together with the ID of the observations in a pandas dataframe for further processing.


In [None]:
#| echo: true
#| output: true

# Fitted values.
fitted = LRmodel.predict(X_train) + 0*Y_train

# Residuals
residuals = Y_train - fitted

# Construct a pandas data.frame
residual_data = pd.DataFrame()
residual_data["Fitted"] = fitted
residual_data["Residuals"] = residuals
residual_data["ID"] = residuals.index

## A technical Python note

Note that we have this weird way of saving the predictions of the model on the training dataset, since we added `0*Y_train`.

</br>


In [None]:
#| echo: true
#| output: true

# Fitted values.
fitted = LRmodel.predict(X_train) + 0*Y_train

</br>

The addition of `0*Y_train` does not impact the predicted responses. Instead, it allows the `fitted` object to have the same indices as the `residuals` object in the previous section.

In this way, these two objects can be put together in a pandas dataframe and being visualized using **seaborn**.

## 

To validate assumptions 1 and 2, we use a "residuals versus fitted values" plot


In [None]:
#| echo: true
#| output: true
#| code-fold: true
#| fig-align: center

# Residual vs Fitted Values Plot
plt.figure(figsize=(8, 5))
sns.scatterplot(data = residual_data, x = "Fitted", y = "Residuals")
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Fitted Values')
plt.xlabel('Fitted (predicted) Values', fontsize = 14)
plt.ylabel('Residuals', fontsize = 14)
plt.show()

## 

To validate assumption 3, we use the "residuals versus time" plot. Here, we use the ID as a time variable.


In [None]:
#| echo: true
#| output: true
#| code-fold: true
#| fig-align: center

# Residual vs ID Plot
plt.figure(figsize=(8, 5))
sns.scatterplot(data = residual_data, x = "ID", y = "Residuals")
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Time')
plt.xlabel('ID', fontsize = 14)
plt.ylabel('Residuals', fontsize = 14)
plt.show()

## Prediction error

After estimating and validating the linear regression model, we can check the quality of its predictions on [**unobserved**]{style="color:darkred;"} data. That is, on the data in the [validation set]{style="color:orange;"}.

</br>

One metric for this is the **mean squared error** (MSE$_v$):

::: {style="font-size: 65%;"}
$$\text{MSE}_v = \frac{(Y_1 - (\hat{\beta}_0 + \hat{\beta}_1 X_1 ))^2 + (Y_2 - (\hat{\beta}_0 + \hat{\beta}_1 X_2 ))^2 + \cdots + (Y_{n_v} - (\hat{\beta}_0 + \hat{\beta}_1 X_{n_v} ))^2}{n_v}$$
:::

-   For the $n_v$ observations in the the validation data!

The smaller $\text{MSE}_v$, the better the predictions.

## 

</br>

In practice, we use the square root of the mean squared error:

$$\text{RMSE}_v = \sqrt{\text{MSE}_v}.$$

The advantage of $\text{RMSE}_v$ is that it is in the same units as the response $Y$, which facilitates its interpretation.

For example, if $\text{RMSE}_v = 1$, then a prediction of $\hat{Y} = 5$ will have an (average) error rate of $\pm 1$.

::: notes
Prediction of tesla car next month, 1000 USD. Prediction of ON running shoes next month, 20 USD.
:::

## In Python

</br>

To evaluate the model's performance, we use the validation dataset.

</br>

First, we predict the responses using the predictor matrix in `X_valid` and our pre-trained model in `LRmodel`. To this end, we use the `.predict()` function.


In [None]:
#| echo: true
#| output: true
#| fig-align: center
#| code-fold: false

Y_pred = LRmodel.predict(X_valid)

## 

</br>

Next, we use the *mean squared error* in the `mse()` function. Recall that the responses from the validation dataset are in `Y_valid`, and the model predictions are in `Y_pred`.


In [None]:
#| echo: true
#| output: true
#| fig-align: center
#| code-fold: false

mse = mean_squared_error(Y_valid, Y_pred)  # Mean Squared Error (MSE)
print(round(mse, 2))

</br>

To obtain the **root mean squared error (RMSE)**, we simply take the square root of the MSE.


In [None]:
#| echo: true
#| output: true
#| fig-align: center
#| code-fold: false

print(round(mse**(1/2), 2))

## Mini-Activity (*solo* mode)

</br></br>

1.  Consider the Advertising.xlsx dataset in Canvas.

2.  Use a model to predict Sales that includes the Radio predictor (money spent on radio ads for a product (\$)). What is the $\text{RMSE}_v$?

3.  Now, use a model to predict Sales that includes two predictors: TV and Radio. What is the $\text{RMSE}_v$?

4.  Which model do you prefer?

## Other candidate functions

The linear regression model is one of the most common models for predicting a response. It is simple and easy to calculate and interpret. However, it can be limited for complex problems.

For this purpose, there are other, more advanced candidate functions $\hat{f}(\boldsymbol{X})$, such as:

-   Decision or *regression* trees.
-   Ensamble methods (bagging and random forest).
-   *K* nearest neighbors.

# [Return to main page](https://alanrvazquez.github.io/TEC-IN2004B/)