<a href="https://colab.research.google.com/github/alessiodevoto/notebooks/blob/main/deepcamp_lab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deepcamp: Codelab 2
---

**In this tutorial we will cover**:

- Outliers and what to do with them

**Author**:
- Alessio Devoto (alessio.devoto@uniroma1.it)

**Duration**: 30 mins


# Sales prediction


- You company collected a dataset containing information about investments in commercials via TV, Radio and Newspapers.
- Given the amount of money invested in different advertisement media (TV, Radio, Newspaper) predict sales.

In this notebook we are going to meet **outliers** 😦 for the first time!



What are we going to do? 🤔

1. Data import , analysis & preprocessing ⚙️
2. Train an ML model
3. Treat outliers
4. Check performance degradation due to outliers

But wait... what actually is an outlier?

![image](https://raw.githubusercontent.com/alessiodevoto/deepers/main/images/outliers_example.jpg)

From Wikipedia: *in statistics, an outlier is a data point that differs significantly from other observations.*

First, we import the necessary libraries as usual...

In [None]:
!pip install pandas
!pip install scikit-learn
!pip install plotly

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression

... and download the data

In [None]:
!wget https://raw.githubusercontent.com/alessiodevoto/deepers/main/data/Advertising_outliers.csv

--2023-05-12 16:03:28--  https://raw.githubusercontent.com/alessiodevoto/deepers/main/data/Advertising_outliers.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4019 (3.9K) [text/plain]
Saving to: ‘Advertising_outliers.csv’


2023-05-12 16:03:28 (37.9 MB/s) - ‘Advertising_outliers.csv’ saved [4019/4019]



## 1. ⬇️ Data import & analysis

Aftern the first codelab we should be quite good at this 😀

In [None]:
# read dataframe from csv
sales_data = pd.read_csv('Advertising_outliers.csv')

In [None]:
sales_data.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,517.0,37.8,69.2,22.1
1,616.0,39.3,45.1,10.4
2,668.0,45.9,69.3,9.3
3,775.0,41.3,58.5,18.5
4,658.0,10.8,58.4,12.9


In [None]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


Let us explore the correlation between each of the three features and the target.

In [None]:
px.scatter(sales_data, x=['TV', 'Radio', 'Newspaper'], y='Sales')

It looks like we have a quite clear ***linear correlation*** between the money spent for TV advertisment and the sales, if not for all those awful points on the right hand side.

**Our goal**: use the info amount of money invested in TV commercial to predict the revenues (i.e. we don't care about Radio and Newspapers for now)

Let's have a look at the distribution of each indipendent variable via a boxplot.

In [None]:
fig = px.box(sales_data, y=["TV", "Radio", 'Newspaper'])
fig.show()

Bad news: we were already suspecting it but it is clear now, there are some ***outliers*** in the TV feature: exactly the one we had chosen for our model! 😞

Well, let's try and train our model, maybe we'll get reasonable performances despite the outliers!

## 2. Train an ML model

Before we start: keep in mind we are training our model on **dirty data** which will probably affect the model's prediction.

### 2.1 Linear Regression

[Linear regression](http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm) is a simple method that looks for the line that best suits the data by minimizing the mean squared error.

In [None]:
# do the usual train test split

sales_data_feat, sales_data_labels = sales_data.drop(columns='Sales'), sales_data['Sales']

X_train, X_test, y_train, y_test = train_test_split(sales_data_feat, sales_data_labels, train_size=0.75)

We can use any of the three columns, but in this case we only keep the TV as we saw there is a strong linear correlation


In [None]:
# this way we can pick which columns we should use
use_columns = ['TV']

# fit the model
lr = LinearRegression().fit(X=X_train[use_columns].values, y=y_train.values)


Let's see what is the mean error we are getting ...

In [None]:
print('Score:', lr.score(X_test[use_columns].values, y_test))
y_pred = lr.predict(X_test[use_columns].values)
print("mean_absolute_error:    ", metrics.mean_absolute_error(y_test, y_pred))
print("mean_squared_error:    ",metrics.mean_squared_error(y_test, y_pred))
print("squared mean_squared_error:   ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Score: 0.07856899406694329
mean_absolute_error:     4.084282010494524
mean_squared_error:     24.141894099364674
squared mean_squared_error:    4.913440149158701


Seems to be quite bad 😢, let's try and visualize what's happening.

The result of linear regression is just a line in an N dimesional space, where N=number of features!

We can retrieve the line's equation and plot it!

In [None]:
print('Coefficients: ', lr.coef_)
print('Intercept: ', lr.intercept_)

px.line(x=sales_data['TV'], y=sales_data['TV'] * lr.coef_ + lr.intercept_)


Coefficients:  [0.01470421]
Intercept:  11.51163455961833


Well, that's just a line...
Now we can visualize how well the line fits the data.

In [None]:
trace1= go.Scatter(mode='markers', x=sales_data['TV'], y=sales_data['Sales'])
trace2 = go.Scatter(x=sales_data['TV'], y=sales_data['TV'] * lr.coef_ + lr.intercept_)

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 600, width = 800,xaxis=dict(tickangle=-90))
fig.show()

We can see that the linear model is being misled by the outliers...

![outliers_lr](https://raw.githubusercontent.com/alessiodevoto/deepers/main/images/outliers.jpg)

### Exercise: 🏋 Decision Tree

Perform the same task with a decision tree for regression (`sklearn.tree.DecisionTreeRegressor()`) and plot the results


In [None]:
from sklearn import tree
tree_reg =

In [None]:
#@title Peek solution
from sklearn import tree
# this way we can pick which features we should use
use_columns = ['TV']

# fit the model
tree_reg = tree.DecisionTreeRegressor().fit(X=X_train[use_columns].values, y=y_train.values)

In [None]:
print('Score:', tree_reg.score(X_test[use_columns].values, y_test))
y_pred = tree_reg.predict(X_test[use_columns].values)
print("mean_absolute_error:    ", metrics.mean_absolute_error(y_test, y_pred))
print("mean_squared_error:    ",metrics.mean_squared_error(y_test, y_pred))
print("squared mean_squared_error:   ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Score: 0.12405083640592862
mean_absolute_error:     3.495
mean_squared_error:     22.950249999999997
squared mean_squared_error:    4.79064191940913


In [None]:
y_pred = tree_reg.predict(X_test[use_columns].values)
df = pd.DataFrame({'x':X_test[use_columns].values.squeeze(), 'y': y_pred}).sort_values(by='x')
# px.line(df, x='x', y='y')

trace1= go.Scatter(mode='markers', x=sales_data['TV'], y=sales_data['Sales'])
trace2 = go.Scatter(x=df['x'], y=df['y'])

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 600, width = 800,xaxis=dict(tickangle=-90))
fig.show()

## 3. Treat outliers & retrain

We can assume outliers are the major responsible for our bad results.

In order to handle the outliers we should first:

1. Find how many outliers we have and where they have  
2. Decide what to do with them

There are a few [methods](https://regenerativetoday.com/a-complete-guide-for-detecting-and-dealing-with-outliers/) to handle outliers: z-score, Quartiles ...



#### Z-Score

Let's write a simple function to find out which samples are out of distribution.

We first compute the distribution's mean and standard deviation, and then we check how many 'standard deviations' each point is from the mean.

$ ζ(x) = \frac{(x - μ)}{σ} $

where $\mu$ is the mean and $σ$ is the std deviation.

We can then drop points which are too far from the mean, i.e.



```
if Z_score > threshold:
  drop sample
```


In [None]:

# a simple function to detect outliers
def Zscore_outlier(df: pd.DataFrame, threshold):
    out=[]
    m = np.mean(df) # mean
    sd = np.std(df) # std deviation

    # iterate over the dataset
    for idx, i in enumerate(df):
        z = (i-m)/sd
        if np.abs(z) > threshold:
            out.append(idx)
    print("Outliers:", out)
    return out

# apply function to TV column and get indexes of outliers
outliers_idx = Zscore_outlier(sales_data['TV'], threshold=2.5)

Outliers: [1, 2, 3, 4, 5, 6, 7, 8, 9]


In [None]:
# let's just drop the outliers and retrain

sales_data = sales_data.drop(index=outliers_idx)

In [None]:
sales_data_feat, sales_data_labels = sales_data.drop(columns='Sales'), sales_data['Sales']

X_train, X_test, y_train, y_test = train_test_split(sales_data_feat, sales_data_labels, train_size=0.75)

In [None]:
use_columns = ['TV']

# fit the model
lr = LinearRegression().fit(X=X_train[use_columns].values, y=y_train.values)

In [None]:
print('Score:', lr.score(X_test[use_columns].values, y_test))
y_pred = lr.predict(X_test[use_columns].values)
print("mean_absolute_error:    ", metrics.mean_absolute_error(y_test, y_pred))
print("mean_squared_error:    ",metrics.mean_squared_error(y_test, y_pred))
print("squared mean_squared_error:   ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Score: 0.589997273643232
mean_absolute_error:     2.841946224777112
mean_squared_error:     13.012417039604875
squared mean_squared_error:    3.6072727980574015


Looks like we got a quite good increase in score! 🍾

---


1. 🚫 Simply ***dropping the outliers*** is not a smart way to deal with them.
In the vast majority of cases, you should conduct a deeper analysis of the outliers distribution, maybe ask a *domain expert* or replace the outliers with another value.!

2. 🚫 Some ML methods, like Linear Regression, are more sensitive to outliers. Other methods, like decision trees, are more robust. The choice of the ML algorithm you use will affect the robustness of your model!


In [None]:
trace1= go.Scatter(mode='markers', x=sales_data['TV'], y=sales_data['Sales'])
trace2 = go.Scatter(x=sales_data['TV'], y=sales_data['TV'] * lr.coef_ + lr.intercept_)

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 600, width = 800,xaxis=dict(
      tickangle=-90
    ))
fig.show()

# Final Exercises 🔥

We probably won't have time but you can do this at home just for fun 😀

## 1. Perform linear regression on the dirty dataset with all the three columns

In [None]:
# your code here

In [None]:
#@title Peek solution 👀

use_columns = ['TV', 'Radio', 'Newspaper']

# fit the model
lr = LinearRegression().fit(X=X_train[ue_columns].values, y=y_train.values)

print('Score:', lr.score(X_test[use_columns].values, y_test))
y_pred = lr.predict(X_test[use_columns].values)
print("mean_absolute_error:    ", metrics.mean_absolute_error(y_test, y_pred))
print("mean_squared_error:    ",metrics.mean_squared_error(y_test, y_pred))
print("squared mean_squared_error:   ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


Score: 0.843567025979476
mean_absolute_error:     1.1421037681213264
mean_squared_error:     3.3529669837143548
squared mean_squared_error:    1.831110860574628


## 2. Use SVC instead of linear regression

Hint: the model is called `sklearn.svm.SVR`.


In [None]:
from sklearn import svm

# this way we can pick which features we should use
use_columns = ['TV']

# fit the model
svr = svm.SVR().fit(X=X_train[use_columns].values, y=y_train.values)

In [None]:
print('Score:', svr.score(X_test[use_columns].values, y_test))
y_pred = svr.predict(X_test[use_columns].values)
print("mean_absolute_error:    ", metrics.mean_absolute_error(y_test, y_pred))
print("mean_squared_error:    ",metrics.mean_squared_error(y_test, y_pred))
print("squared mean_squared_error:   ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Score: 0.5638541333971796
mean_absolute_error:     2.43187847158925
mean_squared_error:     9.34830204411301
squared mean_squared_error:    3.0574993122015597


In [None]:
# let's plot and see what it looks like

y_svr = svr.predict(X_test[use_columns].values)
df = pd.DataFrame({'x':X_test[use_columns].values.squeeze(), 'y': y_svr}).sort_values(by='x')


trace1= go.Scatter(mode='markers', x=sales_data['TV'], y=sales_data['Sales'])
trace2 = go.Scatter(x=df['x'], y=df['y'])

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 600, width = 800,xaxis=dict(tickangle=-90))
fig.show()