# Deepcamp: Codelab 2

**In this tutorial we will cover**:

- Outliers and what to do with them

**Author**:
- Alessio Devoto (alessio.devoto@uniroma1.it)

**Duration**: 30 mins 


# Sales prediction 

Given the amount of money invested in different advertisement media (TV, Radio, Newspaper) predict sales.

What are we going to do? 🤔

1. Data import , analysis & preprocessin 🔍
2. Train an ML model 
3. Add outliers 
4. Check performance degradation due to outliers

But wait... what actually is an outlier?

![image](https://raw.githubusercontent.com/alessiodevoto/deepers/main/images/outliers_example.png)

First, we import the necessary libraries as usual...

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression



... and download the data

In [2]:
!wget https://raw.githubusercontent.com/iamontheinet/datascience/master/Databricks_ML/Advertising.csv

--2023-04-07 15:30:23--  https://raw.githubusercontent.com/iamontheinet/datascience/master/Databricks_ML/Advertising.csv
Risoluzione di raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connessione a raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connesso.
Richiesta HTTP inviata, in attesa di risposta... 200 OK
Lunghezza: 5166 (5,0K) [text/plain]
Salvataggio in: «Advertising.csv»


2023-04-07 15:30:23 (23,0 MB/s) - «Advertising.csv» salvato [5166/5166]



## 1. Data import & analysis

In [19]:
sales_data = pd.read_csv('Advertising_outliers.csv')

In [20]:
sales_data.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,517.0,37.8,69.2,22.1
1,616.0,39.3,45.1,10.4
2,668.0,45.9,69.3,9.3
3,775.0,41.3,58.5,18.5
4,658.0,10.8,58.4,12.9


In [21]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


Let us explore the correlation between each of the three features and the target.

In [22]:
px.scatter(sales_data, x=['TV', 'Radio', 'Newspaper'], y='Sales')

It looks like we have a quite clear linear correlation between the money spent for TV advertisment and the sales.

Let's have a look at the distribution of each indipendent variable via a boxplot.

In [16]:
import numpy

last_row = sales_data.shape[0]
for i in range(10):
    x = numpy.random.randint(500,800) 
    sales_data.loc[i, 'TV'] = x

In [23]:
fig = px.box(sales_data, y=["TV", "Radio", 'Newspaper'])
fig.show()

Bad news: there are some outliers in the TV feature: exactly the one we had chosen for our linear regression!

Well, let's try and train our model anyway, maybe we'll get reasonable performances.

## 2. Train an ML model

### 2.1 Linear Regression

We do the usual train test split

In [24]:
sales_data_feat, sales_data_labels = sales_data.drop(columns='Sales'), sales_data['Sales']

X_train, X_test, y_train, y_test = train_test_split(sales_data_feat, sales_data_labels, train_size=0.75)

Linear regression is a simple method that looks for the line that best suits the data by minimizing the mean squared error.

In [25]:
# this way we can pick which features we should use
use_columns = ['TV']

# fit the model
lr = LinearRegression().fit(X=X_train[use_columns].values, y=y_train.values)


Let's see what is the mean error we are getting ...

In [26]:
print('Score:', lr.score(X_test[use_columns].values, y_test))
y_pred = lr.predict(X_test[use_columns].values)
print("mean_absolute_error:    ", metrics.mean_absolute_error(y_test, y_pred))
print("mean_squared_error:    ",metrics.mean_squared_error(y_test, y_pred))
print("squared mean_squared_error:   ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Score: 0.1651216972148648
mean_absolute_error:     3.889897561146561
mean_squared_error:     23.157456960581957
squared mean_squared_error:    4.812219546174297


The result of linear regression is just a line in an N dimesional space, where N=number of features!

In [27]:
print('Coefficients: ', lr.coef_)
print('Intercept: ', lr.intercept_)

Coefficients:  [0.01287383]
Intercept:  11.605910758122558


We can visualize how well the line fits the data...

In [28]:
trace1= go.Scatter(mode='markers', x=sales_data['TV'], y=sales_data['Sales'])
trace2 = go.Scatter(x=sales_data['TV'], y=sales_data['TV'] * lr.coef_ + lr.intercept_)

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 600, width = 800,xaxis=dict(tickangle=-90))
fig.show()

We can see that the linear model is being misled by the outliers...

![outliers_lr](https://raw.githubusercontent.com/alessiodevoto/deepers/main/images/outliers.jpg)

### 2.2 (Optional) SVC

In [29]:
from sklearn import svm

# this way we can pick which features we should use
use_columns = ['TV']

# fit the model
svr = svm.SVR().fit(X=X_train[use_columns].values, y=y_train.values)

In [30]:
print('Score:', svr.score(X_test[use_columns].values, y_test))
y_pred = svr.predict(X_test[use_columns].values)
print("mean_absolute_error:    ", metrics.mean_absolute_error(y_test, y_pred))
print("mean_squared_error:    ",metrics.mean_squared_error(y_test, y_pred))
print("squared mean_squared_error:   ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Score: 0.5101585512944369
mean_absolute_error:     2.7692729799130635
mean_squared_error:     13.58698893966533
squared mean_squared_error:    3.6860533012512624


In [31]:
y_svr = svr.predict(X_test[use_columns].values)
df = pd.DataFrame({'x':X_test[use_columns].values.squeeze(), 'y': y_svr}).sort_values(by='x')
# px.line(df, x='x', y='y')

trace1= go.Scatter(mode='markers', x=sales_data['TV'], y=sales_data['Sales'])
trace2 = go.Scatter(x=df['x'], y=df['y'])

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 600, width = 800,xaxis=dict(tickangle=-90))
fig.show()

### 2.3 (Optional) Decision Tree

In [32]:
from sklearn import tree
# this way we can pick which features we should use
use_columns = ['TV']

# fit the model
tree_reg = tree.DecisionTreeRegressor().fit(X=X_train[use_columns].values, y=y_train.values)

In [33]:
print('Score:', tree_reg.score(X_test[use_columns].values, y_test))
y_pred = tree_reg.predict(X_test[use_columns].values)
print("mean_absolute_error:    ", metrics.mean_absolute_error(y_test, y_pred))
print("mean_squared_error:    ",metrics.mean_squared_error(y_test, y_pred))
print("squared mean_squared_error:   ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Score: 0.1948380107755835
mean_absolute_error:     3.4999999999999996
mean_squared_error:     22.333199999999998
squared mean_squared_error:    4.725801519319236


In [34]:
y_svr = tree_reg.predict(X_test[use_columns].values)
df = pd.DataFrame({'x':X_test[use_columns].values.squeeze(), 'y': y_svr}).sort_values(by='x')
# px.line(df, x='x', y='y')

trace1= go.Scatter(mode='markers', x=sales_data['TV'], y=sales_data['Sales'])
trace2 = go.Scatter(x=df['x'], y=df['y'])

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 600, width = 800,xaxis=dict(tickangle=-90))
fig.show()

## 4. Treat outliers & retrain

In order to handle the ourliers we should first:
1. Find them  
2. Decide what to do with them

There are a few methods to handle outliers: z-score, Quartiles ...

In [37]:
# find outliers with zScore

def Zscore_outlier(df):
    out=[]
    m = np.mean(df)
    sd = np.std(df)
    for idx ,i in enumerate(df): 
        z = (i-m)/sd
        if np.abs(z) > 2: 
            out.append(idx)
    print("Outliers:",out)
    return out

outliers_idx = Zscore_outlier(sales_data['TV'])

Outliers: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [38]:
sales_data = sales_data.drop(index=outliers_idx)

In [39]:
sales_data_feat, sales_data_labels = sales_data.drop(columns='Sales'), sales_data['Sales']

X_train, X_test, y_train, y_test = train_test_split(sales_data_feat, sales_data_labels, train_size=0.75)

In [40]:
# this way we can pick which features we should use
use_columns = ['TV']

# fit the model
lr = LinearRegression().fit(X=X_train[use_columns].values, y=y_train.values)

In [41]:
print('Score:', lr.score(X_test[use_columns].values, y_test))
y_pred = lr.predict(X_test[use_columns].values)
print("mean_absolute_error:    ", metrics.mean_absolute_error(y_test, y_pred))
print("mean_squared_error:    ",metrics.mean_squared_error(y_test, y_pred))
print("squared mean_squared_error:   ",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Score: 0.4953031715652353
mean_absolute_error:     2.2563381589673526
mean_squared_error:     8.870517722493217
squared mean_squared_error:    2.978341438198988


In [42]:
trace1= go.Scatter(mode='markers', x=sales_data['TV'], y=sales_data['Sales'])
trace2 = go.Scatter(x=sales_data['TV'], y=sales_data['TV'] * lr.coef_ + lr.intercept_)

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(trace1)
fig.add_trace(trace2,secondary_y=True)
fig['layout'].update(height = 600, width = 800,xaxis=dict(
      tickangle=-90
    ))
fig.show()

This looks way better!