# Lecture 2
## Introduction to Sklearn
### Outlier Detection in sklearn

<ol>
<li> Used data: Telephone data set from P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, page 26, table 2.
<li> Notebook Goal: Learn how to detect outlying points in a data set using an isolation forest model. Observe how removing outliers can heavily increase the performance of a model.
<li> Extra Exercise: Yes, theoretical and coding.
</ol>

![SegmentLocal](../Pictures/IsolationForest.png "segment")

In [15]:
#Necessary Imports
from sklearn.ensemble import IsolationForest
import pandas as pd
import numpy as np
from plotly.offline import iplot
import plotly.graph_objects as go
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)
from sklearn.linear_model import LinearRegression

We start by loading the data. 

In [16]:
df = pd.read_csv('../Data/Phones.csv', index_col=0)

We then use the IsolationForest model in order to detect the outliers. From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html) in fit_predict, we see that outliers are labeled using -1 and inliers are labeled using 1. 

In [17]:
IsoFo = IsolationForest(n_estimators=100, contamination= 'auto')
labels = IsoFo.fit_predict(np.array(df.calls).reshape(-1,1))

Using only the inliers (as predicted by the Isolation Forest model), we repeat the OLS regression we performed earlier. We compare it to the OLS regression we performed in the introduction.

In [18]:
#Only including the inliers
calls_filtered = df.calls[labels == 1]
date_filtered = np.array(df.year[labels == 1]).reshape(-1,1)
OLS_filtered = LinearRegression()
OLS_filtered.fit(date_filtered, calls_filtered)

#Full data set
X = np.array(df.year).reshape(-1,1)
y = df.calls

OLS = LinearRegression()
OLS.fit(X,y)
y_hat = OLS.predict(X)


fig = go.Figure()
df['year_full'] =['19'+str(x) for x in df['year']]
fig.add_trace(go.Scatter(x=df.year, y=df.calls,name='Calls', mode='markers'))
fig.add_trace(go.Scatter(x=df.year, y=OLS_filtered.predict(X),name='Filtered OLS'))
fig.add_trace(go.Scatter(x=df.year, y=y_hat, name='Linear Regression'))

fig.update_layout(title='Number of calls in the data set',title_x=0.5)
fig.update_layout(xaxis_title = 'Year', yaxis_title='Number of calls')
fig.update_xaxes(tickangle=-45)
iplot(fig)

We see that the outliers were detected and removed. The underlying trend seems to be succesfully extracted from the data.

__Exercise__: 

IsolationForest is an ensemble method. What is meant by this? What does the 'n_estimators' hyperparameter mean in these models?

__Exercise__: 

Choose your own favorite Outlier Detection model that is implemented in Sklearn. Use this model to find outliers in the Telephone data set. Does your model give back the same predictions as the one used above?

In [13]:
#Exercise