# Exercise: Automatic outlier removal on meteorological data

## Instructions
Here you will continue to work with the meteorological data from the previous exercise. This time you will be trying out two "automatic" (multivariate) outlier detection methods: [local outlier factor](https://scikit-learn.org/stable/auto_examples/neighbors/plot_lof_outlier_detection.html) and [isolation forest](https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html). 

1. Load data from the CSV file. 
2. A technical report on the data mentions that sensor 1's measurements of temperature and humidity are more reliable than sensor 2's. Drop the temperature and humidity columns for sensor 2 (`Temp2` and `hum2`).
3. Do a pairplot of the data. There are plenty of moments on which some sensor was malfunctionning. The pairplot may evidentiate suspicious data points.
3. This time we will try to filter out outliers on all columns at once. For each method:
    - Apply the method to the data. 
    - Vary the hyperparameters indicated in the instructions for each method.
    - Plot the correlation matrix and see the changes on the measured correlations before and after cleaning.
    - Repeat the pair plot and note the differences before and after cleaning.

# Imports

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

import seaborn as sns

import plotly.express as px
#increase font size of all elements
sns.set(font_scale=1.5)


# Load the data

In [None]:
# read csv file
full_data = (pd.read_csv('data/donnee_Station_Meteo.csv', sep=';',
                       index_col=1,
                       parse_dates=[1])
            .dropna(how='all')
            .drop(columns=['id',' '])
            )
display(full_data.info())
display(full_data.head())

# Select columns of interest

In [None]:
data = full_data.drop(columns=['Temp2', 'hum2'])

# Visualize data points

In [None]:
px.scatter(data_frame=data)

# Pair plot


In [None]:
sns.pairplot(data, diag_kws=dict(stat='proportion', kde=True), aspect=5)

# Method 1: Local Outlier Factor

Check out the documentation: [`sklearn.neighbors.LocalOutlierFactor`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html)

Try different values for the following parameters:
- n_neighbors : try 20 (default) and 40
- contamination: try auto, 0.1, 0.01

Then check their effects on the max and min values for each column. Organize your results in a table or plot.

Finally, plot the resulting correlation matrix after cleaning. Do you have a different result than when you use simple univariate methods?


In [None]:
from sklearn.neighbors import LocalOutlierFactor

estimator = LocalOutlierFactor()

estimator.fit(data)

# Method 2: Isolation Forest

Check out the documentation: [`sklearn.ensemble.IsolationForest`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)

Try different values for the following parameters:
- n_estimators: try 100 (default) and 300
- contamination: try auto, 0.1, 0.01

Then check their effects on the max and min values for each column. Organize your results in a table or plot.

Finally, plot the resulting correlation matrix after cleaning. Do you have a different result than when you use simple univariate methods?


In [None]:
from sklearn.ensemble import IsolationForest

estimator = IsolationForest(random_state=42)

estimator.fit(data)