# Exercise: Simple outlier removal on meteorological data
## Instructions
1. Load the data from the csv file
2. We are going to look at data retrived from "sensor 2", which measures temperature, humidity and pressure (columns `Temp2`, `Pression` and `hum2`). Create a new dataframe containing only these columns.
3. Use `plotly.express` to plot the points in each column. Use `px.scatter` to see only the available points without connecting lines. Zoom around and notice how the time frequency of the measuremnts is variable. Note many dates have no measurements registered. Do you see hints of outlier presence? 
4. Plot histograms and use the `describe` function to visualize some statistics. Do you see hints of outlier presence?
5. Plot each column using boxplots (using `pandas.DataFrame.plot(kind='box',...)`). Do you see any outliers in the plots?
6. Try changing the whiskers limits using the parameter `whis`, and observe the difference in the number of outliers detected. Repeat the plot for the following thresholds:
    - +- 3 IQR
    - bottom 5% / top 95%
    - bottom 1% / top 99%
    - bottom 0.5% / top 99.5%
7. Based on the previous thresholds, use a simple method to remove some of the outliers. 
    - Select apropriate tresholds for each column
    - You can just remove them (with a mask) or use winsorizing (`pd.clip`). 
    - Save the clean data to a **new** dataframe, **do not overwrite the previous one**.
8. To exemplify the effect of removing outliers, you are going to plot a heatmat representing the correlation matrix of these columns. 
    - Repeat the plot for the data with and without outliers. 
    - Did you observe major changes in the correlation indexes? 
    - You can use the following code to plot the correlation heatmap:
        ``` python
        sns.heatmap(data.corr(),
                    vmin=-1, vmax=1,
                    annot=True, fmt=".2f", cmap='coolwarm',
                    mask=np.tri(data.shape[1], k=-1).T)
        ```



## Useful documentation
- [Plotly express `scatter`](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html)
- [Pandas `DataFrame.plot(kind=...)`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)
- [Matplotlib `hist`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.hist.html)
- [Seaborn `histplot`](https://seaborn.pydata.org/generated/seaborn.histplot.html)
- [Matplotlib `boxplot`](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.boxplot.html)
- [Seaborn `boxplot`](https://seaborn.pydata.org/generated/seaborn.boxplot.html)
- [Pandas `DataFrame.clip`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.clip.html)

# Imports

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

import seaborn as sns

import plotly.express as px
#increase font size of all elements
sns.set(font_scale=1.5)


# Load the data

In [2]:
import urllib.request

url = 'https://raw.githubusercontent.com/EPF-MDE/data-cleaning/main/Outliers/data/donnee_Station_Meteo.csv'
filename = 'meteo_data.csv'

# Download the file and save it to the local disk
urllib.request.urlretrieve(url, filename)

# Load the file into a Pandas DataFrame
full_data = pd.read_csv(filename, sep=';',index_col=1,parse_dates=[1])
display(full_data.info())
display(full_data.head())

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3036 entries, 2018-06-22 12:00:52 to 2022-01-31 07:08:40
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             3036 non-null   int64  
 1   Temp1          3036 non-null   float64
 2   hum1           3036 non-null   float64
 3   Pression       3036 non-null   float64
 4   Temp2          3036 non-null   float64
 5   hum2           3036 non-null   float64
 6   Conc_Stand_1   3036 non-null   int64  
 7   Conc_Stand_25  3036 non-null   int64  
 8   Conc_Stand_10  3036 non-null   int64  
 9   Conc_Env_1     3036 non-null   int64  
 10  Conc_Env_25    3036 non-null   int64  
 11  Conc_Env_10    3036 non-null   int64  
 12  Part_03        3036 non-null   int64  
 13  Part_05        3036 non-null   int64  
 14  Part_1         3036 non-null   int64  
 15  Part_25        3036 non-null   int64  
 16  Part_5         3036 non-null   int64  
 17  Part_50        3

None

Unnamed: 0_level_0,id,Temp1,hum1,Pression,Temp2,hum2,Conc_Stand_1,Conc_Stand_25,Conc_Stand_10,Conc_Env_1,Conc_Env_25,Conc_Env_10,Part_03,Part_05,Part_1,Part_25,Part_5,Part_50,Lum,Unnamed: 20_level_0
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2018-06-22 12:00:52,1,28.7,41.45,1011.88,29.07,50.52,6,10,22,6,10,22,0,0,0,0,0,0,51,
2018-06-22 12:05:59,2,26.67,44.21,1011.9,26.63,53.92,6,9,9,6,9,9,1164,366,50,3,0,0,54,
2018-06-22 12:11:05,3,25.63,46.44,1011.81,26.34,54.95,6,8,10,6,8,10,1164,340,37,5,2,0,22,
2018-06-22 12:16:12,4,24.88,50.0,1011.77,26.07,57.26,7,8,9,7,8,9,1245,367,45,3,1,0,9,
2018-06-22 12:21:19,5,24.95,49.12,1011.67,26.01,56.83,4,6,6,4,6,6,1182,332,28,2,0,0,11,


# Select columns of interest

In [3]:
data_interest = full_data[['Temp2', 'Pression', 'hum2']].copy()
data_interest

Unnamed: 0_level_0,Temp2,Pression,hum2
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-06-22 12:00:52,29.07,1011.88,50.52
2018-06-22 12:05:59,26.63,1011.90,53.92
2018-06-22 12:11:05,26.34,1011.81,54.95
2018-06-22 12:16:12,26.07,1011.77,57.26
2018-06-22 12:21:19,26.01,1011.67,56.83
...,...,...,...
2022-01-23 15:23:36,12.63,1023.85,100.00
2022-01-23 15:24:06,12.60,1023.82,100.00
2022-01-24 14:45:46,15.50,1025.13,100.00
2022-01-26 15:42:28,129.37,369.48,0.00


# Visualize data points

# Plot histograms


# Print basic statistics wtih `describe`


# Plot box-plots


### Cut-out at +- 1.5 IQR

### Cut-out at 5%/95%

### Cut-out at 1%/99%

### Cut-out at 0.5%/99.5%

# Use a simple method to remove outliers

# Plot column correlation with outliers

# Plot column correlation after removing outliers