# Lecture 2
## Introduction to Sklearn
### Scalers in Sklearn!

<ol>
<li> Used data: Simulated data (Data simulation done in notebook)
<li> Notebook Goal: Learn how scalers work in Sklearn, what the StandardScaler does and how to make your own custom scaler!
<li> Extra Exercise: Yes, see below.
</ol>

In [19]:
#Necessary Imports
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from plotly.offline import iplot

from sklearn.base import BaseEstimator
import plotly.offline as ofl
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)

We will start by sampling 9000 samples from a bivariate normal distribution $X$ with

$$ X \sim \mathcal{N}(\vec{\mu}, \Sigma), \quad \text{where } \vec{\mu} = \begin{pmatrix} 0.5 \\ 14 \end{pmatrix}, \Sigma = \begin{pmatrix} 0.25 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}. $$

This can be done using the _random_ module from the numpy package ( https://stackoverflow.com/questions/8674832/sampling-from-bivariate-normal-in-python ).

In [20]:
means = [0.5, 14]
cov_matrix = [[0.25, 0.3], [0.3, 7]]
X = np.random.multivariate_normal(means, cov_matrix, 9000)

The data cloud is as follows.

In [21]:
fig = go.Figure()
fig.add_traces(go.Scatter(x=X[:,0], y=X[:,1],mode='markers'))
fig.update_layout(title='Randomly Generated Bivariate Normal Data', title_x = 0.5)
fig.update_layout(width=750, height=500, autosize=False)
iplot(fig)

We will now scale the data using the StandardScaler class. Just like the other models we have seen so far, the actual fitting and transforming is barely any work.

In fact, in cases like this we can actually fit and transform _at the same time_ using the fit\_transform method!

In [22]:
StS = StandardScaler()
X_scaled = StS.fit_transform(X)
print(f'The mean of the scaled sample is ' + str(np.mean(X_scaled)) + ' and the standard deviation is ' + str(np.std(X_scaled))[0:4])

The mean of the scaled sample is -9.752593794271686e-15 and the standard deviation is 0.99


The mean of the sample is nearly equal to the machine precision

In [23]:
print(np.finfo(float).eps)

2.220446049250313e-16


Of course, the resulting data cloud is different.

In [24]:

fig = go.Figure()
fig.add_traces(go.Scatter(x=X_scaled[:,0], y=X_scaled[:,1],mode='markers'))
fig.update_layout(title='Standardized Data', title_x=0.5)
fig.update_layout(width=750, height=500, autosize=False)
iplot(fig)

## Making your own scaler

In this section, we briefly discuss how to make your own scalers. For this, we will construct a class with as parent class the BaseEstimator from sklearn.base. 

We will construct a scaler that maps values to the interval [0,10] by mapping values outside of this interval to the nearest boundary. Points within the interval remain unchanged. 
Hence, all values are scaled to the [0,10] interval.

In order to make this scaler, we have to define the fit and transform methods of our new class. The code is given below.

In [25]:
class CutOffScaler(BaseEstimator):
    def __init__(self, min_cut = 0, max_cut = 10):
        self.min_cut = min_cut
        self.max_cut = max_cut
    def fit(self, X, y=None):
        return self
    def transform(self, X, y = None):
        return pd.DataFrame(data=np.clip(X.values, self.min_cut, self.max_cut), columns=X.columns, index=X.index)
    def fit_transform(self,X,y=None):
        fit = self.fit(X,y)
        transform = self.transform(X,y)
        return transform
    

We now apply this class to a toy data set.

In [26]:
df = pd.DataFrame({
    'A' : [-5,2,1,9,-2,-1,12],
    'B' : [131,-55,4,12,0,1,2]
})
df

Unnamed: 0,A,B
0,-5,131
1,2,-55
2,1,4
3,9,12
4,-2,0
5,-1,1
6,12,2


In [27]:
CO = CutOffScaler()
CO.transform(df)

Unnamed: 0,A,B
0,0,10
1,2,0
2,1,4
3,9,10
4,0,0
5,0,1
6,10,2


__Exercise__:

Write a class called MinMaxCustom that 
<ol>
<li> Maps the minimum of the data set to 0
<li> Maps the maximum of the data set to 1
<li> Maps a value x to 

$$ x_{\text{new}} = \frac{x - x_\text{min}}{x_\text{max}-x_\text{min}} $$

</ol>