## *Data Transformation*
- This Notebook showcases the different techniques of data transformations that i gathered.

### <font color='yellow'>Encoding</font>
>This technique is used for discrete data types. **Example:** Converting categories into numbers.

###  <font color='yellow'>Normalisation</font>

>This technique is applied for continuous data. **Example:** Scaling values between 0 and 1 [0..1].


In [49]:
import pandas as pd
import numpy as np 
from IPython.display import display

# For Numeric Attributes Preprocessing
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, QuantileTransformer

# For Descrete Attributes Preprocessing
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

## Loading a numeric Dataset

In [50]:
data = {
    'X1': [25, 18, 22],
    'X2': [40, 44, 35],
    'X3': [100, 200, 300]
}

df = pd.DataFrame(data)
print('list of numeric values: {}'.format(list(df.describe(include = np.number))))
display(df)

list of numeric values: ['X1', 'X2', 'X3']


Unnamed: 0,X1,X2,X3
0,25,40,100
1,18,44,200
2,22,35,300



# 🌟 Normalisation 🌟
#### <font color='cyan'>This technique is used for <u>Numeric Attributes Preprocessing</u></font>
---

## `MinMaxScaler()`

#### *Transforms numeric values to a range between 0 and 1. [0, 1]*
- **<font color='green'>Optimal for uniform distribution😀</font>**
- **<font color='green'>Preserves the distribution shape😀</font>**
- **<font color='red'>Sensitive to outliers.🤬</font>**


In [51]:
df_aux = df.copy() # using an auxilary dataframe to avoid changes in the original dataframe
print('df before applying MinMaxScaler()')
display(df_aux)

min_max_scaler = MinMaxScaler()
arr = min_max_scaler.fit_transform(df_aux)
df_aux = pd.DataFrame(arr)

print('df after applying MinMaxScaler()')
display(df)

df before applying MinMaxScaler()


Unnamed: 0,X1,X2,X3
0,25,40,100
1,18,44,200
2,22,35,300


df after applying MinMaxScaler()


Unnamed: 0,X1,X2,X3
0,25,40,100
1,18,44,200
2,22,35,300


## `StandardScaler()`

#### *Transforms numeric values to a range between* $[-\infty, +\infty]$
- **<font color='green'>Optimal for symmetric distribution</font>**
- **<font color='green'>Preserves the distribution shape</font>**
- **<font color='red'>Sensitive to outliers</font>**


In [52]:
df_aux = df.copy() # using an auxilary dataframe to avoid changes in the original dataframe

print('df Before applying StandardScaler()')
display(df_aux)

standard_scaler = StandardScaler()
arr = standard_scaler.fit_transform(df_aux)
df_aux = pd.DataFrame(arr, columns=['X1', 'X2', 'X3'])

print('df After applying StandardScaler()')
display(df_aux)


df Before applying StandardScaler()


Unnamed: 0,X1,X2,X3
0,25,40,100
1,18,44,200
2,22,35,300


df After applying StandardScaler()


Unnamed: 0,X1,X2,X3
0,1.162476,0.090536,-1.224745
1,-1.278724,1.176965,0.0
2,0.116248,-1.2675,1.224745


## `RobustScaler()`
---
#### *Transforms numeric values to a range between*
 $[-\infty, +\infty]$
#### *Transformed Values are based on <u>the Median and Standard Deviation(Ecart Type)</u>*
 $[-\infty, +\infty]$
---
- **<font color='green'>Optimal for asymmetric distribution😀</font>**
- **<font color='green'>Preserves the distribution shape😀</font>**
- **<font color='green'>not sensitive to outliers😀</font>**


In [53]:
df_aux = df.copy() # using an auxilary dataframe to avoid changes in the original dataframe
print('df before aplying RobustScaler()')
display(df_aux)

robust_scaler = RobustScaler()
arr = robust_scaler.fit_transform(df_aux)
df_aux = pd.DataFrame(arr)

print('df after applying RobustScaler()')
display(df_aux)



df before aplying RobustScaler()


Unnamed: 0,X1,X2,X3
0,25,40,100
1,18,44,200
2,22,35,300


df after applying RobustScaler()


Unnamed: 0,0,1,2
0,0.857143,0.0,-1.0
1,-1.142857,0.888889,0.0
2,0.0,-1.111111,1.0


## `QuantileTransformer()`
---
#### *Transforms numeric values to a range between*
 $[-\infty, +\infty]$
---
- **<font color='green'>Optimal for asymmetric & Multimodal distribution😀</font>**

In [54]:
df_aux = df.copy() # using an auxilary dataframe to avoid changes in the original dataframe
print('df before applying QuantileTransformer()')
display(df_aux)

quantile_transformer = QuantileTransformer(
    output_distribution = 'normal',
    n_quantiles = len(df_aux))
arr = quantile_transformer.fit_transform(df_aux)
df_aux = pd.DataFrame(arr)

print('df after applying QuantileTransformer()')
display(df_aux)

df before applying QuantileTransformer()


Unnamed: 0,X1,X2,X3
0,25,40,100
1,18,44,200
2,22,35,300


df after applying QuantileTransformer()


Unnamed: 0,0,1,2
0,5.199338,0.0,-5.199338
1,-5.199338,5.199338,0.0
2,0.0,-5.199338,5.199338



# 🌟 Encoding 🌟
#### <font color='cyan'>This technique is used for <u>Categorical Attributes Preprocessing</u></font>
---

# Loading a Categorial Dataset

In [55]:
data = {
    'Sex': ['M', 'F', 'M', 'F', 'F', 'F', 'M', 'F'],
    'Season': ['Summer', 'Spring', 'Winter', 'Autumn', 'Summer', 'Spring', 'Winter', 'Autumn'],
}

df = pd.DataFrame(data)
print('list of categorial values: {}'.format(list(df.describe(exclude = np.number))))

display(df)

list of categorial values: ['Sex', 'Season']


Unnamed: 0,Sex,Season
0,M,Summer
1,F,Spring
2,M,Winter
3,F,Autumn
4,F,Summer
5,F,Spring
6,M,Winter
7,F,Autumn


## `OneHotEncoder() & get_dummies()`
---
#### *Transforms the space of n descriptors to k binary descriptors(k > n)*
---

In [56]:
df_aux = df.copy() # using an auxilary dataframe to avoid changes in the original dataframe
print("df before applying OneHotEncoder to 'Sex'")
display(df_aux)

one_hot_enc = OneHotEncoder(
    sparse_output = False,
    categories = [['M', 'F']])

df_aux[['S=M', 'S=F']] = one_hot_enc.fit_transform(df_aux.loc[:, ['Sex']])
df_aux.drop(labels = ['Sex'], axis = 1, inplace = True)

print("df after applying OneHotEncoder() to 'Sex'")
display(df_aux)

df before applying OneHotEncoder to 'Sex'


Unnamed: 0,Sex,Season
0,M,Summer
1,F,Spring
2,M,Winter
3,F,Autumn
4,F,Summer
5,F,Spring
6,M,Winter
7,F,Autumn


df after applying OneHotEncoder() to 'Sex'


Unnamed: 0,Season,S=M,S=F
0,Summer,1.0,0.0
1,Spring,0.0,1.0
2,Winter,1.0,0.0
3,Autumn,0.0,1.0
4,Summer,0.0,1.0
5,Spring,0.0,1.0
6,Winter,1.0,0.0
7,Autumn,0.0,1.0


## `OrdinalEncoder() & LabelEncoder()`
---
#### *Transforms string type descriptors to ordinal number type descriptors*
---



In [57]:
df_aux = df.copy() # using an auxilary dataframe to avoid changes in the original dataframe
print("df before applying OrdinalEncoder() to ['Sex', 'Season']")
display(df_aux)

ordinal_encoder = OrdinalEncoder()

df_aux[['Sex', 'Season']] = ordinal_encoder.fit_transform(df_aux.loc[:, ['Sex', 'Season']])

print("df after applying OrdinalEncoder() to ['Sex', 'Season']")
display(df_aux)

df before applying OrdinalEncoder() to ['Sex', 'Season']


Unnamed: 0,Sex,Season
0,M,Summer
1,F,Spring
2,M,Winter
3,F,Autumn
4,F,Summer
5,F,Spring
6,M,Winter
7,F,Autumn


df after applying OrdinalEncoder() to ['Sex', 'Season']


Unnamed: 0,Sex,Season
0,1.0,2.0
1,0.0,1.0
2,1.0,3.0
3,0.0,0.0
4,0.0,2.0
5,0.0,1.0
6,1.0,3.0
7,0.0,0.0
