# ML Workflow Intro

![Image](./img/scikit_learn.png)

## Installing scikit-learn

```$ conda create -n sklearn-env -c conda-forge scikit-learn```

```$ conda activate sklearn-env```

```$ conda install ipykernel```

```$ conda install pandas```

## [API Reference](https://scikit-learn.org/stable/modules/classes.html#)

- Datasets
- Impute
- Preprocessing and Normalization
- Model Selection
- Metrics
- Linear Models
- Ensemble Methods
- Clustering

---

# ML Data Preparation

- Missing values

- Encoding

- Scaling

- Data imbalance

In [3]:
# imports

import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

---

## Missing values

__scikit-learn__ estimators assume that all values in an array are numerical, and that all have and hold meaning!

In [4]:
# loading a classic!!!

titanic = pd.read_csv('./datasets/titanic.csv')
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


__Dataset Features:__

- PassengerId - Numerical PK ([1:891])

- Survived - Survival (0 = No; 1 = Yes)

- Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

- Name - Name of the passanger

- Sex - Genre of the passanger

- Age - Age of the passanger

- SibSp - Number of Siblings/Spouses Aboard

- Parch - Number of Parents/Children Aboard

- Ticket - Ticket Number

- Fare - Passenger Fare

- Cabin - Cabin Number

- Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [5]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
# numeric features

titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
# categorical features

cols = ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

cat_list = []
for col in cols:
    cat = titanic[col].unique()
    cat_num = len(cat)
    cat_dict = {"categorical_variable":col,
                "number_of_possible_values":cat_num,
                "values":cat}
    cat_list.append(cat_dict)
    
categories = pd.DataFrame(cat_list).sort_values(by="number_of_possible_values",
                                                ascending=False).reset_index(drop=True)
categories

Unnamed: 0,categorical_variable,number_of_possible_values,values
0,Name,891,"[Braund, Mr. Owen Harris, Cumings, Mrs. John B..."
1,Ticket,681,"[A/5 21171, PC 17599, STON/O2. 3101282, 113803..."
2,Cabin,148,"[nan, C85, C123, E46, G6, C103, D56, A6, C23 C..."
3,Embarked,4,"[S, C, Q, nan]"
4,Sex,2,"[male, female]"


In [8]:
# missing values

titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [9]:
# missing values percentage function

def missing_percentage(df):
    percent_missing = df.isnull().sum() * 100 / len(df)
    missing_values_df = pd.DataFrame({'column_name': df.columns,'percent_missing': percent_missing})
    return missing_values_df

In [10]:
# missing values percentage

missing_percentage(titanic)

Unnamed: 0,column_name,percent_missing
PassengerId,PassengerId,0.0
Survived,Survived,0.0
Pclass,Pclass,0.0
Name,Name,0.0
Sex,Sex,0.0
Age,Age,19.86532
SibSp,SibSp,0.0
Parch,Parch,0.0
Ticket,Ticket,0.0
Fare,Fare,0.0


---

### Delete missing values

In [11]:
# drop columns
no_nan_col = ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Fare']
titanic_no_nan_col = titanic[no_nan_col]
titanic_no_nan_col

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,1,0,A/5 21171,7.2500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,1,0,PC 17599,71.2833
2,3,1,3,"Heikkinen, Miss. Laina",female,0,0,STON/O2. 3101282,7.9250
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,1,0,113803,53.1000
4,5,0,3,"Allen, Mr. William Henry",male,0,0,373450,8.0500
...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,0,0,211536,13.0000
887,888,1,1,"Graham, Miss. Margaret Edith",female,0,0,112053,30.0000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,1,2,W./C. 6607,23.4500
889,890,1,1,"Behr, Mr. Karl Howell",male,0,0,111369,30.0000


In [12]:
missing_percentage(titanic_no_nan_col)

Unnamed: 0,column_name,percent_missing
PassengerId,PassengerId,0.0
Survived,Survived,0.0
Pclass,Pclass,0.0
Name,Name,0.0
Sex,Sex,0.0
SibSp,SibSp,0.0
Parch,Parch,0.0
Ticket,Ticket,0.0
Fare,Fare,0.0


In [13]:
# drop rows

titanic_no_nan_rows = titanic.dropna()
titanic_no_nan_rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [14]:
missing_percentage(titanic_no_nan_rows)

Unnamed: 0,column_name,percent_missing
PassengerId,PassengerId,0.0
Survived,Survived,0.0
Pclass,Pclass,0.0
Name,Name,0.0
Sex,Sex,0.0
Age,Age,0.0
SibSp,SibSp,0.0
Parch,Parch,0.0
Ticket,Ticket,0.0
Fare,Fare,0.0


---

### Imputation of missing values

In [15]:
# we make a copy

titanic_input = titanic.copy()
titanic_input

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [16]:
# Using pandas -> Numeric continuous values

titanic_input['Age'] = titanic_input['Age'].fillna(titanic_input['Age'].mean())
#titanic_input['Age'] = titanic_input['Age'].replace(np.nan, titanic_input['Age'].mean())
missing_percentage(titanic_input)

Unnamed: 0,column_name,percent_missing
PassengerId,PassengerId,0.0
Survived,Survived,0.0
Pclass,Pclass,0.0
Name,Name,0.0
Sex,Sex,0.0
Age,Age,0.0
SibSp,SibSp,0.0
Parch,Parch,0.0
Ticket,Ticket,0.0
Fare,Fare,0.0


In [17]:
# Using pandas -> Categorical values

titanic_input['Embarked'] = titanic_input['Embarked'].fillna(titanic_input['Embarked'].value_counts().index[0])
missing_percentage(titanic_input)

Unnamed: 0,column_name,percent_missing
PassengerId,PassengerId,0.0
Survived,Survived,0.0
Pclass,Pclass,0.0
Name,Name,0.0
Sex,Sex,0.0
Age,Age,0.0
SibSp,SibSp,0.0
Parch,Parch,0.0
Ticket,Ticket,0.0
Fare,Fare,0.0


In [18]:
# we make another copy

titanic_input = titanic.copy()
titanic_input

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [20]:
# Using sklearn univariate feature imputation -> Numeric continuous values

imputer = SimpleImputer(strategy='mean', missing_values=np.nan)
imputer

SimpleImputer()

#### [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

![Image](./img/imputer_methods.JPG)

In [24]:
imputer = imputer.fit(titanic_input[['Age']])
imputer.get_params(deep=True)

{'add_indicator': False,
 'copy': True,
 'fill_value': None,
 'missing_values': nan,
 'strategy': 'mean',
 'verbose': 0}

In [25]:
titanic_input['Age'] = imputer.transform(titanic_input[['Age']])
missing_percentage(titanic_input)

Unnamed: 0,column_name,percent_missing
PassengerId,PassengerId,0.0
Survived,Survived,0.0
Pclass,Pclass,0.0
Name,Name,0.0
Sex,Sex,0.0
Age,Age,0.0
SibSp,SibSp,0.0
Parch,Parch,0.0
Ticket,Ticket,0.0
Fare,Fare,0.0


In [26]:
# Using sklearn univariate feature imputation -> Categorical values

imputer = SimpleImputer(strategy='most_frequent', missing_values=np.nan)
imputer = imputer.fit(titanic_input[['Embarked']])
titanic_input['Embarked'] = imputer.transform(titanic_input[['Embarked']])
missing_percentage(titanic_input)

Unnamed: 0,column_name,percent_missing
PassengerId,PassengerId,0.0
Survived,Survived,0.0
Pclass,Pclass,0.0
Name,Name,0.0
Sex,Sex,0.0
Age,Age,0.0
SibSp,SibSp,0.0
Parch,Parch,0.0
Ticket,Ticket,0.0
Fare,Fare,0.0


#### Other options:

- Last observation carried forward method: `.fillna(method='ffill')`)
- Iterpolation of the variable before and after a timestamp: `.interpolate(method='linear', limit_direction='forward', axis=0)`
- Using Algorithms that support missing values (no available for Sklearn algorithms)
- Missing values prediction (Machine Learning for Machine Learning)

---

## Encoding Categorical Data

again, __scikit-learn__ estimators assume that all values in an array are numerical, and that all have and hold meaning!

__Very Important:__ Ordinal Data vs. Nominal Data


In [27]:
# first we get the categorical data
cat_cols = ['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
titanic_enconded = titanic[cat_cols]
titanic_enconded

Unnamed: 0,Pclass,Name,Sex,Ticket,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,3,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,3,"Allen, Mr. William Henry",male,373450,,S
...,...,...,...,...,...,...
886,2,"Montvila, Rev. Juozas",male,211536,,S
887,1,"Graham, Miss. Margaret Edith",female,112053,B42,S
888,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,W./C. 6607,,S
889,1,"Behr, Mr. Karl Howell",male,111369,C148,C


In [28]:
cat_list = []
for col in cat_cols:
    cat = titanic[col].unique()
    cat_num = len(cat)
    cat_dict = {"categorical_variable":col,
                "number_of_possible_values":cat_num,
                "values":cat}
    cat_list.append(cat_dict)
    
cat_df = pd.DataFrame(cat_list).sort_values(by="number_of_possible_values",
                                                ascending=False).reset_index(drop=True)
cat_df

Unnamed: 0,categorical_variable,number_of_possible_values,values
0,Name,891,"[Braund, Mr. Owen Harris, Cumings, Mrs. John B..."
1,Ticket,681,"[A/5 21171, PC 17599, STON/O2. 3101282, 113803..."
2,Cabin,148,"[nan, C85, C123, E46, G6, C103, D56, A6, C23 C..."
3,Embarked,4,"[S, C, Q, nan]"
4,Pclass,3,"[3, 1, 2]"
5,Sex,2,"[male, female]"


---

### Label encoding

In [29]:
encoding = {'S':1, 'C':2, 'Q':3}
def ordinal_encoding(x):
    for key in encoding:
        if x == key:
            return encoding[key]

In [30]:
titanic_enconded['Embarked_num'] = titanic_enconded['Embarked'].apply(ordinal_encoding)
titanic_enconded

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Pclass,Name,Sex,Ticket,Cabin,Embarked,Embarked_num
0,3,"Braund, Mr. Owen Harris",male,A/5 21171,,S,1.0
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C,2.0
2,3,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S,1.0
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S,1.0
4,3,"Allen, Mr. William Henry",male,373450,,S,1.0
...,...,...,...,...,...,...,...
886,2,"Montvila, Rev. Juozas",male,211536,,S,1.0
887,1,"Graham, Miss. Margaret Edith",female,112053,B42,S,1.0
888,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,W./C. 6607,,S,1.0
889,1,"Behr, Mr. Karl Howell",male,111369,C148,C,2.0


---

### One-hot encoding

In [31]:
# One-hot encoding https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

cat_cols = ['Name', 'Pclass', 'Sex', 'Embarked']
titanic_one_hot_encoding = pd.get_dummies(titanic[cat_cols], 
                                          columns=['Sex', 'Pclass'], 
                                          drop_first=True)
titanic_one_hot_encoding

Unnamed: 0,Name,Embarked,Sex_male,Pclass_2,Pclass_3
0,"Braund, Mr. Owen Harris",S,1,0,1
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",C,0,0,0
2,"Heikkinen, Miss. Laina",S,0,0,1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",S,0,0,0
4,"Allen, Mr. William Henry",S,1,0,1
...,...,...,...,...,...
886,"Montvila, Rev. Juozas",S,1,1,0
887,"Graham, Miss. Margaret Edith",S,0,0,0
888,"Johnston, Miss. Catherine Helen ""Carrie""",S,0,0,1
889,"Behr, Mr. Karl Howell",C,1,0,0


__You can also use the [skalearn method](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for it__

---

## Feature Scaling

Scaling of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance.

__Why is it important?__ If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

![Image](./img/scaling.jpg)

[Here](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html) you may find a comparison between different approches.

In [32]:
# Sample data

sample_data = titanic[['Age', 'Pclass']]
sample_data.describe()
#sample_data = sample_data.to_numpy()


Unnamed: 0,Age,Pclass
count,714.0,891.0
mean,29.699118,2.308642
std,14.526497,0.836071
min,0.42,1.0
25%,20.125,2.0
50%,28.0,3.0
75%,38.0,3.0
max,80.0,3.0


---

### [Standarization](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

The result of standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with μ=0  and σ=1 where μ is the mean (average) and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows:

![Image](./img/standarization.JPG)

In [33]:
# Using scikit-learn .StandardScaler()

scaler = StandardScaler()
scaled_data = scaler.fit_transform(sample_data)
scaled_data

array([[-0.53037664,  0.82737724],
       [ 0.57183099, -1.56610693],
       [-0.25482473,  0.82737724],
       ...,
       [        nan,  0.82737724],
       [-0.25482473, -1.56610693],
       [ 0.15850313,  0.82737724]])

In [34]:
scaled_df = pd.DataFrame(scaled_data, columns=['Age', 'Pclass'])
scaled_df.describe()

Unnamed: 0,Age,Pclass
count,714.0,891.0
mean,2.174187e-16,-2.031048e-16
std,1.000701,1.000562
min,-2.016979,-1.566107
25%,-0.6595416,-0.3693648
50%,-0.1170488,0.8273772
75%,0.571831,0.8273772
max,3.465126,0.8273772


---

### [MinMax Scaling or Normalization](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

In this approach, the data is scaled to a fixed range - usually 0 to 1. The cost of having this bounded range - in contrast to standardization - is that we will end up with smaller standard deviations, which can suppress the effect of outliers. A Min-Max scaling is typically done via the following equation:

![Image](./img/normalization.JPG)

In [35]:
# Using scikit-learn .MinMaxScaler()

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(sample_data)
scaled_data

array([[0.27117366, 1.        ],
       [0.4722292 , 0.        ],
       [0.32143755, 1.        ],
       ...,
       [       nan, 1.        ],
       [0.32143755, 0.        ],
       [0.39683338, 1.        ]])

In [36]:
scaled_df = pd.DataFrame(scaled_data, columns=['Age', 'Pclass'])
scaled_df.describe()

Unnamed: 0,Age,Pclass
count,714.0,891.0
mean,0.367921,0.654321
std,0.18254,0.418036
min,0.0,0.0
25%,0.247612,0.5
50%,0.346569,1.0
75%,0.472229,1.0
max,1.0,1.0


---

### [Robust Scaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.

In [37]:
# Using scikit-learn .MinMaxScaler()

scaler = RobustScaler()
scaled_data = scaler.fit_transform(sample_data)
scaled_data

array([[-0.33566434,  0.        ],
       [ 0.55944056, -2.        ],
       [-0.11188811,  0.        ],
       ...,
       [        nan,  0.        ],
       [-0.11188811, -2.        ],
       [ 0.22377622,  0.        ]])

In [38]:
scaled_df = pd.DataFrame(scaled_data, columns=['Age', 'Pclass'])
scaled_df.describe()

Unnamed: 0,Age,Pclass
count,714.0,891.0
mean,0.095056,-0.691358
std,0.812671,0.836071
min,-1.542937,-2.0
25%,-0.440559,-1.0
50%,0.0,0.0
75%,0.559441,0.0
max,2.909091,0.0


[Here](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html) you may find more info about scaling.



---

## Data imbalance (mainly classification)

- Get more data

- Resampling: Under-sampling and Over-sampling

[Imbalanced-learn](https://imbalanced-learn.org/stable/references/index.html) is an open source, MIT-licensed library relying on scikit-learn that provides tools when dealing with classification with imbalanced classes.

`conda install -c conda-forge imbalanced-learn`

In [39]:
# Over-sampling example: original data

X, y = make_classification(n_classes=2, 
                           class_sep=2, 
                           weights=[0.1, 0.9], 
                           n_informative=3, 
                           n_redundant=1, 
                           flip_y=0, 
                           n_features=20, 
                           n_clusters_per_class=1, 
                           n_samples=1000, 
                           random_state=42)

print(X.shape, y.shape, Counter(y))

(1000, 20) (1000,) Counter({1: 900, 0: 100})


In [40]:
# Using SMOTE (Synthetic Minority Over-sampling Technique)

sm = SMOTE(random_state=42)

X_res, y_res = sm.fit_resample(X, y)

print(X_res.shape, y_res.shape, Counter(y_res))

(1800, 20) (1800,) Counter({1: 900, 0: 900})


---