# <center>Welcome to Supervised Learning</center>
## <center>Part 2: How to prepare your data for supervised machine learning</center>
## <center>Instructor: Andras Zsom</center>
### <center>https://github.com/azsom/Supervised-Learning<center>

## The topic of the course series: supervised Machine Learning (ML)
- how to build an ML pipeline from beginning to deployment
- we assume you already performed data cleaning
- this is the first course out of 6 courses
    - Part 1: Introduction to machine learning and the bias-variance tradeoff
    - **Part 2: How to prepare your data for supervised machine learning**
    - Part 3: Evaluation metrics in supervised machine learning
    - Part 4: SVMs, Random Forests, XGBoost
    - Part 5: Missing data in supervised ML
    - Part 6: Interpretability
- you can complete the courses in sequence or complete individual courses based on your interest

### Structured data
| X|feature_1|feature_2|...|feature_j|...|feature_m|<font color='red'>Y</font>|
|-|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|__data_point_1__|x_11|x_12|...|x_1j|...|x_1m|__<font color='red'>y_1</font>__|
|__data_point_2__|x_21|x_22|...|x_2j|...|x_2m|__<font color='red'>y_2</font>__|
|__...__|...|...|...|...|...|...|__<font color='red'>...</font>__|
|__data_point_i__|x_i1|x_i2|...|x_ij|...|x_im|__<font color='red'>y_i</font>__|
|__...__|...|...|...|...|...|...|__<font color='red'>...</font>__|
|__data_point_n__|x_n1|x_n2|...|x_nj|...|x_nm|__<font color='red'>y_n</font>__|

We focus on the feature matrix (X) in this course.

### Learning objectives of this course

By the end of the course, you will be able to
- describe why data splitting is necessary in machine learning
- summarize the properties of IID data
- list examples of non-IID datasets
- apply IID splitting techniques
- apply non-IID splitting techniques
- identify when a custom splitting strategy is necessary
- describe the two motivating concepts behind preprocessing
- apply various preprocessors to categorical and continuous features
- perform preprocessing with a sklearn pipeline and ColumnTransformer


# Module 1: Split IID data
### Learning objectives of this module:
- describe why data splitting is necessary in machine learning
- summarize the properties of IID data
- apply IID splitting techniques

## Why do we split the data?
- we want to find the best hyper-parameters of our ML algorithms
   - fit models to training data
   - evaluate each model on validation set
   - we find hyper-parameter values that optimize the validation score
- we want to know how the model will perform on previously unseen data
   - apply our final model on the test set
   
### We need to split the data into three parts!

## How should we split the data into train/validation/test?

- data is **Independent and Identically Distributed** (iid)
   - all samples stem from the same generative process and the generative process is assumed to have no memory of past generated samples
   - identify cats and dogs on images
   - predict the house price
   - predict if someone's salary is above or below 50k
- examples of not iid data:
   - data generated by time-dependent processes
   - data has group structure (samples collected from e.g., different subjects, experiments, measurement devices)

## Splitting strategies for iid data: basic approach
- 60% train, 20% validation, 20% test for small datasets
- 98% train, 1% validation, 1% test for large datasets
    - if you have 1 million points, you still have 10000 points in validation and test which is plenty to assess model performance


### Let's work with the adult data!

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split 

df = pd.read_csv('data/adult_data.csv')

# let's separate the feature matrix X, and target variable y
y = df['gross-income'] # remember, we want to predict who earns more than 50k or less than 50k
X = df.loc[:, df.columns != 'gross-income'] # all other columns are features
print(y)
print(X.head())


0         <=50K
1         <=50K
2         <=50K
3         <=50K
4         <=50K
          ...  
32556     <=50K
32557      >50K
32558     <=50K
32559     <=50K
32560      >50K
Name: gross-income, Length: 32561, dtype: object
   age          workclass  fnlwgt   education  education-num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Mal

In [2]:
help(train_test_split)

Help on function train_test_split in module sklearn.model_selection._split:

train_test_split(*arrays, **options)
    Split arrays or matrices into random train and test subsets
    
    Quick utility that wraps input validation and
    ``next(ShuffleSplit().split(X, y))`` and application to input data
    into a single call for splitting (and optionally subsampling) data in a
    oneliner.
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    *arrays : sequence of indexables with same length / shape[0]
        Allowed inputs are lists, numpy arrays, scipy-sparse
        matrices or pandas dataframes.
    
    test_size : float or int, default=None
        If float, should be between 0.0 and 1.0 and represent the proportion
        of the dataset to include in the test split. If int, represents the
        absolute number of test samples. If None, the value is set to the
        complement of the train size. If ``train_size`` is also Non

In [3]:
random_state = 42

# first split to separate out the training set
X_train, X_other, y_train, y_other = train_test_split(X,y,train_size = 0.6,random_state=random_state)
print('training set:',X_train.shape, y_train.shape) # 60% of points are in train
print(X_other.shape, y_other.shape) # 40% of points are in other

# second split to separate out the validation and test sets
X_val, X_test, y_val, y_test = train_test_split(X_other,y_other,train_size = 0.5,random_state=random_state)
print('validation set:',X_val.shape, y_val.shape) # 20% of points are in validation
print('test set:',X_test.shape, y_test.shape) # 20% of points are in test

training set: (19536, 14) (19536,)
(13025, 14) (13025,)
validation set: (6512, 14) (6512,)
test set: (6513, 14) (6513,)


## Randomness due to splitting
- the model performance, validation and test scores will change depending on which points are in train, val, test
    - inherent randomness or uncertainty of the ML pipeline
- change the random state a couple of times and repeat the whole ML pipeline to assess how much the random splitting affects your test score
    - you would expect a similar uncertainty when the model is deployed

## Splitting strategies for iid data: k-fold splitting

<center><img src="figures/grid_search_cross_validation.png" width="600"></center>


In [4]:
from sklearn.model_selection import KFold
help(KFold)

Help on class KFold in module sklearn.model_selection._split:

class KFold(_BaseKFold)
 |  KFold(n_splits=5, *, shuffle=False, random_state=None)
 |  
 |  K-Folds cross-validator
 |  
 |  Provides train/test indices to split data in train/test sets. Split
 |  dataset into k consecutive folds (without shuffling by default).
 |  
 |  Each fold is then used once as a validation while the k - 1 remaining
 |  folds form the training set.
 |  
 |  Read more in the :ref:`User Guide <cross_validation>`.
 |  
 |  Parameters
 |  ----------
 |  n_splits : int, default=5
 |      Number of folds. Must be at least 2.
 |  
 |      .. versionchanged:: 0.22
 |          ``n_splits`` default value changed from 3 to 5.
 |  
 |  shuffle : bool, default=False
 |      Whether to shuffle the data before splitting into batches.
 |      Note that the samples within each split will not be shuffled.
 |  
 |  random_state : int or RandomState instance, default=None
 |      When `shuffle` is True, `random_state` af

In [5]:
random_state = 42

# first split to separate out the test set
X_other, X_test, y_other, y_test = train_test_split(X,y,test_size = 0.2,random_state=random_state)
print(X_other.shape,y_other.shape)
print('test set:',X_test.shape,y_test.shape)

# do KFold split on other
kf = KFold(n_splits=5,shuffle=True,random_state=random_state)
for train_index, val_index in kf.split(X_other,y_other):
    X_train = X_other.iloc[train_index]
    y_train = y_other.iloc[train_index]
    X_val = X_other.iloc[val_index]
    y_val = y_other.iloc[val_index]
    print('   training set:',X_train.shape, y_train.shape) 
    print('   validation set:',X_val.shape, y_val.shape) 
    # the validation set contains different points in each iteration
    print(X_val[['age','workclass','education']].head())
    

(26048, 14) (26048,)
test set: (6513, 14) (6513,)
   training set: (20838, 14) (20838,)
   validation set: (5210, 14) (5210,)
       age   workclass      education
27240   38     Private      Bachelors
4       28     Private      Bachelors
14242   34     Private        HS-grad
16461   58     Private   Some-college
2209    49   Local-gov        HS-grad
   training set: (20838, 14) (20838,)
   validation set: (5210, 14) (5210,)
       age   workclass   education
5514    33   Local-gov   Bachelors
32240   21     Private   Assoc-voc
8615    33     Private        10th
7743    20     Private     HS-grad
20097   39     Private   Assoc-voc
   training set: (20838, 14) (20838,)
   validation set: (5210, 14) (5210,)
       age          workclass      education
9876    27            Private   Some-college
5455    44            Private      Bachelors
29805   62   Self-emp-not-inc      Bachelors
15081   20            Private        HS-grad
13770   40            Private     Assoc-acdm
   training se

## How many splits should I create?
- tough question, 3-5 is most common
- if you do n splits, n models will be trained, so the larger the n, the most computationally intensive it will be to train the models
- KFold is usually better suited to small datasets
- KFold is good to estimate uncertainty due to random splitting of train and val, but it is not perfect
    - the test set remains the same

### Why shuffling iid data is important?
- by default, data is not shuffled by Kfold which can introduce errors!
<center><img src="figures/kfold.png" width="600"></center>


## Imbalanced data
- imbalanced data: only a small fraction of the points are in one of the classes, usually ~5% or less but there is no hard limit here
- examples:
    - people visit a bank's website. do they sign up for a new credit card?
        - most customers just browse and leave the page
        - usually 1% or less of the customers get a credit card (class 1), the rest leaves the page without signing up (class 0).
    - fraud detection
        - only a tiny fraction of credit card payments are fraudulent
    - rare disease diagnosis
- the issue with imbalanced data:
    - if you apply train_test_split or KFold, you might not have class 1 points in one of your sets by chance
    - this is what we need to fix

## Solution: stratified splits

In [6]:
random_state = 42

X_train, X_other, y_train, y_other = train_test_split(X,y,train_size = 0.6,random_state=random_state)
X_val, X_test, y_val, y_test = train_test_split(X_other,y_other,train_size = 0.5,random_state=random_state)

print('**balance without stratification:**')
# a variation on the order of 1% which would be too much for imbalanced data!
print(y_train.value_counts(normalize=True))
print(y_val.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

X_train, X_other, y_train, y_other = train_test_split(X,y,train_size = 0.6,stratify=y,random_state=random_state)
X_val, X_test, y_val, y_test = train_test_split(X_other,y_other,train_size = 0.5,stratify=y_other,random_state=random_state)
print('**balance with stratification:**')
# very little variation (in the 4th decimal point only) which is important if the problem is imbalanced
print(y_train.value_counts(normalize=True))
print(y_val.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))

**balance without stratification:**
 <=50K    0.758855
 >50K     0.241145
Name: gross-income, dtype: float64
 <=50K    0.75476
 >50K     0.24524
Name: gross-income, dtype: float64
 <=50K    0.764625
 >50K     0.235375
Name: gross-income, dtype: float64
**balance with stratification:**
 <=50K    0.759214
 >50K     0.240786
Name: gross-income, dtype: float64
 <=50K    0.759214
 >50K     0.240786
Name: gross-income, dtype: float64
 <=50K    0.759097
 >50K     0.240903
Name: gross-income, dtype: float64


## Stratified folds
<center><img src="figures/stratified_kfold.png" width="600"></center>


In [7]:
from sklearn.model_selection import StratifiedKFold
help(StratifiedKFold)

Help on class StratifiedKFold in module sklearn.model_selection._split:

class StratifiedKFold(_BaseKFold)
 |  StratifiedKFold(n_splits=5, *, shuffle=False, random_state=None)
 |  
 |  Stratified K-Folds cross-validator
 |  
 |  Provides train/test indices to split data in train/test sets.
 |  
 |  This cross-validation object is a variation of KFold that returns
 |  stratified folds. The folds are made by preserving the percentage of
 |  samples for each class.
 |  
 |  Read more in the :ref:`User Guide <cross_validation>`.
 |  
 |  Parameters
 |  ----------
 |  n_splits : int, default=5
 |      Number of folds. Must be at least 2.
 |  
 |      .. versionchanged:: 0.22
 |          ``n_splits`` default value changed from 3 to 5.
 |  
 |  shuffle : bool, default=False
 |      Whether to shuffle each class's samples before splitting into batches.
 |      Note that the samples within each split will not be shuffled.
 |  
 |  random_state : int or RandomState instance, default=None
 |     

In [8]:
# what we did before: variance in balance on the order of 1%
random_state = 42

X_other, X_test, y_other, y_test = train_test_split(X,y,test_size = 0.2,random_state=random_state)
print('test balance:',y_test.value_counts(normalize=True))

# do KFold split on other
kf = KFold(n_splits=5,shuffle=True,random_state=random_state)
for train_index, val_index in kf.split(X_other,y_other):
    X_train = X_other.iloc[train_index]
    y_train = y_other.iloc[train_index]
    X_val = X_other.iloc[val_index]
    y_val = y_other.iloc[val_index]
    print('train balance:')
    print(y_train.value_counts(normalize=True))
    print('val balance:')
    print(y_val.value_counts(normalize=True))

test balance:  <=50K    0.75879
 >50K     0.24121
Name: gross-income, dtype: float64
train balance:
 <=50K    0.756982
 >50K     0.243018
Name: gross-income, dtype: float64
val balance:
 <=50K    0.768522
 >50K     0.231478
Name: gross-income, dtype: float64
train balance:
 <=50K    0.757702
 >50K     0.242298
Name: gross-income, dtype: float64
val balance:
 <=50K    0.765643
 >50K     0.234357
Name: gross-income, dtype: float64
train balance:
 <=50K    0.761014
 >50K     0.238986
Name: gross-income, dtype: float64
val balance:
 <=50K    0.752399
 >50K     0.247601
Name: gross-income, dtype: float64
train balance:
 <=50K    0.758866
 >50K     0.241134
Name: gross-income, dtype: float64
val balance:
 <=50K    0.760991
 >50K     0.239009
Name: gross-income, dtype: float64
train balance:
 <=50K    0.761889
 >50K     0.238111
Name: gross-income, dtype: float64
val balance:
 <=50K    0.748896
 >50K     0.251104
Name: gross-income, dtype: float64


In [9]:
# stratified K Fold: variation in balance is very small (4th decimal point)
random_state = 42

# stratified train-test split
X_other, X_test, y_other, y_test = train_test_split(X,y,test_size = 0.2,stratify=y,random_state=random_state)
print('test balance:',y_test.value_counts(normalize=True))

# do StratifiedKFold split on other
kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=random_state)
for train_index, val_index in kf.split(X_other,y_other):
    X_train = X_other.iloc[train_index]
    y_train = y_other.iloc[train_index]
    X_val = X_other.iloc[val_index]
    y_val = y_other.iloc[val_index]
    print('train balance:')
    print(y_train.value_counts(normalize=True))
    print('val balance:')
    print(y_val.value_counts(normalize=True))

test balance:  <=50K    0.759251
 >50K     0.240749
Name: gross-income, dtype: float64
train balance:
 <=50K    0.75919
 >50K     0.24081
Name: gross-income, dtype: float64
val balance:
 <=50K    0.759117
 >50K     0.240883
Name: gross-income, dtype: float64
train balance:
 <=50K    0.75919
 >50K     0.24081
Name: gross-income, dtype: float64
val balance:
 <=50K    0.759117
 >50K     0.240883
Name: gross-income, dtype: float64
train balance:
 <=50K    0.75919
 >50K     0.24081
Name: gross-income, dtype: float64
val balance:
 <=50K    0.759117
 >50K     0.240883
Name: gross-income, dtype: float64
train balance:
 <=50K    0.759154
 >50K     0.240846
Name: gross-income, dtype: float64
val balance:
 <=50K    0.759263
 >50K     0.240737
Name: gross-income, dtype: float64
train balance:
 <=50K    0.759154
 >50K     0.240846
Name: gross-income, dtype: float64
val balance:
 <=50K    0.759263
 >50K     0.240737
Name: gross-income, dtype: float64


# Module 2: Split non-IID data
### Learning objectives of this module:
- list examples of non-IID datasets
- apply non-IID splitting techniques
- identify when a custom splitting strategy is necessary

## Examples of non-iid data
- if there is any sort of time or group structure in your data, it is likely non-iid
    - group structure:
        - each point is someone's visit to the ER and some people visited the ER multiple times
        - each point is stats of a youtube video and the stats are collected weekly, one of the stats is whether it is featured
        - each point is a customer's visit to CVS and customers tend to return regularly
    - time structure
        - each point is the stocks price at a given time
        - eahc point is a person's health or activity status
        

## Ask yourself these questions!
- What is the intended use of the model? What is it supposed to do/predict?
- What data do you have available at the time of prediction?
- Your split must mimic the intended use of the model only then will you accurately estimate how well the model will perform on previously unseen points (generalization error).
- two examples:
    - if you want to predict the outcome of a new patient's visit to the ER:
        - your test score must be based on patients not included in training and validation
        - your validation score must be based on patients not included in training
        - points of one patient should not be distributed over multiple sets because your generalization error will be off
    - a youtube video was released 4 weeks ago and you want to predict if it will be featured a week from now, your training data should only contain info that will available upon predictions (stuff you know 4 weeks after release)
        - split data based on youtube vid ID
        - use info that's available 4 weeks after release
        - your classification label will be whether it was featured or not 5 weeks after release

## Group-based split: GroupShuffleSplit
<center><img src="figures/groupshufflesplit.png" width="600"></center>


In [10]:
import numpy as np
from sklearn.model_selection import GroupShuffleSplit
X = np.ones(shape=(8, 2))
y = np.ones(shape=(8, 1))
groups = np.array([1, 1, 2, 2, 2, 3, 3, 3])

gss = GroupShuffleSplit(n_splits=10, train_size=.8, random_state=42)

for train_idx, test_idx in gss.split(X, y, groups):
    print("TRAIN:", train_idx, "TEST:", test_idx)


TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]
TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]
TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]
TRAIN: [0 1 2 3 4] TEST: [5 6 7]
TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 2 3 4] TEST: [5 6 7]


## Group-based split: GroupKFold
<center><img src="figures/groupkfold.png" width="600"></center>


In [11]:
from sklearn.model_selection import GroupKFold

group_kfold = GroupKFold(n_splits=3)

for train_index, test_index in group_kfold.split(X, y, groups):
    print("TRAIN:", train_index, "TEST:", test_index)


TRAIN: [0 1 2 3 4] TEST: [5 6 7]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]
TRAIN: [2 3 4 5 6 7] TEST: [0 1]


In [12]:
help(GroupKFold)

Help on class GroupKFold in module sklearn.model_selection._split:

class GroupKFold(_BaseKFold)
 |  GroupKFold(n_splits=5)
 |  
 |  K-fold iterator variant with non-overlapping groups.
 |  
 |  The same group will not appear in two different folds (the number of
 |  distinct groups has to be at least equal to the number of folds).
 |  
 |  The folds are approximately balanced in the sense that the number of
 |  distinct groups is approximately the same in each fold.
 |  
 |  Parameters
 |  ----------
 |  n_splits : int, default=5
 |      Number of folds. Must be at least 2.
 |  
 |      .. versionchanged:: 0.22
 |          ``n_splits`` default value changed from 3 to 5.
 |  
 |  Examples
 |  --------
 |  >>> import numpy as np
 |  >>> from sklearn.model_selection import GroupKFold
 |  >>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
 |  >>> y = np.array([1, 2, 3, 4])
 |  >>> groups = np.array([0, 0, 2, 2])
 |  >>> group_kfold = GroupKFold(n_splits=2)
 |  >>> group_kfold.get_n_splits

## Data leakage in time series data is similar!
- do NOT use information in validation or test which will not be available once your model is deployed
   - don't use future information!
   
<center><img src="figures/timeseriessplit.png" width="600"></center>


In [13]:
import numpy as np
from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit()
for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]


TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]


# Module 3: Preprocess continuous and categorical features
### Learning objectives of this module:
- describe the two motivating concepts behind preprocessing
- apply various preprocessors to categorical and continuous features
- perform preprocessing with a sklearn pipeline and ColumnTransformer

### Data almost never comes in a format that's directly usable in ML
- ML works with numerical data but some columns can be text (e.g., home country, educational level, gender, race)
    - some ML algorithms accept (and prefer) a non-numerical feature matrix (like [CatBoost](https://catboost.ai/) ) but that's not standard
    - sklearn throws an error message if the feature matrix contains non-numerical elements
- the order of magnitude of numerical features can vary greatly which is not good for most ML algorithms (e.g., salary in USD, age in years, time spent on the site in sec)
    - many ML algorithms are distance-based and they perform better and converge faster if the features are standardized (features have a mean of 0 and the same standard deviation, usually 1)
        - Lasso and Ridge regression because of the penalty term, K Nearest Neightbors, SVM, linear models if you want to use the coefficients to measure feature importance (more on this in part 6), neural networks
    - tree-based methods don't require standardization 
    - check out part 1 to learn more about linear and logistic regression, Lasso and Ridge
    - check out part 4 to learn more about SVMs, tree-based methods, and K Nearest Neighbors

### scikit-learn transformers to the rescue!

Preprocessing is done with various transformers. All transformes have three methods:
- **fit** method: estimates parameters necessary to do the transformation,
- **transform** method: transforms the data based on the estimated parameters,
- **fit_transform** method: both steps are performed at once, this can be faster than doing the steps separately.

### Transformers we cover 
- **OrdinalEncoder** - converts categorical features into an integer array
- **OneHotEncoder** - converts categorical features into dummy arrays
- **StandardScaler** - standardizes continuous features by removing the mean and scaling to unit variance

## Ordered categorical data: OrdinalEncoder

Let's assume we have a categorical feature and training and test sets

The cateogies can be ordered or ranked

E.g., educational level in the adult dataset

In [14]:
import pandas as pd

train_edu = {'educational level':['Bachelors','Masters','Bachelors','Doctorate','HS-grad','Masters']} 
test_edu = {'educational level':['HS-grad','Masters','Masters','College','Bachelors']}

X_train = pd.DataFrame(train_edu)
X_test = pd.DataFrame(test_edu)

In [15]:
from sklearn.preprocessing import OrdinalEncoder
help(OrdinalEncoder)

Help on class OrdinalEncoder in module sklearn.preprocessing._encoders:

class OrdinalEncoder(_BaseEncoder)
 |  OrdinalEncoder(*, categories='auto', dtype=<class 'numpy.float64'>)
 |  
 |  Encode categorical features as an integer array.
 |  
 |  The input to this transformer should be an array-like of integers or
 |  strings, denoting the values taken on by categorical (discrete) features.
 |  The features are converted to ordinal integers. This results in
 |  a single column of integers (0 to n_categories - 1) per feature.
 |  
 |  Read more in the :ref:`User Guide <preprocessing_categorical_features>`.
 |  
 |  .. versionadded:: 0.20
 |  
 |  Parameters
 |  ----------
 |  categories : 'auto' or a list of array-like, default='auto'
 |      Categories (unique values) per feature:
 |  
 |      - 'auto' : Determine categories automatically from the training data.
 |      - list : ``categories[i]`` holds the categories expected in the ith
 |        column. The passed categories should no

In [16]:
# initialize the encoder
cats = ['HS-grad','Bachelors','Masters','Doctorate']

enc = OrdinalEncoder(categories = [cats]) # The ordered list of 
# categories need to be provided. By default, the categories are alphabetically ordered!

# fit the training data
enc.fit(X_train)
# print the categories - not really important because we manually gave the ordered list of categories
print(enc.categories_)
# transform X_train. We could have used enc.fit_transform(X_train) to combine fit and transform
X_train_oe = enc.transform(X_train)
print(X_train_oe)
# transform X_test
X_test_oe = enc.transform(X_test) # OrdinalEncoder always throws an error message if 
                                  # it encounters an unknown category in test
print(X_test_oe)

[array(['HS-grad', 'Bachelors', 'Masters', 'Doctorate'], dtype=object)]
[[1.]
 [2.]
 [1.]
 [3.]
 [0.]
 [2.]]


ValueError: Found unknown categories ['College'] in column 0 during transform

## Unordered categorical data: one-hot encoder

some categories cannot be ordered. e.g., workclass, relationship status

first feature: gender (male, female, unknown)

second feature: browser  used 

these categories cannot be ordered

In [17]:
train = {'gender':['Male','Female','Unknown','Male','Female','Female'],\
         'browser':['Safari','Safari','Internet Explorer','Chrome','Chrome','Internet Explorer']}
test = {'gender':['Female','Male','Unknown','Female'],'browser':['Chrome','Firefox','Internet Explorer','Safari']}

X_train = pd.DataFrame(train)
X_test = pd.DataFrame(test)

In [18]:
# How do we convert this to numerical features?
from sklearn.preprocessing import OneHotEncoder

help(OneHotEncoder)

Help on class OneHotEncoder in module sklearn.preprocessing._encoders:

class OneHotEncoder(_BaseEncoder)
 |  OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')
 |  
 |  Encode categorical features as a one-hot numeric array.
 |  
 |  The input to this transformer should be an array-like of integers or
 |  strings, denoting the values taken on by categorical (discrete) features.
 |  The features are encoded using a one-hot (aka 'one-of-K' or 'dummy')
 |  encoding scheme. This creates a binary column for each category and
 |  returns a sparse matrix or dense array (depending on the ``sparse``
 |  parameter)
 |  
 |  By default, the encoder derives the categories based on the unique values
 |  in each feature. Alternatively, you can also specify the `categories`
 |  manually.
 |  
 |  This encoding is needed for feeding categorical data to many scikit-learn
 |  estimators, notably linear models and SVMs with the standard ker

In [19]:
# initialize the encoder
enc = OneHotEncoder(sparse=False) # by default, OneHotEncoder returns a sparse matrix. sparse=False returns a 2D array
# fit the training data
enc.fit(X_train)
print('categories:',enc.categories_)
print('feature names:',enc.get_feature_names())
# transform X_train
X_train_ohe = enc.transform(X_train)
#print(X_train_ohe)
# do all of this in one step
X_train_ohe = enc.fit_transform(X_train)
#print(X_train_ohe)

# transform X_test
X_test_ohe = enc.transform(X_test)
print('X_test transformed')
print(X_test_ohe)

categories: [array(['Female', 'Male', 'Unknown'], dtype=object), array(['Chrome', 'Internet Explorer', 'Safari'], dtype=object)]
feature names: ['x0_Female' 'x0_Male' 'x0_Unknown' 'x1_Chrome' 'x1_Internet Explorer'
 'x1_Safari']


ValueError: Found unknown categories ['Firefox'] in column 1 during transform

## Continuous features: StandardScaler

In [20]:
train = {'salary':[50_000,75_000,40_000,1_000_000,30_000,250_000,35_000,45_000]}
test = {'salary':[25_000,55_000,1_500_000,60_000]}

X_train = pd.DataFrame(train)
X_test = pd.DataFrame(test)

In [21]:
from sklearn.preprocessing import StandardScaler
help(StandardScaler)

Help on class StandardScaler in module sklearn.preprocessing._data:

class StandardScaler(sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
 |  StandardScaler(*, copy=True, with_mean=True, with_std=True)
 |  
 |  Standardize features by removing the mean and scaling to unit variance
 |  
 |  The standard score of a sample `x` is calculated as:
 |  
 |      z = (x - u) / s
 |  
 |  where `u` is the mean of the training samples or zero if `with_mean=False`,
 |  and `s` is the standard deviation of the training samples or one if
 |  `with_std=False`.
 |  
 |  Centering and scaling happen independently on each feature by computing
 |  the relevant statistics on the samples in the training set. Mean and
 |  standard deviation are then stored to be used on later data using
 |  :meth:`transform`.
 |  
 |  Standardization of a dataset is a common requirement for many
 |  machine learning estimators: they might behave badly if the
 |  individual features do not more or less look like s

In [22]:
scaler = StandardScaler()
print(scaler.fit_transform(X_train))
print(scaler.transform(X_test))

[[-0.44873188]
 [-0.36895732]
 [-0.4806417 ]
 [ 2.58270127]
 [-0.51255153]
 [ 0.18946457]
 [-0.49659661]
 [-0.46468679]]
[[-0.52850644]
 [-0.43277697]
 [ 4.1781924 ]
 [-0.41682206]]


## How and when to do preprocessing in the ML pipeline?
- **SPLIT YOUR DATA FIRST!**
- **APPLY TRANSFORMER.FIT ONLY ON YOUR TRAINING DATA!** Then transform the validation and test sets.
- One of the most common mistake practitioners make is leaking statistics!
     - fit_transform is applied to the whole dataset, then the data is split into train/validation/test
         - this is wrong because the test set statistics impacts how the training and validation sets are transformed
         - but the test set must be separated by train and val, and val must be separated by train
     - or fit_transform is applied to the train, then fit_transform is applied to the validation set, and fit_transform is applied to the test set
         - this is wrong because the relative position of the points change
<center><img src="figures/no_separate_scaling.png" width="1200"></center>


## Scikit-learn's pipelines

- The steps in the ML pipleine can be chained together into a scikit-learn pipeline which consists of transformers and one final estimator which is usually your classifier or regression model.
- It neatly combines the preprocessing steps and it helps to avoid leaking statistics.

https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html


In [23]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split

np.random.seed(0)

df = pd.read_csv('data/adult_data.csv')

# let's separate the feature matrix X, and target variable y
y = df['gross-income'] # remember, we want to predict who earns more than 50k or less than 50k
X = df.loc[:, df.columns != 'gross-income'] # all other columns are features

random_state = 42

# first split to separate out the training set
X_train, X_other, y_train, y_other = train_test_split(X,y,train_size = 0.6,random_state=random_state)

# second split to separate out the validation and test sets
X_val, X_test, y_val, y_test = train_test_split(X_other,y_other,train_size = 0.5,random_state=random_state)


In [29]:
# collect which encoder to use on each feature
# needs to be done manually
ordinal_ftrs = ['education'] 
ordinal_cats = [[' Preschool',' 1st-4th',' 5th-6th',' 7th-8th',' 9th',' 10th',' 11th',' 12th',' HS-grad',\
                ' Some-college',' Assoc-voc',' Assoc-acdm',' Bachelors',' Masters',' Prof-school',' Doctorate']]
onehot_ftrs = ['workclass','marital-status','occupation','relationship','race','sex','native-country']
std_ftrs = ['capital-gain','capital-loss','age','hours-per-week']

# collect all the encoders
preprocessor = ColumnTransformer(
    transformers=[
        ('ord', OrdinalEncoder(categories = ordinal_cats), ordinal_ftrs),
        ('onehot', OneHotEncoder(sparse=False,handle_unknown='ignore'), onehot_ftrs),
        ('std', StandardScaler(), std_ftrs)])

# for now we only preprocess, later on we will add other steps here
# note the final scaler which is a standard scaler
# the ordinal and one hot encoded features do not have a mean of 0 and an std of 1
# the final scaler standardizes those features
clf = Pipeline(steps=[('preprocessor', preprocessor),('final scaler',StandardScaler())]) 

X_train_prep = clf.fit_transform(X_train)
X_val_prep = clf.transform(X_val)
X_test_prep = clf.transform(X_test)

print(X_train.shape)
print(X_train_prep.shape)

print(np.mean(X_train_prep,axis=0))
print(np.std(X_train_prep,axis=0))
print(np.mean(X_val_prep,axis=0))
print(np.std(X_val_prep,axis=0))
print(np.mean(X_test_prep,axis=0))
print(np.std(X_test_prep,axis=0))


(19536, 14)
(19536, 91)
[ 7.25395719e-16  7.34820908e-16  9.11623470e-16 -5.87390724e-17
  3.34518904e-16  1.04880160e-15  9.01044641e-16  1.06839644e-16
  4.91874377e-16 -3.95981179e-16  7.93386650e-16 -9.68342606e-16
  4.51465692e-16 -1.75428707e-16 -7.22463312e-16  3.37539397e-17
  1.93160962e-16 -8.81736785e-16 -1.07378957e-15 -1.00853427e-15
  8.40492704e-16  1.04219800e-16  5.15376257e-16  2.54946953e-15
 -1.59131967e-15  7.07187516e-16 -3.34244701e-16  5.09164782e-16
  1.24403115e-15  4.97617008e-16 -7.04658599e-17 -6.75902823e-17
  6.09110996e-16  1.25116043e-16  7.13211454e-16  8.29678032e-16
  3.64686896e-16 -1.15906522e-15 -9.56086948e-17 -2.50416782e-16
  3.05817115e-16  2.09246579e-16 -8.60286453e-16  4.12548783e-16
 -4.12548783e-16  1.94979225e-15 -1.22182037e-15  4.58788185e-16
  1.63516583e-15 -1.81151660e-15  6.48220415e-16  9.90599563e-16
 -2.07177627e-16 -1.02138984e-15 -9.51633638e-16  9.36854036e-16
  5.97379946e-16 -2.73804719e-15  6.95594278e-18 -8.41395584e-16
 