# Feature Engineering

+ Feature engineering is to transform the data in such a way that the information content is easily exposed to the model.
+ This statement can mean many things and highly depends on what exactly is "the model".
+ As we have seen, we are using many tools in combination to manipulate data. Thus far, we have encountered pandas, Dask, and sklearn in this course, but there are many more (PySpark, SQL, DAX, M, R, etc.)
+ It is important to discuss which tools are the right ones, specifically in the context of data leakage.

## Transform using Pandas, Dask, SQL,  or Scikit-Learn?

+ Most join and filtering should be done closer to the source such as a database, Spark or DataBricks.
+ Use data manipulation tools like Pandas, Dask, or PySpark: 
    * Rename columns.
    * Column transforms that do not require sampling.
    * Time-series manipulation such as adding lags and contemporaneous features.
    * Parallel computation.
- Use ML pipelines with sklearn or PyTorch:
    * Add features that are sample-dependent like scaling and normalization, one-hot encoding, tokenization, and vectorization.
    * Model-dependent transformations: PCA, embeddings, iterative/knn imputation, etc.

+ Decisions must be guided by optimization criteria (time and resources) while avoiding data leakage.

## Example Transforms in sklearn

The list below is from [Scikit's Documentation](https://scikit-learn.org/stable/modules/preprocessing.html), which also includes convenience interfaces for the classes listed below.

Work with categorical variables:

+ `preprocessing.Binarizer(*[, threshold, copy])`: Binarize data (set feature values to 0 or 1) according to a threshold.
+ `preprocessing.KBinsDiscretizer([n_bins, ...])`:  Bin continuous data into intervals.
+ `preprocessing.LabelBinarizer(*[, neg_label, ...])`: Binarize labels in a one-vs-all fashion.
+ `preprocessing.LabelEncoder()`: Encode target labels with value between 0 and n_classes-1.
+ `preprocessing.MultiLabelBinarizer(*[, ...])`:  Transform between iterable of iterables and a multilabel format.
+ `preprocessing.OneHotEncoder(*[, categories, ...])`: Encode categorical features as a one-hot numeric array.
+ `preprocessing.OrdinalEncoder(*[, ...])`: Encode categorical features as an integer array.

Scale and normalize:

+ `preprocessing.StandardScaler(*[, copy, ...])`: Standardize features by removing the mean and scaling to unit variance.
+ `preprocessing.MaxAbsScaler(*[, copy])`: Scale each feature by its maximum absolute value.
+ `preprocessing.MinMaxScaler([feature_range, ...])`: Transform features by scaling each feature to a given range.
+ `preprocessing.Normalizer([norm, copy])`:  Normalize samples individually to unit norm.
+ `preprocessing.RobustScaler(*[, ...])`: Scale features using statistics that are robust to outliers.


Nonlinear transforms:

+ `preprocessing.FunctionTransformer([func, ...])`: Constructs a transformer from an arbitrary callable.
+ `preprocessing.KernelCenterer()`: Center an arbitrary kernel matrix 
+ `preprocessing.PolynomialFeatures([degree, ...])`: Generate polynomial and interaction features.
+ `preprocessing.PowerTransformer([method, ...])`: Apply a power transform featurewise to make data more Gaussian-like.
+ `preprocessing.QuantileTransformer(*[, ...])`: Transform features using quantiles information.
+ `preprocessing.SplineTransformer([n_knots, ...])`: Generate univariate B-spline bases for features.
+ `preprocessing.TargetEncoder([categories, ...])`: Target Encoder for regression and classification targets.

## What are we doing?

<div>
<img src="./images/04_column_transform_1.png" width="75%">
</div>

### The Objectives

Build a pipeline that: 

+ Add indicators: 

    - SME indicated that a Debt-to-Ratio > 100% is too high.
    - Missing values indicator for `monthly_income` and `num_dependents`.

+ Impute missing values, where required.
+ Standardize variables.
+ Evaluate if a transform (Yeo-Johnson or Box-Cox) of selected variables (debt_ratio, monthly_income, and revolving_unsecured_line_utilization) is beneficial.

Feature selection:

+ We are looking for informative features: their contribution to prediction is valuable.
+ We prefer parsimonious models.
+ We want to retain evidence of our work and ensure reproducibility.

# Data Source

+ For this example, we will use [Give Me Some Credit from Kaggle](https://www.kaggle.com/c/GiveMeSomeCredit/data), a widely refered example. 
+ To run the examples below, download the data set and extract cs-training.csv to `../05_src/data/credit/`.
 

## Our data




In [5]:
# Load environment variables
%load_ext dotenv
%dotenv 
%run update_path.py

import os

# Standard libraries
import pandas as pd
import numpy as np


# Load data
ft_file = os.getenv("CREDIT_DATA")
df_raw = pd.read_csv(ft_file)

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [6]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 12 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   Unnamed: 0                            150000 non-null  int64  
 1   SeriousDlqin2yrs                      150000 non-null  int64  
 2   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 3   age                                   150000 non-null  int64  
 4   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 5   DebtRatio                             150000 non-null  float64
 6   MonthlyIncome                         120269 non-null  float64
 7   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 8   NumberOfTimes90DaysLate               150000 non-null  int64  
 9   NumberRealEstateLoansOrLines          150000 non-null  int64  
 10  NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 11  

In [7]:
df = df_raw.drop(columns = ["Unnamed: 0"]).rename(
    columns = {
        'SeriousDlqin2yrs': 'delinquency',
        'RevolvingUtilizationOfUnsecuredLines': 'revolving_unsecured_line_utilization', 
        'age': 'age',
        'NumberOfTime30-59DaysPastDueNotWorse': 'num_30_59_days_late', 
        'DebtRatio': 'debt_ratio', 
        'MonthlyIncome': 'monthly_income',
        'NumberOfOpenCreditLinesAndLoans': 'num_open_credit_loans', 
        'NumberOfTimes90DaysLate':  'num_90_days_late',
        'NumberRealEstateLoansOrLines': 'num_real_estate_loans', 
        'NumberOfTime60-89DaysPastDueNotWorse': 'num_60_89_days_late',
        'NumberOfDependents': 'num_dependents'
    }
)

## Manual Solution

To get some insights into the task, first approach it manually.

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, PowerTransformer
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.linear_model import LogisticRegression

num_cols = ['revolving_unsecured_line_utilization', 'age',
       'num_30_59_days_late', 'debt_ratio', 'monthly_income',
       'num_open_credit_loans', 'num_90_days_late', 'num_real_estate_loans',
       'num_60_89_days_late', 'num_dependents'
       ]

pipe_num_simple = Pipeline([
    ('imputer', SimpleImputer(strategy = 'median')),
    ('standardizer', StandardScaler())
])

ctransform_simple= ColumnTransformer([
    ('numeric_simple', pipe_num_simple, num_cols),
], remainder='passthrough')

pipe_simple = Pipeline([
    ('preprocess', ctransform_simple),
    ('model', LogisticRegression())
])
pipe_simple


0,1,2
,steps,"[('preprocess', ...), ('model', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('numeric_simple', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


## Cross-Validation of Simple Pipeline

In [9]:
X = df.drop(columns = 'delinquency')
Y = df['delinquency']

scoring = ['neg_log_loss', 'roc_auc', 'f1', 'accuracy', 'precision', 'recall']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)



In [10]:
res_simple_dict = cross_validate(pipe_simple, X_train, Y_train, cv = 5, scoring = scoring)
res_simple = pd.DataFrame(res_simple_dict).assign(experiment = 1)
res_simple


Unnamed: 0,fit_time,score_time,test_neg_log_loss,test_roc_auc,test_f1,test_accuracy,test_precision,test_recall,experiment
0,0.131551,0.015829,-0.225973,0.697979,0.072093,0.9335,0.584906,0.038414,1
1,0.105803,0.014266,-0.2248,0.697751,0.095726,0.933875,0.595745,0.052045,1
2,0.1016,0.01404,-0.226357,0.693513,0.087256,0.93375,0.59375,0.047088,1
3,0.097206,0.014523,-0.225342,0.707628,0.082664,0.933417,0.5625,0.04461,1
4,0.101484,0.014205,-0.2254,0.699981,0.088151,0.933625,0.578947,0.047708,1


On average, we obtain a log-loss of about 0.23.

In [11]:
res_simple.mean()

fit_time             0.107529
score_time           0.014573
test_neg_log_loss   -0.225574
test_roc_auc         0.699370
test_f1              0.085178
test_accuracy        0.933633
test_precision       0.583170
test_recall          0.045973
experiment           1.000000
dtype: float64

## Alternative Pipeline

- The pipeline below is more complex.
- Treat selected numericals using [Yeo-Johnson transformation](https://feature-engine.trainindata.com/en/latest/user_guide/transformation/YeoJohnsonTransformer.html).
- Treat other numericals with scaling only.

In [12]:
num_cols = ['age',
       'num_30_59_days_late', 'num_open_credit_loans', 'num_90_days_late', 'num_real_estate_loans',
       'num_60_89_days_late', 'num_dependents', 
       ]

num_cols_transform = ['revolving_unsecured_line_utilization', 'debt_ratio', 'monthly_income',]

pipe_num_simple = Pipeline([
    ('imputer', SimpleImputer(strategy = 'median')),
    ('standardizer', StandardScaler())
])

pipe_num_yj = Pipeline([
    ('imputer', SimpleImputer(strategy = 'median')),
    ('standardizer', StandardScaler()),
    ('transform', PowerTransformer(method='yeo-johnson'))
])

ctramsform_yj = ColumnTransformer([
    ('numeric_std', pipe_num_simple, num_cols),
    ('numeric_yj', pipe_num_yj, num_cols_transform),
], remainder='passthrough')

pipe_yj = Pipeline([
    ('preprocess', ctramsform_yj),
    ('clf', LogisticRegression())
])
pipe_yj

0,1,2
,steps,"[('preprocess', ...), ('clf', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('numeric_std', ...), ('numeric_yj', ...)]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,method,'yeo-johnson'
,standardize,True
,copy,True

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [13]:
res_yj_dict = cross_validate(pipe_yj, X_train, Y_train, cv = 5, scoring = scoring)
res_yj = pd.DataFrame(res_yj_dict).assign(experiment = 2)
res_yj

Unnamed: 0,fit_time,score_time,test_neg_log_loss,test_roc_auc,test_f1,test_accuracy,test_precision,test_recall,experiment
0,0.269703,0.019382,-0.212495,0.794176,0.082161,0.932042,0.447853,0.045229,2
1,0.229449,0.01954,-0.216061,0.780469,0.109695,0.933042,0.518325,0.061338,2
2,0.236719,0.019794,-0.216864,0.777062,0.096089,0.932583,0.488636,0.053284,2
3,0.246401,0.018452,-0.215289,0.787657,0.093126,0.931833,0.442105,0.052045,2
4,0.221241,0.018458,-0.215241,0.783011,0.097424,0.932833,0.505814,0.053903,2


We obtained a loss of 0.22, therefore the additional feature enhances performance.

In [14]:
res_yj.mean()

fit_time             0.240703
score_time           0.019125
test_neg_log_loss   -0.215190
test_roc_auc         0.784475
test_f1              0.095699
test_accuracy        0.932467
test_precision       0.480547
test_recall          0.053160
experiment           2.000000
dtype: float64

# Reflection

+ We are currently evaluating two feature engineering procedures using the same classifier. 

    - However, feature engineering is classifier-dependent: each classifier is a specialized tool to learn a certain type of hypothesis. 
    - Different classifiers will benefit from different types of engineered features (see, for example, [Khun and Silge's recommendations on TMWR.org](https://www.tmwr.org/pre-proc-table)).

+ We are producing data from our experiments.

    - The data that we produced is more or less structured: we are using standard performance metrics, for instance.
    - Each preprocessing pipeline will be different and may accept different configuration parameters.
    - Likewise, classifiers will tend to have different configuration parameters. 
    
+ We modify code to produce experiments:

    - Our experiment results will be a function of our algorithm's logic, its implementation (code), and our data.
    - Code tracking is done with Git.
    - Data tracking is in development.

**It is generally a good idea to use software for experiment tracking once you move out of the Proof of Concept stage.** Some solutions include:

- [ML Flow](https://mlflow.org/).
- [Weights & Balances](https://wandb.ai/site).
- [Sacred](https://sacred.readthedocs.io/en/stable/).

# MLFlow

+ MLFlow is a software tool that automates tasks related to experiment tracking:

    - Keep track of experiment parameters.
    - Save configurations for individual experiment runs in files or databases.
    - Store models and other artifacts in an object store.

+ A few features that may be useful:

    - Keep track of code and artifacts associated with the experiment.
    - Store experiment run times and system characteristics.
    - Work with different backend stores ("[Observers](https://mlflow.org/docs/latest/tracking/backend-stores)").

## Our First Experiment

Continuing with our example, the following setup will track an experiment to measure the performance of a model pipeline. The main file for this experiment is `./05_src/credit/exp__logistic_simple.py`. You can run this experiment from the `05_src/` folder using `python -m credit.exp__logistic_simple`.
        

After running the experiment, take a look at MLFlow by navigating to [http://localhost:5001](http://localhost:5001).
