# Feature Engineering

+ Feature engineering is to transform the data in such a way that the information content is easily exposed to the model.
+ This statement can mean many things and highly depends on what exactly is "the model".0
+ As we have seen, we are using many tools in combination to manipulate data. Thus far, we have encountered pandas, Dask, and sklearn in this course, but there are many more (PySpark, SQL, DAX, M, R, etc.)
+ It is important to discuss which tools are the right ones, specifically in the context of data leakage.

## Transform using pandas/Dask/SQL or sklearn?

+ Depending on the perspective, the answer could be neither, pandas, or sklearn:

    - Neither: 
        * Most join and filtering should be done closer to the source using a database or parquet/Dask operation. 
        * Map-Reduce and Group-by-Aggregate ("data warehousing") operations.
        * Indexing and reshuffling.
    - Pandas, Dask, or PySpark: 
        * Renames tasks.
        * Use python libraries like pandas, Dask, or pySpark to add contemporaneous feature, time-series manipulation (for example, adding lags), parallel computation (using Dask or pySpark).
        * Do not use these libraries for sample-dependent features.
    - Use sklearn, pytorch:
        * Use python libraries like sklearn or pytorch to add features that are sample-dependent like scaling and normalization, one-hot encoding, tokenization, and vectorization.
        * Model-depdenent transformations: PCA, embeddings, iterative/knn imputation, etc.
+ Decisions must be guided by optimization criteria (time and resources) while avoiding data leakage.

## Example Transforms in sklearn

The list below is found in [Scikit's Documentation](https://scikit-learn.org/stable/modules/preprocessing.html), which also includes convenience interfaces for the classes below.

Work with categorical variables:

+ `preprocessing.Binarizer(*[, threshold, copy])`: Binarize data (set feature values to 0 or 1) according to a threshold.
+ `preprocessing.KBinsDiscretizer([n_bins, ...])`:  Bin continuous data into intervals.
+ `preprocessing.LabelBinarizer(*[, neg_label, ...])`: Binarize labels in a one-vs-all fashion.
+ `preprocessing.LabelEncoder()`: Encode target labels with value between 0 and n_classes-1.
+ `preprocessing.MultiLabelBinarizer(*[, ...])`:  Transform between iterable of iterables and a multilabel format.
+ `preprocessing.OneHotEncoder(*[, categories, ...])`: Encode categorical features as a one-hot numeric array.
+ `preprocessing.OrdinalEncoder(*[, ...])`: Encode categorical features as an integer array.

Scale and normalize:

+ `preprocessing.StandardScaler(*[, copy, ...])`: Standardize features by removing the mean and scaling to unit variance.
+ `preprocessing.MaxAbsScaler(*[, copy])`: Scale each feature by its maximum absolute value.
+ `preprocessing.MinMaxScaler([feature_range, ...])`: Transform features by scaling each feature to a given range.
+ `preprocessing.Normalizer([norm, copy])`:  Normalize samples individually to unit norm.
+ `preprocessing.RobustScaler(*[, ...])`: Scale features using statistics that are robust to outliers.


Nonlinear transforms:

+ `preprocessing.FunctionTransformer([func, ...])`: Constructs a transformer from an arbitrary callable.
+ `preprocessing.KernelCenterer()`: Center an arbitrary kernel matrix 
+ `preprocessing.PolynomialFeatures([degree, ...])`: Generate polynomial and interaction features.
+ `preprocessing.PowerTransformer([method, ...])`: Apply a power transform featurewise to make data more Gaussian-like.
+ `preprocessing.QuantileTransformer(*[, ...])`: Transform features using quantiles information.
+ `preprocessing.SplineTransformer([n_knots, ...])`: Generate univariate B-spline bases for features.
+ `preprocessing.TargetEncoder([categories, ...])`: Target Encoder for regression and classification targets.


## What are we doing?

<div>
<img src="./images/04_column_transform_1.png" width="75%">
</div>

### The Objectives

Build a pipeline that: 

+ Add indicators: 

    - SME indicated that a Debt-to-Ratio > 100% is too high.
    - Missing values indicator for `monthly_income` and `num_dependents`.

+ Impute missing values, where required.
+ Standardize variables.
+ Evaluate if a transform (Yeo-Johnson or Box-Cox) of selected variables (debt_ratio, monthly_income, and revolving_unsecured_line_utilization) is beneficial.

Feature selection:

+ We are looking for informative features: their contribution to prediction is valuable.
+ We prefer parsimonious models.
+ We want to retain evidence of our work and afford reproducibility. 

# Data Source

+ For this example, we will use [Give Me Some Credit from Kaggle](https://www.kaggle.com/c/GiveMeSomeCredit/data), a widely refered example. 
+ To run the examples below, download the data set and extract cs-training.csv to `../05_src/data/credit/`.
 

## Our data




## Manual Solution

+ To get deeper insights into the task, first approach it manually.

## Cross-validation of simple pipeline

On average, we obtain a log-loss of about 0.362.

## Alternative Pipeline

+ The pipeline below is more complex:

    - Treat selected numericals using [Yeo-Johnson transformation](https://feature-engine.trainindata.com/en/latest/user_guide/transformation/YeoJohnsonTransformer.html).
    - Treat other numericals with scaling only.
    - Do not treat booleans.

We obtained a greater loss of 0.443, therefore the additional feature is not profitable.

# Reflection

+ We are currently evaluating two feature engineering procedures using the same classifier. 

    - However, feature engineering is classifier-dependent: each classifier is a specialized tool to learn a certain type of hypothesis. 
    - Different classifiers will benefit from different type of engineered features (see, for example, [Khun and Silge's recommendations on TMWR.org](https://www.tmwr.org/pre-proc-table)).

+ We are producing data from our experiments.

    - The data that we produced is more or less structured: we are using standard performance metrics, for instance.
    - Each preprocessing pipeline will be different and may accept different configuration parameters.
    - Likewise, classifiers will tend to have different configuration parameters. 
    
+ We modify code to produce experiments:

    - Our experiment results will be a function of our algorithm's logic, its implementation (code), and our data.
    - Code tracking is doen with Git.
    - Data tracking is in development.

**It is generally a good idea to use software for experiment tracking once you move out of the Proof of Concept stage.** Some solutions include:

- [ML Flow](https://mlflow.org/).
- [Weights & Balances](https://wandb.ai/site).
- [Sacred](https://sacred.readthedocs.io/en/stable/).

# MLFlow

+ MLFlow is a software tool that automates taks related to experiment tracking:

    - Keep track of experiment parameters.
    - Save configuration+s for individual experiment runs in files or databases.
    - Store models and other artifacts to an object store.

+ A few features that may be useful:

    - Keep track of code and artifacts associated with experiment.
    - Store experiment run times and system characteristics.
    - Work with different backend stores ("[Observers](https://mlflow.org/docs/latest/tracking/backend-stores)").


## Our First Experiment

Continuing with our example, the following setup will track an experiment to measure the performance of a model pipeline. The main file for this experiment is `./05_src/credit/exp__logistic_simple.py`. You can run this experiment from the `05_src/` folder using `python -m credit.exp__logistic_simple`.
        

After running the experiment, take a look at MLFlow by navigating to [http://localhost:5001](http://localhost:5001).
