<a href="https://colab.research.google.com/github/boleslawkol/Purdue-Notebooks/blob/main/Feature_Engineering_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Common issues
1. Missing data
2. Categorical variables
3. Imbalanced data
4. Linear assumptions
5. Distributions
6. Outliers
7. Feature scale

## 1. Imputing missing values

### Numerical
- Mean/Median imputation
- Arbitrary
- End of tail

### Categorical
- Mode
- Add "Missing" category

### Numerical and Categorical
- Complete Case Analysis
- Add a missing indicator
- Random sample imputation

In [None]:
# Before applying feature engineering, split yout dataset in train  and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=.25,
                                                    random_state=42)

In [None]:
# Numerical - Mean/Median Imputation

from sklearn.impute import SimpleImputer

# for normal distributions
imputer = SimpleImputer(strategy='mean')

# for skewed distributions
imputer = SimpleImputer(strategy='median')

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

#### Assumptions
- Data is missing at random
- Missing data would look like most of your observations

#### Pros
- Easy to implement
- Fast
- Can be used in production

#### Cons
- Distorts: distributions, variance and covariance
- The more missing values the higher the distortion

In [None]:
# Numerical - Arbitrary Imputation

from sklearn.impute import SimpleImputer

# for normal distributions
imputer = SimpleImputer(strategy='constant', fill_value=999)

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

#### Assumptions
- Data is missing at random

#### Pros
- Easy to implement
- Fast
- Can be used in production
- Captures the importance of a value being missing

#### Cons
- Distorts: distributions, variance and covariance
- The more missing values the higher the distortion
- It can mask or create outliers
- Be careful not to use values that are too similar to mean/median

In [None]:
# Numerical - End ot tail imputation

from feature_engine.imputation import EndTailImputer

# for normal distributions
imputer = EndTailImputer(imputation_method='gaussian', tail='both')

# for skewed distributions
imputer = EndTailImputer(imputation_method='iqr', tail='both')

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

#### Assumptions
- Data is missing at random

#### Pros
- Easy to implement
- Fast
- Can be used in production

#### Cons
- Distorts: distributions, variance and covariance
- The more missing values the higher the distortion

In [None]:
# Categorical - Frequency/Mode imputation

imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

#### Assumptions
- Data is missing at random
- Missing observations most likely look like the majority

#### Pros
- Easy to implement
- Fast
- Can be used in production

#### Cons
- Distorts in the relation between the most frequent values and other variables
- Overrepresentation of the mode if there are many missing values

In [None]:
# Categorical - "Missing" new category

imputer = SimpleImputer(strategy='constant', fill_value="Missing")

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

#### Pros
- Easy to implement
- Fast
- Can be used in production
- Capture the importance of missing data
- There are not assumpiotn of data missing at random or not

#### Cons
- If the number of missing values is small you can end up with a rare category

In [None]:
# Numerical and Categorical - Complete Case Analysis

import pandas as pd

df.dropna(inplace=True)

#### Assumptions
- Data is missing at random

#### Pros
- Easy to implement
- No data manipulation required
- Preserves distributions

#### Cons
- A lot of observations can be discarded if there is a significant amount of missing data
- Can create a biased datset when your CCA differ from the original data
- Can't be used in production

#### When to use CCA:
- Data is completely missing at random
- No more than 5% of observation will be discarded

In [None]:
# Numerical and Categorical - Missing Indicator

from pandas.core.internals import concat
from sklearn.impute import MissingIndicator

indicator = MissingIndicator(features='missing-only')

# Fit
indicator.fit(X_train)

# Print and get columns with missing indicator
print(X_train.columns[indicator.features_])
temp = indicator.transform(X_train)

# Create columns for each new indicator
indicator_columns =[column + "_NA_IND" for column in X_train.columns[indicator.features_]]
indicator_df = pd.DataFrame(temp, columns=indicator_columns)

# Concat columns with indicators and rest of training data
X_train= = pd.concat([X_train.reset_index(), indicator_df], axis=1)

# Same for Test
temp_test = indicator.transform(X_test)
test_indicator_df = pd.DataFrame(temp_test, columns=indicator_columns)

X_test = pd.concat([X_test.reset_index(), test_indicator_df], axis=1)

#### Assumptions
- Data is NOT misssing at random
- Missing data can be predicted

#### Pros
- Easy to implement
- Can be integrated in production
- Captures the importance of missing data

#### Cons
- Expands the feature set
- The original variable still has to be imputed
- Many missing indicators may be very highly correlated

In [None]:
from scipy.sparse.construct import random
# Numerical and Categorical - Random Sample Imputation

from feature_engine.imputation import RandomSampleImputer

imputer = RandomSampleImputer(random_state=42)

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

#### Assumptions
- Data is misssing at random
- MIssing values are replaced with other values within the same distribution of the original variable

#### Pros
- Easy to implement
- Can be integrated in production
- It preserves distributions

#### Cons
- Randomness
- If there are many nmissing values the relationships between imputed variables and other variables may be affected
- Memory allocation in production due to need to store both original and imputed datsets during imputation

## 2. Categorical Variables

### Classic techniques
- One-hot encoding
- Frequency encoding
- Ordinal / Laber encoding

### Monotonic relationships
- Ordered label encoding
- Mean encoding
- Weight of Evidence
- Probability Ratio

### Other techniques
- Rare encoding
- Binary encoding
- Decision Tree encoding

In [None]:
# Other - Rare Label encoding (first encoder to apply)

from feature_engine.encoding import RareLabelEncoder

encoder = RareLabelEncoder()

# Fit and transform
encoder.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)


In [None]:
# Classic techniques - One-hot encoder

from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder()

# Fit and transform
encoder.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

#### Pros
- Doesn't assume distributions
- Retains all categorical variable information
- Works very with linear models

#### Cons
- Expands the feature space
- Doesn't add any extra information while encoding
- Add sparsity
- Possible dummy variables may be identical

In [None]:
# Classic techniques - Frequency encoding

from feature_engine.encoding import CountFrequencyEncoder

encoder = CountFrequencyEncoder(encoding_method='frequency')

# Fit and transform
encoder.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

#### Pros
- Easy to implement
- Feature space remains the same size
- Work well with tree-based algorithms

#### Cons
- Limitations with linear model
- Does not handle new categories
- If 2 or more categories have the same count/frequency information can be lost

In [None]:
# Classic techniques - Label encoding

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

# Fit and transform
encoder.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

#### Pros
- Easy to implement
- Feature space remains the same size
- Work well with tree-based algorithms

#### Cons
- Limitations with linear model
- Does not handle new categories
- Doesn't add any extra valuable information while encoding
- Creates an ordered relatiosnships between the categories

In [None]:
# Monotinic techniques - Ordered encoding

from feature_engine.encoding import OrdinalEncoder

encoder = OrdinalEncoder(encoding_method='ordered')

# Fit and transform
encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

#### Pros
- Easy to implement
- Feature space remains the same size
- Creates a monotinic relationship between with the target variable
- Works very well for regression problems

#### Cons
- Can overfit models

In [None]:
# Monotinic - Mean Encoder

from feature_engine.encoding import MeanEncoder

encoder = MeanEncoder()

# Fit and transform
encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

#### Pros
- Easy to implement
- Feature space remains the same size
- Creates a monotinic relationship between with the target variable
- Works very well for regression problems

#### Cons
- Can overfit models
- If 2 ot more categproes have the same mean as the target information and relationships can decrease

In [None]:
# Monotinic - Weight of Evidence (binary classification only)

from feature_engine.encoding import WoEEncoder

encoder = WoEEncoder()

# Fit and transform
encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

#### Pros
- Easy to implement
- Feature space remains the same size
- Creates a monotinic relationship between with the target variable
- Orders the categories in a log scale
- Works great for analysis as it is easy to compare the transformed variables to determine which one is a better predictor

#### Cons
- Can overfit models
- Undefined when denominatror is 0

In [None]:
# Monotonic - Probability Ratios encoding (binary classification only)

from feature_engine.encoding import PRatioEncoder

encoder = PRatioEncoder()

# Fit and transform
encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

#### Pros
- Easy to implement
- Feature space remains the same size
- Creates a monotinic relationship between with the target variable
- Works well with linear models (as every other monotinic approach)
- Works great for analysis as it is easy to compare the transformed variables to determine which one is a better predictor

#### Cons
- Can overfit models
- Undefined when denominatror is 0

In [None]:
# Other - Binary encoder

from category_encoders import BinaryEncoder

encoder = BinaryEncoder()

# Fit and transform
encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

#### Pros
- Easy to implement
- Feature space remains ALMOST the same size

#### Cons
- Difficult to intepret
- Potential loss of information during encoding

## 3. Linear assumptions (Transformations)

- Logarithmic (right skewness)
- Square root (right skewness)
- Reciprocal (both)
- Exponential (power) (left skewness)
- Box-Cox
- Yeo-Johnson

In [None]:
# Box-Con (exponential transformation with automatic search)

from feature_engine.transformation import BoxCoxTransformer

transformer = BoxCoxTransformer()

# Fit and transform
transformer.fit(X_train)
X_train = transformer.transform(X_train)
X_test = transformer.transform(X_test)

In [None]:
# Yeo- Johhson (exponential transformation with automatic search including negative values)

from feature_engine.transformation import YeoJohnsonTransformer

transformer = YeoJohnsonTransformer()

# Fit and transform
transformer.fit(X_train)
X_train = transformer.transform(X_train)
X_test = transformer.transform(X_test)

## 6. Feature scales

- Mean normalization
- Standardization
- Robust Scaling
- Min Max
- Absolute max
- Unit norm

In [None]:
# Standardization

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()


# Fit and transform
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Robust Scaler

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()


# Fit and transform
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
#  MinMax Scaler

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 10))

# Fit and transform
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
#  Unit Norm Scaler

from sklearn.preprocessing import Normalizer

scaler = Normalizer()

# Fit and transform
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)