# Common issues
1. Missing data
2. Categorical variables
3. Imbalanced data
4. Linear assumptions
5. Outliers
6. Feature scale
7. High dimensionality (many features/variables)

## 1. Imputing missing values

### Numerical
- Mean/Median imputation
- Arbitrary
- End of tail

### Categorical
- Mode
- Add "Missing" category

### Numerical and Categorical
- Complete Case Analysis
- Add a missing indicator
- Random sample imputation

In [1]:
# Before applying feature engineering, split yout dataset in train  and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=.25,
                                                    random_state=42)

NameError: name 'X' is not defined

In [None]:
# Numerical - Mean/Median Imputation

#With sklearn

from sklearn.impute import SimpleImputer

# for normal distributions
imputer = SimpleImputer(strategy='mean')

# for skewed distributions
imputer = SimpleImputer(strategy='median')

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [None]:
%pip install feature-engine -q
import pandas as pd
from feature_engine.imputation import MeanMedianImputer


# Load the Titanic dataset

titanic_df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
mmi = MeanMedianImputer(imputation_method='mean')
mmi.fit(titanic_df)

In [None]:
mmi.imputer_dict_

In [None]:
mmm = MeanMedianImputer(imputation_method='median')
mmm.fit(titanic_df)

In [None]:
mmm.imputer_dict_

In [None]:
# Numerical - Mean/Median Imputation

#With feature_engine

from feature_engine.imputation import MeanMedianImputer

# for normal distributions

mmi = MeanMedianImputer(imputation_method='mean')

# for skewed distributions

mmi = MeanMedianImputer(imputation_method='median')

# Fit and transform

mmi.fit(X_train)
X_train = mmi.transform(X_train)
X_test = mmi.transform(X_test)

***Assumptions***
- Data is missing at random
- Missing data would look like most of your observations

**Pros**
- Easy to implement
- Fast
- Can be used in production

**Cons**
- Distorts: distributions, variance and covariance
- The more missing values the higher the distortion

In [None]:
# Numerical - Arbitrary Imputation wth sklearn

from sklearn.impute import SimpleImputer

# for normal distributions
imputer = SimpleImputer(strategy='constant', fill_value=999)

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

In [None]:
# Numerical - Arbitrary Imputation wth feature engine

from feature_engine.imputation import ArbitraryNumberImputer
arbitrary_imputer = ArbitraryNumberImputer(
    arbitrary_number=-999,
    )

# Fit and transform
arbitrary_imputer.fit(X_train)
X_train = arbitrary_imputer.transform(X_train)
X_test = arbitrary_imputer.transform(X_test)

***Assumptions***
- Data is not missing at random

**Pros**
- Easy to implement
- Fast
- Can be used in production
- Captures the importance of a value being missing

**Cons**
- Distorts: distributions, variance and covariance
- The more missing values the higher the distortion
- It can mask or create outliers
- Be careful not to use values that are too similar to mean/median

In [None]:
# Numerical - End ot tail imputation

from feature_engine.imputation import EndTailImputer

# for normal distributions
imputer = EndTailImputer(imputation_method='gaussian', tail='both')

# for skewed distributions
imputer = EndTailImputer(imputation_method='iqr', tail='both')

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

***Assumptions***
- Data is not missing at random

**Pros**
- Easy to implement
- Fast
- Can be used in production

**Cons**
- Distorts: distributions, variance and covariance
- The more missing values the higher the distortion

In [None]:
# Categorical - Frequency/Mode imputation

imputer = SimpleImputer(strategy='most_frequent')

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

***Assumptions***
- Data is missing at random
- Missing observations most likely look like the majority

**Pros**
- Easy to implement
- Fast
- Can be used in production

**Cons**
- Distorts in the relation between the most frequent values and other variables
- Bias/Overrepresentation of the mode if there are many missing values

In [None]:
# Categorical - "Missing" new category

imputer = SimpleImputer(strategy='constant', fill_value="Missing")

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

**Pros**
- Easy to implement
- Fast
- Can be used in production
- Capture the importance of missing data
- There are not assumptions of data missing at random or not

**Cons**
- If the number of missing values is small you can end up with a rare category

In [None]:
# Numerical and Categorical - Complete Case Analysis

import pandas as pd

df.dropna(inplace=True)

***Assumptions***
- Data is missing at random

**Pros**
- Easy to implement
- No data manipulation required
- Preserves distributions

**Cons**
- A lot of observations can be discarded if there is a significant amount of missing data
- Can create a biased dataset when your CCA differ from the original data
- Can't be used in production

**When to use CCA:**
- Data is completely missing at random
- No more than 5% of observation will be discarded

In [None]:
# Numerical and Categorical - Missing Indicator

from pandas.core.internals import concat
from sklearn.impute import MissingIndicator

indicator = MissingIndicator(features='missing-only')

# Fit
indicator.fit(X_train)

# Print and get columns with missing indicator
print(X_train.columns[indicator.features_])
temp = indicator.transform(X_train)

# Create columns for each new indicator
indicator_columns =[column + "_NA_IND" for column in X_train.columns[indicator.features_]]
indicator_df = pd.DataFrame(temp, columns=indicator_columns)

# Concat columns with indicators and rest of training data
X_train= = pd.concat([X_train.reset_index(), indicator_df], axis=1)

# Same for Test
temp_test = indicator.transform(X_test)
test_indicator_df = pd.DataFrame(temp_test, columns=indicator_columns)

X_test = pd.concat([X_test.reset_index(), test_indicator_df], axis=1)

***Assumptions***
- Data is NOT misssing at random
- Missing data can be predicted

**Pros**
- Easy to implement
- Can be integrated in production
- Captures the importance of missing data

**Cons**
- Expands the feature set
- The original variable still has to be imputed
- Many missing indicators may be very highly correlated

In [None]:
from scipy.sparse.construct import random
# Numerical and Categorical - Random Sample Imputation

from feature_engine.imputation import RandomSampleImputer

imputer = RandomSampleImputer(random_state=42)

# Fit and transform
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

***Assumptions***
- Data is misssing at random
- MIssing values are replaced with other values within the same distribution of the original variable

**Pros**
- Easy to implement
- Can be integrated in production
- It preserves distributions

**Cons**
- Randomness
- If there are many missing values the relationships between imputed variables and other variables may be affected
- Memory allocation in production due to need to store both original and imputed datsets during imputation

## 2. Categorical Variables

### Classic techniques
- One-hot encoding
- Frequency encoding
- Ordinal / Label encoding

### Monotonic relationships
- Ordered label encoding (regression)
- Mean encoding (regression)
- Weight of Evidence (binary classification)
- Probability Ratio (binary classification)

### Other techniques
- Rare encoding
- Binary encoding
- Decision Tree encoding

In [None]:
# Other - Rare Label encoding (first encoder to apply)

from feature_engine.encoding import RareLabelEncoder

encoder = RareLabelEncoder()

# Fit and transform
encoder.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)


In [None]:
# Classic techniques - One-hot encoder

from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder()

# Fit and transform
encoder.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

**Pros**
- Doesn't assume distributions
- Retains all categorical variable information
- Works very with linear models

**Cons**
- Expands the feature space
- Doesn't add any extra information while encoding
- Add sparsity
- Possible dummy variables may be identical

In [None]:
# Classic techniques - Frequency encoding

from feature_engine.encoding import CountFrequencyEncoder

encoder = CountFrequencyEncoder(encoding_method='frequency')

# Fit and transform
encoder.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

**Pros**
- Easy to implement
- Feature space remains the same size
- Work well with tree-based algorithms

**Cons**
- Limitations with linear model
- Does not handle new categories
- If 2 or more categories have the same count/frequency information can be lost

In [None]:
# Classic techniques - Label encoding

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

# Fit and transform
encoder.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

**Pros**
- Easy to implement
- Feature space remains the same size
- Work well with tree-based algorithms

**Cons**
- Limitations with linear model
- Does not handle new categories
- Doesn't add any extra valuable information while encoding
- Creates an ordered relatiosnships between the categories

In [None]:
# Monotinic techniques - Ordered encoding

from feature_engine.encoding import OrdinalEncoder

encoder = OrdinalEncoder(encoding_method='ordered')

# Fit and transform
encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

**Pros**
- Easy to implement
- Feature space remains the same size
- Creates a monotinic relationship between with the target variable
- Works very well for regression problems

**Cons**
- Can overfit models

In [None]:
# Monotinic - Mean Encoder

from feature_engine.encoding import MeanEncoder

encoder = MeanEncoder()

# Fit and transform
encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

**Pros**
- Easy to implement
- Feature space remains the same size
- Creates a monotinic relationship between with the target variable
- Works very well for regression problems

**Cons**
- Can overfit models
- If 2 or more categories have the same mean as the target information and relationships can decrease

In [None]:
# Monotinic - Weight of Evidence (binary classification only)

from feature_engine.encoding import WoEEncoder

encoder = WoEEncoder()

# Fit and transform
encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

**Pros**
- Easy to implement
- Feature space remains the same size
- Creates a monotinic relationship between with the target variable
- Orders the categories in a log scale
- Works great for analysis as it is easy to compare the transformed variables to determine which one is a better predictor

**Cons**
- Can overfit models
- Undefined when denominatror is 0

In [None]:
# Monotonic - Probability Ratios encoding (binary classification only)

from feature_engine.encoding import PRatioEncoder

encoder = PRatioEncoder()

# Fit and transform
encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

**Pros**
- Easy to implement
- Feature space remains the same size
- Creates a monotinic relationship between with the target variable
- Works well with linear models (as every other monotinic approach)
- Works great for analysis as it is easy to compare the transformed variables to determine which one is a better predictor

**Cons**
- Can overfit models
- Undefined when denominatror is 0

In [None]:
# Other - Binary encoder

from category_encoders import BinaryEncoder

encoder = BinaryEncoder()

# Fit and transform
encoder.fit(X_train, y_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

**Pros**
- Easy to implement
- Feature space remains ALMOST the same size

**Cons**
- Difficult to intepret
- Potential loss of information during encoding

## 3. Linear assumptions (Transformations)

- Logarithmic (right skewness)
- Square root (right skewness)
- Reciprocal (both)
- Exponential (power) (left skewness)
- Box-Cox
- Yeo-Johnson

### Logarithmic

**Description**
The logarithmic transformation is commonly used to stabilize variance when data exhibits right skewness. It reduces the range of data and helps make the distribution more symmetric.

**Pros**

*	Reduces right skewness effectively.
*	Helps in handling outliers by compressing the data range.
*	Useful in making multiplicative relationships additive.

**Cons**

*	Cannot be used with non-positive values (zeros or negatives).
*	May not perform well if the data contains many zeros or very small positive values.

**When to Use**

*	When the data is positively skewed and you need to stabilize the variance.
*	Suitable for data that follows a multiplicative process.


In [None]:
from feature_engine.transformation import LogTransformer
import pandas as pd

# Sample data
data = pd.DataFrame({'feature': [10, 20, 30, 40, 50]})

# Apply Logarithmic Transformation
transformer = LogTransformer(variables=['feature'])
data_transformed = transformer.fit_transform(data)

print(data_transformed)

### Square Root Transformation

**Description**

The square root transformation is another technique used to stabilize variance for right-skewed data. It is less aggressive than the logarithmic transformation.

**Pros**

* Reduces right skewness while preserving more of the dataâ€™s original structure.
* Can handle zero values (but not negative values).

**Cons**

* Cannot be used with negative values.
*	Less effective than logarithmic transformation for extremely skewed data.

**When to Use**

* When data is moderately skewed to the right.
*	When logarithmic transformation is too strong but variance still needs to be stabilized.

In [None]:
from feature_engine.transformation import PowerTransformer
import pandas as pd

# Sample data
data = pd.DataFrame({'feature': [10, 20, 30, 40, 50]})

# Apply Square Root Transformation
transformer = PowerTransformer(variables=['feature'], exp=0.5)
data_transformed = transformer.fit_transform(data)

print(data_transformed)

### Reciprocal Transformation

**Description**

The reciprocal transformation is useful for both right and left-skewed data. It involves transforming the data to its reciprocal (1/x).

**Pros**

* Effective for both right and left skewness.
*	Can handle large ranges of data.

**Cons**

*	Cannot be used with zero or negative values.
*	Highly sensitive to very small values.

**When to Use**

*	When data exhibits either right or left skewness.
*	Useful for data that spans several orders of magnitude.

In [None]:
from feature_engine.transformation import ReciprocalTransformer
import pandas as pd

# Sample data
data = pd.DataFrame({'feature': [10, 20, 30, 40, 50]})

# Apply Reciprocal Transformation
transformer = ReciprocalTransformer(variables=['feature'])
data_transformed = transformer.fit_transform(data)

print(data_transformed)

###Exponential (Power) Transformation

**Description**

The exponential transformation is typically used for left-skewed data. It involves raising the data to a specified power.

**Pros**

*	Effective for reducing left skewness.
*	Can handle a wide range of transformations by adjusting the power.

**Cons**

*	Choice of power requires careful consideration and may need experimentation.
*	Can increase skewness if the wrong power is chosen.

**When to Use**

*	When data is negatively skewed (left-skewed).
*	When a more flexible transformation is needed.


In [None]:
from feature_engine.transformation import PowerTransformer
import pandas as pd

# Sample data
data = pd.DataFrame({'feature': [-5, -3, -1, 1, 3]})

# Apply Exponential (Power) Transformation with power=2
transformer = PowerTransformer(variables=['feature'], exp=2)
data_transformed = transformer.fit_transform(data)

print(data_transformed)

### Box-Cox Transformation

**Description**

The Box-Cox transformation is a family of power transformations parameterized by lambda. It is useful for stabilizing variance and making the data more normally distributed.

**Pros**

*	Can handle positive data and find an optimal transformation parameter (lambda).
*	Versatile and widely used for various types of data.

**Cons**

*	Cannot be used with non-positive values.
*	Requires estimation of the lambda parameter, which can be computationally intensive.

**When to Use**

*	When data needs to be normalized and variance stabilized.
*	Suitable for continuous, positive data.

In [None]:
from feature_engine.transformation import BoxCoxTransformer
import pandas as pd

# Sample data
data = pd.DataFrame({'feature': [10, 20, 30, 40, 50]})

# Apply Box-Cox Transformation
transformer = BoxCoxTransformer(variables=['feature'])
data_transformed = transformer.fit_transform(data)

print(data_transformed)

### Yeo-Johnson Transformation

**Description**

The Yeo-Johnson transformation is a modification of the Box-Cox transformation that can handle both positive and negative values.

**Pros**

* Can handle both positive and negative values.
*	Finds an optimal transformation parameter (lambda) automatically.

**Cons**

*	Requires estimation of the lambda parameter.
*	Computationally more intensive than simpler transformations.

**When to Use**

*	When data includes both positive and negative values and variance needs to be stabilized.
*	Suitable for normalizing a wide variety of data distributions.


In [None]:
from feature_engine.transformation import YeoJohnsonTransformer
import pandas as pd

# Sample data
data = pd.DataFrame({'feature': [-10, -5, 0, 5, 10]})

# Apply Yeo-Johnson Transformation
transformer = YeoJohnsonTransformer(variables=['feature'])
data_transformed = transformer.fit_transform(data)

print(data_transformed)

## 6. Feature scales

- Mean normalization
- Standardization
- Robust Scaling
- Min Max
- Absolute max
- Unit norm

In [None]:
# Standardization

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()


# Fit and transform
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Robust Scaler

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()


# Fit and transform
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
#  MinMax Scaler

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 10))

# Fit and transform
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
#  Unit Norm Scaler

from sklearn.preprocessing import Normalizer

scaler = Normalizer()

# Fit and transform
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Imbalanced data



In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE, ADASYN

In [None]:
# Create imbalanced dataset

X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [None]:
def evaluate_model(X_train, y_train, X_test, y_test):
    model = LogisticRegression(random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))

In [None]:
print('Original dataset:')
evaluate_model(X_train, y_train, X_test, y_test)

In [None]:
# Oversampling using SMOTE

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

smote.fit(X_train, y_train)
X_train_smote, y_train_smote = smote.resample(X_train, y_train)

print('SMOTE dataset:')
evaluate_model(X_train_smote, y_train_smote, X_test, y_test)

In [None]:
# Oversampling using ADASYN

adasyn = ADASYN(random_state=42)
X_train_adasyn, y_train_adasyn = adasyn.fit_resample(X_train, y_train)

print('ADASYN dataset:')
evaluate_model(X_train_adasyn, y_train_adasyn, X_test, y_test)

In [None]:
! pip install feature_engine

In [None]:
import pandas as pd
import numpy as np