# **Data Cleaning and Feature Engineering**

## Objectives

* Missing data:
    * Observer any missing data.
    * Clean the dataset.
* Perform feature engineering on the datasets
* Define a data cleaning and feature engineering pipeline.
* Split the data into Train and Test datasets.

## Inputs

* inputs/datasets/loan_data.csv

## Outputs

* Summary and conclusions at the end of each step.
* Standard code feature engineering pipeline.
* Standard code for data split.
* Save the cleaned Train and Test datasets in:
  * `[outputs/datasets/collection/cleaned/TestSetCleaned.csv]`.
  * `[outputs/datasets/collection/cleaned/TrainSetCleaned.csv]`.
* Save the cleaned Train and Test datasets in:
  * `[outputs/datasets/collection/feature_engineered/TestSetCleaned.csv]`.
  * `[outputs/datasets/collection/feature_engineered/TrainSetCleaned.csv]`.

---

## **Change working directory**

* Change the working directory from its current folder to its parent folder.

In [None]:
import os
current_dir = os.getcwd()
current_dir

* Make the parent of the current directory the new current directory.

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

* Confirm the new current directory.

In [None]:
current_dir = os.getcwd()
current_dir

## **Dataset Loading**

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/row/LoanDefaultDataset.csv"))
df.head(3)

## **Missing Data**

* Observe variable types and any potential transformation are needed. 
* Observer an missing cells in any of the variables in the dataset.

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
if len(vars_with_missing_data) == 0:
    print("There are no variables with missing data")
else:
    print(f"There are {len(vars_with_missing_data)} in the dataset")

> Result:

- There are no missing data in any of the variable in the dataset.
- No further action is required with regards to Missing Data Step.

### Clean dataset Split

- Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split


TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['loan_status'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

### Save the Output Cleaned Datasets

In [7]:
import os
try:
  os.makedirs(name='outputs/datasets/collection/cleaned') # create outputs/datasets/collection/cleaned folder
  TrainSet.to_csv("outputs/datasets/collection/cleaned/TrainSet.csv", index=False)
  TestSet.to_csv("outputs/datasets/collection/cleaned/TestSet.csv", index=False)
except Exception as e:
  print(e)

## **Feature Engineering**

* A custom function that performs variables analysis and transformations.

In [8]:
%matplotlib inline

import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')

def calculate_skew_kurtosis(df,col):
  print(f"{col} with skewness: {df[col].skew().round(2)} and kurtosis: {df[col].kurtosis().round(2)}")

def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like ydata-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != 'ordinal_encoder':
            calculate_skew_kurtosis(df_feat_eng,col)
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
                  '#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', variables=[
                                 f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    # Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

### Categorical Encoding

- Identify categorical variables for the encoder

In [None]:
categorical_variables_names = df.columns[df.dtypes=='object'].to_list()
categorical_variables_names

> Results:

- categorical_variables_names = `['person_gender',
 'person_education',
 'person_home_ownership',
 'loan_intent',
 'previous_loan_defaults_on_file']`

- Build a DataFrame consisting the selected variables for the analysis.

In [None]:
categorical_variables = df[categorical_variables_names].copy()
categorical_variables.head(3)

- Apply the FeatureEngineeringAnalysis function to evaluate the variables transformation.

In [None]:
categorical_variables = FeatureEngineeringAnalysis(df=categorical_variables, analysis_type='ordinal_encoder')

> Results:

- Transformed variables with ordinal_encoder  = `['person_gender',
 'person_education',
 'person_home_ownership',
 'loan_intent',
 'previous_loan_defaults_on_file']`

- Apply the transformation on the categorical_variables using OrdinalEncoder.

In [None]:
# the steps are: 
# 1 - create a transformer
# 2 - fit_transform into df
encoder = OrdinalEncoder(encoding_method='arbitrary', variables = categorical_variables_names)
df = encoder.fit_transform(df)

print("* Categorical encoding - ordinal transformation done!")

### Numerical Transformation

- Identify numerical variables for potential transformation.

In [None]:
numerical_variables_names = df.columns[(df.dtypes =='float64') | (df.dtypes == 'int64')].to_list()
numerical_variables_names

> Results:

- numerical_variables_names = `['person_age',
 'person_gender',
 'person_education',
 'person_income',
 'person_emp_exp',
 'person_home_ownership',
 'loan_amnt',
 'loan_intent',
 'loan_int_rate',
 'loan_percent_income',
 'cb_person_cred_hist_length',
 'credit_score',
 'previous_loan_defaults_on_file',
 'loan_status']`

- Build a DataFrame consisting the selected variables for the analysis.

In [None]:
numerical_variables = df[numerical_variables_names].copy()
numerical_variables.head(3)

- Apply the FeatureEngineeringAnalysis function to evaluate the variables transformation.

In [None]:
numerical_variables = FeatureEngineeringAnalysis(df=numerical_variables, analysis_type='numerical')

#### **Summary**

From the statistical data and graphs Yeo Johnson transformation proved to be effective on the following variables:

- person_income:
  - Histogram and QQ plot resemble normal distribution with mild deviation
  - The median is in the middle of the Interquartile Range (IQR) of the Boxplot
  - Skewness: -0.02
  - Kurtosis: 1.01
- loan_amount:
  - Histogram and QQ plot resemble normal distribution with mild deviation
  - The median is in the middle of the Interquartile Range (IQR) of the Boxplot
  - Skewness: -0.02
  - Kurtosis: -0.38
- loan_percent_income:
  - Histogram and QQ plot resemble normal distribution with mild deviation
  - The median is in the middle of the Interquartile Range (IQR) of the Boxplot
  - Skewness: 0.09
  - kurtosis: -0.76
- credit_score: 
  - Histogram and QQ plot resemble normal distribution with mild deviation
  - The median is in the middle of the Interquartile Range (IQR) of the Boxplot
  - Skewness: -0.04
  - Kurtosis: -0.32

> **Conclusion**:

- Yeo Johnson transformation to be applied on the following variables: `['person_income','loan_amnt','loan_percent_income','credit_score']`.

In [16]:
variables_names_for_yeo_johnson= ['person_income','loan_amnt','loan_percent_income','credit_score']

- Apply the transformation using Yeo Johnson Transformer.

In [None]:
# the steps are: 
# 1 - create a transformer
# 2 - fit_transform into df
yeo_transformer = vt.YeoJohnsonTransformer(variables = variables_names_for_yeo_johnson)
df = yeo_transformer.fit_transform(df)

print("* Numerical encoding - Yeo Johnson transformation done!")

### SmartCorrelatedSelection Variables

In [None]:
smart_correlated_variables = df.copy()
smart_correlated_variables.head(3)

- Apply Smart Correlated Selection to the dataset.

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.7, selection_method="variance")

corr_sel.fit_transform(smart_correlated_variables)
corr_sel.correlated_feature_sets_

- List the features to be dropped. 

In [None]:
corr_sel.features_to_drop_

- Drop the selected features.

In [None]:
from sklearn.pipeline import Pipeline
from feature_engine.selection import DropFeatures

pipeline = Pipeline([
      ( 'drop_features', DropFeatures(features_to_drop = corr_sel.features_to_drop_))
])

df_transformed = pipeline.fit_transform(df)
df_transformed.head()

- The following variables are to be dropped as per SmartCorrelatedSelection results, these are `['person_age', 'cb_person_cred_hist_length']`

#### **Summary**

The following transformations are identified as beneficial for the data pipeline:

- Categorical encoding - ordinal transformation on the following variables:
   `['person_gender','person_education','person_home_ownership','loan_intent','previous_loan_defaults_on_file']`
- Numerical encoding - Yeo Johnson transformation on the following variables:
- `['person_income','loan_amnt','loan_percent_income','credit_score]`
- SmartCorrelatedSelection to drop variables: `['person_age', 'cb_person_cred_hist_length']`

### Feature Engineered Dataset Split

- Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split


TrainSet, TestSet, _, __ = train_test_split(
                                        df_transformed,
                                        df_transformed['loan_status'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

## Save the Output Feature Engineered Datasets

In [23]:
import os
try:
  os.makedirs(name='outputs/datasets/collection/feature_engineered') # create outputs/datasets/collection/feature_engineered folder
  TrainSet.to_csv("outputs/datasets/collection/feature_engineered/TrainSet.csv", index=False)
  TestSet.to_csv("outputs/datasets/collection/feature_engineered/TestSet.csv", index=False)
except Exception as e:
  print(e)

## **Conclusion**

In this notebook a data cleaning, transformation and feature engineering are conducted. The aim is to create two typical **data cleaning and feature engineering datasets** and saved in their respective folders. The typical code to perform the data split and feature engineering are provided below to be used in other notebooks.

- The typical **data split code** to be used across the rest of the notebooks of this project is presented below:

```
from sklearn.model_selection import train_test_split


TrainSet, TestSet, _, __ = train_test_split(
                                        df_transformed,
                                        df_transformed['loan_status'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

```
- The typical **feature engineering pipeline** pipeline is summarized in the code below:
```

from sklearn.pipeline import Pipeline

# Feature Engineering
from feature_engine.selection import SmartCorrelatedSelection
from feature_engine.encoding import OrdinalEncoder
from feature_engine import transformation as vt


def PipelineDataCleaningAndFeatureEngineering():
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=[
                                                        'person_gender',
                                                        'person_education',
                                                        'person_home_ownership',
                                                        'loan_intent',
                                                        'previous_loan_defaults_on_file',
                                                        ])),
        ("YeoJohnsonTransformer", vt.YeoJohnsonTransformer(
            variables = [
                'person_income',
                'loan_amnt',
                'loan_percent_income',
                'credit_score'
                ])),

        ("SmartCorrelatedSelection", SmartCorrelatedSelection(variables=None,
         method="spearman", threshold=0.7, selection_method="variance")), # to be dropped = ['person_age', 'cb_person_cred_hist_length'].
    ])

    return pipeline_base


PipelineDataCleaningAndFeatureEngineering()

```

