# Feature Engineering Notebook

## Objectives

* Engineering features for Regression model

## Inputs

* inputs/datasets/cleaned/TrainSetCleaned.csv
* inputs/datasets/cleaned/TestSetCleaned.csv 

## Outputs

* Generate a lit with variable to engineer 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load the cleaned data Test and Train Sets

Train Set

In [None]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head(3)

Test Set

In [None]:
test_set_path = "outputs/datasets/cleaned/TestSetCleaned.csv"
TestSet = pd.read_csv(test_set_path)
TestSet.head(3)

---

# Revisit Data Exploration

In feature engineering, we are interested to evalute which potential transformations could be used to engineer the features further.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

## Correlation and PPS Analysis of Cleaned Train Set

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps


def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

Calculate Correlations and Predictive Power Score

In [None]:
ts_corr_pearson, ts_corr_spearman, pps_matrix = CalculateCorrAndPPS(TrainSet)

Display in heatmaps

In [None]:
DisplayCorrAndPPS(df_corr_pearson = ts_corr_pearson,
                  df_corr_spearman = ts_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.4, PPS_Threshold =0.2,
                  figsize=(12,10), font_annot=10)

# Feature Engineering

We will use the custom function from the feature engineering lesson from CI to help us assess the feasibility and usefulness of various engineering techniques.

In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like pandas-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != 'ordinal_encoder':
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
                  '#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', variables=[
                                 f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    # Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

We will test three possible types of feature engineering methods:

- Categorical Encoding
- Numerical Transformation
- Handeling of outliers using the Windsorizer
- Smart Correlation Selection

At each step, we will apply the feature engineering method and assess it's effects. If we decide that it helps to make the data more suitable for maching learning we keep track of the step in a table at the end of this notebook in order to build it into our feature engine pipeline in the next step of the ML model development.

## Categorical Encoding

**Ordinal Encoding**: replaces categories with ordinal numbers

- Step 1: Select the categorical variables, which are currently encoded using strings in the data set.

In [None]:
variables_engineering= ['BsmtExposure','BsmtFinType1','GarageFinish','KitchenQual']
variables_engineering

Step 2: Create a separate DataFrame with the variables to be engineered.

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3).info()

Step 3: Create engineered variables by applying the transformation, assess the engineered variables distribution and select the most suitable method for each variable.

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='ordinal_encoder')

The encoding has been successful.

Step 4: Apply the transformation to the Train and Test sets/

In [None]:
from sklearn.pipeline import Pipeline
# the steps are:
# 1. create the transformer
# 2. fit_transform the TrainSet
# 3. transform the TestSet
# encoder = OrdinalEncoder(encoding_method='arbitrary', variables=variables_engineering)
pipeline = Pipeline([
    ('ordinal_encoder', OrdinalEncoder(encoding_method='arbitrary', variables=variables_engineering))
])

TrainSet = pipeline.fit_transform(TrainSet)
TestSet = pipeline.transform(TestSet)

print("* Categorical encoding - ordinal transformation - done!")

## Numerical Transformation

The custom function above will test several numerical transformations: "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson". The goal is to improve the distribution of the data to fit more closely a normal distribution. Machine Learning models then to be more effective when the features passed to them are normally distributed.

Step 1: Select the variable to be transformed. Because the data set contains a lot of numerical data, we start with only those data that are measured in feet or square feet.

In [None]:
variables_engineering= ['1stFlrSF','2ndFlrSF','GrLivArea','BsmtFinSF1','BsmtUnfSF','TotalBsmtSF',
                        'GarageArea','LotArea','LotFrontage','OpenPorchSF','MasVnrArea']
variables_engineering

Step 2: Create a separate DataFrame with the selected variables.

In [None]:
df_engineering = TestSet[variables_engineering].copy()
df_engineering.head(3)

Step 3: Apply the transformation to the selected variables using the custom function above and use the output to assess which transformations are most suitable for which variables. We keep track of the most appropriate transformations in the table at the end of this notebook.

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='numerical')

Next assess the numerical transformations for the remaining numerical variables, repeating the same steps as before.

In [None]:
variables_engineering= ['BedroomAbvGr','GarageYrBlt','YearBuilt','YearRemodAdd', 'Sale Price']
variables_engineering

In [None]:
df_engineering = TestSet[variables_engineering].copy()
df_engineering.head(3)

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='numerical')

In [None]:
from sklearn.pipeline import Pipeline

variables_engineering_log= ['1stFlrSF','GrLivArea']
variables_engineering_yeo_johnson= ['BsmtUnfSF','GarageArea','TotalBsmtSF','SalePrice']
variables_engineering_power= ['LotArea']

# the steps are:
# 1. create the transformer
# 2. fit_transform the TrainSet
# 3. transform the TestSet

pipeline = Pipeline([
    ('log_transformer', vt.LogTransformer(variables=variables_engineering_log,base='e')),
    ('power_transformer', vt.PowerTransformer(variables=variables_engineering_power)),
    ('yeo_johnson_transformer', vt.YeoJohnsonTransformer(variables=variables_engineering_yeo_johnson))
])

TrainSet = pipeline.fit_transform(TrainSet)
TestSet = pipeline.transform(TestSet)

print("* Numerical transformations done!")

## Assessing outliers



In [None]:
df_engineering = TrainSet.copy()
df_engineering.head(3)

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='outlier_winsorizer')

Analysis of the windsorized data shows, that the only variable that benefits from handling the outliers is the 'GrLivArea', so we will apply the Windsorizer transformation to only this one variable.

In [None]:
from sklearn.pipeline import Pipeline

variables_engineering= ['GrLivArea']

# the steps are:
# 1. create the transformer
# 2. fit_transform the TrainSet
# 3. transform the TestSet

pipeline = Pipeline([
    ('windsorizer', Winsorizer(capping_method='iqr', tail='both', fold=1.5, variables=variables_engineering)),
])

TrainSet = pipeline.fit_transform(TrainSet)
TestSet = pipeline.transform(TestSet)

print("* Outlier transformations done!")

## Smart Correlated Selection Variables

For this transformer there is no need to select variables, all the variables will be fed into the transformer. The transformer finds groups of correlated features and from each group of features it will select the most relevant one based on a variety of criteria, it will then drop the other non-critical features. The target variable 'SalePrice' Should not be included in this analysis, since it should not be dropped.

In [None]:
df_engineering = TrainSet.copy().drop(['SalePrice'],axis=1)
df_engineering.head(3)

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="pearson", threshold=0.6, selection_method="variance")

corr_sel.fit_transform(df_engineering)
print("* The following groups of correlated variables have been identified:")
corr_sel.correlated_feature_sets_

In [None]:
print("* The Smart Correlation Selection has determined to drop the following features:")
corr_sel.features_to_drop_

# Conclusion: Proposed transformations

- The table below lists the transformations that have been determined to be applicable given the features present in the data set. 

- The Smart Corrleation Selection has determined that the following variable should be dropped: '1stFlrSF', 'GarageYrBlt', 'GrLivArea', and 'YearRemodAdd'.

- These transformation will be added to the ML Pipeline.

| Variable | Proposed Transformation |
| ------ | ------ |
| BsmtExposure | OrdinalEncoder, SmartCorrelatedSelection |
| BsmtFinType1 | OrdinalEncoder, SmartCorrelatedSelection |
| GarageFinish | OrdinalEncoder, SmartCorrelatedSelection |
| KitchenQual | OrdinalEncoder, SmartCorrelatedSelection |
| OverallCond | SmartCorrelatedSelection |
| OverallQual | SmartCorrelatedSelection |
| 1stFlrSF | Numerical - log_e, SmartCorrelatedSelection|
| 2ndFlrSF | SmartCorrelatedSelection|
| BsmtFinSF1 | SmartCorrelatedSelection |
| BsmtUnfSF | Numerical - yeo_johnson, SmartCorrelatedSelection |
| GarageArea | Numerical - yeo_johnson, SmartCorrelatedSelection |
| GrLivArea | Numerical - log_e, Windsorizer, SmartCorrelatedSelection |
| LotArea | Numerical - power, SmartCorrelatedSelection |
| LotFrontage | SmartCorrelatedSelection |
| MasVnrArea | SmartCorrelatedSelection |
| OpenPorchSF | SmartCorrelatedSelection |
| TotalBsmtSF | Numerical - yeo_johnson, SmartCorrelatedSelection |
| BedroomAbvGr | SmartCorrelatedSelection |
| GarageYrBlt | SmartCorrelatedSelection |
| YearBuilt | SmartCorrelatedSelection |
| YearRemodAdd | SmartCorrelatedSelection |
| SalePrice | Numerical - yeo_johnson, SmartCorrelatedSelection|


---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
