# Feature Engineering Notebook

## Objectives

* Drop features that are correlated with other features

## Inputs

* outputs/datasets/collection/HousePricesRecords.csv

## Outputs

* No output

## Conclusions

* Six more features could be dropped leaving 15 features out of 23.


---

## Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-heritage-housing-issues/jupyter_notebooks'

We want to make the parent of current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/milestone-project-heritage-housing-issues'

---

## Load Data

In [5]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/HousePricesRecords.csv"))
df.head(3)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,856.0,854.0,3.0,No,706.0,GLQ,150.0,0.0,548.0,RFn,...,65.0,196.0,61.0,5,7,856.0,0.0,2003,2003,208500.0
1,1262.0,0.0,3.0,Gd,978.0,ALQ,284.0,,460.0,RFn,...,80.0,0.0,0.0,8,6,1262.0,,1976,1976,181500.0
2,920.0,866.0,3.0,Mn,486.0,GLQ,434.0,0.0,608.0,RFn,...,68.0,162.0,42.0,5,7,920.0,,2001,2002,223500.0


---

## SmartCorrelatedSelection Variables

In [9]:
from sklearn.pipeline import Pipeline

### Drop features
from feature_engine.selection import DropFeatures

### Median Imputer
from feature_engine.imputation import MeanMedianImputer

### Correlation selection
from feature_engine.selection import SmartCorrelatedSelection

In [6]:
### Custom Encoder from DataCleaning notebook
from sklearn.base import BaseEstimator, TransformerMixin
class MyCustomEncoder(BaseEstimator, TransformerMixin):

  def __init__(self, variables, dic):
    if not isinstance(variables, list): 
      self.variables = [variables]
    else: self.variables = variables
    self.dic = dic

  def fit(self, X, y=None):    
    return self

  def transform(self, X):
    for col in self.variables:
      if X[col].dtype == 'object':
        X[col] = X[col].replace(dic[col])
      else:
        print(f"Warning: {col} data type should be object to use MyCustomEncoder()")
      
    return X

* Add SmartCorrelatedSelection to the pipeline to drop features that are correlated with other features and therefore do not add any new information

In [10]:
# dic and vars_with_missing_data is needed for datacleaning - See the DataCleaning notebook
df2 = df.copy().drop('SalePrice', axis=1)
dic = {'BsmtExposure': {'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1, 'None': 0}, 'BsmtFinType1': {'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'None': 0}, 'GarageFinish': {'Fin': 3, 'RFn': 2, 'Unf': 1, 'None': 0}, 'KitchenQual': {'Ex': 4, 'Gd': 3, 'TA': 2, 'Fa': 1, 'Po': 0}}
vars_with_missing_data = ['2ndFlrSF', 'BedroomAbvGr', 'BsmtFinType1', 'GarageFinish', 'GarageYrBlt', 'LotFrontage', 'MasVnrArea']

from feature_engine.selection import SmartCorrelatedSelection

pipeline = Pipeline([
      ('drop_features', DropFeatures(features_to_drop = ['EnclosedPorch', 'WoodDeckSF'])),
      ('custom_encoder', MyCustomEncoder(variables=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'], dic=dic)),
      ('median_imputer',  MeanMedianImputer(imputation_method='median', variables=vars_with_missing_data)),
      ('corr_sel', SmartCorrelatedSelection(method="spearman", threshold=0.6, selection_method="variance"))
])

df_transformed = pipeline.fit_transform(df2) 
df_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   BedroomAbvGr  1460 non-null   float64
 1   BsmtExposure  1460 non-null   int64  
 2   BsmtFinSF1    1460 non-null   float64
 3   BsmtUnfSF     1460 non-null   float64
 4   GarageArea    1460 non-null   float64
 5   GarageFinish  1460 non-null   float64
 6   GrLivArea     1460 non-null   float64
 7   LotArea       1460 non-null   float64
 8   LotFrontage   1460 non-null   float64
 9   MasVnrArea    1460 non-null   float64
 10  OpenPorchSF   1460 non-null   float64
 11  OverallCond   1460 non-null   int64  
 12  OverallQual   1460 non-null   int64  
 13  TotalBsmtSF   1460 non-null   float64
 14  YearBuilt     1460 non-null   int64  
dtypes: float64(11), int64(4)
memory usage: 171.2 KB


* Show the groups with correlated features

In [11]:
pipeline['corr_sel'].correlated_feature_sets_

[{'1stFlrSF', 'TotalBsmtSF'},
 {'2ndFlrSF', 'GrLivArea'},
 {'BsmtFinSF1', 'BsmtFinType1'},
 {'GarageYrBlt', 'YearBuilt', 'YearRemodAdd'},
 {'KitchenQual', 'OverallQual'}]

* Show the removed features

In [12]:
pipeline['corr_sel'].features_to_drop_

['1stFlrSF',
 '2ndFlrSF',
 'BsmtFinType1',
 'GarageYrBlt',
 'KitchenQual',
 'YearRemodAdd']