# Feature Engineering Notebook

## Objectives

- Engineer features for Regression and Decision Tree models

## Inputs

- outputs/datasets/cleaned/TrainSetCleaned.csv
- outputs/datasets/cleaned/TestSetCleaned.csv

## Outputs

- generate a list with variables to engineer

---

## Change working directory
Change current working directory to its parent

In [1]:
import os 
cwd = os.getcwd()
cwd

'/workspaces/heritage-housing/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(cwd))
print("You set a new current working directory")

You set a new current working directory


In [3]:
cwd = os.getcwd()
cwd

'/workspaces/heritage-housing'

---

## Load cleaned data

In [5]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head(3)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GrLivArea,...,TotalBsmtSF,HouseAge,RemodAge,GarageAge,TotalSF,AboveGradeSF,IsRemodeled,Has2ndFlr,HasPorch,HasDeck
0,1828,0.0,3.0,Av,48,Missing,1774,774,Unf,1828,...,1822,18,18,18.0,3650.0,1828.0,0,0,0,0
1,894,0.0,2.0,No,0,Unf,894,308,Missing,894,...,894,63,63,63.0,1788.0,894.0,0,0,0,0
2,964,0.0,2.0,No,713,ALQ,163,432,Unf,964,...,876,104,19,104.0,1840.0,964.0,1,0,0,0


In [6]:
test_set_path = "outputs/datasets/cleaned/TestSetCleaned.csv"
TestSet = pd.read_csv(test_set_path)
TestSet.head(3)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GrLivArea,...,TotalBsmtSF,HouseAge,RemodAge,GarageAge,TotalSF,AboveGradeSF,IsRemodeled,Has2ndFlr,HasPorch,HasDeck
0,2515,0.0,4.0,No,1219,Rec,816,484,Missing,2515,...,2035,68,50,50.0,4550.0,2515.0,1,0,0,0
1,958,620.0,3.0,No,403,BLQ,238,240,Unf,1578,...,806,84,75,84.0,2384.0,1578.0,1,1,0,0
2,979,224.0,3.0,No,185,LwQ,524,352,Unf,1203,...,709,75,75,75.0,1912.0,1203.0,0,1,0,0


---

## Feature Engineering

### Ordinal Encoding 

In [18]:
variables_to_encode = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']
df_encode = TrainSet[variables_to_encode].copy()

Defining explicit mappings for ordinal variables

In [19]:
kitchen_mapping = {'Ex': 4, 'Gd': 3, 'TA': 2, 'Fa': 1}
bsmt_exposure_mapping = {'Gd': 5, 'Av': 4, 'Mn': 3, 'No': 2, 'Missing': 1}
bsmt_fin_mapping = {'GLQ': 7, 'ALQ': 6, 'BLQ': 5, 'Rec': 4, 'LwQ': 3, 'Unf': 2, 'Missing': 1}
garage_finish_mapping = {'Fin': 4, 'RFn': 3, 'Unf': 2, 'Missing': 1}


Apply mappings

In [20]:
df_encode['KitchenQual'] = df_encode['KitchenQual'].map(kitchen_mapping)
df_encode['BsmtExposure'] = df_encode['BsmtExposure'].map(bsmt_exposure_mapping)
df_encode['BsmtFinType1'] = df_encode['BsmtFinType1'].map(bsmt_fin_mapping)
df_encode['GarageFinish'] = df_encode['GarageFinish'].map(garage_finish_mapping)


In [22]:
print(df_encode.head())
print(df_encode.describe())

   BsmtExposure  BsmtFinType1  GarageFinish  KitchenQual
0             4             1             2            3
1             2             2             1            2
2             2             6             2            2
3             2             7             3            3
4             2             2             3            3
       BsmtExposure  BsmtFinType1  GarageFinish  KitchenQual
count   1168.000000   1168.000000   1168.000000  1168.000000
mean       2.625000      4.252568      2.528253     2.501712
std        1.061014      2.226369      1.000029     0.662366
min        1.000000      1.000000      1.000000     1.000000
25%        2.000000      2.000000      2.000000     2.000000
50%        2.000000      4.000000      2.000000     2.000000
75%        3.000000      7.000000      3.000000     3.000000
max        5.000000      7.000000      4.000000     4.000000


In [23]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

for var in variables_to_encode:
    plt.figure(figsize=(8, 5))
    sns.boxplot(x=TrainSet[var], y=TrainSet['SalePrice'], palette="viridis")
    plt.title(f"SalePrice vs {var} (Ordinal Encoded)")
    plt.xlabel(var)
    plt.ylabel("SalePrice")
    plt.show()

KeyError: 'SalePrice'

<Figure size 800x500 with 0 Axes>

In [24]:
TrainSet.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GrLivArea,...,TotalBsmtSF,HouseAge,RemodAge,GarageAge,TotalSF,AboveGradeSF,IsRemodeled,Has2ndFlr,HasPorch,HasDeck
0,1828,0.0,3.0,Av,48,Missing,1774,774,Unf,1828,...,1822,18,18,18.0,3650.0,1828.0,0,0,0,0
1,894,0.0,2.0,No,0,Unf,894,308,Missing,894,...,894,63,63,63.0,1788.0,894.0,0,0,0,0
2,964,0.0,2.0,No,713,ALQ,163,432,Unf,964,...,876,104,19,104.0,1840.0,964.0,1,0,0,0
3,1689,0.0,3.0,No,1218,GLQ,350,857,RFn,1689,...,1568,23,23,23.0,3257.0,1689.0,0,0,0,0
4,1541,0.0,3.0,No,0,Unf,1541,843,RFn,1541,...,1541,24,23,24.0,3082.0,1541.0,1,0,0,0
