# **Feature Engineering Notebook**

## Objectives

* Engineer features for *Regression* models

## Inputs

* <code>outputs/datasets/cleaned/TrainSetCleaned.csv</code>
* <code>outputs/datasets/cleaned/TestSetCleaned.csv</code>

## Outputs

* To generate a list of variables to engineer.

## Additional Comments - TBD

* Feature Engineering Transformers:
    * Ordinal categorical encoding:
    * Numerical transformation:
    * Outlier windorzier:
    * Smart Correlation selection:

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/heritage-housing-issues'

---

# Load Cleaned Data

Train Set:

In [4]:
import pandas as pd
train_set = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set)
TrainSet.head(5)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,1828,0.0,3.0,Av,48,Unf,1774,774,Unf,2007.0,...,11694,90.0,452.0,108,5,9,1822,2007,2007,314813
1,894,0.0,2.0,No,0,Unf,894,308,Unf,1962.0,...,6600,60.0,0.0,0,5,5,894,1962,1962,109500
2,964,0.0,2.0,No,713,ALQ,163,432,Unf,1921.0,...,13360,80.0,0.0,0,7,5,876,1921,2006,163500
3,1689,0.0,3.0,No,1218,GLQ,350,857,RFn,2002.0,...,13265,69.0,148.0,59,5,8,1568,2002,2002,271000
4,1541,0.0,3.0,No,0,Unf,1541,843,RFn,2001.0,...,13704,118.0,150.0,81,5,7,1541,2001,2002,205000


Checking for missing values:

In [5]:
missing_vars_train = TrainSet.columns[TrainSet.isna().sum() > 0].to_list()
missing_vars_train

[]

Test Set:

In [6]:
test_set = "outputs/datasets/cleaned/TestSetCleaned.csv"
TestSet = pd.read_csv(test_set)
TestSet.head(5)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,2515,0.0,4.0,No,1219,Rec,816,484,Unf,1975.0,...,32668,69.0,0.0,0,3,6,2035,1957,1975,200624
1,958,620.0,3.0,No,403,BLQ,238,240,Unf,1941.0,...,9490,79.0,0.0,0,7,6,806,1941,1950,133000
2,979,224.0,3.0,No,185,LwQ,524,352,Unf,1950.0,...,7015,69.0,161.0,0,4,5,709,1950,1950,110000
3,1156,866.0,4.0,No,392,BLQ,768,505,Fin,1977.0,...,10005,83.0,299.0,117,5,7,1160,1977,1977,192000
4,525,0.0,3.0,No,0,Unf,525,264,Unf,1971.0,...,1680,21.0,381.0,0,5,6,525,1971,1971,88000


Check for missing values:

In [7]:
missing_vars_test = TestSet.columns[TestSet.isna().sum() > 0].to_list()
missing_vars_test

[]

No missing values for <code>Train</code> and <code>Test</code> sets.

---

# Data Exploration

We are going to use the *ProfileReport* to check the variables to determine the best transformers to use:

In [8]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

#### Findings:

* We are aware that *7* variables contain zeros, therefore we will ignore this.
* Categorical encoder:
    * <code>BsmtExposure</code>
    * <code>BsmtFinType1</code>
    * <code>GarageFinish</code>
    * <code>KitchenQual</code>
* Outlier winsoriser:
    * <code>GarageArea</code>
    * <code>LotArea</code>
    * <code>LotFrontage</code>
    * <code>MasVnrArea</code>
    * <code>OpenPorchSF</code>
    * <code>1stFlrSF</code>
    * <code>2ndFlrSF</code>

---

# Feature Engineering

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
