# Feature Engineering

After analyzing the data, this is the next step of the process where we "Engineer" the features. This is still a part of the pre-processing step. It includes fixing the variables with NaNs, discarding rare labels, fixing the distribution of variables, and also splitting the data into testing and training sets. 

In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [48]:
data = pd.read_csv("train.csv")
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## Splitting the data into testing and training sets 

We do this step before pre-processing the data. The test-set is not used to learn the transform parameters; it needs to be completely new data. 

In [49]:
x_train, x_test, y_train, y_test = train_test_split(data, data.SalePrice, test_size=0.1, random_state=0)

## Filling up the missing (NaN) values from the categorical and continuous variables

We discovered some variables having "NaN" in their values, we want to remove those values. We perform slightly different operations between categorical and numerical values. 

### Categorical variables 

In [50]:
#Take only the categories in which the type is categorical and there exists some missing values for the variable
cat_vars = [var for var in data.columns if x_train[var].dtypes == 'O' and x_train[var].isnull().sum() > 1 ]
cat_vars

['Alley',
 'MasVnrType',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'FireplaceQu',
 'GarageType',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']

In [51]:
for var in cat_vars: 
    print(var, " has ", np.round(x_train[var].isnull().mean(),3) , "% values missing")

Alley  has  0.938 % values missing
MasVnrType  has  0.005 % values missing
BsmtQual  has  0.024 % values missing
BsmtCond  has  0.024 % values missing
BsmtExposure  has  0.025 % values missing
BsmtFinType1  has  0.024 % values missing
BsmtFinType2  has  0.025 % values missing
FireplaceQu  has  0.473 % values missing
GarageType  has  0.056 % values missing
GarageFinish  has  0.056 % values missing
GarageQual  has  0.056 % values missing
GarageCond  has  0.056 % values missing
PoolQC  has  0.995 % values missing
Fence  has  0.814 % values missing
MiscFeature  has  0.961 % values missing


In [52]:
def fill_cat_na (data,var):
    df = data.copy()
    df[var] = df[var].fillna("Missing")
    return df

x_train = fill_cat_na(x_train,cat_vars)
x_test = fill_cat_na(x_test,cat_vars)

#we can check if there are any values missing as well
x_train[cat_vars].isnull().sum()

Alley           0
MasVnrType      0
BsmtQual        0
BsmtCond        0
BsmtExposure    0
BsmtFinType1    0
BsmtFinType2    0
FireplaceQu     0
GarageType      0
GarageFinish    0
GarageQual      0
GarageCond      0
PoolQC          0
Fence           0
MiscFeature     0
dtype: int64

### Numerical Variables

For these variables, instead of replacing with just "Missing" we want to actually replace with information that we can use. For numerical variables, we usually want to replace with either the mean or the mode. 

In [53]:
num_vars = [var for var in data.columns if data[var].dtypes != 'O' and data[var].isnull().sum() > 1]
num_vars

['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [54]:
#Let's also see how many values are missing for each type 
for var in num_vars: 
    print(var, "has", np.round(data[var].isnull().mean(),3), "% values missing")

LotFrontage has 0.177 % values missing
MasVnrArea has 0.005 % values missing
GarageYrBlt has 0.055 % values missing


In [55]:
for var in num_vars:
    mode = x_train[var].mode()[0] #we capture the mode value 
    x_train[var].fillna(mode, inplace=True) #replace the missing value with the mode 
    x_test[var].fillna(mode,inplace=True)

# Making numerical variables normally distributed 

Converting numerical values which do not contain 0 to a Gaussian distribution helps linear models converge better

In [56]:
fnum_vars = [var for var in data.columns if data[var].dtypes != 'O' and var not in cat_vars+['Id'] and "Yr" not in var and "Year" not in var]

In [62]:
x_train[fnum_vars].isin(0)

TypeError: only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'int'

In [60]:
new

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,SalePrice
930,20,73.0,8925,8,5,0.0,16,0,1450,1466,...,610,100,18,0,0,0,0,0,7,201000
656,20,72.0,10007,5,7,54.0,806,0,247,1053,...,312,0,0,0,0,0,0,0,8,145500
45,120,61.0,7658,9,5,412.0,456,0,1296,1752,...,576,196,82,0,0,0,0,0,2,319900
1348,20,60.0,16196,7,5,0.0,1443,0,39,1482,...,514,402,25,0,0,0,0,0,8,215000
55,20,100.0,10175,6,5,272.0,490,0,935,1425,...,576,0,0,0,407,0,0,0,7,180500
1228,120,65.0,8769,9,5,766.0,1540,0,162,1702,...,1052,0,72,0,0,224,0,0,10,367294
963,20,122.0,11923,9,5,0.0,0,0,1800,1800,...,702,288,136,0,0,0,0,0,5,239000
921,90,67.0,8777,5,7,0.0,1084,0,188,1272,...,0,0,70,0,0,0,0,0,9,145900
458,70,60.0,5100,8,7,0.0,0,0,588,588,...,228,192,63,0,0,0,0,0,6,161000
1386,60,80.0,16692,7,5,184.0,790,469,133,1392,...,564,0,112,0,0,440,519,2000,7,250000
