# Data Preperation

* This notebook contains the detailed working and testing for data preparation.
* All the contents of the summarised steps are included in the modelling workbook
* Further data features may have been added in the modelling phase. This was just the cleaning and set up I thought was necessary as a starting point prior to modelling.
<br><br><br>
Overall steps for data preparation will be: 

0. Import modules and initialise data frame
1. Deal with any null values
2. Create additional bespoke data features
3. Create manual OneHotEncoding
4. Design code for target_encoded columns
5. Design code for ordinal_encoded columns
6. Design code for onehot encoded columns
7. Run individual code sets and expected modelling data set (noting params in pipeline that may change)
<br><br>

Originally had a step:
*Extract file for use in model pipeline (enables target encoding parameters to be manipulated)*  

Decided to remove this step since I thought it would just complicate adding further features once I was in the modelling phase.

## 0. Import modules and data set, adjust pandas settings


In [1]:
import numpy as np
import pandas as pd
import category_encoders as ce
import sklearn.pipeline as pipeline

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [3]:
df_orig = pd.read_csv(r"C:\Users\Jonat\ga\Material\Unit 3\homework\data\iowa_full.csv")

In [4]:
df = df_orig.copy()

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [6]:
df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


***

## 1. Deal with any null values

Below section steps through logic and checks. See summary at the end for all adjustments in a single point.


In [7]:
# Use function to add in indicators for presence of null values

In [6]:
def denote_null_values(df):
    """Denotes whether or not there are null values or not"""
    empty_cols_query = df.isnull().sum() > 0
    empty_df_cols = df.loc[:, empty_cols_query].columns.tolist()
    for col in empty_df_cols:
        col_name = f"{col}_missing"
        df[col_name] = pd.isnull(df[col])
    return df

In [7]:
df = denote_null_values(df)

In [8]:
df.info()
# This shwos an additional 19 "_missing" columns so the function work properly.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 100 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Id                    1460 non-null   int64  
 1   MSSubClass            1460 non-null   int64  
 2   MSZoning              1460 non-null   object 
 3   LotFrontage           1201 non-null   float64
 4   LotArea               1460 non-null   int64  
 5   Street                1460 non-null   object 
 6   Alley                 91 non-null     object 
 7   LotShape              1460 non-null   object 
 8   LandContour           1460 non-null   object 
 9   Utilities             1460 non-null   object 
 10  LotConfig             1460 non-null   object 
 11  LandSlope             1460 non-null   object 
 12  Neighborhood          1460 non-null   object 
 13  Condition1            1460 non-null   object 
 14  Condition2            1460 non-null   object 
 15  BldgType            

***

In [8]:
# LotFrontage - replace nulls using average for the neighbourhood.
# get a DF to join to the data set as a new column
lotfrontage_neighborhood_mean = df.groupby(by=['Neighborhood'])[['LotFrontage']].mean().reset_index()
lotfrontage_neighborhood_mean.columns = ['Neighborhood','LotFrontage_Neighborhood_Mean']
lotfrontage_neighborhood_mean

Unnamed: 0,Neighborhood,LotFrontage_Neighborhood_Mean
0,Blmngtn,47.142857
1,Blueste,24.0
2,BrDale,21.5625
3,BrkSide,57.509804
4,ClearCr,83.461538
5,CollgCr,71.68254
6,Crawfor,71.804878
7,Edwards,68.217391
8,Gilbert,79.877551
9,IDOTRR,62.5


In [9]:
df = df.merge(lotfrontage_neighborhood_mean,how='left',left_on='Neighborhood',right_on='Neighborhood')

In [10]:
df['LotFrontage'] = df['LotFrontage'].fillna(df.LotFrontage_Neighborhood_Mean)

In [11]:
df.drop('LotFrontage_Neighborhood_Mean',axis=1,inplace=True)

In [12]:
def LotFrontage_na_calc(training_df):
    lotfrontage_neighborhood_mean = training_df.groupby(by=['Neighborhood'])[['LotFrontage']].mean().reset_index()
    lotfrontage_neighborhood_mean.columns = ['Neighborhood','LotFrontage_Neighborhood_Mean']
    return lotfrontage_neighborhood_mean

def LotFrontage_na_apply(training_df, testing_df, validation_df=None):
    # Calc mean based on training data
    lnm = LotFrontage_na_calc(training_df)
    
    # Apply mean to training data - for neighbourhood
    # Reset LotFrontage NaN in case they have been filled in a prior run
    training_df['LotFrontage'] = np.where(training_df['LotFrontage_missing']==True,np.nan,training_df['LotFrontage'])
    training_df = training_df.merge(lnm,how='left',left_on='Neighborhood',right_on='Neighborhood')
    training_df['LotFrontage'] = training_df['LotFrontage'].fillna(training_df.LotFrontage_Neighborhood_Mean)
    training_df.drop('LotFrontage_Neighborhood_Mean',axis=1,inplace=True)
    
    # Apply mean to testing data
    # Reset LotFrontage NaN in case they have been filled in a prior run
    testing_df['LotFrontage'] = np.where(testing_df['LotFrontage_missing']==True,np.nan,testing_df['LotFrontage'])
    testing_df = testing_df.merge(lnm,how='left',left_on='Neighborhood',right_on='Neighborhood')
    testing_df['LotFrontage'] = testing_df['LotFrontage'].fillna(testing_df.LotFrontage_Neighborhood_Mean)
    testing_df.drop('LotFrontage_Neighborhood_Mean',axis=1,inplace=True)
    # Fill the training sample mean if a specific neighborhood is missing from the training sample
    testing_df['LotFrontage'] = testing_df['LotFrontage'].fillna(training_df['LotFrontage'].mean())

    if validation_df is None:
        return training_df, testing_df
    else:
        # Apply mean to validation data set
        validation_df['LotFrontage'] = np.where(validation_df['LotFrontage_missing']==True,np.nan,validation_df['LotFrontage'])
        validation_df = validation_df.merge(lnm,how='left',left_on='Neighborhood',right_on='Neighborhood')
        validation_df['LotFrontage'] = validation_df['LotFrontage'].fillna(validation_df.LotFrontage_Neighborhood_Mean)
        validation_df.drop('LotFrontage_Neighborhood_Mean',axis=1,inplace=True)        
        validation_df['LotFrontage'] = validation_df['LotFrontage'].fillna(training_df['LotFrontage'].mean())
        return training_df, testing_df,validation_df

In [13]:
# Test the functions above
train = df.sample(frac=0.3,random_state=743)
test = df.drop(train.index)
train,val = train.iloc[:-100],train.iloc[-100:]

In [14]:
print(f"train size {train.shape[0]} and test size {test.shape[0]} and val size {val.shape[0]}")
print(f"total size {df.shape[0]} and check size {train.shape[0] + test.shape[0] + val.shape[0]}")

train size 338 and test size 1022 and val size 100
total size 1460 and check size 1460


In [129]:
train,test, val = LotFrontage_na_apply(train, test, val)

In [116]:
float(9.00000).is_integer()

True

In [15]:
# Exclude the numbers that end evenly (i.e. original data), and look at results
# Then compare with same code for the test set
# Realised after could have just used LotFrontage_missing!; Probably simpler and clearer
# train[~(train['LotFrontage'].apply(lambda x: x.is_integer()))].groupby(by='Neighborhood')['LotFrontage'].value_counts()
train[(train.LotFrontage_missing==True)].groupby(by='Neighborhood')['LotFrontage'].value_counts()

Neighborhood  LotFrontage
Blmngtn       47.142857       1
ClearCr       83.461538       6
CollgCr       71.682540       3
Crawfor       71.804878       4
Edwards       68.217391       2
Gilbert       79.877551       4
IDOTRR        62.500000       1
MeadowV       27.800000       1
Mitchel       70.083333       6
NAmes         76.462366       7
NWAmes        81.288889       4
SWISU         58.913043       1
Sawyer        74.437500      12
SawyerW       71.500000       1
Somerst       64.666667       5
StoneBr       62.700000       1
Name: LotFrontage, dtype: int64

In [16]:
train[(train['Neighborhood'] == 'BrkSide')]['LotFrontage'].mean()

60.07142857142857

In [17]:
#test[~(test['LotFrontage'].apply(lambda x: x.is_integer()))].groupby(by='Neighborhood')['LotFrontage'].value_counts()
test[(test.LotFrontage_missing==True)].groupby(by='Neighborhood')['LotFrontage'].value_counts()

Neighborhood  LotFrontage
Blmngtn       47.142857       2
BrkSide       57.509804       7
ClearCr       83.461538       8
CollgCr       71.682540      19
Crawfor       71.804878       5
Edwards       68.217391       5
Gilbert       79.877551      25
IDOTRR        62.500000       2
MeadowV       27.800000       1
Mitchel       70.083333       6
NAmes         76.462366      31
NPkVill       32.285714       2
NWAmes        81.288889      20
NoRidge       91.878788       8
NridgHt       81.881579       1
OldTown       62.788991       4
SWISU         58.913043       1
Sawyer        74.437500      12
SawyerW       71.500000       7
Somerst       64.666667       3
StoneBr       62.700000       3
Timber        80.133333       8
Veenker       59.714286       4
Name: LotFrontage, dtype: int64

In [18]:
#val[~(val['LotFrontage'].apply(lambda x: x.is_integer()))].groupby(by='Neighborhood')['LotFrontage'].value_counts()
val[(val.LotFrontage_missing==True)].groupby(by='Neighborhood')['LotFrontage'].value_counts()

Neighborhood  LotFrontage
ClearCr       83.461538      1
CollgCr       71.682540      2
Crawfor       71.804878      1
Edwards       68.217391      1
Gilbert       79.877551      1
Mitchel       70.083333      1
NAmes         76.462366      1
NWAmes        81.288889      4
Sawyer        74.437500      2
SawyerW       71.500000      1
StoneBr       62.700000      1
Name: LotFrontage, dtype: int64

In [19]:
train[['LotFrontage','LotFrontage_missing']]

Unnamed: 0,LotFrontage,LotFrontage_missing
1139,98.0,False
412,64.666667,True
1186,107.0,False
1173,138.0,False
395,68.0,False
266,70.0,False
507,75.0,False
21,57.0,False
724,86.0,False
817,70.083333,True


In [20]:
train['LotFrontage'] = np.where(train['LotFrontage_missing']==True,np.nan,train['LotFrontage'])

***

In [21]:
# Create AlleyAccess_Flag
df['Alley'].value_counts()

Grvl    50
Pave    41
Name: Alley, dtype: int64

In [16]:
# ?np.where

In [22]:
df['AlleyAccess_Flag'] = np.where(df['Alley'].isnull(),0,1)

In [18]:
df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontage_missing,Alley_missing,MasVnrType_missing,MasVnrArea_missing,BsmtQual_missing,BsmtCond_missing,BsmtExposure_missing,BsmtFinType1_missing,BsmtFinType2_missing,Electrical_missing,FireplaceQu_missing,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing,GarageQual_missing,GarageCond_missing,PoolQC_missing,Fence_missing,MiscFeature_missing,AlleyAccess_Flag
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,True,True,0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0


In [23]:
df[(df['AlleyAccess_Flag']==1)].head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontage_missing,Alley_missing,MasVnrType_missing,MasVnrArea_missing,BsmtQual_missing,BsmtCond_missing,BsmtExposure_missing,BsmtFinType1_missing,BsmtFinType2_missing,Electrical_missing,FireplaceQu_missing,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing,GarageQual_missing,GarageCond_missing,PoolQC_missing,Fence_missing,MiscFeature_missing,AlleyAccess_Flag
21,22,45,RM,57.0,7449,Pave,Grvl,Reg,Bnk,AllPub,Inside,Gtl,IDOTRR,Norm,Norm,1Fam,1.5Unf,7,7,1930,1950,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,TA,PConc,TA,TA,No,Unf,0,Unf,0,637,637,GasA,Ex,Y,FuseF,1108,0,0,1108,0,0,1,0,3,1,Gd,6,Typ,1,Gd,Attchd,1930.0,Unf,1,280,TA,TA,N,0,0,205,0,0,0,,GdPrv,,0,6,2007,WD,Normal,139400,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,1
30,31,70,C (all),50.0,8500,Pave,Pave,Reg,Lvl,AllPub,Inside,Gtl,IDOTRR,Feedr,Norm,1Fam,2Story,4,4,1920,1950,Gambrel,CompShg,BrkFace,BrkFace,,0.0,TA,Fa,BrkTil,TA,TA,No,Unf,0,Unf,0,649,649,GasA,TA,N,SBrkr,649,668,0,1317,0,0,1,0,3,1,TA,6,Typ,0,,Detchd,1920.0,Unf,1,250,TA,Fa,N,0,54,172,0,0,0,,MnPrv,,0,7,2008,WD,Normal,40000,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,True,1
56,57,160,FV,24.0,2645,Pave,Pave,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,Twnhs,2Story,8,5,1999,2000,Gable,CompShg,MetalSd,MetalSd,BrkFace,456.0,Gd,TA,PConc,Gd,TA,No,GLQ,649,Unf,0,321,970,GasA,Ex,Y,SBrkr,983,756,0,1739,1,0,2,1,3,1,Gd,7,Typ,0,,Attchd,1999.0,Fin,2,480,TA,TA,Y,115,0,0,0,0,0,,,,0,8,2009,WD,Abnorml,172500,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,True,True,1
79,80,50,RM,60.0,10440,Pave,Grvl,Reg,Lvl,AllPub,Corner,Gtl,OldTown,Norm,Norm,1Fam,2Story,5,6,1910,1981,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,TA,PConc,TA,TA,No,Unf,0,Unf,0,440,440,GasA,Gd,Y,SBrkr,682,548,0,1230,0,0,1,1,2,1,TA,5,Typ,0,,Detchd,1966.0,Unf,2,440,TA,TA,Y,74,0,128,0,0,0,,MnPrv,,0,5,2009,WD,Normal,110000,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,True,1
87,88,160,FV,40.0,3951,Pave,Pave,Reg,Lvl,AllPub,Corner,Gtl,Somerst,Norm,Norm,TwnhsE,2Story,6,5,2009,2009,Gable,CompShg,VinylSd,VinylSd,Stone,76.0,Gd,TA,PConc,Gd,TA,Av,Unf,0,Unf,0,612,612,GasA,Ex,Y,SBrkr,612,612,0,1224,0,0,2,1,2,1,Gd,4,Typ,0,,Detchd,2009.0,RFn,2,528,TA,TA,Y,0,234,0,0,0,0,,,,0,6,2009,New,Partial,164500,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,True,True,1


In [24]:
df['Alley'] = df['Alley'].fillna('no_access')

In [25]:
df['MasVnrType'].value_counts()

None       864
BrkFace    445
Stone      128
BrkCmn      15
Name: MasVnrType, dtype: int64

In [26]:
df['MasVnrType'] = df['MasVnrType'].fillna('None')

In [27]:
df['MasVnrArea'] = df['MasVnrArea'].fillna(0)

***

In [28]:
df[(df.BsmtQual_missing==True)]

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontage_missing,Alley_missing,MasVnrType_missing,MasVnrArea_missing,BsmtQual_missing,BsmtCond_missing,BsmtExposure_missing,BsmtFinType1_missing,BsmtFinType2_missing,Electrical_missing,FireplaceQu_missing,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing,GarageQual_missing,GarageCond_missing,PoolQC_missing,Fence_missing,MiscFeature_missing,AlleyAccess_Flag
17,18,90,RL,72.0,10791,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,4,5,1967,1967,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,Slab,,,,,0,,0,0,0,GasA,TA,Y,SBrkr,1296,0,0,1296,0,0,2,0,2,2,TA,6,Typ,0,,CarPort,1967.0,Unf,2,516,TA,TA,Y,0,0,0,0,0,0,,,Shed,500,10,2006,WD,Normal,90000,False,True,False,False,True,True,True,True,True,False,True,False,False,False,False,False,True,True,False,0
39,40,90,RL,65.0,6040,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,Edwards,Norm,Norm,Duplex,1Story,4,5,1955,1955,Gable,CompShg,AsbShng,Plywood,,0.0,TA,TA,PConc,,,,,0,,0,0,0,GasA,TA,N,FuseP,1152,0,0,1152,0,0,2,0,2,2,Fa,6,Typ,0,,,,,0,0,,,N,0,0,0,0,0,0,,,,0,6,2008,WD,AdjLand,82000,False,True,False,False,True,True,True,True,True,False,True,True,True,True,True,True,True,True,True,0
90,91,20,RL,60.0,7200,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,4,5,1950,1950,Gable,CompShg,BrkFace,Wd Sdng,,0.0,TA,TA,Slab,,,,,0,,0,0,0,GasA,TA,Y,FuseA,1040,0,0,1040,0,0,1,0,2,1,TA,4,Typ,0,,Detchd,1950.0,Unf,2,420,TA,TA,Y,0,29,0,0,0,0,,,,0,7,2006,WD,Normal,109900,False,True,False,False,True,True,True,True,True,False,True,False,False,False,False,False,True,True,True,0
102,103,90,RL,64.0,7018,Pave,no_access,Reg,Bnk,AllPub,Inside,Gtl,SawyerW,Norm,Norm,Duplex,1Story,5,5,1979,1979,Gable,CompShg,HdBoard,HdBoard,,0.0,TA,Fa,Slab,,,,,0,,0,0,0,GasA,TA,Y,SBrkr,1535,0,0,1535,0,0,2,0,4,2,TA,8,Typ,0,,Attchd,1979.0,Unf,2,410,TA,TA,Y,0,0,0,0,0,0,,,,0,6,2009,WD,Alloca,118964,False,True,False,False,True,True,True,True,True,False,True,False,False,False,False,False,True,True,True,0
156,157,20,RL,60.0,7200,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1950,1950,Hip,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,TA,CBlock,,,,,0,,0,0,0,GasA,TA,Y,FuseF,1040,0,0,1040,0,0,1,0,2,1,TA,5,Typ,0,,Detchd,1950.0,Unf,2,625,TA,TA,Y,0,0,0,0,0,0,,,,0,6,2006,WD,Normal,109500,False,True,False,False,True,True,True,True,True,False,True,False,False,False,False,False,True,True,True,0
182,183,20,RL,60.0,9060,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,Edwards,Artery,Norm,1Fam,1Story,5,6,1957,2006,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,98.0,TA,TA,PConc,,,,,0,,0,0,0,GasA,Ex,Y,SBrkr,1340,0,0,1340,0,0,1,0,3,1,TA,7,Typ,1,Gd,Attchd,1957.0,RFn,1,252,TA,TA,Y,116,0,0,180,0,0,,MnPrv,,0,6,2007,WD,Normal,120000,False,True,False,False,True,True,True,True,True,False,False,False,False,False,False,False,True,False,True,0
259,260,20,RM,70.0,12702,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,5,1956,1956,Gable,CompShg,BrkFace,BrkFace,,0.0,TA,TA,PConc,,,,,0,,0,0,0,GasA,Gd,Y,FuseA,882,0,0,882,0,0,1,0,2,1,TA,4,Typ,0,,Detchd,1956.0,Unf,1,308,TA,TA,Y,0,45,0,0,0,0,,,,0,12,2008,WD,Normal,97000,False,True,False,False,True,True,True,True,True,False,True,False,False,False,False,False,True,True,True,0
342,343,90,RL,76.462366,8544,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,Duplex,1Story,3,4,1949,1950,Gable,CompShg,Stucco,Stucco,BrkFace,340.0,TA,TA,Slab,,,,,0,,0,0,0,Wall,Fa,N,FuseA,1040,0,0,1040,0,0,2,0,2,2,TA,6,Typ,0,,Detchd,1949.0,Unf,2,400,TA,TA,Y,0,0,0,0,0,0,,,,0,5,2006,WD,Normal,87500,True,True,False,False,True,True,True,True,True,False,True,False,False,False,False,False,True,True,True,0
362,363,85,RL,64.0,7301,Pave,no_access,Reg,Lvl,AllPub,Corner,Gtl,Edwards,Norm,Norm,1Fam,SFoyer,7,5,2003,2003,Gable,CompShg,HdBoard,HdBoard,BrkFace,500.0,Gd,TA,Slab,,,,,0,,0,0,0,GasA,Ex,Y,SBrkr,495,1427,0,1922,0,0,3,0,4,1,Gd,7,Typ,1,Ex,BuiltIn,2003.0,RFn,2,672,TA,TA,Y,0,0,177,0,0,0,,,,0,7,2009,ConLD,Normal,198500,False,True,False,False,True,True,True,True,True,False,False,False,False,False,False,False,True,True,True,0
371,372,50,RL,80.0,17120,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,ClearCr,Feedr,Norm,1Fam,1.5Fin,4,4,1959,1959,Gable,CompShg,WdShing,Plywood,,0.0,TA,TA,CBlock,,,,,0,,0,0,0,GasA,TA,Y,SBrkr,1120,468,0,1588,0,0,2,0,4,1,TA,7,Min2,1,Gd,Detchd,1991.0,Fin,2,680,TA,TA,N,0,59,0,0,0,0,,,,0,7,2008,WD,Normal,134432,False,True,False,False,True,True,True,True,True,False,False,False,False,False,False,False,True,True,True,0


In [29]:
df.BsmtCond.value_counts()

TA    1311
Gd      65
Fa      45
Po       2
Name: BsmtCond, dtype: int64

In [30]:
df['BsmtQual'] = df['BsmtQual'].fillna('NA')
df['BsmtCond'] = df['BsmtCond'].fillna('NA')
df['BsmtExposure'] = df['BsmtExposure'].fillna('NA')
df['BsmtFinType1'] = df['BsmtFinType1'].fillna('NA')
df['BsmtFinType2'] = df['BsmtFinType2'].fillna('NA')

***

In [31]:
df[(df.Electrical_missing==True)]['Utilities']
# Given the record shows electricity is present, replace with typical electrical system from dataset

1379    AllPub
Name: Utilities, dtype: object

In [32]:
df.Electrical.value_counts()

SBrkr    1334
FuseA      94
FuseF      27
FuseP       3
Mix         1
Name: Electrical, dtype: int64

In [33]:
df['Electrical'] = df['Electrical'].fillna('SBrkr')

***

In [34]:
df[(df.FireplaceQu_missing == True)]['Fireplaces'].sum()
# Doesn't look there are any fireplaces in places with fireplaces missing

0

In [35]:
df['FireplaceQu'] = df['FireplaceQu'].fillna('NA')

***

In [67]:
df[(df.GarageType_missing == True)][['GarageType','GarageYrBlt','GarageFinish','GarageCars','GarageArea','GarageQual','GarageCond']]
# Doesn't look like there are any cases where there is garage relevant data

Unnamed: 0,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond
39,,0.0,,0,0,,
48,,0.0,,0,0,,
78,,0.0,,0,0,,
88,,0.0,,0,0,,
89,,0.0,,0,0,,
99,,0.0,,0,0,,
108,,0.0,,0,0,,
125,,0.0,,0,0,,
127,,0.0,,0,0,,
140,,0.0,,0,0,,


In [36]:
df['GarageType'] = df['GarageType'].fillna('NA')
df['GarageYrBlt'] = df['GarageYrBlt'].fillna(0)
df['GarageFinish'] = df['GarageFinish'].fillna('NA')
df['GarageQual'] = df['GarageQual'].fillna('NA')
df['GarageCond'] = df['GarageCond'].fillna('NA')

****

In [37]:
df[df.PoolQC_missing == True]['PoolArea'].sum()
# Check if any areas without pool data recorded have a pool in the mix

0

In [38]:
df['PoolQC'] = df['PoolQC'].fillna('NA')

***

In [39]:
df['Fence'] = df['Fence'].fillna('NA')

In [40]:
df['MiscFeature'] = df['MiscFeature'].fillna('NO_MISC_FEATURE_RECORDED')

### 1 Summary: Capture all adjustments in a single step

In [66]:
# Capture all adjustments to deal with NaN values.
def denote_null_values(df):
    """Denotes whether or not there are null values or not"""
    empty_cols_query = df.isnull().sum() > 0
    empty_df_cols = df.loc[:, empty_cols_query].columns.tolist()
    for col in empty_df_cols:
        col_name = f"{col}_missing"
        df[col_name] = pd.isnull(df[col])
    return df

df = denote_null_values(df)

# LotFrontage Functions to populate training, test and validation
def LotFrontage_na_calc(training_df):
    lotfrontage_neighborhood_mean = training_df.groupby(by=['Neighborhood'])[['LotFrontage']].mean().reset_index()
    lotfrontage_neighborhood_mean.columns = ['Neighborhood','LotFrontage_Neighborhood_Mean']
    return lotfrontage_neighborhood_mean

def LotFrontage_na_apply(training_df, testing_df, validation_df=None):
    # Calc mean based on training data
    lnm = LotFrontage_na_calc(training_df)
    
    # Apply mean to training data - for neighbourhood
    # Reset LotFrontage NaN in case they have been filled in a prior run
    training_df['LotFrontage'] = np.where(training_df['LotFrontage_missing']==True,np.nan,training_df['LotFrontage'])
    training_df = training_df.merge(lnm,how='left',left_on='Neighborhood',right_on='Neighborhood')
    training_df['LotFrontage'] = training_df['LotFrontage'].fillna(training_df.LotFrontage_Neighborhood_Mean)
    training_df.drop('LotFrontage_Neighborhood_Mean',axis=1,inplace=True)
    
    # Apply mean to testing data
    # Reset LotFrontage NaN in case they have been filled in a prior run
    testing_df['LotFrontage'] = np.where(testing_df['LotFrontage_missing']==True,np.nan,testing_df['LotFrontage'])
    testing_df = testing_df.merge(lnm,how='left',left_on='Neighborhood',right_on='Neighborhood')
    testing_df['LotFrontage'] = testing_df['LotFrontage'].fillna(testing_df.LotFrontage_Neighborhood_Mean)
    testing_df.drop('LotFrontage_Neighborhood_Mean',axis=1,inplace=True)
    # Fill the training sample mean if a specific neighborhood is missing from the training sample
    testing_df['LotFrontage'] = testing_df['LotFrontage'].fillna(training_df['LotFrontage'].mean())

    if validation_df is None:
        return training_df, testing_df
    else:
        # Apply mean to validation data set
        validation_df['LotFrontage'] = np.where(validation_df['LotFrontage_missing']==True,np.nan,validation_df['LotFrontage'])
        validation_df = validation_df.merge(lnm,how='left',left_on='Neighborhood',right_on='Neighborhood')
        validation_df['LotFrontage'] = validation_df['LotFrontage'].fillna(validation_df.LotFrontage_Neighborhood_Mean)
        validation_df.drop('LotFrontage_Neighborhood_Mean',axis=1,inplace=True)        
        validation_df['LotFrontage'] = validation_df['LotFrontage'].fillna(training_df['LotFrontage'].mean())
        return training_df, testing_df,validation_df


# Other fills don't rely on knowledge of full sample to update
df['AlleyAccess_Flag'] = np.where(df['Alley'].isnull(),0,1)
df['MasVnrType'] = df['MasVnrType'].fillna('None')
df['MasVnrArea'] = df['MasVnrArea'].fillna(0)
df['BsmtQual'] = df['BsmtQual'].fillna('NA')
df['BsmtCond'] = df['BsmtCond'].fillna('NA')
df['BsmtExposure'] = df['BsmtExposure'].fillna('NA')
df['BsmtFinType1'] = df['BsmtFinType1'].fillna('NA')
df['BsmtFinType2'] = df['BsmtFinType2'].fillna('NA')
df['Electrical'] = df['Electrical'].fillna('SBrkr')
df['FireplaceQu'] = df['FireplaceQu'].fillna('NA')
df['GarageType'] = df['GarageType'].fillna('NA')
df['GarageYrBlt'] = df['GarageYrBlt'].fillna(0)
df['GarageFinish'] = df['GarageFinish'].fillna('NA')
df['GarageQual'] = df['GarageQual'].fillna('NA')
df['GarageCond'] = df['GarageCond'].fillna('NA')
df['PoolQC'] = df['PoolQC'].fillna('NA')
df['Fence'] = df['Fence'].fillna('NA')
df['MiscFeature'] = df['MiscFeature'].fillna('no_misc_feature_recorded')

## 2. Create additional bespoke data features

In [None]:
# Created df['AlleyAccess_Flag'] above

***

In [41]:
df['BsmtFinSF_Total'] = df['BsmtFinSF1']+df['BsmtFinSF2']

In [42]:
df['BsmtFinSF_Total'].isnull().sum()

0

***

In [43]:
df['Functional'].value_counts()



Typ     1360
Min2      34
Min1      31
Mod       15
Maj1      14
Maj2       5
Sev        1
Name: Functional, dtype: int64

In [44]:
np.where(df['Functional']=='Typ',1,0).sum()

1360

In [45]:
df['Functional_Typical_flag']=np.where(df['Functional']=='Typ',1,0)
df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontage_missing,Alley_missing,MasVnrType_missing,MasVnrArea_missing,BsmtQual_missing,BsmtCond_missing,BsmtExposure_missing,BsmtFinType1_missing,BsmtFinType2_missing,Electrical_missing,FireplaceQu_missing,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing,GarageQual_missing,GarageCond_missing,PoolQC_missing,Fence_missing,MiscFeature_missing,AlleyAccess_Flag,BsmtFinSF_Total,Functional_Typical_flag
0,1,60,RL,65.0,8450,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,2,2008,WD,Normal,208500,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,True,True,0,706,1
1,2,20,RL,80.0,9600,Pave,no_access,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,5,2007,WD,Normal,181500,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,978,1
2,3,60,RL,68.0,11250,Pave,no_access,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,9,2008,WD,Normal,223500,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,486,1
3,4,70,RL,60.0,9550,Pave,no_access,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,2,2006,WD,Abnorml,140000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,216,1
4,5,60,RL,84.0,14260,Pave,no_access,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,12,2008,WD,Normal,250000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,655,1


***

In [46]:
df['PorchSF_Total'] = (df['WoodDeckSF']+df['OpenPorchSF']+df['EnclosedPorch']+df['3SsnPorch']+df['ScreenPorch'])
df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontage_missing,Alley_missing,MasVnrType_missing,MasVnrArea_missing,BsmtQual_missing,BsmtCond_missing,BsmtExposure_missing,BsmtFinType1_missing,BsmtFinType2_missing,Electrical_missing,FireplaceQu_missing,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing,GarageQual_missing,GarageCond_missing,PoolQC_missing,Fence_missing,MiscFeature_missing,AlleyAccess_Flag,BsmtFinSF_Total,Functional_Typical_flag,PorchSF_Total
0,1,60,RL,65.0,8450,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,2,2008,WD,Normal,208500,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,True,True,0,706,1,61
1,2,20,RL,80.0,9600,Pave,no_access,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,5,2007,WD,Normal,181500,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,978,1,298
2,3,60,RL,68.0,11250,Pave,no_access,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,9,2008,WD,Normal,223500,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,486,1,42
3,4,70,RL,60.0,9550,Pave,no_access,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,2,2006,WD,Abnorml,140000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,216,1,307
4,5,60,RL,84.0,14260,Pave,no_access,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,12,2008,WD,Normal,250000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,655,1,276


In [47]:
df['HasPorch_flag']=np.where(df['PorchSF_Total']>0,1,0)
df.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontage_missing,Alley_missing,MasVnrType_missing,MasVnrArea_missing,BsmtQual_missing,BsmtCond_missing,BsmtExposure_missing,BsmtFinType1_missing,BsmtFinType2_missing,Electrical_missing,FireplaceQu_missing,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing,GarageQual_missing,GarageCond_missing,PoolQC_missing,Fence_missing,MiscFeature_missing,AlleyAccess_Flag,BsmtFinSF_Total,Functional_Typical_flag,PorchSF_Total,HasPorch_flag
0,1,60,RL,65.0,8450,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,2,2008,WD,Normal,208500,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,True,True,0,706,1,61,1
1,2,20,RL,80.0,9600,Pave,no_access,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,5,2007,WD,Normal,181500,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,978,1,298,1
2,3,60,RL,68.0,11250,Pave,no_access,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,9,2008,WD,Normal,223500,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,486,1,42,1
3,4,70,RL,60.0,9550,Pave,no_access,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,2,2006,WD,Abnorml,140000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,216,1,307,1
4,5,60,RL,84.0,14260,Pave,no_access,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,12,2008,WD,Normal,250000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,655,1,276,1


In [48]:
df[(df['HasPorch_flag']==0)].head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontage_missing,Alley_missing,MasVnrType_missing,MasVnrArea_missing,BsmtQual_missing,BsmtCond_missing,BsmtExposure_missing,BsmtFinType1_missing,BsmtFinType2_missing,Electrical_missing,FireplaceQu_missing,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing,GarageQual_missing,GarageCond_missing,PoolQC_missing,Fence_missing,MiscFeature_missing,AlleyAccess_Flag,BsmtFinSF_Total,Functional_Typical_flag,PorchSF_Total,HasPorch_flag
10,11,20,RL,70.0,11200,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,1Fam,1Story,5,5,1965,1965,Hip,CompShg,HdBoard,HdBoard,,0.0,TA,TA,CBlock,TA,TA,No,Rec,906,Unf,0,134,1040,GasA,Ex,Y,SBrkr,1040,0,0,1040,1,0,1,0,3,1,TA,5,Typ,0,,Detchd,1965.0,Unf,1,384,TA,TA,Y,0,0,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,2,2008,WD,Normal,129500,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,True,True,0,906,1,0,0
16,17,20,RL,76.462366,11241,Pave,no_access,IR1,Lvl,AllPub,CulDSac,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,7,1970,1970,Gable,CompShg,Wd Sdng,Wd Sdng,BrkFace,180.0,TA,TA,CBlock,TA,TA,No,ALQ,578,Unf,0,426,1004,GasA,Ex,Y,SBrkr,1004,0,0,1004,1,0,1,0,2,1,TA,5,Typ,1,TA,Attchd,1970.0,Fin,2,480,TA,TA,Y,0,0,0,0,0,0,,,Shed,700,3,2010,WD,Normal,149000,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,0,578,1,0,0
17,18,90,RL,72.0,10791,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,4,5,1967,1967,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,Slab,,,,,0,,0,0,0,GasA,TA,Y,SBrkr,1296,0,0,1296,0,0,2,0,2,2,TA,6,Typ,0,,CarPort,1967.0,Unf,2,516,TA,TA,Y,0,0,0,0,0,0,,,Shed,500,10,2006,WD,Normal,90000,False,True,False,False,True,True,True,True,True,False,True,False,False,False,False,False,True,True,False,0,0,1,0,0
19,20,20,RL,70.0,7560,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,6,1958,1965,Hip,CompShg,BrkFace,Plywood,,0.0,TA,TA,CBlock,TA,TA,No,LwQ,504,Unf,0,525,1029,GasA,TA,Y,SBrkr,1339,0,0,1339,0,0,1,0,3,1,TA,6,Min1,0,,Attchd,1958.0,Unf,1,294,TA,TA,Y,0,0,0,0,0,0,,MnPrv,NO_MISC_FEATURE_RECORDED,0,5,2009,COD,Abnorml,139000,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,True,0,504,0,0,0
37,38,20,RL,74.0,8532,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,6,1954,1990,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,650.0,TA,TA,CBlock,TA,TA,No,Rec,1213,Unf,0,84,1297,GasA,Gd,Y,SBrkr,1297,0,0,1297,0,1,1,0,3,1,TA,5,Typ,1,TA,Attchd,1954.0,Fin,2,498,TA,TA,Y,0,0,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,10,2009,WD,Normal,153000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,1213,1,0,0


***

In [151]:
df['PoolQC'].value_counts()

NA    1453
Gd       3
Ex       2
Fa       2
Name: PoolQC, dtype: int64

In [50]:
np.where(df['PoolQC']!='NA',1,0).sum()

7

In [51]:
df['HasPool_flag']=np.where(df['PoolQC']!='NA',1,0)

In [52]:
df[(df['HasPool_flag']==1)].head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,LotFrontage_missing,Alley_missing,MasVnrType_missing,MasVnrArea_missing,BsmtQual_missing,BsmtCond_missing,BsmtExposure_missing,BsmtFinType1_missing,BsmtFinType2_missing,Electrical_missing,FireplaceQu_missing,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing,GarageQual_missing,GarageCond_missing,PoolQC_missing,Fence_missing,MiscFeature_missing,AlleyAccess_Flag,BsmtFinSF_Total,Functional_Typical_flag,PorchSF_Total,HasPorch_flag,HasPool_flag
197,198,75,RL,174.0,25419,Pave,no_access,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Artery,Norm,1Fam,2Story,8,4,1918,1990,Gable,CompShg,Stucco,Stucco,,0.0,Gd,Gd,PConc,TA,TA,No,GLQ,1036,LwQ,184,140,1360,GasA,Gd,Y,SBrkr,1360,1360,392,3112,1,1,2,0,4,1,Gd,8,Typ,1,Ex,Detchd,1918.0,Unf,2,795,TA,TA,Y,0,16,552,0,0,512,Ex,GdPrv,NO_MISC_FEATURE_RECORDED,0,3,2006,WD,Abnorml,235000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,0,1220,1,568,1,1
810,811,20,RL,78.0,10140,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,NWAmes,Norm,Norm,1Fam,1Story,6,6,1974,1999,Hip,CompShg,HdBoard,HdBoard,BrkFace,99.0,TA,TA,CBlock,TA,TA,No,ALQ,663,LwQ,377,0,1040,GasA,Fa,Y,SBrkr,1309,0,0,1309,1,0,1,1,3,1,Gd,5,Typ,1,Fa,Attchd,1974.0,RFn,2,484,TA,TA,Y,265,0,0,0,0,648,Fa,GdPrv,NO_MISC_FEATURE_RECORDED,0,1,2006,WD,Normal,181000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,0,1040,1,265,1,1
1170,1171,80,RL,76.0,9880,Pave,no_access,Reg,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,SLvl,6,6,1977,1977,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,TA,TA,Av,ALQ,522,Unf,0,574,1096,GasA,TA,Y,SBrkr,1118,0,0,1118,1,0,1,0,3,1,TA,6,Typ,1,Po,Attchd,1977.0,Fin,1,358,TA,TA,Y,203,0,0,0,0,576,Gd,GdPrv,NO_MISC_FEATURE_RECORDED,0,7,2008,WD,Normal,171000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,0,522,1,203,1,1
1182,1183,60,RL,160.0,15623,Pave,no_access,IR1,Lvl,AllPub,Corner,Gtl,NoRidge,Norm,Norm,1Fam,2Story,10,5,1996,1996,Hip,CompShg,Wd Sdng,ImStucc,,0.0,Gd,TA,PConc,Ex,TA,Av,GLQ,2096,Unf,0,300,2396,GasA,Ex,Y,SBrkr,2411,2065,0,4476,1,0,3,1,4,1,Ex,10,Typ,2,TA,Attchd,1996.0,Fin,3,813,TA,TA,Y,171,78,0,0,0,555,Ex,MnPrv,NO_MISC_FEATURE_RECORDED,0,7,2007,WD,Abnorml,745000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,0,2096,1,249,1,1
1298,1299,60,RL,313.0,63887,Pave,no_access,IR3,Bnk,AllPub,Corner,Gtl,Edwards,Feedr,Norm,1Fam,2Story,10,5,2008,2008,Hip,ClyTile,Stucco,Stucco,Stone,796.0,Ex,TA,PConc,Ex,TA,Gd,GLQ,5644,Unf,0,466,6110,GasA,Ex,Y,SBrkr,4692,950,0,5642,2,0,2,1,3,1,Ex,12,Typ,3,Gd,Attchd,2008.0,Fin,2,1418,TA,TA,Y,214,292,0,0,0,480,Gd,,NO_MISC_FEATURE_RECORDED,0,1,2008,New,Partial,160000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,0,5644,1,506,1,1
1386,1387,60,RL,80.0,16692,Pave,no_access,IR1,Lvl,AllPub,Inside,Gtl,NWAmes,RRAn,Norm,1Fam,2Story,7,5,1978,1978,Gable,CompShg,Plywood,Plywood,BrkFace,184.0,TA,TA,CBlock,Gd,TA,No,BLQ,790,LwQ,469,133,1392,GasA,TA,Y,SBrkr,1392,1392,0,2784,1,0,3,1,5,1,Gd,12,Typ,2,TA,Attchd,1978.0,RFn,2,564,TA,TA,Y,0,112,0,0,440,519,Fa,MnPrv,TenC,2000,7,2006,WD,Normal,250000,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,0,1259,1,552,1,1
1423,1424,80,RL,68.217391,19690,Pave,no_access,IR1,Lvl,AllPub,CulDSac,Gtl,Edwards,Norm,Norm,1Fam,SLvl,6,7,1966,1966,Flat,Tar&Grv,Plywood,Plywood,,0.0,Gd,Gd,CBlock,Gd,TA,Av,Unf,0,Unf,0,697,697,GasA,TA,Y,SBrkr,1575,626,0,2201,0,0,2,0,4,1,Gd,8,Typ,1,Gd,Attchd,1966.0,Unf,2,432,Gd,Gd,Y,586,236,0,0,0,738,Gd,GdPrv,NO_MISC_FEATURE_RECORDED,0,8,2006,WD,Alloca,274970,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,0,0,1,822,1,1


***

## Section 2 summary - all code in one step

In [53]:

# Additional data features to tidy things up; potentially drop some others
df['BsmtFinSF_Total'] = df['BsmtFinSF1']+df['BsmtFinSF2']
df['Functional_Typical_flag']=np.where(df['Functional']=='Typ',1,0)
df['PorchSF_Total'] = (df['WoodDeckSF']+df['OpenPorchSF']+df['EnclosedPorch']+df['3SsnPorch']+df['ScreenPorch'])
df['HasPorch_flag']=np.where(df['PorchSF_Total']>0,1,0)
df['HasPool_flag']=np.where(df['PoolQC']!='NA',1,0)

# 3. Create manual OneHotEncoding

This is required for 6 columns in the data, each of which contain multiple pieces of information
* Condition1 & Condition2
* Exterior1st & Exterior2nd
* BsmtFinType1 & BsmtFinType2

This will be set up as 3 functions that put in place the coding for a data frame.

All info will be combined in a summary in a final cell.

In [157]:
df['Condition1'].value_counts()

Norm      1260
Feedr       81
Artery      48
RRAn        26
PosN        19
RRAe        11
PosA         8
RRNn         5
RRNe         2
Name: Condition1, dtype: int64

In [158]:
df['Condition2'].value_counts()

Norm      1445
Feedr        6
PosN         2
Artery       2
RRNn         2
PosA         1
RRAn         1
RRAe         1
Name: Condition2, dtype: int64

In [20]:
df['Condition1']

0         Norm
1        Feedr
2         Norm
3         Norm
4         Norm
5         Norm
6         Norm
7         PosN
8       Artery
9       Artery
10        Norm
11        Norm
12        Norm
13        Norm
14        Norm
15        Norm
16        Norm
17        Norm
18        RRAe
19        Norm
20        Norm
21        Norm
22        Norm
23        Norm
24        Norm
25        Norm
26        Norm
27        Norm
28        Norm
29       Feedr
30       Feedr
31        Norm
32        Norm
33        Norm
34        Norm
35        Norm
36        Norm
37        Norm
38        Norm
39        Norm
40        Norm
41        Norm
42        Norm
43        Norm
44        Norm
45        Norm
46        Norm
47        Norm
48        Norm
49        Norm
50        Norm
51        Norm
52        RRNn
53        Norm
54        Norm
55        Norm
56        Norm
57        Norm
58        Norm
59        Norm
60        RRAe
61        Norm
62        Norm
63        RRAn
64        Norm
65        Norm
66        

In [54]:
def ManualOneHotEncoding(df,column_list,ohc_prefix):
    # Identify values for new one hot encoded columns
    
    unique_col_vals = []
    
    for i,col in enumerate(column_list):
        if i == 0:
            unique_col_vals = df[col].unique().tolist()
        else:
            [unique_col_vals.append(j) for j in df[col].unique().tolist()]

    # Limit to unique values to generate columns
    unique_col_vals_set = set(unique_col_vals)
    new_cols = sorted(list(unique_col_vals_set))
    
    # Create and populate columns for data set
    for col in new_cols:
        new_col = ohc_prefix + '_' + col
        df[new_col] = 0 #Create new columns and set to 0
        onehot_target = col
        for i,target_cols in enumerate(column_list):
            if i == 0:
                where_conditions = (df[target_cols] == onehot_target) 
            else:
                where_conditions = where_conditions | (df[target_cols] == onehot_target) 
        # Populate with 0s & 1s
        df[new_col] = np.where(where_conditions,1,0)
        
    return df

In [55]:
df_test = df_orig.copy()
df_test1 = ManualOneHotEncoding(df_test,['Condition1','Condition2'],'Condition')
df_test1.head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Condition_Artery,Condition_Feedr,Condition_Norm,Condition_PosA,Condition_PosN,Condition_RRAe,Condition_RRAn,Condition_RRNe,Condition_RRNn
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500,0,0,1,0,0,0,0,0,0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500,0,1,1,0,0,0,0,0,0
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500,0,0,1,0,0,0,0,0,0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000,0,0,1,0,0,0,0,0,0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000,0,0,1,0,0,0,0,0,0
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin,5,5,1993,1995,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,Wood,Gd,TA,No,GLQ,732,Unf,0,64,796,GasA,Ex,Y,SBrkr,796,566,0,1362,1,0,1,1,1,1,TA,5,Typ,0,,Attchd,1993.0,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000,0,0,1,0,0,0,0,0,0
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Somerst,Norm,Norm,1Fam,1Story,8,5,2004,2005,Gable,CompShg,VinylSd,VinylSd,Stone,186.0,Gd,TA,PConc,Ex,TA,Av,GLQ,1369,Unf,0,317,1686,GasA,Ex,Y,SBrkr,1694,0,0,1694,1,0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,2004.0,RFn,2,636,TA,TA,Y,255,57,0,0,0,0,,,,0,8,2007,WD,Normal,307000,0,0,1,0,0,0,0,0,0
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,PosN,Norm,1Fam,2Story,7,6,1973,1973,Gable,CompShg,HdBoard,HdBoard,Stone,240.0,TA,TA,CBlock,Gd,TA,Mn,ALQ,859,BLQ,32,216,1107,GasA,Ex,Y,SBrkr,1107,983,0,2090,1,0,2,1,3,1,TA,7,Typ,2,TA,Attchd,1973.0,RFn,2,484,TA,TA,Y,235,204,228,0,0,0,,,Shed,350,11,2009,WD,Normal,200000,0,0,1,0,1,0,0,0,0
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,7,5,1931,1950,Gable,CompShg,BrkFace,Wd Shng,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,952,952,GasA,Gd,Y,FuseF,1022,752,0,1774,0,0,2,0,2,2,TA,8,Min1,2,TA,Detchd,1931.0,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,,,,0,4,2008,WD,Abnorml,129900,1,0,1,0,0,0,0,0,0
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Artery,2fmCon,1.5Unf,5,6,1939,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,TA,TA,No,GLQ,851,Unf,0,140,991,GasA,Ex,Y,SBrkr,1077,0,0,1077,1,0,1,0,2,2,TA,5,Typ,2,TA,Attchd,1939.0,RFn,1,205,Gd,TA,Y,0,4,0,0,0,0,,,,0,1,2008,WD,Normal,118000,1,0,0,0,0,0,0,0,0


In [59]:
df_test1[(df_test1['Condition_PosA']==1)][['Condition1','Condition2','Condition_Artery','Condition_PosA']]

Unnamed: 0,Condition1,Condition2,Condition_Artery,Condition_PosA
66,PosA,Norm,0,1
293,PosA,Norm,0,1
446,PosA,Norm,0,1
471,PosA,Norm,0,1
583,Artery,PosA,1,1
859,PosA,Norm,0,1
934,PosA,Norm,0,1
997,PosA,Norm,0,1
1310,PosA,Norm,0,1


**Up to figuring out how to implement column checks for manual onehot encoding**


In [22]:
column_list = ['Condition1','Condition2'] # ['a','b','c','d']

for i, col in enumerate(column_list):
    print(f"{i}: {col}")


0: Condition1
1: Condition2


In [60]:
df[((df['Condition1'] == 'Artery') | (df['Condition2'] == 'Artery'))].head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,7,5,1931,1950,Gable,CompShg,BrkFace,Wd Shng,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,952,952,GasA,Gd,Y,FuseF,1022,752,0,1774,0,0,2,0,2,2,TA,8,Min1,2,TA,Detchd,1931.0,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Artery,Artery,2fmCon,1.5Unf,5,6,1939,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,BrkTil,TA,TA,No,GLQ,851,Unf,0,140,991,GasA,Ex,Y,SBrkr,1077,0,0,1077,1,0,1,0,2,2,TA,5,Typ,2,TA,Attchd,1939.0,RFn,1,205,Gd,TA,Y,0,4,0,0,0,0,,,,0,1,2008,WD,Normal,118000
68,69,30,RM,47.0,4608,Pave,,Reg,Lvl,AllPub,Corner,Gtl,OldTown,Artery,Norm,1Fam,1Story,4,6,1945,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,Gd,CBlock,TA,TA,No,Unf,0,Unf,0,747,747,GasA,TA,Y,SBrkr,747,0,0,747,0,0,1,0,2,1,TA,4,Typ,0,,Attchd,1945.0,Unf,1,220,TA,TA,Y,0,0,0,0,0,0,,,,0,6,2010,WD,Normal,80000
108,109,50,RM,85.0,8500,Pave,,Reg,Lvl,AllPub,Corner,Gtl,IDOTRR,Artery,Norm,1Fam,1.5Fin,5,7,1919,2005,Gable,CompShg,CemntBd,CmentBd,,0.0,TA,TA,CBlock,TA,TA,No,Unf,0,Unf,0,793,793,GasW,TA,N,FuseF,997,520,0,1517,0,0,2,0,3,1,Fa,7,Typ,0,,,,,0,0,,,N,0,0,144,0,0,0,,,,0,8,2007,WD,Normal,115000
142,143,50,RL,71.0,8520,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Artery,Norm,1Fam,1.5Fin,5,4,1952,1952,Gable,CompShg,BrkFace,Wd Sdng,,0.0,TA,Fa,CBlock,TA,TA,No,Rec,507,Unf,0,403,910,GasA,Fa,Y,SBrkr,910,475,0,1385,0,0,2,0,4,1,TA,6,Typ,0,,Detchd,2000.0,Unf,2,720,TA,TA,Y,0,0,0,0,0,0,,MnPrv,,0,6,2010,WD,Normal,166000
155,156,50,RL,60.0,9600,Pave,,Reg,Lvl,AllPub,Corner,Gtl,Edwards,Artery,Norm,1Fam,1.5Fin,6,5,1924,1950,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,TA,BrkTil,TA,TA,No,Unf,0,Unf,0,572,572,Grav,Fa,N,FuseF,572,524,0,1096,0,0,1,0,2,1,TA,5,Typ,0,,,,,0,0,,,N,0,8,128,0,0,0,,,,0,4,2008,WD,Normal,79000
182,183,20,RL,60.0,9060,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Edwards,Artery,Norm,1Fam,1Story,5,6,1957,2006,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,98.0,TA,TA,PConc,,,,,0,,0,0,0,GasA,Ex,Y,SBrkr,1340,0,0,1340,0,0,1,0,3,1,TA,7,Typ,1,Gd,Attchd,1957.0,RFn,1,252,TA,TA,Y,116,0,0,180,0,0,,MnPrv,,0,6,2007,WD,Normal,120000
185,186,75,RM,90.0,22950,Pave,,IR2,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,2.5Fin,10,9,1892,1993,Gable,WdShngl,Wd Sdng,Wd Sdng,,0.0,Gd,Gd,BrkTil,TA,TA,Mn,Unf,0,Unf,0,1107,1107,GasA,Ex,Y,SBrkr,1518,1518,572,3608,0,0,2,1,4,1,Ex,12,Typ,2,TA,Detchd,1993.0,Unf,3,840,Ex,TA,Y,0,260,0,0,410,0,,GdPrv,,0,6,2006,WD,Normal,475000
197,198,75,RL,174.0,25419,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Artery,Norm,1Fam,2Story,8,4,1918,1990,Gable,CompShg,Stucco,Stucco,,0.0,Gd,Gd,PConc,TA,TA,No,GLQ,1036,LwQ,184,140,1360,GasA,Gd,Y,SBrkr,1360,1360,392,3112,1,1,2,0,4,1,Gd,8,Typ,1,Ex,Detchd,1918.0,Unf,2,795,TA,TA,Y,0,16,552,0,0,512,Ex,GdPrv,,0,3,2006,WD,Abnorml,235000
202,203,50,RL,50.0,7000,Pave,,Reg,Lvl,AllPub,Corner,Gtl,OldTown,Artery,Norm,1Fam,1.5Fin,6,6,1924,1950,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,Gd,BrkTil,Fa,TA,No,LwQ,617,Unf,0,0,617,GasA,Gd,Y,SBrkr,865,445,0,1310,0,0,2,0,2,1,TA,6,Min1,0,,Attchd,1924.0,Unf,1,398,TA,TA,Y,0,0,126,0,0,0,,,,0,5,2006,COD,Normal,112000


In [48]:
col1 = 'Condition1'
col2 = 'Condition2'
onehot_target = 'Artery'

col_cond = (df[col1] == onehot_target) 
col_cond_a = col_cond
col_cond = col_cond | (df[col2] == onehot_target)
col_cond_b = col_cond
np.where(cond,1,0)

array([0, 0, 0, ..., 0, 0, 0])

***
### Section 3 Summary - All Code in one step

In [72]:
def ManualOneHotEncoding(df,column_list,ohc_prefix):
    # Identify values for new one hot encoded columns
    
    unique_col_vals = []
    
    for i,col in enumerate(column_list):
        if i == 0:
            unique_col_vals = df[col].unique().tolist()
        else:
            [unique_col_vals.append(j) for j in df[col].unique().tolist()]

    # Limit to unique values to generate columns
    unique_col_vals_set = set(unique_col_vals)
    new_cols = sorted(list(unique_col_vals_set))
    
    # Create and populate columns for data set
    for col in new_cols:
        new_col = ohc_prefix + '_' + col
        df[new_col] = 0 #Create new columns and set to 0
        onehot_target = col
        for i,target_cols in enumerate(column_list):
            if i == 0:
                where_conditions = (df[target_cols] == onehot_target) 
            else:
                where_conditions = where_conditions | (df[target_cols] == onehot_target) 
        # Populate with 0s & 1s
        df[new_col] = np.where(where_conditions,1,0)
        
    return df

# Populate OneHotEncoded Columns
df = ManualOneHotEncoding(df,['Condition1','Condition2'],'Conditions')
df = ManualOneHotEncoding(df,['Exterior1st','Exterior2nd'],'Exterior')
df = ManualOneHotEncoding(df,['BsmtFinType1','BsmtFinType2'],'BsmtFinType')

# Drop OneHotEncoded Columns
df.drop('Condition1',axis=1,inplace=True)
df.drop('Condition2',axis=1,inplace=True)
df.drop('Exterior1st',axis=1,inplace=True)
df.drop('Exterior2nd',axis=1,inplace=True)
df.drop('BsmtFinType1',axis=1,inplace=True)
df.drop('BsmtFinType2',axis=1,inplace=True)

In [73]:
df.head(10)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,AlleyAccess_Flag,BsmtFinSF_Total,Functional_Typical_flag,PorchSF_Total,HasPorch_flag,HasPool_flag,Conditions_Artery,Conditions_Feedr,Conditions_Norm,Conditions_PosA,Conditions_PosN,Conditions_RRAe,Conditions_RRAn,Conditions_RRNe,Conditions_RRNn,Exterior_AsbShng,Exterior_AsphShn,Exterior_Brk Cmn,Exterior_BrkComm,Exterior_BrkFace,Exterior_CBlock,Exterior_CemntBd,Exterior_CmentBd,Exterior_HdBoard,Exterior_ImStucc,Exterior_MetalSd,Exterior_Other,Exterior_Plywood,Exterior_Stone,Exterior_Stucco,Exterior_VinylSd,Exterior_Wd Sdng,Exterior_Wd Shng,Exterior_WdShing,BsmtFinType_ALQ,BsmtFinType_BLQ,BsmtFinType_GLQ,BsmtFinType_LwQ,BsmtFinType_NA,BsmtFinType_Rec,BsmtFinType_Unf
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,1Fam,2Story,7,5,2003,2003,Gable,CompShg,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,706,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,2,2008,WD,Normal,208500,0,706,1,61,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,1Fam,1Story,6,8,1976,1976,Gable,CompShg,,0.0,TA,TA,CBlock,Gd,TA,Gd,978,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,5,2007,WD,Normal,181500,0,978,1,298,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,1Fam,2Story,7,5,2001,2002,Gable,CompShg,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,486,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,9,2008,WD,Normal,223500,0,486,1,42,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,1Fam,2Story,7,5,1915,1970,Gable,CompShg,,0.0,TA,TA,BrkTil,TA,Gd,No,216,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,2,2006,WD,Abnorml,140000,0,216,1,307,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,1
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,1Fam,2Story,8,5,2000,2000,Gable,CompShg,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,655,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,12,2008,WD,Normal,250000,0,655,1,276,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,1Fam,1.5Fin,5,5,1993,1995,Gable,CompShg,,0.0,TA,TA,Wood,Gd,TA,No,732,0,64,796,GasA,Ex,Y,SBrkr,796,566,0,1362,1,0,1,1,1,1,TA,5,Typ,0,,Attchd,1993.0,Unf,2,480,TA,TA,Y,40,30,0,320,0,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000,0,732,1,390,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Somerst,1Fam,1Story,8,5,2004,2005,Gable,CompShg,Stone,186.0,Gd,TA,PConc,Ex,TA,Av,1369,0,317,1686,GasA,Ex,Y,SBrkr,1694,0,0,1694,1,0,2,0,3,1,Gd,7,Typ,1,Gd,Attchd,2004.0,RFn,2,636,TA,TA,Y,255,57,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,8,2007,WD,Normal,307000,0,1369,1,312,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,1Fam,2Story,7,6,1973,1973,Gable,CompShg,Stone,240.0,TA,TA,CBlock,Gd,TA,Mn,859,32,216,1107,GasA,Ex,Y,SBrkr,1107,983,0,2090,1,0,2,1,3,1,TA,7,Typ,2,TA,Attchd,1973.0,RFn,2,484,TA,TA,Y,235,204,228,0,0,0,,,Shed,350,11,2009,WD,Normal,200000,0,891,1,667,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,1Fam,1.5Fin,7,5,1931,1950,Gable,CompShg,,0.0,TA,TA,BrkTil,TA,TA,No,0,0,952,952,GasA,Gd,Y,FuseF,1022,752,0,1774,0,0,2,0,2,2,TA,8,Min1,2,TA,Detchd,1931.0,Unf,2,468,Fa,TA,Y,90,0,205,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,4,2008,WD,Abnorml,129900,0,0,0,295,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,2fmCon,1.5Unf,5,6,1939,1950,Gable,CompShg,,0.0,TA,TA,BrkTil,TA,TA,No,851,0,140,991,GasA,Ex,Y,SBrkr,1077,0,0,1077,1,0,1,0,2,2,TA,5,Typ,2,TA,Attchd,1939.0,RFn,1,205,Gd,TA,Y,0,4,0,0,0,0,,,NO_MISC_FEATURE_RECORDED,0,1,2008,WD,Normal,118000,0,851,1,4,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1


***
## 4. Set up target encoding parameters

In [88]:
targ_enc_cols = [
    'MSSubClass',
    'MSZoning',
    'LandContour',
    'Neighborhood',
    'BldgType',
    'HouseStyle',
    'RoofStyle',
    'RoofMatl',
    'MasVnrType',
    'Foundation',
    'Heating',
    'Electrical',
    'Functional',
    'GarageType',
    'Fence',
    'SaleType',
    'SaleCondition',
]
target_enc = ce.TargetEncoder(verbose=1,cols=targ_enc_cols,min_samples_leaf=5,smoothing=0.1)
target_enc.get_params()

# Keep min_samples_leaf / smoothing in order to enable these variables to be adjusted as test different model pipelines

{'cols': ['MSSubClass',
  'MSZoning',
  'LandContour',
  'Neighborhood',
  'BldgType',
  'HouseStyle',
  'RoofStyle',
  'RoofMatl',
  'MasVnrType',
  'Foundation',
  'Heating',
  'Electrical',
  'Functional',
  'GarageType',
  'Fence',
  'SaleType',
  'SaleCondition'],
 'drop_invariant': False,
 'handle_missing': 'value',
 'handle_unknown': 'value',
 'min_samples_leaf': 5,
 'return_df': True,
 'smoothing': 0.1,
 'verbose': 1}

In [89]:
df_te = target_enc.fit_transform(df.drop('SalePrice',axis=1),df['SalePrice'])

  elif pd.api.types.is_categorical(cols):


In [90]:
df_te.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,AlleyAccess_Flag,BsmtFinSF_Total,Functional_Typical_flag,PorchSF_Total,HasPorch_flag,HasPool_flag,Conditions_Artery,Conditions_Feedr,Conditions_Norm,Conditions_PosA,Conditions_PosN,Conditions_RRAe,Conditions_RRAn,Conditions_RRNe,Conditions_RRNn,Exterior_AsbShng,Exterior_AsphShn,Exterior_Brk Cmn,Exterior_BrkComm,Exterior_BrkFace,Exterior_CBlock,Exterior_CemntBd,Exterior_CmentBd,Exterior_HdBoard,Exterior_ImStucc,Exterior_MetalSd,Exterior_Other,Exterior_Plywood,Exterior_Stone,Exterior_Stucco,Exterior_VinylSd,Exterior_Wd Sdng,Exterior_Wd Shng,Exterior_WdShing,BsmtFinType_ALQ,BsmtFinType_BLQ,BsmtFinType_GLQ,BsmtFinType_LwQ,BsmtFinType_NA,BsmtFinType_Rec,BsmtFinType_Unf
0,1,239948.501672,191004.994787,65.0,8450,Pave,,Reg,180183.746758,AllPub,Inside,Gtl,197965.773333,185763.807377,210051.764045,7,5,2003,2003,171483.956179,179803.679219,204691.87191,196.0,Gd,TA,225230.44204,Gd,TA,No,706,0,150,856,182021.195378,Ex,Y,186810.637453,856,854,0,1710,1,0,2,1,3,1,Gd,8,183429.147059,0,,202892.656322,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,187596.837998,NO_MISC_FEATURE_RECORDED,0,2,2008,173401.836622,175202.219533,0,706,1,61,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
1,2,185224.811567,191004.994787,80.0,9600,Pave,,Reg,180183.746758,AllPub,FR2,Gtl,238772.727273,185763.807377,175985.477961,6,8,1976,1976,171483.956179,179803.679219,156958.243119,0.0,TA,TA,149805.714511,Gd,TA,Gd,978,0,284,1262,182021.195378,Ex,Y,186810.637453,1262,0,0,1262,0,1,2,0,3,1,TA,6,183429.147059,1,TA,202892.656322,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,187596.837998,NO_MISC_FEATURE_RECORDED,0,5,2007,173401.836622,175202.219533,0,978,1,298,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
2,3,239948.501672,191004.994787,68.0,11250,Pave,,IR1,180183.746758,AllPub,Inside,Gtl,197965.773333,185763.807377,210051.764045,7,5,2001,2002,171483.956179,179803.679219,204691.87191,162.0,Gd,TA,225230.44204,Gd,TA,Mn,486,0,434,920,182021.195378,Ex,Y,186810.637453,920,866,0,1786,1,0,2,1,3,1,Gd,6,183429.147059,1,TA,202892.656322,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,187596.837998,NO_MISC_FEATURE_RECORDED,0,9,2008,173401.836622,175202.219533,0,486,1,42,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
3,4,166772.416667,191004.994787,60.0,9550,Pave,,IR1,180183.746758,AllPub,Corner,Gtl,210624.72549,185763.807377,210051.764045,7,5,1915,1970,171483.956179,179803.679219,156958.243119,0.0,TA,TA,132291.075342,TA,Gd,No,216,0,540,756,182021.195378,Gd,Y,186810.637453,961,756,0,1717,1,0,1,0,3,1,Gd,7,183429.147059,1,Gd,134091.162791,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,187596.837998,NO_MISC_FEATURE_RECORDED,0,2,2006,173401.836622,146526.623762,0,216,1,307,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,1
4,5,239948.501672,191004.994787,84.0,14260,Pave,,IR1,180183.746758,AllPub,FR2,Gtl,335295.317073,185763.807377,210051.764045,8,5,2000,2000,171483.956179,179803.679219,204691.87191,350.0,Gd,TA,225230.44204,Gd,TA,Av,655,0,490,1145,182021.195378,Ex,Y,186810.637453,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,183429.147059,1,TA,202892.656322,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,187596.837998,NO_MISC_FEATURE_RECORDED,0,12,2008,173401.836622,175202.219533,0,655,1,276,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1


***
## 4. Set up Ordinal encoding parameters

In [85]:
ordenc_cols = [
'LotShape',
'Utilities',
'LotConfig',
'LandSlope',
'ExterQual',
'ExterCond',
'BsmtQual',
'BsmtCond',
'BsmtExposure',
'HeatingQC',
'KitchenQual',
'FireplaceQu',
'GarageFinish',
'GarageQual',
'GarageCond',
'PavedDrive',
'PoolQC',
]

ordenc_maps = [
{'col':'LotShape', 'mapping':{"Reg":0,"IR1":1,"IR2":2,"IR3":3}},
{'col':'Utilities', 'mapping':{"AllPub":0,"NoSwer":1,"NoSeWa":2,"ELO":3}},
{'col':'LotConfig', 'mapping':{'Gtl':1,'Mod':2,'Sev':3,}},
{'col':'LandSlope', 'mapping':{'Gtl':1,'Mod':2,'Sev':3,}},
{'col':'ExterQual', 'mapping':{'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'ExterCond', 'mapping':{'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'BsmtQual', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'BsmtCond', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'BsmtExposure', 'mapping':{'Gd':1,'Av':2,'Mn':3,'No':4,'NA':5,}},
{'col':'HeatingQC', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'KitchenQual', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'FireplaceQu', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'GarageFinish', 'mapping':{'Fin':1,'RFn':2,'Unf':3,'NA':4,}},
{'col':'GarageQual', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'GarageCond', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'PavedDrive', 'mapping':{'Y':1,'P':2,'N':3}},
{'col':'PoolQC', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
]

ordinal_enc = ce.OrdinalEncoder(cols=ordenc_cols,mapping=ordenc_maps,verbose=1)
ordinal_enc.get_params()


{'cols': ['LotShape',
  'Utilities',
  'LotConfig',
  'LandSlope',
  'ExterQual',
  'ExterCond',
  'BsmtQual',
  'BsmtCond',
  'BsmtExposure',
  'HeatingQC',
  'KitchenQual',
  'FireplaceQu',
  'GarageFinish',
  'GarageQual',
  'GarageCond',
  'PavedDrive',
  'PoolQC'],
 'drop_invariant': False,
 'handle_missing': 'value',
 'handle_unknown': 'value',
 'mapping': [{'col': 'LotShape',
   'mapping': {'Reg': 0, 'IR1': 1, 'IR2': 2, 'IR3': 3}},
  {'col': 'Utilities',
   'mapping': {'AllPub': 0, 'NoSwer': 1, 'NoSeWa': 2, 'ELO': 3}},
  {'col': 'LotConfig', 'mapping': {'Gtl': 1, 'Mod': 2, 'Sev': 3}},
  {'col': 'LandSlope', 'mapping': {'Gtl': 1, 'Mod': 2, 'Sev': 3}},
  {'col': 'ExterQual',
   'mapping': {'Ex': 1, 'Gd': 2, 'TA': 3, 'Fa': 4, 'Po': 5}},
  {'col': 'ExterCond',
   'mapping': {'Ex': 1, 'Gd': 2, 'TA': 3, 'Fa': 4, 'Po': 5}},
  {'col': 'BsmtQual',
   'mapping': {'NA': 0, 'Ex': 1, 'Gd': 2, 'TA': 3, 'Fa': 4, 'Po': 5}},
  {'col': 'BsmtCond',
   'mapping': {'NA': 0, 'Ex': 1, 'Gd': 2, 'TA': 3

In [91]:
df_oe = ordinal_enc.fit_transform(df.drop('SalePrice',axis=1),df['SalePrice'])

  elif pd.api.types.is_categorical(cols):


In [92]:
df_oe.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,AlleyAccess_Flag,BsmtFinSF_Total,Functional_Typical_flag,PorchSF_Total,HasPorch_flag,HasPool_flag,Conditions_Artery,Conditions_Feedr,Conditions_Norm,Conditions_PosA,Conditions_PosN,Conditions_RRAe,Conditions_RRAn,Conditions_RRNe,Conditions_RRNn,Exterior_AsbShng,Exterior_AsphShn,Exterior_Brk Cmn,Exterior_BrkComm,Exterior_BrkFace,Exterior_CBlock,Exterior_CemntBd,Exterior_CmentBd,Exterior_HdBoard,Exterior_ImStucc,Exterior_MetalSd,Exterior_Other,Exterior_Plywood,Exterior_Stone,Exterior_Stucco,Exterior_VinylSd,Exterior_Wd Sdng,Exterior_Wd Shng,Exterior_WdShing,BsmtFinType_ALQ,BsmtFinType_BLQ,BsmtFinType_GLQ,BsmtFinType_LwQ,BsmtFinType_NA,BsmtFinType_Rec,BsmtFinType_Unf
0,1,60,RL,65.0,8450,Pave,,0,Lvl,0,-1.0,1,CollgCr,1Fam,2Story,7,5,2003,2003,Gable,CompShg,BrkFace,196.0,2,3,PConc,2,3,4,706,0,150,856,GasA,1,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,2,8,Typ,0,0,Attchd,2003.0,2,2,548,3,3,1,0,61,0,0,0,0,0,,NO_MISC_FEATURE_RECORDED,0,2,2008,WD,Normal,0,706,1,61,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
1,2,20,RL,80.0,9600,Pave,,0,Lvl,0,-1.0,1,Veenker,1Fam,1Story,6,8,1976,1976,Gable,CompShg,,0.0,3,3,CBlock,2,3,1,978,0,284,1262,GasA,1,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,3,6,Typ,1,3,Attchd,1976.0,2,2,460,3,3,1,298,0,0,0,0,0,0,,NO_MISC_FEATURE_RECORDED,0,5,2007,WD,Normal,0,978,1,298,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
2,3,60,RL,68.0,11250,Pave,,1,Lvl,0,-1.0,1,CollgCr,1Fam,2Story,7,5,2001,2002,Gable,CompShg,BrkFace,162.0,2,3,PConc,2,3,3,486,0,434,920,GasA,1,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,2,6,Typ,1,3,Attchd,2001.0,2,2,608,3,3,1,0,42,0,0,0,0,0,,NO_MISC_FEATURE_RECORDED,0,9,2008,WD,Normal,0,486,1,42,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
3,4,70,RL,60.0,9550,Pave,,1,Lvl,0,-1.0,1,Crawfor,1Fam,2Story,7,5,1915,1970,Gable,CompShg,,0.0,3,3,BrkTil,3,2,4,216,0,540,756,GasA,2,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,2,7,Typ,1,2,Detchd,1998.0,3,3,642,3,3,1,0,35,272,0,0,0,0,,NO_MISC_FEATURE_RECORDED,0,2,2006,WD,Abnorml,0,216,1,307,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,1
4,5,60,RL,84.0,14260,Pave,,1,Lvl,0,-1.0,1,NoRidge,1Fam,2Story,8,5,2000,2000,Gable,CompShg,BrkFace,350.0,2,3,PConc,2,3,2,655,0,490,1145,GasA,1,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,2,9,Typ,1,3,Attchd,2000.0,2,3,836,3,3,1,192,84,0,0,0,0,0,,NO_MISC_FEATURE_RECORDED,0,12,2008,WD,Normal,0,655,1,276,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1


***
## 5. Set up OneHot encoding parameters

In [94]:
onehot_enc = ce.OneHotEncoder(verbose=1,cols=['Street','Alley','CentralAir','MiscFeature'],use_cat_names=True)
onehot_enc.get_params()

{'cols': ['Street', 'Alley', 'CentralAir', 'MiscFeature'],
 'drop_invariant': False,
 'handle_missing': 'value',
 'handle_unknown': 'value',
 'return_df': True,
 'use_cat_names': True,
 'verbose': 1}

In [95]:
df_onehot = onehot_enc.fit_transform(df.drop('SalePrice',axis=1),df['SalePrice'])

  elif pd.api.types.is_categorical(cols):


In [96]:
df_onehot.head(5)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street_Pave,Street_Grvl,Alley_nan,Alley_Grvl,Alley_Pave,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir_Y,CentralAir_N,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature_NO_MISC_FEATURE_RECORDED,MiscFeature_Shed,MiscFeature_Gar2,MiscFeature_Othr,MiscFeature_TenC,MiscVal,MoSold,YrSold,SaleType,SaleCondition,AlleyAccess_Flag,BsmtFinSF_Total,Functional_Typical_flag,PorchSF_Total,HasPorch_flag,HasPool_flag,Conditions_Artery,Conditions_Feedr,Conditions_Norm,Conditions_PosA,Conditions_PosN,Conditions_RRAe,Conditions_RRAn,Conditions_RRNe,Conditions_RRNn,Exterior_AsbShng,Exterior_AsphShn,Exterior_Brk Cmn,Exterior_BrkComm,Exterior_BrkFace,Exterior_CBlock,Exterior_CemntBd,Exterior_CmentBd,Exterior_HdBoard,Exterior_ImStucc,Exterior_MetalSd,Exterior_Other,Exterior_Plywood,Exterior_Stone,Exterior_Stucco,Exterior_VinylSd,Exterior_Wd Sdng,Exterior_Wd Shng,Exterior_WdShing,BsmtFinType_ALQ,BsmtFinType_BLQ,BsmtFinType_GLQ,BsmtFinType_LwQ,BsmtFinType_NA,BsmtFinType_Rec,BsmtFinType_Unf
0,1,60,RL,65.0,8450,1,0,1,0,0,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,1Fam,2Story,7,5,2003,2003,Gable,CompShg,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,706,0,150,856,GasA,Ex,1,0,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,1,0,0,0,0,0,2,2008,WD,Normal,0,706,1,61,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
1,2,20,RL,80.0,9600,1,0,1,0,0,Reg,Lvl,AllPub,FR2,Gtl,Veenker,1Fam,1Story,6,8,1976,1976,Gable,CompShg,,0.0,TA,TA,CBlock,Gd,TA,Gd,978,0,284,1262,GasA,Ex,1,0,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,1,0,0,0,0,0,5,2007,WD,Normal,0,978,1,298,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1
2,3,60,RL,68.0,11250,1,0,1,0,0,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,1Fam,2Story,7,5,2001,2002,Gable,CompShg,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,486,0,434,920,GasA,Ex,1,0,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,1,0,0,0,0,0,9,2008,WD,Normal,0,486,1,42,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
3,4,70,RL,60.0,9550,1,0,1,0,0,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,1Fam,2Story,7,5,1915,1970,Gable,CompShg,,0.0,TA,TA,BrkTil,TA,Gd,No,216,0,540,756,GasA,Gd,1,0,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,1,0,0,0,0,0,2,2006,WD,Abnorml,0,216,1,307,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,1
4,5,60,RL,84.0,14260,1,0,1,0,0,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,1Fam,2Story,8,5,2000,2000,Gable,CompShg,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,655,0,490,1145,GasA,Ex,1,0,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,1,0,0,0,0,0,12,2008,WD,Normal,0,655,1,276,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1


# 8. Run all code in sequence and review expected modelling data set

In [151]:
df_orig = pd.read_csv('../data/iowa_full.csv')

In [152]:
df = df_orig.copy()

In [153]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [155]:
# Drop the row ID column as this is not something that should impart any information.
df.drop('Id',axis=1,inplace=True)

# Capture all adjustments to deal with NaN values.
def denote_null_values(df):
    """Denotes whether or not there are null values or not"""
    empty_cols_query = df.isnull().sum() > 0
    empty_df_cols = df.loc[:, empty_cols_query].columns.tolist()
    for col in empty_df_cols:
        col_name = f"{col}_missing"
        df[col_name] = pd.isnull(df[col])
    return df

df = denote_null_values(df)

# LotFrontage Functions to populate training, test and validation
def LotFrontage_na_calc(training_df):
    lotfrontage_neighborhood_mean = training_df.groupby(by=['Neighborhood'])[['LotFrontage']].mean().reset_index()
    lotfrontage_neighborhood_mean.columns = ['Neighborhood','LotFrontage_Neighborhood_Mean']
    return lotfrontage_neighborhood_mean

def LotFrontage_na_apply(training_df, testing_df, validation_df=None):
    # Calc mean based on training data
    lnm = LotFrontage_na_calc(training_df)
    
    # Apply mean to training data - for neighbourhood
    # Reset LotFrontage NaN in case they have been filled in a prior run
    training_df['LotFrontage'] = np.where(training_df['LotFrontage_missing']==True,np.nan,training_df['LotFrontage'])
    training_df = training_df.merge(lnm,how='left',left_on='Neighborhood',right_on='Neighborhood')
    training_df['LotFrontage'] = training_df['LotFrontage'].fillna(training_df.LotFrontage_Neighborhood_Mean)
    training_df.drop('LotFrontage_Neighborhood_Mean',axis=1,inplace=True)
    
    # Apply mean to testing data
    # Reset LotFrontage NaN in case they have been filled in a prior run
    testing_df['LotFrontage'] = np.where(testing_df['LotFrontage_missing']==True,np.nan,testing_df['LotFrontage'])
    testing_df = testing_df.merge(lnm,how='left',left_on='Neighborhood',right_on='Neighborhood')
    testing_df['LotFrontage'] = testing_df['LotFrontage'].fillna(testing_df.LotFrontage_Neighborhood_Mean)
    testing_df.drop('LotFrontage_Neighborhood_Mean',axis=1,inplace=True)
    # Fill the training sample mean if a specific neighborhood is missing from the training sample
    testing_df['LotFrontage'] = testing_df['LotFrontage'].fillna(training_df['LotFrontage'].mean())

    if validation_df is None:
        return training_df, testing_df
    else:
        # Apply mean to validation data set
        validation_df['LotFrontage'] = np.where(validation_df['LotFrontage_missing']==True,np.nan,validation_df['LotFrontage'])
        validation_df = validation_df.merge(lnm,how='left',left_on='Neighborhood',right_on='Neighborhood')
        validation_df['LotFrontage'] = validation_df['LotFrontage'].fillna(validation_df.LotFrontage_Neighborhood_Mean)
        validation_df.drop('LotFrontage_Neighborhood_Mean',axis=1,inplace=True)        
        validation_df['LotFrontage'] = validation_df['LotFrontage'].fillna(training_df['LotFrontage'].mean())
        return training_df, testing_df,validation_df


# Other fills don't rely on knowledge of full sample to update
df['AlleyAccess_Flag'] = np.where(df['Alley'].isnull(),0,1)
df['MasVnrType'] = df['MasVnrType'].fillna('None')
df['MasVnrArea'] = df['MasVnrArea'].fillna(0)
df['BsmtQual'] = df['BsmtQual'].fillna('NA')
df['BsmtCond'] = df['BsmtCond'].fillna('NA')
df['BsmtExposure'] = df['BsmtExposure'].fillna('NA')
df['BsmtFinType1'] = df['BsmtFinType1'].fillna('NA')
df['BsmtFinType2'] = df['BsmtFinType2'].fillna('NA')
df['Electrical'] = df['Electrical'].fillna('SBrkr')
df['FireplaceQu'] = df['FireplaceQu'].fillna('NA')
df['GarageType'] = df['GarageType'].fillna('NA')
df['GarageYrBlt'] = df['GarageYrBlt'].fillna(0)
df['GarageFinish'] = df['GarageFinish'].fillna('NA')
df['GarageQual'] = df['GarageQual'].fillna('NA')
df['GarageCond'] = df['GarageCond'].fillna('NA')
df['PoolQC'] = df['PoolQC'].fillna('NA')
df['Fence'] = df['Fence'].fillna('NA')
df['MiscFeature'] = df['MiscFeature'].fillna('no_misc_feature_recorded')


# Additional data features to tidy things up; potentially drop some others
df['BsmtFinSF_Total'] = df['BsmtFinSF1']+df['BsmtFinSF2']
df['Functional_Typical_flag']=np.where(df['Functional']=='Typ',1,0)
df['PorchSF_Total'] = (df['WoodDeckSF']+df['OpenPorchSF']+df['EnclosedPorch']+df['3SsnPorch']+df['ScreenPorch'])
df['HasPorch_flag']=np.where(df['PorchSF_Total']>0,1,0)
df['HasPool_flag']=np.where(df['PoolQC']!='NA',1,0)

In [156]:
def ManualOneHotEncoding(df,column_list,ohc_prefix):
    # Identify values for new one hot encoded columns
    
    unique_col_vals = []
    
    for i,col in enumerate(column_list):
        if i == 0:
            unique_col_vals = df[col].unique().tolist()
        else:
            [unique_col_vals.append(j) for j in df[col].unique().tolist()]

    # Limit to unique values to generate columns
    unique_col_vals_set = set(unique_col_vals)
    new_cols = sorted(list(unique_col_vals_set))
    
    # Create and populate columns for data set
    for col in new_cols:
        new_col = ohc_prefix + '_' + col
        df[new_col] = 0 #Create new columns and set to 0
        onehot_target = col
        for i,target_cols in enumerate(column_list):
            if i == 0:
                where_conditions = (df[target_cols] == onehot_target) 
            else:
                where_conditions = where_conditions | (df[target_cols] == onehot_target) 
        # Populate with 0s & 1s
        df[new_col] = np.where(where_conditions,1,0)
        
    return df

# Populate OneHotEncoded Columns
df = ManualOneHotEncoding(df,['Condition1','Condition2'],'Conditions')
df = ManualOneHotEncoding(df,['Exterior1st','Exterior2nd'],'Exterior')
df = ManualOneHotEncoding(df,['BsmtFinType1','BsmtFinType2'],'BsmtFinType')

# Drop OneHotEncoded Columns
df.drop('Condition1',axis=1,inplace=True)
df.drop('Condition2',axis=1,inplace=True)
df.drop('Exterior1st',axis=1,inplace=True)
df.drop('Exterior2nd',axis=1,inplace=True)
df.drop('BsmtFinType1',axis=1,inplace=True)
df.drop('BsmtFinType2',axis=1,inplace=True)

In [157]:
# Train/test sets
train = df.sample(frac=0.8,random_state=743)
test = df.drop(train.index)
train,val = train.iloc[:-100],train.iloc[-100:]

train,test,val = LotFrontage_na_apply(train, test, val)

X_train, y_train = train.drop('SalePrice',axis=1), train['SalePrice']


In [158]:
# Set up encoders

targ_enc_cols = [
    'MSSubClass',
    'MSZoning',
    'LandContour',
    'Neighborhood',
    'BldgType',
    'HouseStyle',
    'RoofStyle',
    'RoofMatl',
    'MasVnrType',
    'Foundation',
    'Heating',
    'Electrical',
    'Functional',
    'GarageType',
    'Fence',
    'SaleType',
    'SaleCondition',
]
target_enc = ce.TargetEncoder(verbose=1,cols=targ_enc_cols,min_samples_leaf=5,smoothing=0.1)

ordenc_cols = [
'LotShape',
'Utilities',
'LotConfig',
'LandSlope',
'ExterQual',
'ExterCond',
'BsmtQual',
'BsmtCond',
'BsmtExposure',
'HeatingQC',
'KitchenQual',
'FireplaceQu',
'GarageFinish',
'GarageQual',
'GarageCond',
'PavedDrive',
'PoolQC',
]

ordenc_maps = [
{'col':'LotShape', 'mapping':{"Reg":0,"IR1":1,"IR2":2,"IR3":3}},
{'col':'Utilities', 'mapping':{"AllPub":0,"NoSwer":1,"NoSeWa":2,"ELO":3}},
{'col':'LotConfig', 'mapping':{'Gtl':1,'Mod':2,'Sev':3,}},
{'col':'LandSlope', 'mapping':{'Gtl':1,'Mod':2,'Sev':3,}},
{'col':'ExterQual', 'mapping':{'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'ExterCond', 'mapping':{'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'BsmtQual', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'BsmtCond', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'BsmtExposure', 'mapping':{'Gd':1,'Av':2,'Mn':3,'No':4,'NA':5,}},
{'col':'HeatingQC', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'KitchenQual', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'FireplaceQu', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'GarageFinish', 'mapping':{'Fin':1,'RFn':2,'Unf':3,'NA':4,}},
{'col':'GarageQual', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'GarageCond', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
{'col':'PavedDrive', 'mapping':{'Y':1,'P':2,'N':3}},
{'col':'PoolQC', 'mapping':{'NA':0,'Ex':1,'Gd':2,'TA':3,'Fa':4,'Po':5,}},
]

ordinal_enc = ce.OrdinalEncoder(cols=ordenc_cols,mapping=ordenc_maps,verbose=1)

onehot_enc = ce.OneHotEncoder(verbose=1,cols=['Street','Alley','CentralAir','MiscFeature'],use_cat_names=True)


In [159]:
df_step1 = target_enc.fit_transform(X_train, y_train)
df_step2 = ordinal_enc.fit_transform(df_step1, y_train)
df_step3 = onehot_enc.fit_transform(df_step2, y_train)


  elif pd.api.types.is_categorical(cols):
  elif pd.api.types.is_categorical(cols):


In [160]:
df_step3.head(10)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street_Pave,Street_Grvl,Alley_nan,Alley_Grvl,Alley_Pave,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir_Y,CentralAir_N,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature_no_misc_feature_recorded,MiscFeature_Shed,MiscFeature_Gar2,MiscFeature_Othr,MiscVal,MoSold,YrSold,SaleType,SaleCondition,LotFrontage_missing,Alley_missing,MasVnrType_missing,MasVnrArea_missing,BsmtQual_missing,BsmtCond_missing,BsmtExposure_missing,BsmtFinType1_missing,BsmtFinType2_missing,Electrical_missing,FireplaceQu_missing,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing,GarageQual_missing,GarageCond_missing,PoolQC_missing,Fence_missing,MiscFeature_missing,AlleyAccess_Flag,BsmtFinSF_Total,Functional_Typical_flag,PorchSF_Total,HasPorch_flag,HasPool_flag,Conditions_Artery,Conditions_Feedr,Conditions_Norm,Conditions_PosA,Conditions_PosN,Conditions_RRAe,Conditions_RRAn,Conditions_RRNe,Conditions_RRNn,Exterior_AsbShng,Exterior_AsphShn,Exterior_Brk Cmn,Exterior_BrkComm,Exterior_BrkFace,Exterior_CBlock,Exterior_CemntBd,Exterior_CmentBd,Exterior_HdBoard,Exterior_ImStucc,Exterior_MetalSd,Exterior_Other,Exterior_Plywood,Exterior_Stone,Exterior_Stucco,Exterior_VinylSd,Exterior_Wd Sdng,Exterior_Wd Shng,Exterior_WdShing,BsmtFinType_ALQ,BsmtFinType_BLQ,BsmtFinType_GLQ,BsmtFinType_LwQ,BsmtFinType_NA,BsmtFinType_Rec,BsmtFinType_Unf
0,97967.666667,192614.530539,98.0,8731,1,0,1,0,0,1,180684.979296,0,-1.0,1,120962.209302,187045.7789,176522.089184,5,5,1920,1950,171668.63494,180787.831107,157191.138629,0.0,3,4,132585.221154,3,3,4,645,0,270,915,182555.83445,3,1,0,187072.650307,1167,0,0,1167,0,0,1,0,3,1,3,6,164627.777778,1,2,134114.645833,1972.0,3,2,495,3,3,1,0,0,216,0,126,0,0,188584.208525,1,0,0,0,0,5,2007,173511.866304,174916.383562,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,645,0,342,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1
1,185822.979747,210581.255319,65.344828,4403,1,0,1,0,0,2,180684.979296,0,-1.0,1,225365.276923,187045.7789,176522.089184,7,5,2009,2009,171668.63494,180787.831107,269900.775281,432.0,1,3,225716.297352,1,3,2,578,0,892,1470,182555.83445,1,1,0,187072.650307,1478,0,0,1478,1,0,2,1,2,1,2,7,183868.233233,1,2,203598.768371,2009.0,1,2,484,3,3,1,0,144,0,0,0,0,0,188584.208525,1,0,0,0,0,6,2010,275480.43299,272147.3,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,578,1,144,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1
2,129726.923077,192614.530539,107.0,10615,1,0,1,0,0,1,139859.897436,0,-1.0,2,128957.207792,129726.923077,211460.292035,3,5,1900,1970,171668.63494,180787.831107,157191.138629,0.0,3,3,147927.199552,4,3,3,440,0,538,978,182555.83445,3,1,0,187072.650307,1014,685,0,1699,1,0,2,0,3,2,3,7,183868.233233,0,0,117946.047767,1920.0,3,2,420,4,4,1,0,74,0,0,0,0,0,188584.208525,1,0,0,0,0,8,2009,173511.866304,145799.514706,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,True,True,0,440,1,74,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
3,139694.875,192614.530539,138.0,18030,1,0,1,0,0,1,139859.897436,0,-1.0,1,205753.565217,187045.7789,139901.490566,5,6,1946,1994,171668.63494,180787.831107,157191.138629,0.0,3,3,147927.199552,3,3,4,152,469,977,1598,182555.83445,3,1,0,187072.650307,1636,971,479,3086,0,0,3,0,3,1,1,12,164627.777778,1,2,103106.451613,0.0,4,0,0,0,0,1,122,0,0,0,0,0,0,145299.747826,1,0,0,0,0,3,2007,173511.866304,174916.383562,False,True,False,False,False,False,False,False,False,False,False,True,True,True,True,True,True,False,True,0,621,0,122,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
4,185822.979747,192614.530539,68.0,9571,1,0,1,0,0,0,180684.979296,0,-1.0,1,125633.571429,187045.7789,176522.089184,5,6,1956,1956,171668.63494,180787.831107,157191.138629,0.0,3,3,147927.199552,3,3,2,739,0,405,1144,182555.83445,3,1,0,187072.650307,1144,0,0,1144,1,0,1,0,3,1,3,6,183868.233233,0,0,203598.768371,1956.0,3,1,596,3,3,1,44,0,0,0,0,0,0,188584.208525,1,0,0,0,0,6,2010,173511.866304,174916.383562,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,True,True,0,739,1,44,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1
5,244609.669683,192614.530539,70.0,11207,1,0,1,0,0,1,228300.567568,0,-1.0,1,196097.803279,187045.7789,211460.292035,6,5,1997,1997,171668.63494,180787.831107,157191.138629,0.0,3,3,225716.297352,2,3,2,714,0,88,802,182555.83445,2,1,0,187072.650307,802,709,0,1511,1,0,2,1,3,1,3,8,183868.233233,1,3,203598.768371,1997.0,1,2,413,3,3,1,95,75,0,0,0,0,0,188584.208525,1,0,0,0,0,6,2006,173511.866304,174916.383562,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,714,1,170,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
6,185822.979747,210581.255319,75.0,7862,1,0,1,0,0,1,180684.979296,0,-1.0,1,225365.276923,187045.7789,176522.089184,6,5,2009,2009,171668.63494,180787.831107,157191.138629,0.0,2,3,225716.297352,2,3,4,27,0,1191,1218,182555.83445,1,1,0,187072.650307,1218,0,0,1218,0,0,2,0,2,1,2,4,183868.233233,0,0,203598.768371,2009.0,1,2,676,3,3,1,0,102,0,0,0,0,0,188584.208525,1,0,0,0,0,9,2009,275480.43299,272147.3,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,True,True,0,27,1,102,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1
7,109862.5,125602.530488,57.0,7449,1,0,0,1,0,0,139859.897436,0,-1.0,1,99358.235294,187045.7789,111790.0,7,7,1930,1950,171668.63494,180787.831107,157191.138629,0.0,3,3,225716.297352,3,3,4,0,0,637,637,182555.83445,1,1,0,109344.619048,1108,0,0,1108,0,0,1,0,3,1,2,6,183868.233233,1,2,203598.768371,1930.0,3,1,280,3,3,3,0,0,205,0,0,0,0,176940.697674,1,0,0,0,0,6,2007,173511.866304,174916.383562,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,1,0,1,205,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1
8,185822.979747,192614.530539,86.0,13286,1,0,1,0,0,1,180684.979296,0,-1.0,1,125633.571429,187045.7789,176522.089184,9,5,2007,2008,220561.588785,180787.831107,269900.775281,340.0,1,3,225716.297352,1,3,4,1234,0,464,1698,182555.83445,1,1,0,187072.650307,1698,0,0,1698,1,0,2,0,3,1,1,8,183868.233233,1,2,203598.768371,2007.0,1,3,768,3,3,1,327,64,0,0,0,0,0,188584.208525,1,0,0,0,0,2,2009,173511.866304,174916.383562,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,1234,1,391,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1
9,185822.979747,192614.530539,72.346154,13265,1,0,1,0,0,1,180684.979296,0,-1.0,1,153276.315789,187045.7789,176522.089184,8,5,2002,2002,220561.588785,180787.831107,206212.831804,148.0,2,3,225716.297352,2,3,4,1218,0,350,1568,182555.83445,1,1,0,187072.650307,1689,0,0,1689,1,0,2,0,3,1,2,7,183868.233233,2,2,203598.768371,2002.0,2,3,857,3,3,1,150,59,0,0,0,0,0,188584.208525,1,0,0,0,0,7,2008,173511.866304,174916.383562,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,True,0,1218,1,209,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1


In [161]:
# Check for any null values
df_step3.isnull().sum().sum()

0