# Data Cleaning and Feature Engineering

## Import Data

In [20]:
import pandas as pd
import numpy as np

In [36]:
# Import data

#Import train data
train_data_url = 'https://raw.githubusercontent.com/cal-dortiz/W207_Applied-_Machine_Learning/main/Final_Project/Data/train.csv'
df_train = pd.read_csv(train_data_url)

#Import test data
#test_data_url = 'https://raw.githubusercontent.com/cal-dortiz/W207_Applied-_Machine_Learning/main/Final_Project/Data/test.csv'
#df_test = pd.read_csv(test_data_url)

## Data Cleaning

### Data Removal

Based on the exploritory data analysis, attributes that have a high amount of missing data and low impact shall be removed from the data set.

In [37]:
df_train = df_train.drop(columns=['PoolQC','MiscFeature','Alley','Fence','Id'])

### Cleaning Housing SqFt

The EDA confirms the assumption that larger houses are correlated to highter prices. This section reviews all attributes that measure the size of the house.

The basement area is not counted as basesments may not be used in assessing property value. The correlation between 'TotalBsmtSF' and price may be due to the correlation of the size of the foundation to the size of the first floor. Keeping 'TotalBsmtSF' in the model may lead to colinearity.

In [39]:
# Removal of 'TotalBsmtSf'
df_train = df_train.drop(columns=['TotalBsmtSF'])

Since square-footage of the house is highly correlated to its price, lets calculate

The data set breaks sqft and room data into basement, first floor, and seccond floor. We believe combining first and second floor room and sqft data into a single dimension will reduce the risk of colineiarity of two attributes being in the model and increase the power of the attribute.

We will leave the basement data seperate as we do not understand if the basement attributes are allowed to be used in housing assessments.

In [18]:
# Combine SqFt
df_train['TotSqFt'] = df_train['1stFlrSF'] + df_train['2ndFlrSF']
df_train = df_train.drop(columns=['1stFlrSF','2ndFlrSF'])

In [41]:
df_train.columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'Heating', 'HeatingQC', 'CentralAir',
       'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional',
       'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF'