# Data Wrangling

From the EDA, we can see some places where columns can be deleted, but we also saw a good number of nan values. Part of this comes from the fact that a valid category for many variables is "NA," which pandas optionally treats as np.nan when it reads in data. However, for numerical variables this is not the case, and those values must be filled, deleted, or have the index removed.  

To fix the NA/nan issue, I will load the data set, then select all columns where NA _should_ be read as nan, and replace it with N/A. Then I can write it back to a file and read it with na_values = ['N/A'] and keep_default_na = False so that there is a distinction made between NA as a valid category and NA as missing. By writing back to a file, I can load the data in other files without having to redo fillna() steps.

In [1]:
import numpy as np
import pandas as pd
from IPython.display import Markdown, display
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

This code was run once to generate the easier-to-deal with train_cleaner file that has different values for truly missing data vs data that has a valid NA entry.
```
keep_na_cols = ['Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 
'BsmtFinType2', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 
'GarageCond', 'PoolQC', 'Fence', 'MiscFeature']
keep_na_dict = {col : "NA" for col in keep_na_cols}
df = pd.read_csv("../train.csv", index_col=0)
change_na_cols = [col for col in df.columns if col not in keep_na_cols]
change_na_dict = {col : 'N/A' for col in change_na_cols}
``` 
```
df.fillna(change_na_dict, inplace=True)
df.fillna(keep_na_dict, inplace=True)
df.to_csv("../train_cleaner.csv")
```


In [151]:
df = pd.read_csv("../train_cleaner.csv", na_values='N/A', keep_default_na=False)

In [152]:
df[df.columns[df.isna().any()].tolist()].count().apply(lambda x: 1460 - x)

LotFrontage    259
MasVnrType       8
MasVnrArea       8
Electrical       1
GarageYrBlt     81
dtype: int64

Missed masonry veneer, fille that with None (and the corresponding MasVnrArea with 0)

In [147]:
df['MasVnrType'].fillna('None', inplace=True)
df['MasVnrArea'].fillna(0.0, inplace=True)




# Data cleaning
The first step is to remove columns/rows that we indicated in EDA.

One assumption that I make is that remodel date is less important than how old the house was when it was remodeled. So I replace YearRemodAdd with "AgeRemodAdd" which is YearBuilt - YearRemodAdd

In [153]:
yr_remod = df['YearRemodAdd'] - df['YearBuilt']
yr_remod

0        0
1        0
2        1
3       55
4        0
        ..
1455     1
1456    10
1457    65
1458    46
1459     0
Length: 1460, dtype: int64

In [155]:

df.insert(80, 'AgeRemodAdd', yr_remod)
df.drop('YearRemodAdd', axis = 1, inplace=True)

In [156]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          1460 non-null   object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

We also want to drop index 945 and column 'Utilities'

In [157]:
df.drop(df.index[df['Id'] == 945][0], inplace=True)
df.drop('Utilities', axis=1, inplace=True)

Delete GarageCars column, GarageArea should be enough

In [158]:
df.drop('GarageCars', axis=1, inplace=True)

## Fill N/A values in numerical fields
For LotFrontage, it makes sense to use the mean per neighborhood.

In [159]:


df['LotFrontage'] = df.groupby('Neighborhood')['LotFrontage'].transform(lambda x : x.fillna(x.mean()))


In [160]:
df['Electrical'].fillna('SBrkr', inplace=True)

The last variable with nan is Garage Year built, but when a home has no garage there is no relevant fill value. For now, let's try dropping the col

In [161]:
df.drop('GarageYrBlt', axis=1, inplace=True)

This concludes the drops/fills we had to do in order to clean up the data in a minimal way. For now, let's write it to a file so we can pull identical data elsewhere, where we may want to drop/augment the data in various ways. 

Convert MSSubClass to str

In [162]:
df['MSSubClass'] = df['MSSubClass'].apply(lambda x: f"{x}".format(x))

In [163]:
df.to_csv("../base_train.csv")