### Iowa Housing Lab -- Data Encoding

Welcome!! This lab is going to be a bit more of an advanced version of last class, where we build a regression model to predict housing prices, but this time we do so with a dataset that has a more interesting mix of data -- numeric and categorical data, as well as some missing values.

**Important:** A summary of each of the columns in this dataset, and what their values mean, can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

**Step 1).  Load in your data set**

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor

In [2]:
df= pd.read_csv('../../data/iowa_train2.csv')

**Step 2).  There are missing values throughout this dataset.  Fill them in appropriately**

We already covered this in class, but to give you a reminder:

 - Are the missing values random or not?
 - Encode them as missing if possible

In [3]:
def denote_null_values(df):
    empty_cols_query = df.isnull().sum() > 0
    empty_df_cols = df.loc[:, empty_cols_query].columns.tolist()
    for col in empty_df_cols:
        col_name = f"{col}_missing"
        df[col_name] = pd.isnull(df[col])
    return df

df = denote_null_values(df)

In [7]:
num_cols = df.select_dtypes(include=np.number).columns.tolist()
cat_cols = df.select_dtypes(include=np.object).columns.tolist()
df[num_cols] = df[num_cols].fillna(0)
df[cat_cols] = df[cat_cols].fillna('None')

**Step 3): Encode Your Categorical Data**

For now, you can choose which encoding technique you would want to use.  Later on you'll go back and check to see if it made a large difference.  

In [8]:
df[cat_cols] = df[cat_cols].astype('category')
for col in cat_cols:
    df[col] = df[col].cat.codes

**Step 4):  Declare X & y, and fit your model**

In [9]:
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']
gbm = GradientBoostingRegressor()

**Step 5):  Score your model, and look at your feature importances** 

In [10]:
gbm.fit(X, y)
gbm.score(X, y)

0.9429450175233424

**Step 6):  (Time Permitting) Re-encode your categorical variables using the opposite technique, and observe if it made a difference**

In [None]:
# your code here

If you've made it this far, you can stop.  We'll discuss step 7 as a way to wrap up the class and head into next session.

**Step 7):  Score your model on your validation set**

How much did your results change?

In [None]:
# your answer here

In [76]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1985)

In [77]:
X_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,...,GrLivArea.1,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageType_missing,GarageYrBlt_missing,GarageFinish_missing
461,462,70,3,7200,18,7,9,1936,1135,575,...,1135,1,0,5,1971.0,2,2,False,False,False
373,374,20,3,10634,12,5,6,1953,1319,1319,...,1319,1,0,1,1953.0,3,1,False,False,False
1271,1272,20,3,9156,14,6,7,1968,1489,1489,...,1489,2,0,1,1968.0,2,2,False,False,False
634,635,90,3,6979,17,6,5,1980,1056,1056,...,1056,0,0,5,1980.0,3,2,False,False,False
1245,1246,80,3,12090,14,6,7,1984,1868,1140,...,1868,3,1,3,1984.0,0,2,False,False,False


In [85]:
gbm.fit(X_train, y_train)
gbm.score(X_test, y_test)

0.885304666734425

In [87]:
X_train.shape, y_train.shape

((1168, 21), (1168,))

In [89]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1985)

In [90]:
gbm.fit(X_train, y_train)

GradientBoostingRegressor()

In [91]:
gbm.score(X_val, y_val)

0.6919964683748443