### Iowa Housing Lab -- Solutions

Welcome!! This lab is going to be a bit more of an advanced version of yesterday's class, where we build a regression model to predict housing prices, but this time do so with a dataset that has a more interesting mix of data -- ordinal and nominal features, as well as some missing values.

**Important:** A summary of each of the columns in this dataset, and what their values mean, can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

**Step 1).  Load in both your training & test sets**

In [3]:
# your code here
import pandas as pd
train = pd.read_csv('../data/iowa_housing/train.csv')
test  = pd.read_csv('../data/iowa_housing/test.csv')

FileNotFoundError: [Errno 2] File b'../data/iowa_housing/train.csv' does not exist: b'../data/iowa_housing/train.csv'

In [None]:
train.info()

Also....when you're cleaning training & test sets, it's usually a good idea to separate the column you're trying to predict from everything else.  

For now, declare `y` to be the `SalePrice` column, and then remove it from the training set entirely.  You can drop the `ID` column too.

In [254]:
# your answer here
y = np.log(train['SalePrice'])
train.drop('SalePrice', axis=1, inplace=True)
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)

**Step 2).  There are missing values throughout this dataset.  For the time being, let's try and do a few things:**

 - were these missing values likely to be randomly occurring, or are they likely encoding for something else?  
 
If values are encoding for something else, there are usually either high correlations with missing values in similar columns, or they could potentially represent a particular rank in a hierarchy -- ie, 'None', 0, 'Other', etc.  

Take a look at the column descriptions, see what you think they might be.

 - when you've made your decisions, fill in the missing values in each column with their average values if it's a number, and their modal values if they're categorical.
 - don't forget to look at missing values in your test set too!
 
*Make sure to perform this operation on the training and test set, using values from the training set for imputation.*

In [255]:
# your code here
train_empty = train.loc[:, train.isnull().sum() > 0]

In [256]:
# there is a 100% correlation between the empty values in these columns
# they all encode for a garage -- these almost certainly represent the same thing
train_empty.isnull().astype(int).corr()

Unnamed: 0,GarageType,GarageYrBlt,GarageFinish
GarageType,1.0,1.0,1.0
GarageYrBlt,1.0,1.0,1.0
GarageFinish,1.0,1.0,1.0


In [272]:
# grab the columns
cols = train_empty.columns.tolist()
# fill with the appropriate value  -- NA, Other, could also work
train[['GarageType', 'GarageFinish']] = train[['GarageType', 'GarageFinish']].fillna('None')
test[['GarageType', 'GarageFinish']]  = test[['GarageType', 'GarageFinish']].fillna('None')

In [273]:
train['GarageYrBlt'].fillna(0, inplace=True)
test['GarageYrBlt'].fillna(0, inplace=True)

In [274]:
# there are still some empty columns in the test set, we'll impute these 
# using values from the training set
test.isnull().sum()

Id              0
MSSubClass      0
MSZoning        4
LotArea         0
Neighborhood    0
OverallQual     0
OverallCond     0
YearBuilt       0
GrLivArea       0
1stFlrSF        0
2ndFlrSF        0
GrLivArea.1     0
FullBath        0
HalfBath        0
GarageType      0
GarageYrBlt     0
GarageFinish    0
GarageCars      1
dtype: int64

In [275]:
# finding the values to use in the training set
ms_mode   = train['MSZoning'].mode()[0]
gcarsmean = train['GarageCars'].mean()

In [276]:
# and applying them to the test set
test['MSZoning'].fillna(ms_mode, inplace=True)
test['GarageCars'].fillna(gcarsmean, inplace=True)

**Step 3): Ordinal vs Categorical Columns**

There are a number of categorical columns in this dataset, and they could represent both ordinal or nominal data.  

Take a look at their descriptions, and decide which one belongs to which.

In [None]:
# your answer here (no real code required for this one)

**Step 4):  Go Ahead and Change Your Ordinal Variables To Their Appropriate Values**

In [277]:
# your code here
# we'll assume the GarageFinish is ordinal.  Ie, FinishedGarage > Unfinished Garage
garage_mapping = {
    'None': 0, # no garage
    'Unf' : 1, # unfinished garage
    'RFn' : 2, # partially finished garage
    'Fin' : 3 # finished garage
}

train['GarageFinish'] = train['GarageFinish'].map(garage_mapping)
test['GarageFinish']  = test['GarageFinish'].map(garage_mapping)

**Step 5):  Now, OneHot Encode Your Dataset For Your Remaining Categorical Columns** 

**Note:** You want your training and your test sets attached for this one.  Detach them when you're finished.

**2nd Note:** Some columns are categorical, even if they're encoded as numbers.  the `MSSubClass` is essentially a zoning category, even though it's encoded as a number.  

`pd.get_dummies` by default onehot encodes every categorical column, but if you want to specify the exact columns to use, there is a `columns` argument that you can pass in.  

A good idea here would be to get the list of categorical columns using the `select_dtypes` method, store it as a list (using the `tolist()`) method, and then append `MSSubClass` to it, and use that as an argument in `pd.get_dummies`.

In [262]:
import numpy as np
cat_cols = train.select_dtypes(include=np.object).columns.tolist()

In [278]:
# MSSubClass is really a category, moreso than a true number
# so we'll add it to the list of items to be encoded
cat_cols.append('MSSubClass')

In [279]:
# concatenate and encode
master = pd.concat([train, test])
master = pd.get_dummies(master, columns=cat_cols)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [280]:
# and split back apart
train  = master[:1460]
test   = master[1460:]

In [282]:
train.isnull().sum()

1stFlrSF                0
2ndFlrSF                0
FullBath                0
GarageCars              0
GarageFinish            0
GrLivArea               0
GrLivArea.1             0
HalfBath                0
Id                      0
LotArea                 0
OverallCond             0
OverallQual             0
SalePrice               0
YearBuilt               0
MSZoning_C (all)        0
MSZoning_FV             0
MSZoning_RH             0
MSZoning_RL             0
MSZoning_RM             0
Neighborhood_Blmngtn    0
Neighborhood_Blueste    0
Neighborhood_BrDale     0
Neighborhood_BrkSide    0
Neighborhood_ClearCr    0
Neighborhood_CollgCr    0
Neighborhood_Crawfor    0
Neighborhood_Edwards    0
Neighborhood_Gilbert    0
Neighborhood_IDOTRR     0
Neighborhood_MeadowV    0
                       ..
MSSubClass_40           0
MSSubClass_45           0
MSSubClass_50           0
MSSubClass_60           0
MSSubClass_70           0
MSSubClass_75           0
MSSubClass_80           0
MSSubClass_8

**Step 6): Standardize Your Data On Your Training and Test Sets**

**Remember:** Use the values from your training set to standardize your test set!  

Ask me if you have any questions on how to do this.

In [283]:
from sklearn.preprocessing import StandardScaler

In [284]:
sc = StandardScaler()

# standardize the training set
train = sc.fit_transform(train)
# and the test set - notice how we're using the training set here?
test = sc.transform(test)

**Step 7):  Create a validation set out of your training set**

Since there is no time based component, random shuffling is fine.

In [285]:
# your answer here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_val, y_train, y_val = train_test_split(train, y, random_state=2020)

**Step 8): Fit Linear Regression on your training set, and score it on your validation set to get a feel for how you did.**

In [286]:
# your answer here
lreg = LinearRegression()
lreg.fit(X_train, y_train)
lreg.score(X_val, y_val)

-1.3428885317822715e+26

**Step 9):  Finally, go ahead and make your predictions on your test set.**

Save to a csv file the following the following columns: the ID of of each row in your test set, as well as your prediction.

In [287]:
# your answer here
preds = pd.DataFrame()
preds['ID'] = np.arange(1460, 1460+1459)
preds['prediction'] = lreg.predict(test)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

**Bonus:** Can you improve your score?

The first part of this lab was meant to be a walk through of the basics of prepping a data set and getting it ready.

However, there's a lot that could be improved upon!  

Using validation scores as your guide, you could try and look at some of the following:

 - Removing outliers from the target variable, or using log transformations to make the data smoother
 - There are lots of highly correlated variables in this dataset.  Do the 4 different columns about the fireplace really tell you something that different from one another?  You can try averaging multiple columns into one if they're highly correlated, or removing some entirely to see if it improves anything.

In [None]:
# your answer here