### Iowa Housing Lab

Welcome!! This lab is going to be a bit more of an advanced version of yesterday's class, where we build a regression model to predict housing prices, but this time do so with a dataset that has a more interesting mix of data -- ordinal and nominal features, as well as some missing values.

**Important:** A summary of each of the columns in this dataset, and what their values mean, can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

**Step 1).  Load in both your training & test sets**

In [15]:
import pandas as pd
import numpy as np

train = pd.read_csv('/Users/aoifeduna/AoifeRepo/aoiferepo/Lectures/Unit3/data/train-small.csv')
test = pd.read_csv('/Users/aoifeduna/AoifeRepo/aoiferepo/Lectures/Unit3/data/test-small.csv')

**Step 2).  There are missing values throughout this dataset.  For the time being, let's try and do a few things:**

 - were these missing values likely to be randomly occurring, or are they likely encoding for something else?  
 
If values are encoding for something else, there are usually either high correlations with missing values in similar columns, or they could potentially represent a particular rank in a hierarchy -- ie, 'None', 0, 'Other', etc.  

Take a look at the column descriptions, see what you think they might be.

 - when you've made your decisions, fill in the missing values in each column with their average values if it's a number, and their modal values if they're categorical.
 
*Make sure to perform this operation on the training and test set, using values from the training set for imputation.*

In [16]:
test.isnull().sum().sort_values(ascending=False)

GarageYrBlt     78
GarageFinish    78
GarageType      76
MSZoning         4
GarageCars       1
OverallCond      0
MSSubClass       0
LotArea          0
Neighborhood     0
OverallQual      0
GrLivArea        0
YearBuilt        0
1stFlrSF         0
2ndFlrSF         0
GrLivArea.1      0
FullBath         0
HalfBath         0
Id               0
dtype: int64

In [17]:
train_empty = train.loc[:, train.isnull().sum() > 0]
train_empty

Unnamed: 0,GarageType,GarageYrBlt,GarageFinish
0,Attchd,2003.0,RFn
1,Attchd,1976.0,RFn
2,Attchd,2001.0,RFn
3,Detchd,1998.0,Unf
4,Attchd,2000.0,RFn
...,...,...,...
1455,Attchd,1999.0,RFn
1456,Attchd,1978.0,Unf
1457,Attchd,1941.0,RFn
1458,Attchd,1950.0,Unf


In [19]:
train_empty.isnull().astype(int).corr()

Unnamed: 0,GarageType,GarageYrBlt,GarageFinish
GarageType,1.0,1.0,1.0
GarageYrBlt,1.0,1.0,1.0
GarageFinish,1.0,1.0,1.0


In [21]:
cols = train_empty.columns.tolist()
cols

['GarageType', 'GarageYrBlt', 'GarageFinish']

In [23]:
train[['GarageType', 'GarageFinish']] = train[['GarageType', 'GarageFinish']].fillna('None')
test[['GarageType', 'GarageFinish']] = test[['GarageType', 'GarageFinish']].fillna('None')

In [25]:
train['GarageYrBlt'].fillna(0, inplace=True)
test['GarageYrBlt'].fillna(0, inplace=True)

In [27]:
test.isnull().sum()

Id              0
MSSubClass      0
MSZoning        4
LotArea         0
Neighborhood    0
OverallQual     0
OverallCond     0
YearBuilt       0
GrLivArea       0
1stFlrSF        0
2ndFlrSF        0
GrLivArea.1     0
FullBath        0
HalfBath        0
GarageType      0
GarageYrBlt     0
GarageFinish    0
GarageCars      1
dtype: int64

In [29]:
ms_mode = train['MSZoning'].mode()[0]
gcarsmean = train['GarageCars'].mean()

In [30]:
test['MSZoning'].fillna(ms_mode, inplace=True)
test['GarageCars'].fillna(gcarsmean, inplace=True)

In [31]:
test.isnull().sum()
# Woohooo the nulls are all gone!

Id              0
MSSubClass      0
MSZoning        0
LotArea         0
Neighborhood    0
OverallQual     0
OverallCond     0
YearBuilt       0
GrLivArea       0
1stFlrSF        0
2ndFlrSF        0
GrLivArea.1     0
FullBath        0
HalfBath        0
GarageType      0
GarageYrBlt     0
GarageFinish    0
GarageCars      0
dtype: int64

**Step 3): Ordinal vs Categorical Columns**

There are a number of categorical columns in this dataset, and they could represent both ordinal or nominal data.  

Take a look at their descriptions, and decide which one belongs to which.

In [33]:
num_cols = train.select_dtypes(include=np.number).columns.tolist()
num_cols
# These are the numeric values

['Id',
 'MSSubClass',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'GrLivArea',
 '1stFlrSF',
 '2ndFlrSF',
 'GrLivArea.1',
 'FullBath',
 'HalfBath',
 'GarageYrBlt',
 'GarageCars',
 'SalePrice']

In [35]:
ord_cols = train.select_dtypes(exclude=np.number).columns.tolist()
ord_cols

['MSZoning', 'Neighborhood', 'GarageType', 'GarageFinish']

**Step 4):  Go Ahead and Change Your Ordinal Variables To Their Appropriate Values**

In [None]:
# your code here

**Step 5):  Now, OneHot Encode Your Dataset For Your Remaining Categorical Columns** 

**Note:** You want your training and your test sets attached for this one.  Detach them when you're finished.

In [None]:
# your code here

**Step 6): Standardize Your Data On Your Training and Test Sets**

**Remember:** Use the values from your training set to standardize your test set!  

Ask me if you have any questions on how to do this.

In [None]:
# your answer here

**Step 7):  Create a validation set out of your training set**

Since there is no time based component, random shuffling is fine.

In [None]:
# your answer here

**Step 8): Fit Linear Regression on your training set, and score it on your validation set to get a feel for how you did.**

In [None]:
# your answer here

**Step 9):  Finally, go ahead and make your predictions on your test set.**

Save to a csv file the following the following columns: the ID of of each row in your test set, as well as your prediction.

In [None]:
# your answer here

**Bonus:** Can you improve your score?

The first part of this lab was meant to be a walk through of the basics of prepping a data set and getting it ready.

However, there's a lot that could be improved upon!  

Using validation scores as your guide, you could try and look at some of the following:

 - Removing outliers from the target variable, or using log transformations to make the data smoother
 - There are lots of highly correlated variables in this dataset.  Do the 4 different columns about the fireplace really tell you something that different from one another?  You can try averaging multiple columns into one if they're highly correlated, or removing some entirely to see if it improves anything.

In [None]:
# your answer here