### Iowa Housing Lab -- Solutions

Welcome!! This lab is going to be a bit more of an advanced version of yesterday's class, where we build a regression model to predict housing prices, but this time do so with a dataset that has a more interesting mix of data -- ordinal and nominal features, as well as some missing values.

**Important:** A summary of each of the columns in this dataset, and what their values mean, can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

**Step 1).  Load in both your training & test sets**

In [323]:
import pandas as pd
import numpy as np

train = pd.read_csv('/Users/aoifeduna/AoifeRepo/aoiferepo/Lectures/Unit3/data/train-small.csv')
test = pd.read_csv('/Users/aoifeduna/AoifeRepo/aoiferepo/Lectures/Unit3/data/test-small.csv')

In [324]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 19 columns):
Id              1460 non-null int64
MSSubClass      1460 non-null int64
MSZoning        1460 non-null object
LotArea         1460 non-null int64
Neighborhood    1460 non-null object
OverallQual     1460 non-null int64
OverallCond     1460 non-null int64
YearBuilt       1460 non-null int64
GrLivArea       1460 non-null int64
1stFlrSF        1460 non-null int64
2ndFlrSF        1460 non-null int64
GrLivArea.1     1460 non-null int64
FullBath        1460 non-null int64
HalfBath        1460 non-null int64
GarageType      1379 non-null object
GarageYrBlt     1379 non-null float64
GarageFinish    1379 non-null object
GarageCars      1460 non-null int64
SalePrice       1460 non-null int64
dtypes: float64(1), int64(14), object(4)
memory usage: 216.8+ KB


In [325]:
y = train['SalePrice']
train.drop('SalePrice', axis=1, inplace=True)
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)
# If you declare this separately you don't have to worry about doing these operations for Y
# The inplace argument ACTUALLY drops the data. If you don't have this, it will just show it in the view.
# You can also skip the inplace and set the whole thing equal to itself, which is basically 'resetting' the dataframe

Also....when you're cleaning training & test sets, it's usually a good idea to separate the column you're trying to predict from everything else.  

For now, declare `y` to be the `SalePrice` column, and then remove it from the training set entirely.  You can drop the `ID` column too, since it encodes nothing meaningful.

**Step 2).  There are missing values throughout this dataset.  For the time being, let's try and do a few things:**

 - were these missing values likely to be randomly occurring, or are they likely encoding for something else?  
 
If values are encoding for something else, there are usually either high correlations with missing values in similar columns, and/or they could potentially represent a particular rank in a hierarchy -- ie, 'None', 0, 'Other', etc.  Ie, the missing values basically are encoding for something specific, it's just not mentioned.

Take a look at the column descriptions, see what you think they might be.

 - if you think they are missing at random, fill in the missing values with their mean(numeric columns) or mode(categorical columns)
 - if you think they are **not** missing at random, then go ahead and fill them in with a value to encode what they are (0, 'Other', and 'None' are common choices)
 
**Hint:** You can try encoding null & non-null values to 0 and 1, respectively, and use the corr() method on that. 
 
*If filling in missing values, make sure to perform this operation on the training and test set, using values from the training set for imputation.*

In [326]:
# Generally, you want to take some sort of value from the training set and use that to fill nulls on the test set.
# Mean or mode. Categorical takes mode.
test.isnull().sum().sort_values(ascending=False)

GarageYrBlt     78
GarageFinish    78
GarageType      76
MSZoning         4
GarageCars       1
YearBuilt        0
LotArea          0
Neighborhood     0
OverallQual      0
OverallCond      0
1stFlrSF         0
GrLivArea        0
2ndFlrSF         0
GrLivArea.1      0
FullBath         0
HalfBath         0
MSSubClass       0
dtype: int64

In [327]:
train_empty = train.loc[:, train.isnull().sum() > 0]
# This is selecting for the columns that have missing values
train_empty
# All of these columns have to do with the garage, and they all have the exact same number of missing values

Unnamed: 0,GarageType,GarageYrBlt,GarageFinish
0,Attchd,2003.0,RFn
1,Attchd,1976.0,RFn
2,Attchd,2001.0,RFn
3,Detchd,1998.0,Unf
4,Attchd,2000.0,RFn
...,...,...,...
1455,Attchd,1999.0,RFn
1456,Attchd,1978.0,Unf
1457,Attchd,1941.0,RFn
1458,Attchd,1950.0,Unf


In [328]:
train_empty.isnull().astype(int).corr()
# See if they're correlated
# Yep they are
# In this case the 0 encodes for "nothing"

Unnamed: 0,GarageType,GarageYrBlt,GarageFinish
GarageType,1.0,1.0,1.0
GarageYrBlt,1.0,1.0,1.0
GarageFinish,1.0,1.0,1.0


In [329]:
cols = train_empty.columns.tolist()
cols

['GarageType', 'GarageYrBlt', 'GarageFinish']

In [330]:
train[['GarageType', 'GarageFinish']] = train[['GarageType', 'GarageFinish']].fillna('None')
test[['GarageType', 'GarageFinish']] = test[['GarageType', 'GarageFinish']].fillna('None')

In [331]:
train['GarageYrBlt'].fillna(0, inplace=True)
test['GarageYrBlt'].fillna(0, inplace=True)
# Just fill it in with 0 because it doesn't exist

In [332]:
test.isnull().sum()

MSSubClass      0
MSZoning        4
LotArea         0
Neighborhood    0
OverallQual     0
OverallCond     0
YearBuilt       0
GrLivArea       0
1stFlrSF        0
2ndFlrSF        0
GrLivArea.1     0
FullBath        0
HalfBath        0
GarageType      0
GarageYrBlt     0
GarageFinish    0
GarageCars      1
dtype: int64

In [333]:
ms_mode   = train['MSZoning'].mode()[0]
# This way you grab the value rather than the series
gcarsmean = train['GarageCars'].mean()
# You use the values on the training set to fill in on the test set

In [334]:
test['MSZoning'].fillna(ms_mode, inplace=True)
test['GarageCars'].fillna(gcarsmean, inplace=True)

In [335]:
test.isnull().sum()
# Woohooo the nulls are all gone!

MSSubClass      0
MSZoning        0
LotArea         0
Neighborhood    0
OverallQual     0
OverallCond     0
YearBuilt       0
GrLivArea       0
1stFlrSF        0
2ndFlrSF        0
GrLivArea.1     0
FullBath        0
HalfBath        0
GarageType      0
GarageYrBlt     0
GarageFinish    0
GarageCars      0
dtype: int64

**Step 3): Ordinal vs Nominal Data**

There are a number of categorical columns in this dataset, and they could represent both ordinal data(data that has a rank) or nominal data (data that doesn't have a rank).  

There is a file called `data_description.txt` that contains descriptions of all the values in each and what they mean if you want to do a little bit of research.

You can also find a brief description here:  https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data.

In [336]:
num_cols = train.select_dtypes(include=np.number).columns.tolist()
num_cols
# These are the numeric values

['MSSubClass',
 'LotArea',
 'OverallQual',
 'OverallCond',
 'YearBuilt',
 'GrLivArea',
 '1stFlrSF',
 '2ndFlrSF',
 'GrLivArea.1',
 'FullBath',
 'HalfBath',
 'GarageYrBlt',
 'GarageCars']

In [337]:
ord_cols = train.select_dtypes(exclude=np.number).columns.tolist()
ord_cols
# MSZoning and Neighborhood are actually more likely to be nominal

['MSZoning', 'Neighborhood', 'GarageType', 'GarageFinish']

**Step 4):  Go Ahead and Change Your Ordinal Variables To Their Appropriate Values, if they exist.**

**Hint:** The `map` method is useful for this.  

It goes like this:  `mapping = {'oldColVal1': 'NewColVal1',`
                                 `oldColVal2': 'NewColVal2', etc}`
                                
`df['Col'] = df['Col'].map(mapping)`

In [338]:
train.GarageType.unique()

array(['Attchd', 'Detchd', 'BuiltIn', 'CarPort', 'None', 'Basment',
       '2Types'], dtype=object)

In [339]:
GarageTypeMapping ={
    'Attchd': 5,
    'Detchd': 4,
    'BuiltIn': 2,
    'CarPort': 3,
    'None': 0,
    'Basment': 1,
    '2Types': 6
}

train['GarageType'] = train['GarageType'].map(GarageTypeMapping)
test['GarageType'] = test['GarageType'].map(GarageTypeMapping)

In [340]:
train.GarageFinish.unique()

array(['RFn', 'Unf', 'Fin', 'None'], dtype=object)

In [341]:
GarageFinishMapping ={
    'RFn': 3,
    'Unf': 1,
    'Fin': 2,
    'None': 0
}

train['GarageFinish'] = train['GarageFinish'].map(GarageFinishMapping)
test['GarageFinish'] = test['GarageFinish'].map(GarageFinishMapping)

In [342]:
train.GarageType.unique()

array([5, 4, 2, 3, 0, 1, 6])

In [343]:
train.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,Neighborhood,OverallQual,OverallCond,YearBuilt,GrLivArea,1stFlrSF,2ndFlrSF,GrLivArea.1,FullBath,HalfBath,GarageType,GarageYrBlt,GarageFinish,GarageCars
0,60,RL,8450,CollgCr,7,5,2003,1710,856,854,1710,2,1,5,2003.0,3,2
1,20,RL,9600,Veenker,6,8,1976,1262,1262,0,1262,2,0,5,1976.0,3,2
2,60,RL,11250,CollgCr,7,5,2001,1786,920,866,1786,2,1,5,2001.0,3,2
3,70,RL,9550,Crawfor,7,5,1915,1717,961,756,1717,1,0,4,1998.0,1,3
4,60,RL,14260,NoRidge,8,5,2000,2198,1145,1053,2198,2,1,5,2000.0,3,3


**Step 5):  Now, OneHot Encode Your Dataset For Your Remaining Categorical Columns** 

**Note:** You want your training and your test sets attached for this one.  Detach them when you're finished.

In [344]:
train['MSSubClass'] = train['MSSubClass'].astype(str)
test['MSSubClass'] = test['MSSubClass'].astype(str)
# This is a categorical variable as well, even though it's numeric
# Just encoding them as strings so they are also included in the encoding

In [345]:
master = pd.concat([train, test], sort=True)
master = pd.get_dummies(master)
# One hot encoding these variables

In [346]:
master.drop('MSSubClass_150', axis=1, inplace=True)
# Dropping this because it's not present in the dataset at all

In [347]:
train.shape
# This is the size of the data set where we need to set the split

(1460, 17)

In [348]:
train = master.iloc[:1460].copy()
test  = master.iloc[1460:].copy()

In [349]:
train
# See, now it's one hot encoded!

Unnamed: 0,1stFlrSF,2ndFlrSF,FullBath,GarageCars,GarageFinish,GarageType,GarageYrBlt,GrLivArea,GrLivArea.1,HalfBath,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,856,854,2,2.0,3,5,2003.0,1710,1710,1,...,0,0,0,0,0,0,0,0,0,0
1,1262,0,2,2.0,3,5,1976.0,1262,1262,0,...,0,0,0,0,0,0,0,0,0,1
2,920,866,2,2.0,3,5,2001.0,1786,1786,1,...,0,0,0,0,0,0,0,0,0,0
3,961,756,1,3.0,1,4,1998.0,1717,1717,0,...,0,0,0,0,0,0,0,0,0,0
4,1145,1053,2,3.0,3,5,2000.0,2198,2198,1,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,953,694,2,2.0,3,5,1999.0,1647,1647,1,...,0,0,0,0,0,0,0,0,0,0
1456,2073,0,2,2.0,1,5,1978.0,2073,2073,0,...,0,0,0,0,0,0,0,0,0,0
1457,1188,1152,2,1.0,3,5,1941.0,2340,2340,0,...,0,0,0,0,0,0,0,0,0,0
1458,1078,0,1,1.0,1,5,1950.0,1078,1078,0,...,0,0,0,0,0,0,0,0,0,0


**Step 6): Standardize Your Data On Your Training and Test Sets**

**Remember:** Use the values from your training set to standardize your test set!  

Ask me if you have any questions on how to do this.

In [360]:
train.mean()

1stFlrSF                 1162.626712
2ndFlrSF                  346.992466
FullBath                    1.565068
GarageCars                  1.767123
GarageFinish                1.763699
GarageType                  4.216438
GarageYrBlt              1868.739726
GrLivArea                1515.463699
GrLivArea.1              1515.463699
HalfBath                    0.382877
LotArea                 10516.828082
OverallCond                 5.575342
OverallQual                 6.099315
YearBuilt                1971.267808
MSSubClass_120              0.059589
MSSubClass_160              0.043151
MSSubClass_180              0.006849
MSSubClass_190              0.020548
MSSubClass_20               0.367123
MSSubClass_30               0.047260
MSSubClass_40               0.002740
MSSubClass_45               0.008219
MSSubClass_50               0.098630
MSSubClass_60               0.204795
MSSubClass_70               0.041096
MSSubClass_75               0.010959
MSSubClass_80               0.039726
M

In [361]:
train.std()

1stFlrSF                 386.587738
2ndFlrSF                 436.528436
FullBath                   0.550916
GarageCars                 0.747315
GarageFinish               0.932792
GarageType                 1.348623
GarageYrBlt              453.697295
GrLivArea                525.480383
GrLivArea.1              525.480383
HalfBath                   0.502885
LotArea                 9981.264932
OverallCond                1.112799
OverallQual                1.382997
YearBuilt                 30.202904
MSSubClass_120             0.236805
MSSubClass_160             0.203266
MSSubClass_180             0.082505
MSSubClass_190             0.141914
MSSubClass_20              0.482186
MSSubClass_30              0.212268
MSSubClass_40              0.052289
MSSubClass_45              0.090317
MSSubClass_50              0.298267
MSSubClass_60              0.403690
MSSubClass_70              0.198580
MSSubClass_75              0.104145
MSSubClass_80              0.195382
MSSubClass_85              0

In [350]:
train_means = train.mean()
train_stds = train.std()

In [351]:
train_std = train - train_means
train_std /= train_stds
# Subtract the mean and divide by the standard deviation

In [352]:
test -= train_means
test /= train_stds

In [353]:
train
# Now it's standardized

Unnamed: 0,1stFlrSF,2ndFlrSF,FullBath,GarageCars,GarageFinish,GarageType,GarageYrBlt,GrLivArea,GrLivArea.1,HalfBath,...,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,856,854,2,2.0,3,5,2003.0,1710,1710,1,...,0,0,0,0,0,0,0,0,0,0
1,1262,0,2,2.0,3,5,1976.0,1262,1262,0,...,0,0,0,0,0,0,0,0,0,1
2,920,866,2,2.0,3,5,2001.0,1786,1786,1,...,0,0,0,0,0,0,0,0,0,0
3,961,756,1,3.0,1,4,1998.0,1717,1717,0,...,0,0,0,0,0,0,0,0,0,0
4,1145,1053,2,3.0,3,5,2000.0,2198,2198,1,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,953,694,2,2.0,3,5,1999.0,1647,1647,1,...,0,0,0,0,0,0,0,0,0,0
1456,2073,0,2,2.0,1,5,1978.0,2073,2073,0,...,0,0,0,0,0,0,0,0,0,0
1457,1188,1152,2,1.0,3,5,1941.0,2340,2340,0,...,0,0,0,0,0,0,0,0,0,0
1458,1078,0,1,1.0,1,5,1950.0,1078,1078,0,...,0,0,0,0,0,0,0,0,0,0


**Step 7):  Create a validation set out of your training set**

Since there is no time based component, random shuffling is fine.  (You can use `train_test_split` for this, although homespun methods usually work equally as well).

In [354]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(train, y, random_state=2020)

**Step 8): Fit Linear Regression on your training set, and score it on your validation set to get a feel for how you did.**

In [355]:
from sklearn.linear_model import LinearRegression

In [356]:
lreg = LinearRegression()
lreg.fit(X_train, y_train)
lreg.score(X_val, y_val)
# FIT on the training dat, SCORE on the validation data

0.8537689474719947

**Step 9):  Finally, go ahead and make your predictions on your test set.**

Save to a csv file the following the following columns: ID of of each row in your test set, as well as your prediction.

In [357]:
predictions = pd.DataFrame()
# Creating an empty data frame
predictions['ID'] = np.arange(1461, 1461+1459)
predictions['Predictions'] = lreg.predict(test)
predictions

Unnamed: 0,ID,Predictions
0,1461,-9.062400e+05
1,1462,-9.198931e+05
2,1463,-9.804421e+05
3,1464,-9.641554e+05
4,1465,-3.549384e+05
...,...,...
1454,2915,-1.168616e+06
1455,2916,-1.160400e+06
1456,2917,-9.670118e+05
1457,2918,-9.182397e+05


In [358]:
predictions.to_csv('Predictions', index=False)
# The index is false because you don't want to write those to the file

**Bonus:** Can you improve your score?

The first part of this lab was meant to be a walk through of the basics of prepping a data set and getting it ready.

However, there's a lot that could be improved upon!  

Using validation scores as your guide, you could try and look at some of the following:

 - Removing outliers from the target variable, or using log transformations to make the data smoother
 - There are lots of highly correlated variables in this dataset.  Do the 4 different columns about the fireplace really tell you something that different from one another?  You can try averaging multiple columns into one if they're highly correlated, or removing some entirely to see if it improves anything.

In [359]:
# your answer here