<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Regression with the Ames Housing Data

---

In [1]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

<img src="http://imgur.com/l5NasQj.png" style="float: left; margin: 25px 15px 0px 0px; height: 25px">

## Estimate the value of homes.

---

Though [the housing data for this project](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) is very rich, it is not trivial to clean and prepare for modeling. Good EDA, cleaning/imputation, and feature engineering will be critical to the success of your models. The computer can't be the researcher for you!

**Your goals:**
1. You work for a real estate company interested in using data science to determine the best properties to buy and re-sell. Perform any cleaning, feature engineering, and EDA you deem necessary to train a model and evaluate its ability to predict sale price on the test houses.
    * Remove any houses that are not residential from the dataset.
    * Evaluate your predictions with root mean squared error.

You have a dataset of housing sale data with a huge amount of features identifying different aspects of the house. The full description of the data features can be found in a separate file:

    housing.csv
    data_description.txt
    
> **Note:** The EDA component to this project is not trivial! Be sure to always think critically and creatively. Justify your actions! Use the data description file!

In [2]:
# Load the data
house = pd.read_csv('./housing.csv')

In [3]:
house.iloc[:1,:14]

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm


In [4]:
house.iloc[:1,14:26]

Unnamed: 0,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType
0,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace


In [5]:
house.iloc[:1,26:38]

Unnamed: 0,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF
0,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150


In [6]:
house.iloc[:1,38:51]

Unnamed: 0,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath
0,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1


In [7]:
house.iloc[:1,51:62]

Unnamed: 0,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars
0,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2


In [8]:
house.iloc[:1,62:74]

Unnamed: 0,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence
0,548,TA,TA,Y,0,61,0,0,0,0,,


In [9]:
house.iloc[:1,74:]

Unnamed: 0,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,,0,2,2008,WD,Normal,208500


In [10]:
# random state so everyone gets the same set of houses
house_train = house.sample(frac=.8,random_state=2).copy()
house_test = house[~house.Id.isin(house_train.Id.unique())].copy()

In [11]:
house_test.shape

(292, 81)

In [13]:
house.KitchenQual.unique()

array(['Gd', 'TA', 'Ex', 'Fa'], dtype=object)