The goal of this competition is to predict the sale price ("SalePrice") of each house in the test dataset. The scoring criteria is root-mean-square-error (the residuals), where the comparison is between the logs of the predicted and actual price, to put all house prices on an even scoring field.

Reading through the [data description,](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) I'm thinking some things about the fields.
* The data might have a lot of missing or NA values. For example, not all houses have a basement, so fields like "BsmtQual," "BsmtCond," etc. won't have anything there. I'm going to need to clean it up or account for it in my models.
* Some of all of this data might have been entered by hand and may not have been cleaned up yet, so I'll need to look out for outliers, spelling errors, etc. and correct them.

Let's start exploring!

In [2]:
import pandas as pd
import matplotlib as plt

df = pd.read_csv("./data/train.csv",index_col="Id")

df.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,Gar2,12500,6,2010,WD,Normal
1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,3,2010,WD,Normal
1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,6,2010,WD,Normal
1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,Inside,...,144,0,,,,0,1,2010,WD,Normal


In [3]:
df.tail()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,6,2006,WD,Normal
2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2006,WD,Abnorml
2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,9,2006,WD,Abnorml
2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal
2919,60,RL,74.0,9627,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,11,2006,WD,Normal


Since I think there might be missing values, I'm going to find out how many observations we should have.

Then I'm going to describe each column in the database. Hopefully the count matches up for each of them. `describe()` will provide other exploratory information, too, like the percentiles. We'll also find out how pandas determined the data types when it read in the csv. *Note here that the field name is listed below the information.*

In [4]:
len(df)

1459

In [5]:
col_list = list(df.columns.values)

for col in col_list:
    print(df[col].describe())

count    1459.000000
mean       57.378341
std        42.746880
min        20.000000
25%        20.000000
50%        50.000000
75%        70.000000
max       190.000000
Name: MSSubClass, dtype: float64
count     1455
unique       5
top         RL
freq      1114
Name: MSZoning, dtype: object
count    1232.000000
mean       68.580357
std        22.376841
min        21.000000
25%              NaN
50%              NaN
75%              NaN
max       200.000000
Name: LotFrontage, dtype: float64
count     1459.000000
mean      9819.161069
std       4955.517327
min       1470.000000
25%       7391.000000
50%       9399.000000
75%      11517.500000
max      56600.000000
Name: LotArea, dtype: float64
count     1459
unique       2
top       Pave
freq      1453
Name: Street, dtype: object
count      107
unique       2
top       Grvl
freq        70
Name: Alley, dtype: object
count     1459
unique       4
top        Reg
freq       934
Name: LotShape, dtype: object
count     1459
unique       4
top   



count    1457.000000
mean        0.065202
std         0.252468
min         0.000000
25%              NaN
50%              NaN
75%              NaN
max         2.000000
Name: BsmtHalfBath, dtype: float64
count    1459.000000
mean        1.570939
std         0.555190
min         0.000000
25%         1.000000
50%         2.000000
75%         2.000000
max         4.000000
Name: FullBath, dtype: float64
count    1459.000000
mean        0.377656
std         0.503017
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         2.000000
Name: HalfBath, dtype: float64
count    1459.000000
mean        2.854010
std         0.829788
min         0.000000
25%         2.000000
50%         3.000000
75%         3.000000
max         6.000000
Name: BedroomAbvGr, dtype: float64
count    1459.000000
mean        1.042495
std         0.208472
min         0.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         2.000000
Name: KitchenAbvGr, dtype: f

There are a few things I'm noticing in this dataset:
- It's mostly complete, but there are some that are missing a few values (LotFrontage, some of the fields about garages). Others are missing a lot of observations and have records that are largely incomplete (Alley, PoolQC, Fence, MiscFeature). FireplaceQu is missing about half.
- Some of the fields about quality are scored on a scale of 1-10. Others are scored on a 5-value scale from Ex (Excellent) to Po (Poor). I might handle this by converting it to a 1-5 scale.
    - Figuring this out required some help from the American Statistical Association's documentation on the [Ames Housing Data](https://ww2.amstat.org/publications/jse/v19n3/decock/datadocumentation.txt).
- There are descriptors I'll need to manage somehow; this will be a learning experience for me.