The goal of this competition is to predict the sale price ("SalePrice") of each house in the test dataset. The scoring criteria is root-mean-square-error (the residuals), where the comparison is between the logs of the predicted and actual price, to put all house prices on an even scoring field.

Reading through the [data description,](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) I'm thinking some things about the fields.
* The data might have a lot of missing or NA values. For example, not all houses have a basement, so fields like "BsmtQual," "BsmtCond," etc. won't have anything there. I'm going to need to clean it up or account for it in my models.
* Some of all of this data might have been entered by hand and may not have been cleaned up yet, so I'll need to look out for outliers, spelling errors, etc. and correct them.

Let's start exploring!

In [8]:
import pandas as pd
import matplotlib as plt

df = pd.read_csv("./data/train.csv",index_col="Id")

df.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [9]:
df.tail()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,175000
1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,142125
1460,20,RL,75.0,9937,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,6,2008,WD,Normal,147500


Since I think there might be missing values, I'm going to find out how many observations we should have.

Then I'm going to describe each column in the database. Hopefully the count matches up for each of them. `describe()` will provide other exploratory information, too, like the percentiles. We'll also find out how pandas determined the data types when it read in the csv. *Note here that the field name is listed below the information.*

In [10]:
len(df)

1460

In [11]:
col_list = list(df.columns.values)

for col in col_list:
    print(df[col].describe())

count    1460.000000
mean       56.897260
std        42.300571
min        20.000000
25%        20.000000
50%        50.000000
75%        70.000000
max       190.000000
Name: MSSubClass, dtype: float64
count     1460
unique       5
top         RL
freq      1151
Name: MSZoning, dtype: object
count    1201.000000
mean       70.049958
std        24.284752
min        21.000000
25%              NaN
50%              NaN
75%              NaN
max       313.000000
Name: LotFrontage, dtype: float64
count      1460.000000
mean      10516.828082
std        9981.264932
min        1300.000000
25%        7553.500000
50%        9478.500000
75%       11601.500000
max      215245.000000
Name: LotArea, dtype: float64
count     1460
unique       2
top       Pave
freq      1454
Name: Street, dtype: object
count       91
unique       2
top       Grvl
freq        50
Name: Alley, dtype: object
count     1460
unique       4
top        Reg
freq       925
Name: LotShape, dtype: object
count     1460
unique       



count    1460.000000
mean       94.244521
std       125.338794
min         0.000000
25%         0.000000
50%         0.000000
75%       168.000000
max       857.000000
Name: WoodDeckSF, dtype: float64
count    1460.000000
mean       46.660274
std        66.256028
min         0.000000
25%         0.000000
50%        25.000000
75%        68.000000
max       547.000000
Name: OpenPorchSF, dtype: float64
count    1460.000000
mean       21.954110
std        61.119149
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       552.000000
Name: EnclosedPorch, dtype: float64
count    1460.000000
mean        3.409589
std        29.317331
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       508.000000
Name: 3SsnPorch, dtype: float64
count    1460.000000
mean       15.060959
std        55.757415
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       480.000000
Name: ScreenPorch, dtype:

There are a few things I'm noticing in this dataset:
- It's mostly complete, but there are some that are missing a few values (LotFrontage, some of the fields about garages). Others are missing a lot of observations and have records that are largely incomplete (Alley, PoolQC, Fence, MiscFeature). FireplaceQu is missing about half.
- Some of the fields about quality are scored on a scale of 1-10. Others are scored on a 5-value scale from Ex (Excellent) to Po (Poor). I might handle this by converting it to a 1-5 scale.
    - Figuring this out required some help from the American Statistical Association's documentation on the [Ames Housing Data](https://ww2.amstat.org/publications/jse/v19n3/decock/datadocumentation.txt).
- There are descriptors I'll need to manage somehow; this will be a learning experience for me.