# House Prices Kaggle Competition

This notebook simply explores the dataset to see what insights can be gained.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

Import and inspect the data

In [2]:
df = pd.read_csv('../data/train.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [4]:
featuresWithNullValues = df.isnull().sum()
print(featuresWithNullValues[featuresWithNullValues > 0])

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64


# Dealing with Missing Data

The **train.csv** dataset has 1460 records and 81 features. There are 19 features with missing data that need to be treated with a variety of strategies. Ranked in order of most missing data to least:
- PoolQC = 99.52%
- MiscFeature = 96.30%
- Alley = 93.76%
- Fence = 80.75%
- FireplaceQu = 47.26%
- LotFrontage = 17.74%
- GaragType, GarageYrBlt, GarageFinish, GarageQual, GarageCond = 5.55%
- BsmtExposure, BsmtFinType2 = 2.60%
- BsmtQual, BsmtCond, BsmtFinType1 = 2.53%
- MasVnrType, MasVnrArea = 0.55%
- Electrical = 0.07%

### PoolQC (99.52%)

Not sure what PoolQC means but it is probably related to PoolArea which is not missing data and has a lot of 0's, which likely means there is no pool. Let's see if missing PoolQC values are correlated to PoolArea values that are 0.

In [7]:
df.PoolArea.value_counts().head(3)

0      1453
512       1
648       1
Name: PoolArea, dtype: int64

There are 1453 PoolArea values equal to 0 which matches the number of missing PoolQC values. Area they 100% correlated?

In [8]:
len(df[(df.PoolArea==0) & df.PoolQC.isnull()])

1453

Yes they are. The data description allows for 'NA' as an option. Let's replace nan with NA.

In [9]:
df.PoolQC.fillna('NA', inplace=True)

### MiscFeatures (96.30%)

Miscellaneous feature not covered in other categories. There is a MiscVal feature that is not missing data. I am guessing everywhere there is missing data for MiscFeature there will be a $0 value for MiscVal.

In [10]:
df.MiscVal.value_counts().head(3)

0      1408
400      11
500       8
Name: MiscVal, dtype: int64

In [11]:
df.MiscFeature.value_counts()

Shed    49
Gar2     2
Othr     2
TenC     1
Name: MiscFeature, dtype: int64

There are 2 more 0 value features than missing misc features. Is this because 'Othr' miscellaneous feature is 0 value?

In [13]:
temp_df = df[['MiscFeature', 'MiscVal']]

shed_df = temp_df[temp_df['MiscFeature'] == 'Shed'] 
gar2_df = temp_df[temp_df['MiscFeature'] == 'Gar2'] 
othr_df = temp_df[temp_df['MiscFeature'] == 'Othr'] 
tenc_df = temp_df[temp_df['MiscFeature'] == 'TenC'] 

othr_df.head()

Unnamed: 0,MiscFeature,MiscVal
705,Othr,3500
873,Othr,0


No, one of the sheds (index 1200) is 0 as well.

I'm not going to worry about the additional 2. If MiscFeature is nan I will set to 'NA' as defined in the data description.

In [14]:
df.MiscFeature.fillna('NA', inplace=True)

### Alley (93.76%)

Type of alley access to property

- Grvl	Gravel
- Pave	Paved
- NA 	No alley access

Some possibly related features:

- Utilities: Type of utilities available - limited utilities access might imply no alley access.
- BldgType: Type of dwelling - townhouses likely have no alley access.

### Fence (80.75%)

Fence quality
		
- GdPrv	Good Privacy
- MnPrv	Minimum Privacy
- GdWo	Good Wood
- MnWw	Minimum Wood/Wire
- NA	No Fence

### FireplaceQu (47.26%)

Fireplace quality

- Ex	Excellent - Exceptional Masonry Fireplace
- Gd	Good - Masonry Fireplace in main level
- TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
- Fa	Fair - Prefabricated Fireplace in basement
- Po	Poor - Ben Franklin Stove
- NA	No Fireplace

Fireplaces: Number of fireplaces

### LotFrontage (17.74%)

Linear feet of street connected to property.

Some related: 
- LotConfig: Lot configuration

### GaragType, GarageYrBlt, GarageFinish, GarageQual, GarageCond (5.55%)

Are all the same records missing these or are missing values spread around?

### BsmtExposure, BsmtFinType2 (2.60%)

Are all the same records missing these or are missing values spread around?

### BsmtQual, BsmtCond, BsmtFinType1 (2.53%)

Are all the same records missing these or are missing values spread around?

### MasVnrType, MasVnrArea (0.55%)

Are all the same records missing these or are missing values spread around?

### Electrical (0.07%)


