In [1]:
import pandas as pd

In [3]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

What do we have in the training data?

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

1460 rows and 81 columns. 3 are floats, 35 are ints, and 43 are strings (at least that's what pandas thinks they are, we may find out pandas read the columns in improperly due to NA values). The first integer column is the id of the house and the last integer column is the sales prices of the house, so we actually have 79 independent variables. It looks like several of the columns have missing data. Let's get a clearer picture of how many values are missing.

In [13]:
missing_train_values = train.isnull().sum()
missing_train_values[missing_train_values > 0]

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

19 of our independent variables have missing values. However, when looking at the data description file, we can see that for several of these variables NA is an acceptable value. This applies to:

* Alley
* BsmtQual
* BsmtCond
* BsmtExposure
* BsmtFinType1
* BsmtFinType2
* FireplaceQC
* GarageType
* GarageFinish
* GarageQual
* GarageCond
* PoolQC
* Fence
* MiscFeature

Similarly, GarageYrBlt will be expected to be null for rows where there is no garage, and we can see there are the same number of missing values for GarageYrBlt as there are for GarageType.

We can see there is likely legitimately missing value for each of BsmtExposure and BsmtFinType2.

LotFrontage has many missing values and is a column of floats. We will need to see if this is actual missing data or if it makes sense in the context of the particular houses. LotFrontage is the linear feet of street connected to the property and we can certainly conceive of properties that are not connected to street.

The last variables are MasVnrType and MasVnrArea. MasVnrType has an acceptable value of None, so let's see if that is what is causing these missing 8 values (we would expect MasVnrArea to be empty if MasVnrType is None).

In [11]:
print(train['MasVnrType'].value_counts())
print(train['MasVnrType'].value_counts().sum())

None       864
BrkFace    445
Stone      128
BrkCmn      15
Name: MasVnrType, dtype: int64
1452


In [9]:
train['MasVnrArea'].value_counts()

0.0       861
72.0        8
180.0       8
108.0       8
120.0       7
16.0        7
80.0        6
200.0       6
106.0       6
340.0       6
170.0       5
132.0       5
360.0       5
84.0        5
320.0       5
100.0       4
196.0       4
246.0       4
216.0       4
160.0       4
183.0       4
178.0       4
270.0       4
300.0       4
210.0       4
268.0       4
252.0       4
168.0       4
336.0       4
220.0       4
         ... 
14.0        1
53.0        1
24.0        1
127.0       1
365.0       1
115.0       1
562.0       1
259.0       1
378.0       1
219.0       1
161.0       1
247.0       1
109.0       1
278.0       1
375.0       1
225.0       1
604.0       1
762.0       1
290.0       1
299.0       1
202.0       1
731.0       1
167.0       1
309.0       1
1129.0      1
651.0       1
337.0       1
415.0       1
293.0       1
621.0       1
Name: MasVnrArea, Length: 327, dtype: int64

In [12]:
train['MasVnrArea'].value_counts().sum()

1452

Since there are 864 None values in MasVnrType, we can be sure that those 8 null values are indeed missing data. Interestingly, in MasVnrArea, there are only 861 values of 0.0 implying there are 3 properties with a None MasVnrType but a non-zero MasVnrArea. Or, maybe this is a consequence of the missing data. Something we'll certainly look into.

Now, what about the test data?

In [15]:
test.shape

(1459, 80)

In [14]:
missing_test_values = test.isnull().sum()
missing_test_values[missing_test_values > 0]

MSZoning           4
LotFrontage      227
Alley           1352
Utilities          2
Exterior1st        1
Exterior2nd        1
MasVnrType        16
MasVnrArea        15
BsmtQual          44
BsmtCond          45
BsmtExposure      44
BsmtFinType1      42
BsmtFinSF1         1
BsmtFinType2      42
BsmtFinSF2         1
BsmtUnfSF          1
TotalBsmtSF        1
BsmtFullBath       2
BsmtHalfBath       2
KitchenQual        1
Functional         2
FireplaceQu      730
GarageType        76
GarageYrBlt       78
GarageFinish      78
GarageCars         1
GarageArea         1
GarageQual        78
GarageCond        78
PoolQC          1456
Fence           1169
MiscFeature     1408
SaleType           1
dtype: int64

There is more missing data in the test data than the train data. Seems a bit unfair (though probably more realistic), but we'll happy deal with it. Many of the columns with missing data are the same as above. New columns with actual missing data are:

* MSZoning
* Utilities
* Exterior1st
* Exterior2nd
* BsmtFinSF1
* BsmtFinSF2
* BsmtUnfSF
* TotalBsmtSF
* BsmtFullBath
* BsmtHalfBath
* KitchenQual
* Functional
* GarageCars
* GarageArea
* SaleType

Luckily, there aren't many missing values for each of these columns. We also have what looks like a couple actually missing values in the columns covered above.

We'll have to figure out a strategy for filling in these values. Now that we have a quick glimpse of the data, let's dive into looking at each variable in the single_variable directory.