# Prepare Data - Ames Housing

Contents
 - start
  - packages
  - directories and paths
 - data manipulation
   - variables to use
   - dealing with missings and outliers
 - save data

Sources:
http://ww2.amstat.org/publications/jse/v19n3/decock.pdf

Copyright (C) 2018 Alan Chalk  
Please do not distribute or publish without permission.

## Start_.

**Import any packages needed** 

In [1]:
import os
import numpy as np
import pandas as pd
import pickle

from sklearn import preprocessing

from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
# AC using an image (27-25-2019) with sklearn 0.20.3

**Set directories and paths**

In [3]:
# Set directories
print(os.getcwd())
dirRawData = "../input/"
dirPData   = "../PData/"
dirPOutput = "../POutput/"

/home/jovyan/Projects/AmesHousing/PCode


**Load data**

In [4]:
#store = pd.HDFStore(dirPData + '01_df_all.h5')
#df_all = pd.read_hdf(store, 'df_all')
#store.close()

f_name = dirPData + '01_df_all.pickle'

with open(f_name, "rb") as f: #read binary as f
    dict_ = pickle.load(f)

df_all = dict_['df_all']
del f_name, dict_


### Data Manipulation

In [5]:
df_all.head()

Unnamed: 0,order,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,...,screen_porch,pool_area,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,sale_condition,saleprice
0,1,526301100,MS_20,RL,141.0,31770.0,Pave,DoesNotHaveOne,IR1,Lvl,...,0.0,0.0,DoesNotHaveOne,DoesNotHaveOne,0.0,5.0,2010.0,WD,Normal,215000.0
1,2,526350040,MS_20,RH,80.0,11622.0,Pave,DoesNotHaveOne,Reg,Lvl,...,120.0,0.0,MnPrv,DoesNotHaveOne,0.0,6.0,2010.0,WD,Normal,105000.0
2,3,526351010,MS_20,RL,81.0,14267.0,Pave,DoesNotHaveOne,IR1,Lvl,...,0.0,0.0,DoesNotHaveOne,DoesNotHaveOne,12500.0,6.0,2010.0,WD,Normal,172000.0
3,4,526353030,MS_20,RL,93.0,11160.0,Pave,DoesNotHaveOne,Reg,Lvl,...,0.0,0.0,DoesNotHaveOne,DoesNotHaveOne,0.0,4.0,2010.0,WD,Normal,244000.0
4,5,527105010,MS_60,RL,74.0,13830.0,Pave,DoesNotHaveOne,IR1,Lvl,...,0.0,0.0,MnPrv,DoesNotHaveOne,0.0,3.0,2010.0,WD,Normal,189900.0


**Variables to use**

Define the variables to use and not to use.  In this project we are not going to use order or pid to predict and so they are put in the list "vars_notToUse".  

Create the following variables:
 - vars_all: an np.ndarray of column names
 - var_dep: a list containing the dependent variable ('saleprice')
 - vars_notToUse: a list of variables not to use ('order' and 'pid')
 - vars_ind: a list of variables to use being all the variables in vars_all except vars_notToUse and var_dep i.e. the independent vars


In [6]:
vars_all = df_all.columns.values
var_dep = ['saleprice']

vars_notToUse = ['order','pid']
#use list comprehension (see below examples)
vars_ind = [var for var in vars_all if var not in (vars_notToUse+var_dep)] #as in second list comprehension example

In [7]:
###example of list comprehensions
numbers = range(10)
print([number for number in numbers]) #gives list
print([number for number in numbers if number % 2 == 0]) #gives list excluding odd nums

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 2, 4, 6, 8]


In [8]:
###another example
lst = ['apple','pear','rabbit']
[word for word in lst if 't' not in word]

['apple', 'pear']

Create:
 - vars_ind_numeric: A list of the numeric independent variables

In [9]:
# df_all[vars_ind]

In [10]:
vars_ind_numeric = [var for var in vars_ind if df_all[var].dtype != 'object']
# df_all[vars_ind_numeric]

**Deal with missing data**

Dealing with missing data is a major topic in machine learning.  We will take the simplest approach.  Delete it.

If a feature has mostly missing values - we will simply delete the feature (i.e. that column of the data).

If a feature is mostly populated and there are just one or two records with missing values - we will delete the records.

TODO 

Carry out the following:
 - print the number of rows and columns in the data
 - find the number of missings of each feature (df_all.isnull() gives missings and you can then use .sum(axis=0) )

In [11]:
print(df_all.shape)
#collapse axis = 0 i.e. sum missing values,
#store as series
# df_all.isnull()
srs_missing = df_all.isnull().sum(axis=0) 
print(srs_missing[srs_missing>0]) #show which features have missing values



(2930, 80)
lot_frontage      490
mas_vnr_area       23
bsmtfin_sf_1        1
bsmtfin_sf_2        1
bsmt_unf_sf         1
total_bsmt_sf       1
bsmt_full_bath      2
bsmt_half_bath      2
garage_yr_blt     159
garage_cars         1
garage_area         1
dtype: int64


TODO 

- You should have found above that ['lot_frontage', 'garage_yr_blt', 'mas_vnr_area'] have a reasonable number of missings. Drop these columns (inplace).
- Remove these 3 variables from vars_ind and vars_ind_numeric
- Then delete any remaining examples with missing features
- Check the number of rows and columns of the remaining data - is it what you expect?

In [12]:
###DROP VARIABLES
vars_toDrop = ['lot_frontage', 'garage_yr_blt', 'mas_vnr_area']
###FOR DEBUGGING, check indiv. data types of vars to drop
# print(df_all['lot_frontage'].dtype)
# print(df_all['garage_yr_blt'].dtype)
# print(df_all['mas_vnr_area'].dtype)
###ALTERNATIVELY, check all at once
# [df_all[var].dtype for var in vars_toDrop]
df_all.drop(labels=vars_toDrop,
            axis=1,
            inplace=True)
df_all.shape

(2930, 77)

In [13]:
###REMOVE DROPPED VARS FROM vars_ind AND vars_ind_numeric
print(len(vars_ind))
print(len(vars_ind_numeric))
###EASIEST WAY: redefine both lists according to updated df_all
vars_ind = [var for var in vars_ind if var in df_all]
vars_ind_numeric = [var for var in vars_ind if df_all[var].dtype!='object']
###ALTERNATIVELY: use set differences
#first turn list into set, then use differences, and turn it back into a list
# list(set(vars_ind).difference(set(vars_toDrop)))
# list(set(vars_ind_numeric).difference(set(vars_toDrop)))
print(len(vars_ind))
print(len(vars_ind_numeric))


77
35
74
32


In [15]:
# now drop the NA
df_all.dropna(axis=0,how='any', inplace=True)
temp = df_all.isnull().sum(axis=0) 
print('Empty series?', temp[temp>0].empty) #double check all missings are gone (expect empty series)
print(df_all.shape) #check no. of cols and rows to see if match what is expected
del temp #no need to keep variable

Empty series? True
(2927, 77)


**Remove known outliers**

See http://ww2.amstat.org/publications/jse/v19n3/decock.pdf which states:
        
> I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these five unusual observations) before assigning it to students.

TODO 

Remove the 5 examples where gr_liv_area > 4000 and check the number of rows again.

In [16]:
df_all = df_all[df_all['gr_liv_area']<=4000]

In [17]:
print(df_all.shape)

(2922, 77)


**Care is needed with the index**

- Run the line of code below - note that the index for examples (rows) is not contiguous.  Why not?

Ans: Possibly because of having dropped examples and not reset index?


In [392]:
print(np.where(np.diff(df_all.index.values, 1) != 1))
df_all.index.values[1339:1345]

(array([1340, 1495, 1756, 1762, 2174, 2228]),)


array([1339, 1340, 1342, 1343, 1344, 1345])

- What is the advantage of leaving it like this?
Ans: can see easily which examples have been dropped 
- Reset the index (inplace)

In [393]:
df_all.reset_index(drop=True, inplace=True)
# print(np.where(np.diff(df_all.index.values, 1) != 1))
# df_all.index.values[1339:1345]

Check that missings is no longer an issue

In [396]:
#NOTE I ALREADY DID THIS ABOVE, BUT LEAVE PROF's CODE IN HERE ANYWAY
srs_missing = df_all.isnull().sum(axis=0)
print(srs_missing[srs_missing > 0])
del srs_missing

Series([], dtype: int64)


### Store the dataset and relevant variables

In [398]:
# Note: if you have run the store commands and for some reason they failed - you may need to run: store.close() 
# Note: When running the first time you do not to use: store.remove()
#store = pd.HDFStore(dirPData + '02_df_all.h5')
#store.remove('df_all')
#df_all.to_hdf(store, 'df_all')
#store.close()

dict_ = {'df_all': df_all}

f_name = dirPData + '02_df.pickle'
with open(f_name, "wb") as f: #open file, write binary
    pickle.dump(dict_, f)
del f_name, dict_

###ALSO creating a dictionary with info about variables (which not to use, which are numeric and which is dependent)
dict_ = {'vars_ind_numeric': vars_ind_numeric,
        'vars_notToUse': vars_notToUse,
        'var_dep': var_dep}
###STORE in seperate file called 02_vars
f_name = dirPData + '02_vars.pickle'
with open(f_name, "wb") as f:
    pickle.dump(dict_, f)
del f_name, dict_