<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Data-Dictionary-(Main),-EDA-Comment,-and-Imputation-Strategy" data-toc-modified-id="Data-Dictionary-(Main),-EDA-Comment,-and-Imputation-Strategy-0.0.0.1"><span class="toc-item-num">0.0.0.1&nbsp;&nbsp;</span>Data Dictionary (Main), EDA Comment, and Imputation Strategy</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Read-the-data" data-toc-modified-id="Read-the-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Read the data</a></span></li><li><span><a href="#Dealing-with-missing-values" data-toc-modified-id="Dealing-with-missing-values-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Dealing with missing values</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Imputation-for-geo-value-with-KNN" data-toc-modified-id="Imputation-for-geo-value-with-KNN-2.0.1"><span class="toc-item-num">2.0.1&nbsp;&nbsp;</span>Imputation for geo value with KNN</a></span></li><li><span><a href="#Filling-missing-values" data-toc-modified-id="Filling-missing-values-2.0.2"><span class="toc-item-num">2.0.2&nbsp;&nbsp;</span>Filling missing values</a></span></li><li><span><a href="#Drop-duplicate-columns" data-toc-modified-id="Drop-duplicate-columns-2.0.3"><span class="toc-item-num">2.0.3&nbsp;&nbsp;</span>Drop duplicate columns</a></span></li><li><span><a href="#Check-point:-Save-data-frame" data-toc-modified-id="Check-point:-Save-data-frame-2.0.4"><span class="toc-item-num">2.0.4&nbsp;&nbsp;</span>Check point: Save data frame</a></span></li></ul></li></ul></li></ul></div>

In [1]:
import pandas as pd
import numpy as np
import os

from datetime import datetime

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

#### Data Dictionary (Main), EDA Comment, and Imputation Strategy
 | Feature | Description | Feature range | comment | Imputation Status|
 | -- | --| -- | -- |--|
 | 'airconditioningtypeid' |  Type of cooling system present in the home  | nan,1 - 13 | amenities| Fill in with 5 suggested in the data dictionary
 | 'architecturalstyletypeid' |  Architectural style of the home (i.e. ranch, colonial, split-level, etc…) | nan,1 - 27 | exterior design|Fill in with 19 suggested in the data dictionary
 | 'basementsqft' |  Finished living area below or partially below ground level |nan,20 - 8516 | No basement may recorded as null. sqft | median
 | 'bathroomcnt' |  Number of bathrooms in home including fractional bathrooms | nan,0 - 20 |room cnt |median
 | 'bedroomcnt' |  Number of bedrooms in home  | 0 - 20 |  room cnt |median
 | 'buildingqualitytypeid' |  Overall assessment of condition of the building from best (lowest) to worst (highest) |nan, 1-12 | quality|need to impute for nan, could  be zero. May be median
 | 'buildingclasstypeid' | The building framing type (steel frame, wood frame, concrete/brick)  |nan, 1-5 | structural frames| Drop
 | 'calculatedbathnbr' |  Number of bathrooms in home including fractional bathroom | nan,0-20 | room cnt, may be same as bathroomcnt |duplicate variables as bath count. Drop
 | 'decktypeid' | Type of deck (if any) present on parcel | nan or 66 | - |Drop
 | 'threequarterbathnbr' |  Number of 3/4 bathrooms in house (shower + sink + toilet) | nan,1-4 | amenities cnt | fill 1
 | 'finishedfloor1squarefeet' |  Size of the finished living area on the first (entry) floor of the home | 0-31303 | sqft |duplicate variables as to finishedsquarefeet50.Drop
 | 'calculatedfinishedsquarefeet' |  Calculated total finished living area of the home  |nan, 1-952576|sqft | duplicate variables as to finishedsquarefeet50.Drop
 | 'finishedsquarefeet6' | Base unfinished and finished area |nan,117 - 952576| sqft | duplicate variables as to finishedsquarefeet50.Drop
 | 'finishedsquarefeet12' | Finished living area |-| sqft | duplicate variables as to finishedsquarefeet50.Drop
 | 'finishedsquarefeet13' | Perimeter  living area |-| sqft | duplicate variables as to finishedsquarefeet50.Drop
 | 'finishedsquarefeet15' | Total area |-| sqft | duplicate variables as to finishedsquarefeet50.Drop
 | 'finishedsquarefeet50' |  Size of the finished living area on the first (entry) floor of the home |-| sqft |median 
 | 'fips' |  Federal Information Processing Standard code  | may be same as zip code | Location |  Drop as duplicate info as zip code
 | 'fireplacecnt' |  Number of fireplaces in a home (if any) |nan,1-9 | amenities| null with 0
 | 'fireplaceflag' |  Is a fiReplace withpresent in this home  | nan, True | Binary| Replace with1 and 0
 | 'fullbathcnt' |  Number of full bathrooms (sink, shower + bathtub, and toilet) present in home | nan,1-20 | room cnt| same as bathroomcnt
 | 'garagecarcnt' |  Total number of garages on the lot including an attached garage | nan, 0-25 | room cnt| Fill 0 if null
 | 'garagetotalsqft' |  Total number of square feet of all garages on lot including an attached garage | nan,0-7749 |sqft| Fill 0 if null
 | 'hashottuborspa' |  Does the home have a hot tub or spa | nan, True  | Binary|Replace with1 and 0
 | 'heatingorsystemtypeid' |  Type of home heating system | nan,1-24 | amenities | None is 13 in data dict
 | 'latitude' |  Latitude of the middle of the parcel multiplied by 10e6 | 'loc_latitude' | Location | mean if null
 | 'longitude' |  Longitude of the middle of the parcel multiplied by 10e6 | 'loc_longitude' |Location|mean if null
 | 'lotsizesquarefeet' |  Area of the lot in square feet | 'sqft_total_lot' | 100-3282638 | sqft
 | 'numberofstories' |  Number of stories or levels the home has | nan,1-6 | cnt| np.random.randint (1,3)|
 | 'parcelid' |  Unique identifier for parcels (lots)  | 'ParcelId' | ID |unique identifier
 | 'poolcnt' |  Number of pools on the lot (if any) | nan, 1.0 | Binary|Replace with1 and 0|
 | 'poolsizesum' |  Total square footage of all pools on property | nan,19-17410 | amenities sqft |If pool count has value and missing the poolsize, fill with median.
 | 'pooltypeid10' |  Spa or Hot Tub | nan, 1.0 |Binary |Replace with1 and 0|
 | 'pooltypeid2' |  Pool with Spa/Hot Tub  | nan, 1.0| Binary |Replace with1 and 0|
 | 'pooltypeid7' |  Pool without hot tub  | nan, 1.0| Binary |Replace with1 and 0|
 | 'propertycountylandusecode' |  County land use code i.e. it's zoning at the county level | object type of zoning code, with nan | Land Use | mode
 | 'propertylandusetypeid' |  Type of land use the property is zoned for |Object| landuse zoning, with nan | mode
 | 'propertyzoningdesc' |  Description of the allowed land uses (zoning) for that property | text zoning, with nan | Land Use | Drop row if median doesn't fill
 | 'rawcensustractandblock' |  Census tract and block ID combined - also contains blockgroup assignment by extension | value | census |mode
 | 'censustractandblock' |  Census tract and block ID combined - also contains blockgroup assignment by extension | value, with nan | census|mode
 | 'regionidcounty' | County in which the property is located | 1286.0, 2061.0, 3101.0, nan | location | May drop b/ duplicate info | median
 | 'regionidcity' |  City in which the property is located (if any) | values | Location | median
 | 'regionidzip' |  Zip code in which the property is located | 'loc_zipcode' | Location | mode
 | 'regionidneighborhood' | Neighborhood in which the property is located | value, with nan | Location |median
 | 'roomcnt' |  Total number of rooms in the principal residence | nan, 0-96 | room cnt |median
 | 'storytypeid' |  Type of floors in a multi-story house (i.e. basement and main level, split-level, attic, etc.).  See tab for details. | nan, 7.0, but dict shows 1-35| exterior design, 7 is Basement |only nan or 7  is shown in the data set while the missing rate is 99%. Imputation is not representative enough. Drop
 | 'typeconstructiontypeid' |  What type of construction material was used to construct the home | nan, 4.0, 6, 10, 11, 13, but record show 1-18 | exterior design | drop as missing value rate is 99% and imputation is  is not representative enough
 | 'unitcnt' |  Number of units the structure is built into (i.e. 2 = duplex, 3 = triplex, etc...) | nan,1-997 |amenities cnt |median
 | 'yardbuildingsqft17' | Patio in  yard | nan,10-7983 | sqft |0 if null
 | 'yardbuildingsqft26' | Storage shed/building in yard | nan,10-6141 | sqft |0 if null
 | 'yearbuilt' |  The Year the principal residence was built  | nan,1801-2015 | compute new feature for how many years were built |median
 | 'taxvaluedollarcnt' | The total tax assessed value of the parcel | 'value_tax_total' | liability | median
 | 'structuretaxvaluedollarcnt' | The assessed value of the built structure on the parcel | 'value_tax_building' | liability | median
 | 'landtaxvaluedollarcnt' | The assessed value of the land area of the parcel | 'value_tax_lot' | median
 | 'taxamount' | The total property tax assessed for that assessment year | numerical value | liability | median 
 | 'assessmentyear' | The year of the property tax assessment  | 'date_year_of_tax' | liability |median
 | 'taxdelinquencyflag' | Property taxes for this parcel are past due as of 2015 | nan, y | liability | replace 0 and 1
 | 'taxdelinquencyyear' | Year for which the unpaid propert taxes were due  | numerical value | liability | replace median if tax flag is true

# Read the data

- Large dataset will be a challenge. Credited to the implementation from [Faster data loading time in Python](https://www.kaggle.com/c/zillow-prize-1/discussion/37261) and [Reducing DataFrame memory size by ~65%](https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65), loading time is shorten by serializing data instead of the csv files in subsequent runs

- The Training Data dataset contains the log error and transaction dates for 90,275 homes sold during 2016.

- The Property Data dataset contains 58 different features. Unlike the Training Data, this dataset features information on all homes - not just ones that have been sold. We will see that many of the homes in this dataset are missing information.



In [2]:
def load_data():
    # Pickled versions of Data Sets
    train2016_p = './data/train2016_p'
    train2017_p = './data/train2017_p'
    prop2016_p = './data/prop2016_p'
    prop2017_p = './data/prop2017_p'
    sample_p = './data/sample_p'

    # If pickled train2016 exists, load it; else load train_2016_v2.csv to df and pickle it
    if os.path.exists(train2016_p):
        train2016 = pd.read_pickle(train2016_p)
    else:
        # load data to df
        train2016 = pd.read_csv('./data/train_2016_v2.csv',parse_dates=['transactiondate'])
        # create pickled file for storage
        train2016.to_pickle('./data/train2016_p')

    # If pickled train2017 exists, load it; else load train_2017.csv to df and pickle it
    if os.path.exists(train2017_p):
        train2017 = pd.read_pickle(train2017_p)
    else:
        # load data to df
        train2017 = pd.read_csv('./data/train_2017.csv',parse_dates=['transactiondate'])
        # create pickled file for storage
        train2017.to_pickle('./data/train2017_p')

    # If pickled prop2016_p load it; else load properties_2016.csv to df and pickle it
    if os.path.exists(prop2016_p):
        prop2016 = pd.read_pickle(prop2016_p)
    else:
        prop2016 = pd.read_csv('./data/properties_2016.csv')
        prop2016.to_pickle('./data/properties_2016_p')

    # If pickled prop2017_p load it; else load properties_2017.csv to df and pickle it
    if os.path.exists(prop2017_p):
        prop2017 = pd.read_pickle(prop2017_p)
    else:
        prop2017 = pd.read_csv('./data/properties_2017.csv')
        prop2017.to_pickle('./data/properties_2017_p')

    # If pickled sample exists, load it; else load sample_submission.csv to df and pickle it
    if os.path.exists(sample_p):
        sample = pd.read_pickle(sample_p)
    else:
        sample = pd.read_csv('./data/sample_submission.csv')
        sample.to_pickle('./data/sample_p')
    return prop2016, prop2017, train2016, train2017, sample

In [3]:
#Load Datasets
properties_2016, properties_2017, train_2016_v2, train_2017, sample = load_data()

In [4]:
# Print row and columns of each dataset
print(properties_2016.shape)
print(properties_2017.shape)
print(train_2016_v2.shape)
print(train_2017.shape)

(2985217, 58)
(2985217, 58)
(90275, 3)
(77613, 3)


In [5]:
#Merge transaction table with the properties table by parcelid
train_df_2016 = train_2016_v2.merge(properties_2016, on='parcelid', how='left')
train_df_2017 = train_2017.merge(properties_2017, on='parcelid', how='left')

print(train_df_2016.shape)
print(train_df_2017.shape)

df = train_df_2016.append(train_df_2017, ignore_index=True)
#df = pd.concat([train_df_2016, train_df_2017], axis = 0)

print(df.shape)

(90275, 60)
(77613, 60)
(167888, 60)


In [6]:
missing_value= pd.DataFrame(df.isnull().sum(axis=0).sort_values(ascending=False),columns=['missing value'])
missing_value['missing value pct'] = missing_value/ df.shape[0]*100
missing_value

Unnamed: 0,missing value,missing value pct
buildingclasstypeid,167857,99.981535
finishedsquarefeet13,167813,99.955327
basementsqft,167795,99.944606
storytypeid,167795,99.944606
yardbuildingsqft26,167723,99.90172
fireplaceflag,167494,99.76532
architecturalstyletypeid,167420,99.721243
typeconstructiontypeid,167366,99.689078
finishedsquarefeet6,167081,99.519322
decktypeid,166616,99.242352


# Dealing with missing values

[KNN filling - Carefully-dealing-with-missing-values](https://www.kaggle.com/nikunjm88/carefully-dealing-with-missing-values)

[Creating-additional-features](https://www.kaggle.com/nikunjm88/creating-additional-features)

### Imputation for geo value with KNN 

In [7]:
# Filling the missing 34 location on latitude and longitude
df['latitude'] = df['latitude'].fillna(df['latitude'].mean())
df['longitude'] = df['longitude'].fillna(df['longitude'].mean())

In [8]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn import neighbors

import warnings
warnings.filterwarnings("ignore")

def fillna_knn( df, base, target, fraction = 1, threshold = 10, n_neighbors = 5 ):
    assert isinstance( base , list ) or isinstance( base , np.ndarray ) and isinstance( target, str ) 
    whole = [ target ] + base
    
    miss = df[target].isnull()
    notmiss = ~miss 
    nummiss = miss.sum()
    
    enc = OneHotEncoder()
    X_target = df.loc[ notmiss, whole ].sample( frac = fraction )
    
    enc.fit( X_target[ target ].unique().reshape( (-1,1) ) )
    
    Y = enc.transform( X_target[ target ].values.reshape((-1,1)) ).toarray()
    X = X_target[ base  ]
    
    print( 'fitting' )
    n_neighbors = n_neighbors
    clf = neighbors.KNeighborsClassifier( n_neighbors, weights = 'uniform' )
    clf.fit( X, Y )
    
    print( 'the shape of active features: ' ,enc.active_features_.shape )
    
    print( 'predicting' )
    Z = clf.predict(df.loc[miss, base])
    
    numunperdicted = Z[:,0].sum()
    if numunperdicted / nummiss *100 < threshold :
        print( 'writing result to df' )    
        df.loc[ miss, target ]  = np.dot( Z , enc.active_features_ )
        print( 'num of unperdictable data: ', numunperdicted )
        return enc
    else:
        print( 'out of threshold: {}% > {}%'.format( numunperdicted / nummiss *100 , threshold ) )

# function to deal with variables that are actually string/categories
def zoningcode2int( df, target ):
    storenull = df[ target ].isnull()
    enc = LabelEncoder( )
    df[ target ] = df[ target ].astype( str )

    print('fit and transform')
    df[ target ]= enc.fit_transform( df[ target ].values )
    print( 'num of categories: ', enc.classes_.shape  )
    df.loc[ storenull, target ] = np.nan
    print('recover the nan value')
    return enc

In [9]:
import warnings
warnings.filterwarnings("ignore")

zoningcode2int( df = df,
                            target = 'propertycountylandusecode' )

fillna_knn( df = df,
                  base = [ 'latitude', 'longitude' ] ,
                  target = 'propertycountylandusecode', fraction = 0.15, n_neighbors = 5 )

zoningcode2int( df = df,
                            target = 'propertyzoningdesc' )

fillna_knn( df = df,
                  base = [ 'latitude', 'longitude' ] ,
                  target = 'propertyzoningdesc', fraction = 0.15, n_neighbors = 5 )



fit and transform
num of categories:  (91,)
recover the nan value
fitting
the shape of active features:  (58,)
predicting
writing result to df
num of unperdictable data:  0.0
fit and transform
num of categories:  (2347,)
recover the nan value
fitting
the shape of active features:  (1348,)
predicting
writing result to df
num of unperdictable data:  0.0


OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)

In [10]:
# regionidcity, regionidneighborhood & regionidzip - assume it is the same as the nereast property. 
# As mentioned above, this is ok if there's a property very nearby to the one with missing values (I leave it up to the reader to check if this is the case!)
fillna_knn( df = df,
                  base = [ 'latitude', 'longitude' ] ,
                  target = 'regionidcity', fraction = 0.15, n_neighbors = 5 )

fillna_knn( df = df,
                  base = [ 'latitude', 'longitude' ] ,
                  target = 'regionidneighborhood', fraction = 0.15, n_neighbors = 5 )

fillna_knn( df = df,
                  base = [ 'latitude', 'longitude' ] ,
                  target = 'regionidzip', fraction = 0.15, n_neighbors = 5 )

# unitcnt - the number of structures the unit is built into. 
# Assume it is the same as the nearest properties. If the property with missing values is in a block of flats or in a terrace street then this is probably ok - but again I leave it up to the reader to check if this is the case!

#lot size square feet - not sure what to do about this one. Lets use nearest neighbours. Assume it has same lot size as property closest to it


fitting
the shape of active features:  (172,)
predicting
writing result to df
num of unperdictable data:  2.0
fitting
the shape of active features:  (415,)
predicting
writing result to df
num of unperdictable data:  301.0
fitting
the shape of active features:  (385,)
predicting
writing result to df
num of unperdictable data:  0.0


OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)

### Filling missing values 

In [11]:
# Most stories are between 1 and 2 as the data suggested
df['numberofstories'][np.isnan(df['numberofstories'])] = np.random.randint(1,3)

# A three-quarter or 3/4 bath, is generally one with a toilet, sink and shower, but not a tub.
df['threequarterbathnbr'] = df['threequarterbathnbr'].fillna(0)

# if Null in garage count it means there are no garages as the sqft for garage 
df['garagecarcnt'] = df['garagecarcnt'].fillna(0)   
df['garagetotalsqft']=  df['garagetotalsqft'].fillna(0)

In [12]:
# From the data dictionary, none is listed as 5 
df['airconditioningtypeid']= df['airconditioningtypeid'].fillna(5)

# From the data dictionary, Other is 19
df['architecturalstyletypeid'] = df['architecturalstyletypeid'].fillna(19)

# From the data dictionary, none is lised as 13
df['heatingorsystemtypeid'] = df['heatingorsystemtypeid'].fillna(13)

In [13]:
# Replace nan value with 0 in the binary object columns
df['hashottuborspa'] = df['hashottuborspa'].apply(lambda x: 1 if x == True else 0)
df['fireplaceflag'] = df['fireplaceflag'].apply(lambda x: 1 if x == True else 0)
df['taxdelinquencyflag'] = df['taxdelinquencyflag'].apply(lambda x: 1 if x=='Y' else 0)

# Fill Nan value with 0 if nan in the binary float columns
for i in ('fireplacecnt', 'poolcnt', 'pooltypeid10','pooltypeid2','pooltypeid7',
          'yardbuildingsqft17','yardbuildingsqft26'):
    df[i] = df[i].fillna(0)

# Fill values with median or number of frequency since 
for i in ('rawcensustractandblock','censustractandblock', 'fips',
          'propertylandusetypeid',
         'regionidcounty',
         'regionidzip',
         'regionidcity'):
    df[i] = df[i].fillna(df[i].mode()[0])
    
# room count, sqft, and attributes filled with median
for i in ('bathroomcnt','bedroomcnt', 'roomcnt','unitcnt','lotsizesquarefeet', 'calculatedfinishedsquarefeet',
          'buildingqualitytypeid',
          'basementsqft', 'yearbuilt',
          'finishedsquarefeet50'):
    df[i] = df[i].fillna(df[i].median())

# Fill in median for the tax liability
for i in ('assessmentyear','taxvaluedollarcnt', 'taxvaluedollarcnt',
          'landtaxvaluedollarcnt','structuretaxvaluedollarcnt','taxamount'):
    df[i] = df[i].fillna(df[i].median())

# If taxdelinquencyflag is zero, no owing any tax, so does tax delinquency year
df.loc[(df['taxdelinquencyyear'] > 0) & (df['taxdelinquencyyear'].isnull()), 'taxdelinquencyyear'] = df.loc[df['taxdelinquencyflag'] > 0, 'taxdelinquencyyear'].median()
df.loc[(df['taxdelinquencyflag'] == 0), 'taxdelinquencyyear'] = 0

# Fill null values with conditions. If pool count has value and missing the poolsize, fill with median. 
df.loc[(df['poolcnt'] > 0) & (df['poolsizesum'].isnull()), 'poolsizesum'] = df.loc[df['poolcnt'] > 0, 'poolsizesum'].median()
df.loc[(df['poolcnt'] == 0), 'poolsizesum'] = 0

In [14]:
missingcount = df.isnull().sum(axis=0)
missingcount[missingcount>0]

buildingclasstypeid         167857
calculatedbathnbr             1832
decktypeid                  166616
finishedfloor1squarefeet    154995
finishedsquarefeet12          8369
finishedsquarefeet13        167813
finishedsquarefeet15        161297
finishedsquarefeet6         167081
fullbathcnt                   1832
storytypeid                 167795
typeconstructiontypeid      167366
dtype: int64

### Drop duplicate columns

In [16]:
# Removing duplicate variables to avoid multicollinearity that will contribute to finishedsquarefeet50 (total area)
# Variables with over 90% missing values is not easy to estimate the correct value may need to be dropped

drop_columns= ['finishedsquarefeet12',
               'finishedsquarefeet13', 
               'finishedsquarefeet15',
               'finishedsquarefeet6',
               'finishedfloor1squarefeet',
               'buildingclasstypeid',
               'calculatedbathnbr',
               'typeconstructiontypeid',
               'fullbathcnt',# same as bathroomcnt
               'decktypeid',# missing too much value as for below attributes
               'storytypeid',
               'typeconstructiontypeid']

df = df.drop(columns=drop_columns, axis=1) 

In [17]:
missingcount = df.isnull().sum(axis=0)
missingcount[missingcount>0]

Series([], dtype: int64)

### Check point: Save data frame 

In [18]:
df.to_pickle('./data/df_wip')

total of dropping rows

In [2]:
167046/167888

0.9949847517392547

In [5]:
(1- (167046/167888))*100

0.5015248260745286