## Data Processing

In this notebook, we will create a preprocessor class that will do the following 

1. Remove certain columns
2. Add some columns related to date time
3. Deal with missing values
4. One hot encoding

##### The DataProcessor class will follow fit and transform template so that it can be used in the machine learning pipelines

In [40]:
import numpy as np
import pandas as pd
import sys
import inspect

#Add the scripts directory to the sys path
sys.path.append("../src/data")

# Show all rows and columns in the display
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

from make_dataset import get_data

In [9]:
print(inspect.getsource(get_data))

def get_data(data_string):
    """
    Read the train/test dataset and merge with properties data set and remove duplicate parcelid's in train
    
    Keyword Arguments:
    data_string -- "train" or "test" 
    
    Returns:
    X, y -- a tuple of dataframe X and Series y
    
    
    """         
    year = 2016 if data_string == "train" else 2017
        
    train = read_data("train_{0}".format(year))
    properties = read_data("properties_{0}".format(year))
    merged = pd.merge(train, properties, on="parcelid", how="left")
                      
    if data_string == "train":
        merged = remove_duplicate_parcels(merged)
                          
    y = merged["logerror"]                          
    merged = merged.drop(columns=["logerror"], axis=1)     
    
    return merged, y



In [10]:
train_X, train_y = get_data(data_string="train")

  exec(code_obj, self.user_global_ns, self.user_ns)


In [18]:
train_X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90150 entries, 0 to 12731
Data columns (total 59 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   parcelid                      90150 non-null  int64  
 1   transactiondate               90150 non-null  object 
 2   airconditioningtypeid         28748 non-null  float64
 3   architecturalstyletypeid      260 non-null    float64
 4   basementsqft                  43 non-null     float64
 5   bathroomcnt                   90150 non-null  float64
 6   bedroomcnt                    90150 non-null  float64
 7   buildingclasstypeid           16 non-null     float64
 8   buildingqualitytypeid         57284 non-null  float64
 9   calculatedbathnbr             88974 non-null  float64
 10  decktypeid                    658 non-null    float64
 11  finishedfloor1squarefeet      6850 non-null   float64
 12  calculatedfinishedsquarefeet  89492 non-null  float64
 13  f

In [12]:
train_y

0        0.0276
1       -0.1684
2       -0.0040
3        0.0218
4       -0.0050
          ...  
40955   -0.0823
32484    0.0516
2055     0.0488
38053    0.0227
12731    0.1621
Name: logerror, Length: 90150, dtype: float64

### Drop certain columns

We will remove below columns before going ahead with the analysis

1. parcelid - this is not required for model training
2. propertyzoningdesc, rawcensustractandblock, regionidneighborhood, regionidzip, censustractandblock - these columns have lot of cardinality in these features. We will use regionidcity (177 unique groups) as we need one feature to distinguish the regions of the properties. 

In [16]:
class DataProcessor:
    
    def __init__(self, cols_to_remove=None):
        self.cols_to_remove = cols_to_remove
    
    
    def fit(self, X, y=None):
        """fit the process on the training data"""
        
        return self
        
    def transform(self, X, y=None):
        """transform the process on the train/test data """
        
        X_new = X.drop(columns=self.cols_to_remove, axis=1)
        
        return X_new
        
        
    def fit_transform(self, X, y=None):
        """fit and transform"""
        
        return self.fit(X).transform(X)
    

In [20]:
dp = DataProcessor(cols_to_remove=["parcelid", "propertyzoningdesc", "rawcensustractandblock", "regionidneighborhood", "regionidzip", "censustractandblock"])
train_X_new = dp.transform(train_X)

In [22]:
train_X_new.info() #6 columns are removed

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90150 entries, 0 to 12731
Data columns (total 53 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   transactiondate               90150 non-null  object 
 1   airconditioningtypeid         28748 non-null  float64
 2   architecturalstyletypeid      260 non-null    float64
 3   basementsqft                  43 non-null     float64
 4   bathroomcnt                   90150 non-null  float64
 5   bedroomcnt                    90150 non-null  float64
 6   buildingclasstypeid           16 non-null     float64
 7   buildingqualitytypeid         57284 non-null  float64
 8   calculatedbathnbr             88974 non-null  float64
 9   decktypeid                    658 non-null    float64
 10  finishedfloor1squarefeet      6850 non-null   float64
 11  calculatedfinishedsquarefeet  89492 non-null  float64
 12  finishedsquarefeet12          85485 non-null  float64
 13  f

## Add month and year column and remove transaction date column

We will need month and year columns as we will need to predict for 6 time points based on month and year 
(October 2016,  November 2016, December 2016, October 2017,  November 2017, December 2017)

In [33]:
class DataProcessor:
    
    def __init__(self, cols_to_remove=None, datecol=None):
        self.cols_to_remove = cols_to_remove
        self.datecol = datecol
    
    
    def fit(self, X, y=None):
        """fit the process on the training data"""
        
        return self
        
    def transform(self, X, y=None):
        """transform the process on the train/test data """
                
        X_new = X.drop(columns=self.cols_to_remove, axis=1)
        
        if self.datecol:
            X_new[self.datecol + "_month"] = pd.to_datetime(X_new[self.datecol]).dt.month
            X_new[self.datecol + "_year"] = pd.to_datetime(X_new[self.datecol]).dt.year
            X_new = X_new.drop(columns=self.datecol, axis=1)
        
        return X_new
        
        
    def fit_transform(self, X, y=None):
        """fit and transform"""
        
        return self.fit(X).transform(X)
    

In [34]:
dp = DataProcessor(cols_to_remove=["parcelid", "propertyzoningdesc", "rawcensustractandblock", "regionidneighborhood", "regionidzip", "censustractandblock"], 
                  datecol="transactiondate")
train_X_new = dp.transform(train_X)

In [35]:
train_X_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90150 entries, 0 to 12731
Data columns (total 54 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   airconditioningtypeid         28748 non-null  float64
 1   architecturalstyletypeid      260 non-null    float64
 2   basementsqft                  43 non-null     float64
 3   bathroomcnt                   90150 non-null  float64
 4   bedroomcnt                    90150 non-null  float64
 5   buildingclasstypeid           16 non-null     float64
 6   buildingqualitytypeid         57284 non-null  float64
 7   calculatedbathnbr             88974 non-null  float64
 8   decktypeid                    658 non-null    float64
 9   finishedfloor1squarefeet      6850 non-null   float64
 10  calculatedfinishedsquarefeet  89492 non-null  float64
 11  finishedsquarefeet12          85485 non-null  float64
 12  finishedsquarefeet13          33 non-null     float64
 13  f

## Deal with missing values

As we have seen in data exploratory notebook, there are lot of columns with missing values. We can not remove those records or columns as we will be losing some information that will help with predictions. Hence let's take a look at imputing these values.

We can impute with median/mean values but there are lot of records that are missing and hence median/mean may not make sense in that case. 
Lets look at imputing with negative values if there are no negative values in the columns..

In [41]:
train_X_new.describe()

Unnamed: 0,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,decktypeid,finishedfloor1squarefeet,calculatedfinishedsquarefeet,finishedsquarefeet12,finishedsquarefeet13,finishedsquarefeet15,finishedsquarefeet50,finishedsquarefeet6,fips,fireplacecnt,fullbathcnt,garagecarcnt,garagetotalsqft,heatingorsystemtypeid,latitude,longitude,lotsizesquarefeet,poolcnt,poolsizesum,pooltypeid10,pooltypeid2,pooltypeid7,propertylandusetypeid,regionidcity,regionidcounty,roomcnt,storytypeid,threequarterbathnbr,typeconstructiontypeid,unitcnt,yardbuildingsqft17,yardbuildingsqft26,yearbuilt,numberofstories,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyyear,transactiondate_month,transactiondate_year
count,28748.0,260.0,43.0,90150.0,90150.0,16.0,57284.0,88974.0,658.0,6850.0,89492.0,85485.0,33.0,3555.0,6850.0,419.0,90150.0,9597.0,88974.0,29897.0,29897.0,56005.0,90150.0,90150.0,80014.0,17876.0,966.0,1159.0,1204.0,16672.0,90150.0,88349.0,90150.0,90150.0,43.0,11996.0,298.0,58271.0,2645.0,95.0,89397.0,20540.0,89772.0,90149.0,90150.0,90149.0,90144.0,1775.0,90150.0,90150.0
mean,1.815222,7.230769,713.581395,2.279545,3.031936,4.0,5.565393,2.309175,66.0,1347.720146,1773.096869,1745.415079,1404.545455,2380.880731,1355.299416,2293.069212,6048.872812,1.187871,2.241172,1.812055,345.539619,3.926399,34005470.0,-118198900.0,29120.64,1.0,519.688406,1.0,1.0,1.0,261.833478,33762.57523,2525.456961,1.47858,7.0,1.008753,6.010067,1.110244,310.204915,311.694737,1968.539761,1.440798,180103.2,457637.9,2015.0,278287.9,5983.070888,13.40169,5.849917,2016.0
std,2.972108,2.721397,437.434198,1.004133,1.156114,0.0,1.900443,0.976137,0.0,652.039469,928.136339,909.947071,110.108211,1068.859229,673.376506,1341.830257,20.667459,0.484253,0.963106,0.608865,267.038335,3.683889,264979.2,360632.1,121790.9,0.0,155.116033,0.0,0.0,0.0,5.183186,46683.848298,805.657943,2.819802,0.0,0.100884,0.43797,0.797389,216.738757,346.35485,23.763165,0.544482,209076.9,554853.2,0.0,400504.0,6838.506814,2.720397,2.812363,0.0
min,1.0,2.0,100.0,0.0,0.0,4.0,1.0,1.0,66.0,44.0,2.0,2.0,1056.0,560.0,44.0,257.0,6037.0,1.0,1.0,0.0,0.0,1.0,33339300.0,-119447900.0,167.0,1.0,28.0,1.0,1.0,1.0,31.0,3491.0,1286.0,0.0,7.0,1.0,4.0,1.0,25.0,18.0,1885.0,1.0,100.0,22.0,2015.0,22.0,49.08,6.0,1.0,2016.0
25%,1.0,7.0,407.5,2.0,2.0,4.0,4.0,2.0,66.0,938.0,1184.0,1172.0,1392.0,1648.0,938.0,1109.0,6037.0,1.0,2.0,2.0,0.0,2.0,33811660.0,-118411700.0,5704.0,1.0,420.0,1.0,1.0,1.0,261.0,12447.0,1286.0,0.0,7.0,1.0,6.0,1.0,180.0,100.0,1953.0,1.0,81271.75,199056.0,2015.0,82285.0,2873.26,13.0,4.0,2016.0
50%,1.0,7.0,616.0,2.0,3.0,4.0,7.0,2.0,66.0,1244.0,1540.0,1518.0,1440.0,2104.0,1248.0,1986.0,6037.0,1.0,2.0,2.0,433.0,2.0,34021530.0,-118173400.0,7200.0,1.0,500.0,1.0,1.0,1.0,261.0,25218.0,3101.0,0.0,7.0,1.0,6.0,1.0,260.0,159.0,1970.0,1.0,132057.0,342931.0,2015.0,193000.0,4543.1,14.0,6.0,2016.0
75%,1.0,7.0,872.0,3.0,4.0,4.0,7.0,3.0,66.0,1614.0,2095.0,2056.0,1440.0,2863.0,1617.0,3405.5,6059.0,1.0,3.0,2.0,484.0,7.0,34172780.0,-117921600.0,11681.75,1.0,600.0,1.0,1.0,1.0,266.0,45457.0,3101.0,0.0,7.0,1.0,6.0,1.0,384.0,361.0,1987.0,2.0,210538.0,540589.0,2015.0,345384.0,6900.165,15.0,8.0,2016.0
max,13.0,21.0,1555.0,20.0,16.0,4.0,12.0,20.0,66.0,7625.0,22741.0,20013.0,1584.0,22741.0,8352.0,7224.0,6111.0,5.0,20.0,24.0,7339.0,24.0,34816010.0,-117554900.0,6971010.0,1.0,1750.0,1.0,1.0,1.0,275.0,396556.0,3101.0,18.0,7.0,4.0,13.0,143.0,2678.0,1366.0,2015.0,4.0,9948100.0,27750000.0,2015.0,24500000.0,321936.09,99.0,12.0,2016.0


As we can see above in the minimum row, there are no records with negative values. Hence lets impute the missing data with -1

In [43]:
class DataProcessor:
    
    def __init__(self, cols_to_remove=None, datecol=None):
        self.cols_to_remove = cols_to_remove
        self.datecol = datecol
    
    
    def fit(self, X, y=None):
        """fit the process on the training data"""
        
        return self
        
    def transform(self, X, y=None):
        """transform the process on the train/test data """
                
        X_new = X.drop(columns=self.cols_to_remove, axis=1)
        
        if self.datecol:
            X_new[self.datecol + "_month"] = pd.to_datetime(X_new[self.datecol]).dt.month
            X_new[self.datecol + "_year"] = pd.to_datetime(X_new[self.datecol]).dt.year
            X_new = X_new.drop(columns=self.datecol, axis=1)
            
        X_new = X_new.fillna(-1)
        
        return X_new
        
        
    def fit_transform(self, X, y=None):
        """fit and transform"""
        
        return self.fit(X).transform(X)
    

In [44]:
dp = DataProcessor(cols_to_remove=["parcelid", "propertyzoningdesc", "rawcensustractandblock", "regionidneighborhood", "regionidzip", "censustractandblock"], 
                  datecol="transactiondate")
train_X_new = dp.transform(train_X)

In [45]:
train_X_new.info()  # no null values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90150 entries, 0 to 12731
Data columns (total 54 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   airconditioningtypeid         90150 non-null  float64
 1   architecturalstyletypeid      90150 non-null  float64
 2   basementsqft                  90150 non-null  float64
 3   bathroomcnt                   90150 non-null  float64
 4   bedroomcnt                    90150 non-null  float64
 5   buildingclasstypeid           90150 non-null  float64
 6   buildingqualitytypeid         90150 non-null  float64
 7   calculatedbathnbr             90150 non-null  float64
 8   decktypeid                    90150 non-null  float64
 9   finishedfloor1squarefeet      90150 non-null  float64
 10  calculatedfinishedsquarefeet  90150 non-null  float64
 11  finishedsquarefeet12          90150 non-null  float64
 12  finishedsquarefeet13          90150 non-null  float64
 13  f