## Data Processing

In this notebook, we will create a DataProcessor class that will do the following 

1. Remove certain columns
2. Add some columns related to date time
3. Deal with missing values
4. One hot encoding

##### The DataProcessor class will follow fit and transform template so that it can be used in the machine learning pipelines

In [None]:
import sys
import inspect
import warnings

In [None]:
import numpy as np
import pandas as pd

In [23]:
# Add the scripts directory to the sys path
sys.path.append("../src/data")

from make_dataset import get_data

In [24]:
# suppres warnings
warnings.filterwarnings("ignore")

# Show all rows and columns in the display
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

In [26]:
print(inspect.getsource(get_data))

def get_data(data_string):
    """
    Read the train/test dataset and merge with properties data set and remove duplicate parcelid's in train
    
    Parameters:
    data_string -- "train" or "test" 
    
    Returns:
    X, y -- a tuple of dataframe X and Series y
    
    
    """         
    year = 2016 if data_string == "train" else 2017
        
    train = read_data("train_{0}".format(year))
    properties = read_data("properties_{0}".format(year))
    merged = pd.merge(train, properties, on="parcelid", how="left")
                      
    if data_string == "train":
        merged = remove_duplicate_parcels(merged)
                          
    y = merged["logerror"]                          
    merged = merged.drop(columns=["logerror"], axis=1) 
    
    id_col = ["parcelid"]
    cat_cols = ["airconditioningtypeid", "architecturalstyletypeid", "buildingclasstypeid", 
                "buildingqualitytypeid", "decktypeid", "fips", "fireplaceflag", 
                "hashottu

In [27]:
X_train, y_train = get_data(data_string="train")

In [28]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90150 entries, 0 to 12731
Data columns (total 59 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   parcelid                      90150 non-null  object        
 1   transactiondate               90150 non-null  datetime64[ns]
 2   airconditioningtypeid         28748 non-null  object        
 3   architecturalstyletypeid      260 non-null    object        
 4   basementsqft                  43 non-null     float64       
 5   bathroomcnt                   90150 non-null  float64       
 6   bedroomcnt                    90150 non-null  float64       
 7   buildingclasstypeid           16 non-null     object        
 8   buildingqualitytypeid         57284 non-null  object        
 9   calculatedbathnbr             88974 non-null  float64       
 10  decktypeid                    658 non-null    object        
 11  finishedfloor1squarefeet    

In [29]:
len(y_train)

90150

### Drop certain columns

We will remove below columns before going ahead with the analysis

1. parcelid - this is not required for model training
2. propertyzoningdesc, rawcensustractandblock, regionidneighborhood, regionidzip, censustractandblock - these columns have lot of cardinality in these features. We will use regionidcity (177 unique groups) as we need one feature to distinguish the regions of the properties. 

In [30]:
class DataProcessor:
    def __init__(self, cols_to_remove=None):
        self.cols_to_remove = cols_to_remove

    def fit(self, X, y=None):
        """fit the process on the training data"""

        return self

    def transform(self, X, y=None):
        """transform the process on the train/test data """

        X_new = X.drop(columns=self.cols_to_remove, axis=1)

        return X_new

    def fit_transform(self, X, y=None):
        """fit and transform"""

        return self.fit(X).transform(X)

In [31]:
dp = DataProcessor(
    cols_to_remove=[
        "parcelid",
        "propertyzoningdesc",
        "rawcensustractandblock",
        "regionidneighborhood",
        "regionidzip",
        "censustractandblock",
    ]
)
X_train_new = dp.transform(X_train)

In [32]:
X_train_new.info()  # 6 columns are removed

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90150 entries, 0 to 12731
Data columns (total 53 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   transactiondate               90150 non-null  datetime64[ns]
 1   airconditioningtypeid         28748 non-null  object        
 2   architecturalstyletypeid      260 non-null    object        
 3   basementsqft                  43 non-null     float64       
 4   bathroomcnt                   90150 non-null  float64       
 5   bedroomcnt                    90150 non-null  float64       
 6   buildingclasstypeid           16 non-null     object        
 7   buildingqualitytypeid         57284 non-null  object        
 8   calculatedbathnbr             88974 non-null  float64       
 9   decktypeid                    658 non-null    object        
 10  finishedfloor1squarefeet      6850 non-null   float64       
 11  calculatedfinishedsquarefeet

## Add month and year column and remove transaction date column

We will need month and year columns as we will need to predict for 6 time points based on month and year 
(October 2016,  November 2016, December 2016, October 2017,  November 2017, December 2017)

In [33]:
class DataProcessor:
    def __init__(self, cols_to_remove=None, datecol=None):
        self.cols_to_remove = cols_to_remove
        self.datecol = datecol

    def fit(self, X, y=None):
        """fit the process on the training data"""

        return self

    def transform(self, X, y=None):
        """transform the process on the train/test data """

        X_new = X.drop(columns=self.cols_to_remove, axis=1)

        if self.datecol:
            X_new[self.datecol + "_month"] = pd.to_datetime(
                X_new[self.datecol]
            ).dt.month
            X_new[self.datecol + "_year"] = pd.to_datetime(X_new[self.datecol]).dt.year
            X_new = X_new.drop(columns=self.datecol, axis=1)

        return X_new

    def fit_transform(self, X, y=None):
        """fit and transform"""

        return self.fit(X).transform(X)

In [34]:
dp = DataProcessor(
    cols_to_remove=[
        "parcelid",
        "propertyzoningdesc",
        "rawcensustractandblock",
        "regionidneighborhood",
        "regionidzip",
        "censustractandblock",
    ],
    datecol="transactiondate",
)
X_train_new = dp.transform(X_train)

In [35]:
X_train_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90150 entries, 0 to 12731
Data columns (total 54 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   airconditioningtypeid         28748 non-null  object 
 1   architecturalstyletypeid      260 non-null    object 
 2   basementsqft                  43 non-null     float64
 3   bathroomcnt                   90150 non-null  float64
 4   bedroomcnt                    90150 non-null  float64
 5   buildingclasstypeid           16 non-null     object 
 6   buildingqualitytypeid         57284 non-null  object 
 7   calculatedbathnbr             88974 non-null  float64
 8   decktypeid                    658 non-null    object 
 9   finishedfloor1squarefeet      6850 non-null   float64
 10  calculatedfinishedsquarefeet  89492 non-null  float64
 11  finishedsquarefeet12          85485 non-null  float64
 12  finishedsquarefeet13          33 non-null     float64
 13  f

## Deal with missing values

As we have seen in data exploratory notebook, there are lot of columns with missing values. We can not remove those records or columns as we will be losing some information that will help with predictions. Hence let's take a look at imputing these values.

We can impute with median/mean values but there are lot of records that are missing and hence median/mean may not make sense in that case. 
Lets look at imputing with negative values if there are no negative values in the columns..

In [36]:
X_train_new.describe()

Unnamed: 0,basementsqft,bathroomcnt,bedroomcnt,calculatedbathnbr,finishedfloor1squarefeet,calculatedfinishedsquarefeet,finishedsquarefeet12,finishedsquarefeet13,finishedsquarefeet15,finishedsquarefeet50,finishedsquarefeet6,fireplacecnt,fullbathcnt,garagecarcnt,garagetotalsqft,latitude,longitude,lotsizesquarefeet,poolcnt,poolsizesum,roomcnt,threequarterbathnbr,unitcnt,yardbuildingsqft17,yardbuildingsqft26,numberofstories,structuretaxvaluedollarcnt,taxvaluedollarcnt,landtaxvaluedollarcnt,taxamount,transactiondate_month,transactiondate_year
count,43.0,90150.0,90150.0,88974.0,6850.0,89492.0,85485.0,33.0,3555.0,6850.0,419.0,9597.0,88974.0,29897.0,29897.0,90150.0,90150.0,80014.0,17876.0,966.0,90150.0,11996.0,58271.0,2645.0,95.0,20540.0,89772.0,90149.0,90149.0,90144.0,90150.0,90150.0
mean,713.581395,2.279545,3.031936,2.309175,1347.720146,1773.096869,1745.415079,1404.545455,2380.880731,1355.299416,2293.069212,1.187871,2.241172,1.812055,345.539619,34005470.0,-118198900.0,29120.64,1.0,519.688406,1.47858,1.008753,1.110244,310.204915,311.694737,1.440798,180103.2,457637.9,278287.9,5983.070888,5.849917,2016.0
std,437.434198,1.004133,1.156114,0.976137,652.039469,928.136339,909.947071,110.108211,1068.859229,673.376506,1341.830257,0.484253,0.963106,0.608865,267.038335,264979.2,360632.1,121790.9,0.0,155.116033,2.819802,0.100884,0.797389,216.738757,346.35485,0.544482,209076.9,554853.2,400504.0,6838.506814,2.812363,0.0
min,100.0,0.0,0.0,1.0,44.0,2.0,2.0,1056.0,560.0,44.0,257.0,1.0,1.0,0.0,0.0,33339300.0,-119447900.0,167.0,1.0,28.0,0.0,1.0,1.0,25.0,18.0,1.0,100.0,22.0,22.0,49.08,1.0,2016.0
25%,407.5,2.0,2.0,2.0,938.0,1184.0,1172.0,1392.0,1648.0,938.0,1109.0,1.0,2.0,2.0,0.0,33811660.0,-118411700.0,5704.0,1.0,420.0,0.0,1.0,1.0,180.0,100.0,1.0,81271.75,199056.0,82285.0,2873.26,4.0,2016.0
50%,616.0,2.0,3.0,2.0,1244.0,1540.0,1518.0,1440.0,2104.0,1248.0,1986.0,1.0,2.0,2.0,433.0,34021530.0,-118173400.0,7200.0,1.0,500.0,0.0,1.0,1.0,260.0,159.0,1.0,132057.0,342931.0,193000.0,4543.1,6.0,2016.0
75%,872.0,3.0,4.0,3.0,1614.0,2095.0,2056.0,1440.0,2863.0,1617.0,3405.5,1.0,3.0,2.0,484.0,34172780.0,-117921600.0,11681.75,1.0,600.0,0.0,1.0,1.0,384.0,361.0,2.0,210538.0,540589.0,345384.0,6900.165,8.0,2016.0
max,1555.0,20.0,16.0,20.0,7625.0,22741.0,20013.0,1584.0,22741.0,8352.0,7224.0,5.0,20.0,24.0,7339.0,34816010.0,-117554900.0,6971010.0,1.0,1750.0,18.0,4.0,143.0,2678.0,1366.0,4.0,9948100.0,27750000.0,24500000.0,321936.09,12.0,2016.0


As we can see above in the minimum row, there are no records with negative values. Hence lets impute the missing data with -1

In [37]:
class DataProcessor:
    def __init__(self, cols_to_remove=None, datecol=None):
        self.cols_to_remove = cols_to_remove
        self.datecol = datecol

    def fit(self, X, y=None):
        """fit the process on the training data"""

        return self

    def transform(self, X, y=None):
        """transform the process on the train/test data """

        X_new = X.drop(columns=self.cols_to_remove, axis=1)

        if self.datecol:
            X_new[self.datecol + "_month"] = pd.to_datetime(
                X_new[self.datecol]
            ).dt.month
            X_new[self.datecol + "_year"] = pd.to_datetime(X_new[self.datecol]).dt.year
            X_new = X_new.drop(columns=self.datecol, axis=1)

        X_new = X_new.fillna(-1)

        return X_new

    def fit_transform(self, X, y=None):
        """fit and transform"""

        return self.fit(X).transform(X)

In [38]:
dp = DataProcessor(
    cols_to_remove=[
        "parcelid",
        "propertyzoningdesc",
        "rawcensustractandblock",
        "regionidneighborhood",
        "regionidzip",
        "censustractandblock",
    ],
    datecol="transactiondate",
)
X_train_new = dp.transform(X_train)

In [39]:
X_train_new.info()  # no null values

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90150 entries, 0 to 12731
Data columns (total 54 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   airconditioningtypeid         90150 non-null  float64
 1   architecturalstyletypeid      90150 non-null  float64
 2   basementsqft                  90150 non-null  float64
 3   bathroomcnt                   90150 non-null  float64
 4   bedroomcnt                    90150 non-null  float64
 5   buildingclasstypeid           90150 non-null  float64
 6   buildingqualitytypeid         90150 non-null  float64
 7   calculatedbathnbr             90150 non-null  float64
 8   decktypeid                    90150 non-null  float64
 9   finishedfloor1squarefeet      90150 non-null  float64
 10  calculatedfinishedsquarefeet  90150 non-null  float64
 11  finishedsquarefeet12          90150 non-null  float64
 12  finishedsquarefeet13          90150 non-null  float64
 13  f

## One-Hot Encoding

We have a lot of categorical variables that needs to be encoded as real numbers which will help with modeling. One-hot encoding is better than label encoding as label encoding will not work with regression type models.

Before One-hot encoding, lets also classify the columns as numerical or categorical manually

In [40]:
class DataProcessor:
    def __init__(self, cols_to_remove=None, datecol=None):
        self.cols_to_remove = cols_to_remove
        self.datecol = datecol
        self.was_fit = False

    def fit(self, X, y=None):
        """fit the process on the training data"""

        self.was_fit = True

        # remove the columns
        X_new = X.drop(columns=self.cols_to_remove, axis=1)

        # get the categorical features
        self.categorical_features = X_new.dtypes[X_new.dtypes == "object"].index

        # dummy encoding
        dummy_df = pd.get_dummies(
            X_new, columns=self.categorical_features, dummy_na=True
        )
        self.allcols = dummy_df.columns

        return self

    def transform(self, X, y=None):
        """transform the process on the train/test data """

        if not self.was_fit:
            raise Error("Fit the DataProcessor first")

        # remove the columns
        X_new = X.drop(columns=self.cols_to_remove, axis=1)

        # get the categorical features
        self.categorical_features = X_new.dtypes[X_new.dtypes == "object"].index

        # dummy encoding
        X_new = pd.get_dummies(X_new, columns=self.categorical_features, dummy_na=True)

        # this is for test - make sure the dummy columns that are not in test but present in train are set to 0
        newcols = set(self.allcols) - set(X_new.columns)
        if newcols:
            for col in newcols:
                X_new[col] = 0

        X_new = X_new[self.allcols]

        # Create month and year columns for the transactiondate and drop transactiondate
        if self.datecol:
            X_new[self.datecol + "_month"] = pd.to_datetime(
                X_new[self.datecol]
            ).dt.month
            X_new[self.datecol + "_year"] = pd.to_datetime(X_new[self.datecol]).dt.year
            X_new = X_new.drop(columns=self.datecol, axis=1)

        # fill NaN with -1
        X_new = X_new.fillna(-1)

        return X_new

    def fit_transform(self, X, y=None):
        """fit and transform"""

        return self.fit(X).transform(X)

In [41]:
dp = DataProcessor(
    cols_to_remove=[
        "parcelid",
        "propertyzoningdesc",
        "rawcensustractandblock",
        "regionidneighborhood",
        "regionidzip",
        "censustractandblock",
    ],
    datecol="transactiondate",
)
dp.fit(X_train)

<__main__.DataProcessor at 0x1f657f05048>

In [42]:
X_train_transformed = dp.transform(X_train)

In [43]:
X_train_transformed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90150 entries, 0 to 12731
Columns: 514 entries, basementsqft to transactiondate_year
dtypes: float64(30), int64(2), uint8(482)
memory usage: 64.1 MB


In [44]:
# Get the test data set and transform using dp object
X_test, y_test = get_data(data_string="test")

In [45]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77613 entries, 0 to 77612
Data columns (total 59 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   parcelid                      77613 non-null  object        
 1   transactiondate               77613 non-null  datetime64[ns]
 2   airconditioningtypeid         25007 non-null  object        
 3   architecturalstyletypeid      207 non-null    object        
 4   basementsqft                  50 non-null     float64       
 5   bathroomcnt                   77579 non-null  float64       
 6   bedroomcnt                    77579 non-null  float64       
 7   buildingclasstypeid           15 non-null     object        
 8   buildingqualitytypeid         49809 non-null  object        
 9   calculatedbathnbr             76963 non-null  float64       
 10  decktypeid                    614 non-null    object        
 11  finishedfloor1squarefeet    

In [46]:
len(y_test)

77613

In [47]:
X_test_transformed = dp.transform(X_test)

In [48]:
X_test_transformed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77613 entries, 0 to 77612
Columns: 514 entries, basementsqft to transactiondate_year
dtypes: float64(30), int64(28), uint8(456)
memory usage: 68.7 MB


In [49]:
assert all(X_train_transformed.isna().sum() == 0), "NaN values present"
# No NaN values present after the transformation

In [50]:
assert all(X_test_transformed.isna().sum() == 0), "NaN values present"
# No NaN values present after the transformation

In [51]:
assert set(X_train_transformed.columns) == set(
    X_test_transformed.columns
), "both don't match"
# columns are same in train and test