## 2 Pipeline Building

### 2.1 Row Removal
We remove a few rows before the actual pipeline (e.g. outliers) because they would prevent a good training. We will also need to remove rows with no price from the holdout data set.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from math import sqrt

In [2]:
import warnings
warnings.filterwarnings('ignore') # seaborn shows a lot of ugly warnings, let's suppress these for now

We remove the rows where we have n/a values for required columns and some outliers which seem to distort the model.

In [3]:
df = pd.read_csv('data/dc_housing/DC_Properties_training.csv', index_col=0, low_memory=False)
df = df.dropna(subset=["PRICE"])

In [4]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df.drop('PRICE', axis=1), 
                                                    df.loc[:,['PRICE']], 
                                                    test_size=0.2, 
                                                    random_state=10)

### 2.2 Evaluation Function / Libraries

In [5]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestRegressor
from day_18_challenge_pipeline_classes import compare_predictions

### 2.3 Pipeline Preparation

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import StandardScaler
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.one_hot import OneHotEncoder
from sklearn.model_selection import train_test_split

In [7]:
df.columns

Index(['BATHRM', 'HF_BATHRM', 'HEAT', 'AC', 'NUM_UNITS', 'ROOMS', 'BEDRM',
       'AYB', 'YR_RMDL', 'EYB', 'STORIES', 'SALEDATE', 'PRICE', 'QUALIFIED',
       'SALE_NUM', 'GBA', 'BLDG_NUM', 'STYLE', 'STRUCT', 'GRADE', 'CNDTN',
       'EXTWALL', 'ROOF', 'INTWALL', 'KITCHENS', 'FIREPLACES', 'USECODE',
       'LANDAREA', 'GIS_LAST_MOD_DTTM', 'SOURCE', 'CMPLX_NUM', 'LIVING_GBA',
       'FULLADDRESS', 'CITY', 'STATE', 'ZIPCODE', 'NATIONALGRID', 'LATITUDE',
       'LONGITUDE', 'ASSESSMENT_NBHD', 'ASSESSMENT_SUBNBHD', 'CENSUS_TRACT',
       'CENSUS_BLOCK', 'WARD', 'SQUARE', 'X', 'Y', 'QUADRANT'],
      dtype='object')

In [8]:
cols_num = ['BATHRM','HF_BATHRM','ROOMS','BEDRM','FIREPLACES','YEAR','AYB','EYB','GBA','LANDAREA']
cols_ord = ['GRADE','HEAT','ZIPCODE','ASSESSMENT_NBHD', 'CENSUS_TRACT']
cols_cat = ['AC','SOURCE','QUALIFIED']

cols_all = cols_num + cols_ord + cols_cat

In [9]:
x_train.loc[:,cols_all + ['LIVING_GBA', 'SALEDATE']].isnull().any()

BATHRM             False
HF_BATHRM          False
ROOMS              False
BEDRM              False
FIREPLACES         False
YEAR                True
AYB                 True
EYB                False
GBA                 True
LANDAREA           False
GRADE               True
HEAT               False
ZIPCODE            False
ASSESSMENT_NBHD    False
CENSUS_TRACT       False
AC                 False
SOURCE             False
QUALIFIED          False
LIVING_GBA          True
SALEDATE            True
dtype: bool

All is as expected - we do have null values for ```GBA``` and ```LIVING_GBA``` (which we will merge into one column), ```AYB``` (which we will replace with the mean of the training data), ```GRADE``` which has null values for all condominiums (we will replace these with a standard value) and ```YEAR``` which we will populate from ```SALEDATE``` (and replace the missing values with the mean).

Let's start with defining all the classes we'll need in the pipeline. We will test these right after the definition in the same order as we are using them in the pipeline (see section 2.4 to see the definition and order of the pipeline).

#### 2.3.1 Merge Columns

In [10]:
class MergeColumns(TransformerMixin):
    def __init__(self, column_one, column_two):
        self.column_one = column_one
        self.column_two = column_two
    
    def fit(self, x, y= None):
        return self
    
    def transform(self, x):
        x[self.column_one] = x[self.column_one].fillna(0) + x[self.column_two].fillna(0)
        x = x.drop(self.column_two, axis=1)
        return x

In [11]:
x_train.loc[:,['SOURCE','GBA','LIVING_GBA']].sample(3, random_state=1)

Unnamed: 0_level_0,SOURCE,GBA,LIVING_GBA
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2045,Residential,3879.0,
33979,Residential,3466.0,
112120,Condominium,,409.0


In [12]:
merge_columns = MergeColumns('GBA', 'LIVING_GBA')
df_merge_columns = merge_columns.fit_transform(x_train)
x_train.loc[:,['SOURCE','GBA']].sample(3, random_state=1)

Unnamed: 0_level_0,SOURCE,GBA
index,Unnamed: 1_level_1,Unnamed: 2_level_1
2045,Residential,3879.0
33979,Residential,3466.0
112120,Condominium,409.0


#### 2.3.2 Impute Values

In [13]:
class ImputeValue(TransformerMixin):
    def __init__(self, column, value):
        self.column = column
        self.value = value
    
    def fit(self, x, y= None):
        return self
    
    def transform(self, x):
        x[self.column].fillna(self.value, inplace=True)
        
        return x

In [14]:
x_train.loc[x_train['AYB'].isnull()].loc[:,['AYB']].head(3)

Unnamed: 0_level_0,AYB
index,Unnamed: 1_level_1
86611,
57303,
61724,


In [15]:
impute_ayb = ImputeValue('AYB', 1940)
impute_ayb.fit_transform(x_train)
x_train.loc[x_train['AYB'].isnull()].loc[:,['AYB']].head(3)

Unnamed: 0_level_0,AYB
index,Unnamed: 1_level_1


#### 2.3.3 Convert Zero Values for AC into 'N'

In [16]:
class ConvertZeroToN(TransformerMixin):
    def __init__(self, column):
        self.column = column
    
    def fit(self, x, y= None):
        return self
    
    def transform(self, x):
        x[self.column][x[self.column] == '0'] = 'N'
        
        return x

In [17]:
x_train.loc[x_train['AC'] == '0'].loc[:,['AC']].head(3)

Unnamed: 0_level_0,AC
index,Unnamed: 1_level_1
138953,0
8465,0
104206,0


In [18]:
convert_zero_to_n = ConvertZeroToN('AC')
convert_zero_to_n.fit_transform(x_train)
x_train.loc[x_train['AC'] == '0'].loc[:,['AC']].head(3)

Unnamed: 0_level_0,AC
index,Unnamed: 1_level_1


#### 2.3.4 Convert Data from String to Year

In [19]:
class ConvertStringDateToYear(TransformerMixin):
    def __init__(self, column):
        self.column = column
    
    def fit(self, x, y= None):
        return self
    
    def transform(self, x):
        x[self.column] = pd.to_datetime(x[self.column], format='%Y-%m-%d', errors='coerce')
        x['YEAR'] = x[self.column].dt.year
        
        return x       

In [20]:
convert_string_date_to_year = ConvertStringDateToYear('SALEDATE')
convert_string_date_to_year.fit_transform(x_train)
x_train.loc[:,['YEAR']].head(3)

Unnamed: 0_level_0,YEAR
index,Unnamed: 1_level_1
144430,2000.0
62281,2010.0
2366,2012.0


In [21]:
x_train.loc[x_train['YEAR'].isnull()].loc[:,['YEAR']].head(3)

Unnamed: 0_level_0,YEAR
index,Unnamed: 1_level_1
94768,


In [22]:
impute_ayb = ImputeValue('YEAR', 2004)
impute_ayb.fit_transform(x_train)
x_train.loc[x_train['YEAR'].isnull()].loc[:,['YEAR']].head(3)

Unnamed: 0_level_0,YEAR
index,Unnamed: 1_level_1


#### 2.3.5 Select Numeric Columns

In [23]:
class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
    
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x.loc[:, self.columns]

Test scaling for numeric columns:

In [24]:
col_sel_num = ColumnSelector(cols_num)
x_train_num = col_sel_num.fit_transform(x_train)
x_train_num.head(3)

Unnamed: 0_level_0,BATHRM,HF_BATHRM,ROOMS,BEDRM,FIREPLACES,YEAR,AYB,EYB,GBA,LANDAREA
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
144430,2,0,4,2,0,2000.0,2000.0,2000,822.0,1095
62281,2,1,6,3,0,2010.0,1925.0,1964,1558.0,1643
2366,1,0,4,2,1,2012.0,1953.0,1962,702.0,609


In [25]:
#scaler = StandardScaler()
#x_train_num_np = scaler.fit_transform(x_train_num)

#x_train_num = pd.DataFrame(x_train_num_np, index=x_train_num.index, columns=x_train_num.columns)

#x_train_num.head(3)

#### 2.3.6 Encode Ordinal Columns

In [26]:
col_sel_ord = ColumnSelector(cols_ord)
x_train_ord = col_sel_ord.fit_transform(x_train)
x_train_ord.head(3)

Unnamed: 0_level_0,GRADE,HEAT,ZIPCODE,ASSESSMENT_NBHD,CENSUS_TRACT
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
144430,,Ht Pump,20012.0,Brightwood,1702.0
62281,Average,Forced Air,20002.0,Eckington,8701.0
2366,Average,Warm Cool,20037.0,Foggy Bottom,5600.0


In [27]:
mapping_ord = [{'col': 'GRADE','mapping': [(False, 0),
                                       ('Low Quality', 1),
                                       ('Fair Quality', 2),
                                       ('Average', 3),
                                       ('Above Average', 4),
                                       ('Good Quality', 5),
                                       ('Very Good', 6),
                                       ('Excellent', 7),
                                       ('Superior', 8),
                                       ('Exceptional-A', 9),
                                       ('Exceptional-B', 10),
                                       ('No Data', 11),
                                       ('Exceptional-D', 12),
                                       ('Exceptional-C', 13)]}]
#ord_encoder = OrdinalEncoder(cols=['GRADE'], mapping=mapping_ord) # this would probably help for non RF estimators. Let's check.
#x_train_ord = ord_encoder.fit_transform(x_train_ord)

ord_encoder = OrdinalEncoder()
x_train_ord = ord_encoder.fit_transform(x_train_ord)

x_train_ord.head(3)

Unnamed: 0_level_0,GRADE,HEAT,ZIPCODE,ASSESSMENT_NBHD,CENSUS_TRACT
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
144430,1,1,20012.0,1,1702.0
62281,2,2,20002.0,2,8701.0
2366,2,3,20037.0,3,5600.0


#### 2.3.7 One-Hot-Encode Categorical Columns

In [28]:
col_sel_cat = ColumnSelector(cols_cat)
x_train_cat = col_sel_cat.fit_transform(x_train)
x_train_cat.head(3)

Unnamed: 0_level_0,AC,SOURCE,QUALIFIED
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
144430,Y,Condominium,Q
62281,Y,Residential,Q
2366,Y,Residential,Q


In [29]:
one_hot_encoder = OneHotEncoder(drop_invariant=True)
x_train_cat = one_hot_encoder.fit_transform(x_train_cat)
x_train_cat.head(3)

Unnamed: 0_level_0,AC_1,AC_2,SOURCE_1,SOURCE_2,QUALIFIED_1,QUALIFIED_2
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
144430,1,0,1,0,1,0
62281,1,0,0,1,1,0
2366,1,0,0,1,1,0


#### 2.3.8 Union Columns and Check for Missing Values

In [30]:
x_train = pd.concat([x_train_num, x_train_ord, x_train_cat], axis=1, sort=False)

In [31]:
x_train.isnull().any()

BATHRM             False
HF_BATHRM          False
ROOMS              False
BEDRM              False
FIREPLACES         False
YEAR               False
AYB                False
EYB                False
GBA                False
LANDAREA           False
GRADE              False
HEAT               False
ZIPCODE            False
ASSESSMENT_NBHD    False
CENSUS_TRACT       False
AC_1               False
AC_2               False
SOURCE_1           False
SOURCE_2           False
QUALIFIED_1        False
QUALIFIED_2        False
dtype: bool

In [32]:
x_train.head()

Unnamed: 0_level_0,BATHRM,HF_BATHRM,ROOMS,BEDRM,FIREPLACES,YEAR,AYB,EYB,GBA,LANDAREA,...,HEAT,ZIPCODE,ASSESSMENT_NBHD,CENSUS_TRACT,AC_1,AC_2,SOURCE_1,SOURCE_2,QUALIFIED_1,QUALIFIED_2
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
144430,2,0,4,2,0,2000.0,2000.0,2000,822.0,1095,...,1,20012.0,1,1702.0,1,0,1,0,1,0
62281,2,1,6,3,0,2010.0,1925.0,1964,1558.0,1643,...,2,20002.0,2,8701.0,1,0,0,1,1,0
2366,1,0,4,2,1,2012.0,1953.0,1962,702.0,609,...,3,20037.0,3,5600.0,1,0,0,1,1,0
53410,2,0,6,3,0,2005.0,1914.0,1957,1545.0,1873,...,4,20001.0,2,3400.0,0,1,0,1,0,1
40380,4,0,7,4,1,2013.0,1951.0,1975,3193.0,11500,...,2,20015.0,4,1500.0,1,0,0,1,1,0


The data cleaning all seems to work fine and removes our null values. In a later step we also should write unit tests to make sure they work fine.

### 2.4 Load Data and Define Pipeline

We need to reset ```x_train``` and ```y_train``` in order for them being used in the pipeline. Let's set up the transformation pipeline using the classes we defined above. 

In [33]:
x_train, x_test, y_train, y_test = train_test_split(df.drop('PRICE', axis=1), 
                                                    df.loc[:,['PRICE']],
                                                    test_size=0.2,
                                                    random_state=5)

In [34]:
processing_pipeline = make_pipeline(
    
    MergeColumns('GBA', 'LIVING_GBA'),
    ImputeValue('AYB', 1940),
    ConvertZeroToN('AC'),
    ConvertStringDateToYear('SALEDATE'),
    ImputeValue('YEAR', 2004),
    make_union(
        make_pipeline(ColumnSelector(cols_num),
                      #StandardScaler()
        ),
        make_pipeline(ColumnSelector(cols_ord),
                      OrdinalEncoder()
        ),
        make_pipeline(ColumnSelector(cols_cat),
                      OneHotEncoder()
        )
    )
)

In [35]:
pipeline = (make_pipeline(processing_pipeline, RandomForestRegressor(random_state=1, 
                                                                          n_jobs=-1, 
                                                                          n_estimators=100)))

### 2.5 Fit Pipeline and Evaluate Accuracy

In [36]:
pipeline.fit(x_train, pd.Series.ravel(y_train))

Pipeline(memory=None,
     steps=[('pipeline', Pipeline(memory=None,
     steps=[('mergecolumns', <__main__.MergeColumns object at 0x000001A6CD638208>), ('imputevalue-1', <__main__.ImputeValue object at 0x000001A6CD638DD8>), ('convertzeroton', <__main__.ConvertZeroToN object at 0x000001A6CDE285F8>), ('convertstringdatetoyear'...stimators=100, n_jobs=-1,
           oob_score=False, random_state=1, verbose=0, warm_start=False))])

In [37]:
pred_train = compare_predictions(x_train, y_train, pipeline, y_train['PRICE'].mean())

RMSE Lazy Predictor 7129763.675459661
MAE Lazy Predictor 964376.7917577944
R^2 Lazy Predictor 0.0

RMSE 873994.5174397157
MAE 179626.10685042062
R^2 0.984973177831752

RMSE Improvement: 6255769.158019945
MAE Inprovement: 784750.6849073739
R^2 Improvement: 0.984973177831752


In [38]:
pred_test = compare_predictions(x_test, y_test, pipeline, y_test['PRICE'].mean())

RMSE Lazy Predictor 6634982.491976388
MAE Lazy Predictor 865667.6296156339
R^2 Lazy Predictor 0.0

RMSE 875903.293613276
MAE 92067.56251801067
R^2 0.9825725937004337

RMSE Improvement: 5759079.198363112
MAE Inprovement: 773600.0670976231
R^2 Improvement: 0.9825725937004337


### 2.6 Fit with Complete Data and Evaluate with Holdout Data (Before Grid Search)

In [39]:
x_train = df.drop('PRICE', axis=1)
y_train = df.loc[:,['PRICE']]

In [40]:
pipeline.fit(x_train, pd.Series.ravel(y_train))

Pipeline(memory=None,
     steps=[('pipeline', Pipeline(memory=None,
     steps=[('mergecolumns', <__main__.MergeColumns object at 0x000001A6CD638208>), ('imputevalue-1', <__main__.ImputeValue object at 0x000001A6CD638DD8>), ('convertzeroton', <__main__.ConvertZeroToN object at 0x000001A6CDE285F8>), ('convertstringdatetoyear'...stimators=100, n_jobs=-1,
           oob_score=False, random_state=1, verbose=0, warm_start=False))])

In [41]:
df_test = pd.read_csv('data/dc_housing/holdout_test_data.csv', index_col=0, low_memory=False)
df_test = df_test[~np.isnan(df_test['PRICE'])]

df_test = df_test.dropna(subset=['PRICE'])

x_test = df_test.drop('PRICE', axis=1)
y_test = df_test.loc[:,['PRICE']]

In [42]:
pred_test = compare_predictions(x_test, y_test, pipeline, y_test['PRICE'].mean())

RMSE Lazy Predictor 9550662.233815953
MAE Lazy Predictor 1661050.334916615
R^2 Lazy Predictor 0.0

RMSE 3267544.72995681
MAE 229341.42856115213
R^2 0.882948735302248

RMSE Improvement: 6283117.503859144
MAE Inprovement: 1431708.9063554627
R^2 Improvement: 0.882948735302248
