# Contents
1. [Introduction](#intro)   
2. [Quantitative features](#quants)  
3. [Ordinal features](#ordinal)  
4. [Categorical features](#cats)  
5. [Text features](#text)  
6. [The full preprocessing and modeling pipeline](#pipeline)  
7. [Model performance](#performance)   

# 1. Introduction
<a id='intro'></a>

In this notebook, we build a pipeline to preprocess features and train an XGBoost model to predict property price. We will combine sub-pipelines for different kinds of features with different preprocessing requirements.

In [136]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import TruncatedSVD
from xgboost import XGBRegressor

In [137]:
X_train = pd.read_csv('data/listings_train.csv', index_col='id', low_memory=False)

Because of the issues we saw in the price data in the `listings.csv` file, we'll use price data from the `calendar.csv` file as our target feature, instead. For now, we'll pick a fixed weekend date that isn't too far beyond when they calendars were scraped.

In [138]:
calendar_train = pd.read_csv('data/calendar_train_price.csv')
y_train = calendar_train[calendar_train.date == '2019-08-03'].set_index('listing_id').loc[X_train.index].price.values

# 2. Quantitative features
<a id='quants'></a>

Convert percent strings to floats:

In [139]:
def pct_to_float(pct_column):
    """Strip punctuation from percents and convert to floats"""
    float_pct = [float(str(pct).replace('%', '')) for pct in pct_column]
    return float_pct

In [140]:
X_train.host_response_rate = pct_to_float(X_train.host_response_rate)

Create `days_as_host` feature from `host_since` column, which is the date the host joined the site:

In [141]:
X_train['days_as_host'] = (pd.to_datetime('2019-07-14') - pd.to_datetime(X_train.host_since)).dt.days

In [142]:
quant_features = ['days_as_host', 'host_response_rate', 'host_listings_count', 'accommodates', 'bathrooms', 'bedrooms',
                  'beds', 'guests_included', 'minimum_nights', 'number_of_reviews', 'review_scores_rating',
                  'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin',
                  'review_scores_communication', 'review_scores_location', 'review_scores_value']

Create a transformer to select the quantitative data that can be used in a pipeline:

In [143]:
get_quant_features = FunctionTransformer(lambda x: x[quant_features], validate=False)

# 3. Ordinal features
<a id='ordinal'></a>

Recode levels of ordinal features to preserve their order (treating binary features as ordinal features with only two levels):

In [144]:
ordinal_features = ['host_is_superhost', 'instant_bookable', 'host_response_time', 'cancellation_policy', 'room_type']

In [145]:
ordinal_mapping = {'host_is_superhost': {'t': 1, 'f': 0},
                   'instant_bookable': {'t': 1, 'f': 0},
                   'host_response_time':
                   {'within an hour': 4, 'within a few hours': 3, 'within a day': 2, 'a few days or more': 1},
                   'cancellation_policy':
                   {'super_strict_60': 1, 'super_strict_30': 2, 'strict': 3, 'strict_14_with_grace_period': 3, \
                    'moderate': 4, 'flexible': 5},
                   'room_type': {'Entire home/apt': 3, 'Private room': 2, 'Shared room': 1}}

Create a transformer to recode ordinal variables in a pipeline:

In [146]:
def recode_ordinals(df):
    df_recoded = df.replace(ordinal_mapping)
    return df_recoded[ordinal_features]
recode_ordinals = FunctionTransformer(recode_ordinals, validate=False)

# 4. Categorical features
<a id='cats'></a>

We will simply use one-hot encoding for categorical features that have no obvious ordering. We can do this by converting these features to dictionaries and then using `DictVectorizer()` later in the pipeline.

In [147]:
cat_features = ['neighbourhood_group_cleansed', 'property_type']
cat_to_dict = FunctionTransformer(lambda x: x[cat_features].to_dict('records'), validate=False)

# 5. Text features
<a id='text'></a>

We will do some simple preprocessing of the text data, first combining all of the text columns from the dataframe together and then later using TfidfVectorizer() in the pipeline.

In [148]:
def combine_text(df):
    """Combine text columns into a single text feature"""    
    text_columns = ['name', 'summary', 'space', 'notes', 'amenities',
                    'description', 'neighborhood_overview']
    text_df = df[text_columns].replace(np.nan, '')
    text_feature = text_df.apply(lambda x: ' '.join(x), axis=1)
    return text_feature

In [149]:
get_text_features = FunctionTransformer(combine_text, validate=False)

# 6. The full preprocessing and modeling pipeline
<a id='pipeline'></a>

Below is the full preprocessing and modeling pipeline, which uses FeatureUnion() to combine sub-pipelines for the different data types:

In [150]:
pipe = Pipeline([
        ('union', FeatureUnion(
            transformer_list = [
                ('quant_features', Pipeline([
                    ('selector', get_quant_features),
                    ('quant_imp', SimpleImputer())
                ])),
                ('ordinal_features', Pipeline([
                    ('recode_ordinals', recode_ordinals),
                    ('ordinal_imp', SimpleImputer(strategy='most_frequent'))
                ])),
                ('cat_features', Pipeline([
                    ('to_dict', cat_to_dict),
                    ('dict_vectorizer', DictVectorizer())
                ])),
                ('text_features', Pipeline([
                    ('combine_text', get_text_features),
                    ('vectorizer', TfidfVectorizer(ngram_range=(1,2))),
                    ('dim_red', TruncatedSVD(100))
                ]))
             ]
        )),
        ('reg', XGBRegressor(n_estimators=50))
    ])

We now preprocess the training data and fit an XGBoost model with:

In [None]:
pipe.fit(X_train, y_train)

# 7. Model performance
<a id='performance'></a>

Finally, we'll evaluate our model's performance by seeing how well it does on the test data. Since our pipeline takes care of most of the preprocessing, we will have to do very little to the raw test data before we feed it into the pipeline:

In [None]:
X_test = pd.read_csv('data/listings_test.csv', index_col='id', low_memory=False)
X_test.host_response_rate = pct_to_float(X_test.host_response_rate)
X_test['days_as_host'] = (pd.to_datetime('2019-07-14') - pd.to_datetime(X_test.host_since)).dt.days

In [None]:
calendar_test = pd.read_csv('data/calendar_test.csv')
y_test = calendar_test[calendar_test.date == '2019-08-03'].set_index('listing_id').loc[X_test.index].price.values

In [None]:
y_pred = pipe.predict(X_test)

In [None]:
rmse = mean_squared_error(y_test, y_pred) ** .5
print('RMSE of the model: {0:.1f}'.format(rmse))

For reference, we can compare this to a baseline model where we always just predict the mean price for all properties we had in the training set:

In [None]:
rmse_baseline = np.mean((np.mean(y_train) - y_test) ** 2) ** .5
print('Baseline RMSE: {0:.1f}'.format(rmse_baseline))