## Airbnb Price Prediction - Optimizing Listings
#### Patrick Huston & Filippos Lymperopoulos | Spring 2016

*This notebook aims to document our process in creating a ML pipeline to predict Airbnb listing prices given an input set of features. A major focus of this exploration is to write modular, well-designed components that could easily be taken and applied to a different modeling situation.*

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import pickle
from sklearn import preprocessing
import seaborn as sns
from datetime import datetime
import binaryHelper as be
import sumHelper as se
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import StratifiedKFold
from sklearn.externals import joblib

from sklearn import linear_model
from sklearn import svm

### The Data

Airbnb provides an expansive listings dataset for public use. It includes all listings for major cities around the world. For the purpose of our exploration, we'll start off by using well-known US cities - Boston, San Francisco, Los Angeles, Washington DC, and Seattle. For each listing, Airbnb provides a large amount of features - check [here](https://github.com/flymperopoulos/DataScience16CTW/blob/master/report/listings_features.md) to see them all listed out.

#### Load in the Listings Data

In [2]:
listingsBoston = pd.read_csv('../data/listingsBoston.csv')
listingsSF = pd.read_csv('../data/listingsSF.csv')
listingsLA = pd.read_csv('../data/listingsLA.csv')
listingsDC = pd.read_csv('../data/listingsDC.csv')
listingsSeattle = pd.read_csv('../data/listingsSeattle.csv')

frames = [listingsBoston, listingsSF, listingsLA, listingsDC, listingsSeattle]

listingsAll = pd.concat(frames)

  interactivity=interactivity, compiler=compiler, result=result)


#### Load in the Calendar Data

In [3]:
calendarBoston = pd.read_csv('../data/calendarBoston.csv')
calendarSF = pd.read_csv('../data/calendarSF.csv')
calendarLA = pd.read_csv('../data/calendarLA.csv')
calendarDC = pd.read_csv('../data/calendarDC.csv')
calendarSeattle = pd.read_csv('../data/calendarSeattle.csv')

frames = [calendarBoston, calendarSF, calendarLA, calendarDC, calendarSeattle]

calendarAll = pd.concat(frames)

### Cleaning Convenience

To facilitate the process of cleaning, we've defined a cleaning helper and cleaning processor class below. 

### Feature Engineering 

The Airbnb dataset is composed of both numerical features - data like the number of beds, the number of bedrooms, and the number of bathrooms - and categorical features - data like the neighborhood, the type of room, and the type of the bed. 

Both types of features will be important in predicting the price of a given listing. While the numerical features can be used directly, we'll have to take some additional processing and encoding steps to get the categorical data in a representation that can be used as a feature in our models.

#### Numerical Features

The numerical features we plan on using directly are the following:
- `bedrooms` - The number of bedrooms included in the listing
- `beds` - The number of beds included in the listing
- `bathrooms` - The number of bathrooms included in the listing
- `accommodates` - The number of people the listing accommodates

Other non-categorical features we'll be extracting are:
- `num_amenities` - Using the 'amenities' feature, we can extract the number of amenities offered
- `days_host` - Using the 'host_since' feature, we can compute the number of days the host has been a host, potentially a measure of experience
- `price` - We must clean the 'price' feature to extract a numerical (rather than string) value for the price of the listing

#### Categorical Features

There are several pertinent categorical features that intuitively seem to have great significance in the price of the listing. To use them in a model, however, we'll need to take some steps to encode them numerically. Listed here are the features we'll be using in the model:

- `neighbourhood` - The neighbourhood the property is in
- `property_type` - The type of property (e.g. apartment, bed and breakfast, house)
- `room_type` - The type of room (e.g. Shared room, Entire home/apt, or Private room
- `bed_type` - The type of bed offered (e.g. Real Bed, Futon, Couch, Pull-out Sofa, Airbed


There are several options for encoding categorical features.

##### Ordinal Encoding

In ordinal encoding, in which each categorical value takes on an integer value. The ations of ordinal encoding are that it doesn't add extra columsn in the feature matrix, which can dilute the other features included. The major drawback of an ordinal encoding is that it inserts a notion of a relationship between each category and the dependent variable (price, in this case). In some situations, this may be appropriate (e.g. in the bed_type feature, there is a natural ordering/relationship between the bed_type and the price of the listing.

##### One-Hot (Dummy) Encoding

In a one-hot encoding scheme, each category is represented as its own new feature that takes on a binary value - 1 if the input fits into a given category. For a category with *n* levels, this means adding *n* new columns to our feature matrix. For low values of *n*, this may be okay, but higher values of *n* risk the possibiilty of blowing the dimensionality of the feature matrix way out of proportion. 

##### Binary Encoding

Binary encoding is a cool alternative to one-hot encoding for the representation of categorical values. First, each value takes on an integer value in an ordinal encoding. From this, each value may be represented as a binary number. Finally, this binary number is split up into individual bits, and each is inserted into the feature matrix as a new column. 

### Encoding Choices

Armed with a wealth of information on categorical encoding, we made the following decisions for each of the categorical features in the dataset.

#### Neighbourhood
Due to its high dimensionality (420 possibilities), we chose a binary encoding to represent the neighbourhood feature. An ordinal encoding might also be possible, but in research we've done, a binary encoding almost always seemed to ourperform ordinal encodings.

#### Property Type
Property was another category that had multiple dimensions (26 options), hence we decided to process the categorical data by encoding them with a binary method. We initially attempted sum encoding however that did not yield improved results. We ended up reducing the "represented" number of features from 26 to 5. 

#### Room Type & Bed Type
Observing the results from the previous encoding procedures we decided to perform a different form of encoding, where we utilized a combination of one-hot and binary encoding. This method significantly aided our model. 

In [4]:
def create_ordinal_mapping(data, parameter):
    parameter_mapping = {}
    
    for index, parameter in enumerate(sorted(data[parameter].unique())):
        parameter_mapping[parameter] = index
    
    return parameter_mapping

neighbourhood_mapping = create_ordinal_mapping(listingsAll, "neighbourhood_cleansed")
bed_type_mapping = create_ordinal_mapping(listingsAll, "bed_type")
room_type_mapping = create_ordinal_mapping(listingsAll, "room_type")
property_type_mapping = create_ordinal_mapping(listingsAll, "property_type")

# Sorting based on room quality
shared_room_ord = room_type_mapping["Shared room"] 
entire_apt_ord = room_type_mapping["Entire home/apt"]
room_type_mapping["Shared room"] = entire_apt_ord
room_type_mapping["Entire home/apt"] = shared_room_ord

#### Helper

The class `Helper` exposes a set of helpful functions that each deal with processing an individual feature in the dataset. For example, `amenities_to_list` takes in an individual row from the `amenities` column - represented as a string - and converts it into a more useful list format. 

In [5]:
class Helper():
    def __init__(self, data_attrs):
        self.data_attrs = data_attrs
    
    # Converts string representation of amenities list 
    def amenities_to_list(self, amenities):
        amenities = amenities.replace('{', '')
        amenities = amenities.replace('}', '')
        amenities = amenities.replace('\"', '')
        return amenities.split(',')

    # Creates new feature out of the number of amenities
    def num_amenities(self, amenitiesList):
        return len(amenitiesList)

    # Converts string representation of price to float 
    def price_to_int(self, price):
        price = price.replace(',', '')
        price = price.replace('$', '')
        return float(price)
    
    def date_to_days(self, startDay):
        dStart = datetime.strptime(startDay, "%Y-%m-%d")
        dEnd = datetime.strptime(self.data_attrs['date_min'], "%Y-%m-%d")
        return abs((dEnd - dStart).days)
    
    def row_to_ordinal(self, row, mapping):
        return self.data_attrs[mapping][row]

    
helper = Helper({'date_min': listingsAll[listingsAll.host_since.isnull() == False].host_since.max(), 
                 'neighbourhood_mapping': neighbourhood_mapping, 
                 'bed_type_mapping':bed_type_mapping, 
                 'room_type_mapping':room_type_mapping,
                 'property_type_mapping':property_type_mapping
                });

#### Clean Processor
The class `cleanProcessor` does handles two things - it applies the methods defined in `Helper` to its dataset and performs some additional processing like null-filling.

In [32]:
class cleanProcessor():
    def __init__(self, data):
        self.data = data
        
    def clean_listings(self):
        df_clean = self.data.copy()
        
        # TODO: What? - Better techniques for filling nulls
        df_clean.loc[df_clean.review_scores_rating.isnull(), 'review_scores_rating'] = 90
        df_clean.loc[df_clean.host_since.isnull(), 'host_since'] = '2015-10-02'
        df_clean.loc[df_clean.bedrooms.isnull(), 'bedrooms'] = 0
        df_clean.loc[df_clean.bathrooms.isnull(), 'bathrooms'] = 0
        df_clean.loc[df_clean.beds.isnull(), 'beds'] = 0
        df_clean.loc[df_clean.property_type.isnull(), 'property_type'] = "Other"
        
        df_clean['amenities'] = df_clean['amenities'].apply(helper.amenities_to_list)
        df_clean['num_amenities'] = df_clean['amenities'].apply(helper.num_amenities)
        df_clean['price'] = df_clean['price'].apply(helper.price_to_int)
        df_clean['days_host'] = df_clean['host_since'].apply(helper.date_to_days)
        df_clean['neighbourhood_binary'] = df_clean['neighbourhood_cleansed'].apply(helper.row_to_ordinal, args=("neighbourhood_mapping",))
        df_clean['property_binary_encoded'] = df_clean['property_type'].apply(helper.row_to_ordinal, args=("property_type_mapping",))
        
        # Binary encoding for neighborhoods
        encoder = be.BinaryEncoder(cols=['neighbourhood_binary'])
        binary_neighbourhoods = encoder.transform(df_clean)
        
        # Sum encoding for property type
        property_encoder = be.BinaryEncoder(cols=['property_binary_encoded'])
        binary_properties = property_encoder.transform(df_clean)
        
        # One-hot and Ordinal Encoding for bed_type and room_type
        bed_type = pd.get_dummies(df_clean.bed_type)    
        df_clean["bed_type"] = df_clean["bed_type"].apply(helper.row_to_ordinal, args=("bed_type_mapping",))
        print df_clean["bed_type"]
        room_type = pd.get_dummies(df_clean.room_type)    
        df_clean["room_type"] = df_clean["room_type"].apply(helper.row_to_ordinal, args=("room_type_mapping",))

        # One-hot encoding for cancellation policy
        cancellation_policy = pd.get_dummies(df_clean.cancellation_policy)
        cancellation_policy.rename(columns={'strict':'cancellation_strict'})
        cancellation_policy.rename(columns={'flexible':'cancellation_flexible'})
        cancellation_policy.rename(columns={'moderate':'cancellation_moderate'})
                
        data = pd.concat([df_clean, cancellation_policy, binary_neighbourhoods, bed_type, room_type, binary_properties], axis=1)

        return data

    def clean_calendar(self):
        self.df_clean = df.copy()
        self.df_clean = self.df_clean[self.df_clean.available == 't']
        self.df_clean['price'] = self.df_clean['price'].apply(helper.price_to_int)
        return self.df_clean

clean_processor_listings = cleanProcessor(listingsAll)
clean_processor_calendar = cleanProcessor(calendarAll)

#### Now, let's use our clean_processor to clean our listings data.

In [33]:
listingsClean = clean_processor_listings.clean_listings()

0       4
1       4
2       4
3       4
4       4
5       4
6       4
7       4
8       4
9       4
10      4
11      4
12      4
13      4
14      4
15      4
16      4
17      4
18      4
19      4
20      4
21      4
22      4
23      4
24      4
25      4
26      4
27      4
28      4
29      4
       ..
3788    4
3789    4
3790    4
3791    4
3792    4
3793    4
3794    4
3795    4
3796    4
3797    2
3798    4
3799    4
3800    4
3801    4
3802    4
3803    4
3804    4
3805    4
3806    4
3807    4
3808    4
3809    4
3810    4
3811    4
3812    4
3813    4
3814    4
3815    4
3816    4
3817    4
Name: bed_type, dtype: int64


In [8]:
print '{} -- {} '.format('Private', listingsClean[listingsClean.room_type == 1].price.median()) 
print '{} -- {} '.format('Shared', listingsClean[listingsClean.room_type == 0].price.median()) 
print '{} -- {} '.format('Entire', listingsClean[listingsClean.room_type == 2].price.median()) 

Private -- 80.0 
Shared -- 47.0 
Entire -- 159.0 


In [9]:
print bed_type_mapping

{'Real Bed': 4, 'Futon': 2, 'Couch': 1, 'Pull-out Sofa': 3, 'Airbed': 0}


In [10]:
print '{} -- {} '.format('Real', listingsClean[listingsClean.bed_type == 4].price.median())
print '{} -- {} '.format('Pull-out', listingsClean[listingsClean.bed_type == 3].price.median()) 
print '{} -- {} '.format('Futon', listingsClean[listingsClean.bed_type == 2].price.median()) 
print '{} -- {} '.format('Air', listingsClean[listingsClean.bed_type == 0].price.median()) 
print '{} -- {} '.format('Couch', listingsClean[listingsClean.bed_type == 1].price.median()) 

Real -- 120.0 
Pull-out -- 79.5 
Futon -- 70.0 
Air -- 65.0 
Couch -- 55.0 


### Choosing Models

Now that we've done a good amount of initial preprocessing on our data, let's move towards some predictive modeling. Our goal is to develop a model that will predict a price given listing parameters. We'll experiment with including different features, and see which combination fo features and models produces the best results.

After some initial research we've decided to start off by trying four different models - 

1. Linear Lasso
       The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent.
       
2. ElasticNet 
       ElasticNet is a linear regression model trained with L1 and L2 prior as regularizer. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.
       
3. Support Vector Regression
       Support Vector Regression is an implementation of regression that uses the SVM approach. In this case, we'll be using the linear kernel.
            
4. Ridge Regression
       Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. The ridge coefficients minimize a penalized residual sum of squares.
       



In [26]:
# Linear Lasso
lasso = linear_model.Lasso(alpha = .01)

# ElasticNet
elasticNet = linear_model.ElasticNet(alpha = 0.1, l1_ratio=0.7)

# Ridge Regression
ridgeRegression = linear_model.Ridge(alpha = .5)

# Support Vector Regression
svr = svm.SVR(C=1.0, epsilon=0.2)

models = {'lasso': lasso, 'elasticNet': elasticNet, 'ridgeRegression': ridgeRegression }

### Testing Models

Now, let's define `ModelHelper`, a class that will facilitate the process of testing our models.

#### A quick note on testing

To test our model, we'll be using the `cross_val_score` method provided by scikit-learn. Additionally, we'll be using `cross_val_score` coupled with a StratifiedKFold train-test splitting step. This step ensures that each neighborhood will have equal representation in both the training and test datasets. This is necessary because the data is sorted by neighbourhood. By default, a train-test split would likely give us training and test sets where a given neighbourhood only appears in one or the other - essentially nullifying the predictive power of the neighbourhood as a feature.

In [34]:
class ModelHelper():
    ''' 
    ModelHelper exposes a set of functions aimed at facilitating the dataset
    manipulation (train-test splitting) and model testing process
    '''
    
    def __init__(self, X, y, features):
        self.X = X
        self.y = y
        self.XFeat = X[features]
    
    def cross_validate(self, model, cv=3):
        '''Cross-validates model within trainnig set with a split of 'cv' - default value of 3'''
        cv = StratifiedKFold(self.X.neighbourhood_cleansed)
        return cross_val_score(model, self.XFeat, self.y, cv=cv).mean()

    def train_test_splitter(self, model, train_size=0.6, save=False):
        '''Performs train-test split on data, trains on train, tests on test, returns score, model, data'''
        X_train, X_test, y_train, y_test = train_test_split(self.XFeat, self.y, train_size=train_size)
        model.fit(X_train, y_train)
        return X_train, X_test, y_train, y_test, model

    def test_models(self, models):
        '''Iterates over all different models and print out their results of train_test_splitter'''
        for modelName, model in models.iteritems():
            print '{} : {}'.format(modelName, self.cross_validate(model))
    
    def save_model(self, fitted_model, filename="model.pkl"):
        '''Persists model to disk for later use in backend API'''
        joblib.dump(fitted_model, filename) 

    def save_mappings(self, filename="mappings.pkl"):
        '''Persists model to disk for later use in backend API'''
        property_type_mapping_copy = property_type_mapping.copy()
        room_type_mapping_copy = room_type_mapping.copy()
        bed_type_mapping.update(property_type_mapping_copy)
        bed_type_mapping.update(room_type_mapping_copy)        
        bed_type_mapping.update(neighbourhood_mapping)
        return bed_type_mapping
        #         joblib.dump(bed_type_mapping, filename) 
        

In [18]:
binary_encoded_properties = ["property_binary_encoded_{}".format(i) for i in range(5)]

In [30]:
# TODO: add property_type to features --> encoding
features = ['num_amenities', 'bedrooms', 'beds',
            'neighbourhood_binary_0', 'neighbourhood_binary_1', 
            'neighbourhood_binary_2', 'neighbourhood_binary_3',
            'neighbourhood_binary_4', 'neighbourhood_binary_5',
            'neighbourhood_binary_6', 'neighbourhood_binary_7',
            'neighbourhood_binary_8', 'bathrooms', 'accommodates',
            'bed_type', 'Airbed', 'Couch', 'Futon', 'Pull-out Sofa',
            'Real Bed', 'room_type', 'Entire home/apt', 'Private room',
            'Shared room']

allFeatures = features + binary_encoded_properties

print allFeatures

pickle.dump(allFeatures, open( '../app/utils/modelFeatures.pkl', 'wb' ))

['num_amenities', 'bedrooms', 'beds', 'neighbourhood_binary_0', 'neighbourhood_binary_1', 'neighbourhood_binary_2', 'neighbourhood_binary_3', 'neighbourhood_binary_4', 'neighbourhood_binary_5', 'neighbourhood_binary_6', 'neighbourhood_binary_7', 'neighbourhood_binary_8', 'bathrooms', 'accommodates', 'bed_type', 'Airbed', 'Couch', 'Futon', 'Pull-out Sofa', 'Real Bed', 'room_type', 'Entire home/apt', 'Private room', 'Shared room', 'property_binary_encoded_0', 'property_binary_encoded_1', 'property_binary_encoded_2', 'property_binary_encoded_3', 'property_binary_encoded_4']


In [35]:
mHelper = ModelHelper(listingsClean, listingsClean.price, allFeatures)

# mHelper.test_models(models)

In [36]:
# save dict mappings
mHelper.save_mappings()

{nan: 0,
 'Adams': 0,
 'Adams-Normandie': 1,
 'Agoura Hills': 2,
 'Airbed': 0,
 'Alhambra': 3,
 'Alki': 4,
 'Allston': 5,
 'Alondra Park': 6,
 'Altadena': 7,
 'Angeles Crest': 8,
 'Apartment': 1,
 'Arbor Heights': 9,
 'Arcadia': 10,
 'Arleta': 11,
 'Arlington Heights': 12,
 'Artesia': 13,
 'Athens': 14,
 'Atlantic': 15,
 'Atwater Village': 16,
 'Avocado Heights': 17,
 'Azusa': 18,
 'Back Bay': 19,
 'Baldwin Hills/Crenshaw': 20,
 'Baldwin Park': 21,
 'Bay Village': 22,
 'Bayview': 23,
 'Beacon Hill': 24,
 'Bed & Breakfast': 2,
 'Bel-Air': 25,
 'Bell': 26,
 'Bellflower': 27,
 'Belltown': 28,
 'Bernal Heights': 29,
 'Beverly Crest': 30,
 'Beverly Grove': 31,
 'Beverly Hills': 32,
 'Beverlywood': 33,
 'Bitter Lake': 34,
 'Boat': 3,
 'Boyle Heights': 35,
 'Bradbury': 36,
 'Brentwood': 37,
 'Briarcliff': 38,
 'Brighton': 39,
 'Brightwood Park, Crestwood, Petworth': 40,
 'Broadview': 41,
 'Broadway': 42,
 'Broadway-Manchester': 43,
 'Brookland, Brentwood, Langdon': 44,
 'Bryant': 45,
 'Bungal