![alt text](http://www.onedropnirvana.com/wp-content/uploads/2014/07/airbnb.gif)

## Optimizing Your Airbnb Pricing
#### Patrick Huston & Filippos Lymperopoulos | Spring 2016

Whether an apartment for a night, a castle for a week, or a villa for a month, Airbnb connects people to unique travel experiences, at any price point, in more than 34,000 cities and 190 countries. And with world-class customer service and a growing community of users, Airbnb is the easiest way for people to monetize their extra space and showcase it to an audience of millions. However, how does one make sure that they price their property realistically to attract guests? How do hosts make sure that they adjust their listing's price with respect to the neighborhood it is in, the number of beds, type of rooms and beds the property is offering and other such parameters? 

An approach to the afore-mentioned questions was developed as a 2-week project by Patrick Huston and Filippos Lymperopoulos for the [Data Science](https://sites.google.com/site/datascience16/) class at Olin College of Engineering, Spring 2016.

------

## Introduction
This notebook aims to document our process in creating an ML pipeline to predict Airbnb listing prices given an input set of features. A major focus of this exploration is to write modular, well-designed components that can easily be used in different modeling situations. At the same time, we are using the model developed in this notebook to provide the backend infrastructure of a web interface that will serve as a guide for potential hosts to price their listings.

## Dataset
We were able to gain access to Airbnb data via [Inside Airbnb](http://insideairbnb.com/get-the-data.html). This resource provided us with information on listings across different cities, *listings.csv*, as well as the relevant dates of availability of the listings presented in the listings datasheet, as given in *calendar.csv*. We ended up not utilizing the *calendar.csv* data, as we focused our exploration more on the features that affect price on different listings on the service.

## Libraries
To run our scripts, we need to import a series of standard libraries.

In [14]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import pickle
from sklearn import preprocessing
import seaborn as sns
from datetime import datetime
import binaryHelper as be
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import StratifiedKFold
from sklearn.externals import joblib

from sklearn import linear_model
from sklearn import svm

## Load in the Listings Data
We loaded the different listings for individual cities and concatenated them into a single dataframe.

In [17]:
# Load the data
listingsBoston = pd.read_csv('../data/listingsBoston.csv')
listingsSF = pd.read_csv('../data/listingsSF.csv')
listingsLA = pd.read_csv('../data/listingsLA.csv')
listingsDC = pd.read_csv('../data/listingsDC.csv')
listingsSeattle = pd.read_csv('../data/listingsSeattle.csv')

# List of city dataframes
frames = [listingsBoston, listingsSF, listingsLA, listingsDC, listingsSeattle]

# Dataframe with all listings
listingsAll = pd.concat(frames)

## Feature Engineering 

The Airbnb dataset is composed of both numerical features - data like the number of beds, the number of bedrooms, and the number of bathrooms - and categorical features - data like the neighborhood, the type of room, the type of property listed and the type of beds offered. 

Both types of features are important in predicting the price of a given listing. While the numerical features can be used directly, we'll have to apply some additional processing and encoding steps to get the categorical data in a representation that can be used as a feature in our models.

### Numerical Features

The numerical features we plan on using directly are the following:
- `bedrooms` - The number of bedrooms included in the listing
- `beds` - The number of beds included in the listing
- `bathrooms` - The number of bathrooms included in the listing
- `accommodates` - The number of people the listing accommodates

Other non-categorical features we'll be extracting are:
- `num_amenities` - Using the 'amenities' feature, we can extract the number of amenities offered
- `days_host` - Using the 'host_since' feature, we can compute the number of days the host has been a host, potentially a measure of experience
- `price` - We must clean the 'price' feature to extract a numerical (rather than string) value for the price of the listing

### Categorical Features

There are several pertinent categorical features that intuitively seem to have great significance in the price of the listing. To use them in a model, however, we'll need to take some steps to encode them numerically. Listed here are the features we'll be using in the model:

- `neighbourhood` - The neighbourhood the property is in
- `property_type` - The type of property (e.g. apartment, bed and breakfast, house)
- `room_type` - The type of room (e.g. Shared room, Entire home/apt, or Private room
- `bed_type` - The type of bed offered (e.g. Real Bed, Futon, Couch, Pull-out Sofa, Airbed


### Encoding Methodologies
There are several options for encoding categorical features. 

##### Ordinal Encoding

In ordinal encoding, each categorical value takes on an integer value. Ordinal encoding doesn't add extra columns in the feature matrix, which can dilute the other features included. The major drawback of an ordinal encoding is that it inserts a notion of a relationship between each category and the dependent variable (price, in this case). In some situations, this may be appropriate (e.g. in the bed_type feature, there is a natural ordering/relationship between the bed_type and the price of the listing). In all ordinal mappings we have assumed a contant increment value between different categories.

##### One-Hot (Dummy) Encoding

In a one-hot encoding scheme, each category is represented as its own new feature that takes on a binary value - 1 if the input fits into a given category and zero, otherwise. For a category with *n* levels, this means adding *n* new columns to our feature matrix. For low values of *n*, this may be okay, but higher values of *n* risk the possibiilty of blowing the dimensionality of the feature matrix way out of proportion. This can be hazardous, as features of length much less than *n* will not get to affect the prediction as much.

##### Binary Encoding

Binary encoding is a cool alternative to one-hot encoding for the representation of categorical values. First, each value takes on an integer value in an ordinal encoding. From this, each value may be represented as a binary number. Finally, this binary number is split up into individual bits, and each is inserted into the feature matrix as a new column. A big advantage of this method compared to the One-Hot Encoding one is that the number of the resulting features column is significantly less and, hence, addresses the dimensionality issue we alluded on previously.

### Encoding Choices

Armed with a wealth of information on categorical encoding, we made the following decisions for each of the categorical features in the dataset.

#### Neighbourhood
Due to its high dimensionality (420 possibilities), we chose a binary encoding to represent the neighbourhood feature. An ordinal encoding might also be possible, but in research we've done, a binary encoding almost always seemed to ourperform ordinal encodings. We ended up reducing the "represented" number of features from 420 to 9. 

#### Property Type
Property was another category that had multiple dimensions (26 options), hence we decided to process the categorical data by encoding them with a binary method. We initially attempted sum encoding however that did not yield improved results. We ended up reducing the "represented" number of features from 26 to 5. 

#### Room Type & Bed Type
Observing the results from the previous encoding procedures we decided to perform a different form of encoding, where we utilized a combination of one-hot and binary encoding. This method significantly aided our model. 

### Ordinal Mapping Implementation

As afore-mentioned, we decided to map all categorical features to ordinal values. We used our prior knowledge to "rank" the ordinality for the room_type mapping.

In [19]:
# Method for lienar ordinal encoding
def create_ordinal_mapping(data, parameter):
    parameter_mapping = {}
    
    for index, parameter in enumerate(sorted(data[parameter].unique())):
        parameter_mapping[parameter] = index
    
    return parameter_mapping

# Ordinal mappings on neighbourhood, bed, room and property type
neighbourhood_mapping = create_ordinal_mapping(listingsAll, "neighbourhood_cleansed")
bed_type_mapping = create_ordinal_mapping(listingsAll, "bed_type")
room_type_mapping = create_ordinal_mapping(listingsAll, "room_type")
property_type_mapping = create_ordinal_mapping(listingsAll, "property_type")

# Sorting based on "prior knowledge" room quality
shared_room_ord = room_type_mapping["Shared room"] 
entire_apt_ord = room_type_mapping["Entire home/apt"]
room_type_mapping["Shared room"] = entire_apt_ord
room_type_mapping["Entire home/apt"] = shared_room_ord

## Processing 

To facilitate the process of cleaning, we defined a cleaning Helper and Processor Class that took care of constructing  the appropriate fields in the dataframe used for our model. 

### Helper

The class `Helper` exposes a set of helpful functions that each deal with processing an individual feature in the dataset. For example, `amenities_to_list` takes in an individual row from the `amenities` column - represented as a string - and converts it into a more useful list format. 

In [35]:
class Helper():
    
    ''' 
    Helper exposes a set of functions aimed processing individual 
    features in the dataset.
    '''

    def __init__(self, data_attrs):
        self.data_attrs = data_attrs
    
    def amenities_to_list(self, amenities):
        """Converts string representation of amenities list"""
        amenities = amenities.replace('{', '')
        amenities = amenities.replace('}', '')
        amenities = amenities.replace('\"', '')
        return amenities.split(',')

    def num_amenities(self, amenitiesList):
        """Creates new feature out of the number of amenities"""
        return len(amenitiesList)
 
    def price_to_int(self, price):
        """Converts string representation of price to float"""
        price = price.replace(',', '')
        price = price.replace('$', '')
        return float(price)
    
    def date_to_days(self, startDay):
        """Returns the number of days an individual has been a host"""
        dStart = datetime.strptime(startDay, "%Y-%m-%d")
        dEnd = datetime.strptime(self.data_attrs['date_min'], "%Y-%m-%d")
        return abs((dEnd - dStart).days)
    
    def row_to_ordinal(self, row, mapping):
        """Returns mapped ordinal values"""
        return self.data_attrs[mapping][row]

helper = Helper({'date_min': listingsAll[listingsAll.host_since.isnull() == False].host_since.max(), 
                 'neighbourhood_mapping': neighbourhood_mapping, 
                 'bed_type_mapping':bed_type_mapping, 
                 'room_type_mapping':room_type_mapping,
                 'property_type_mapping':property_type_mapping
                });

### Clean Processor
The class `cleanProcessor` handles two things - it applies the methods defined in `Helper` to its dataset and performs some additional processing. Null-fitting is an important process handled, while encoding applications and concatenation of the final dataframe is completed.

In [36]:
class cleanProcessor():

    ''' 
    cleanProcessor exposes a set of functions aimed at cleaning up the dataframe
    in question and addressing null exceptions and mappings.
    '''
    
    def __init__(self, data):
        self.data = data
        
    def clean_listings(self):
        """ 
        Returns clean dataframe after null values are removed and appropriate encoding
        are applied
        """
        df_clean = self.data.copy()
        
        # Null-fitting 
        df_clean.loc[df_clean.review_scores_rating.isnull(), 'review_scores_rating'] = 90
        df_clean.loc[df_clean.host_since.isnull(), 'host_since'] = '2015-10-02'
        df_clean.loc[df_clean.bedrooms.isnull(), 'bedrooms'] = 0
        df_clean.loc[df_clean.bathrooms.isnull(), 'bathrooms'] = 0
        df_clean.loc[df_clean.beds.isnull(), 'beds'] = 0
        df_clean.loc[df_clean.property_type.isnull(), 'property_type'] = "Other"
        
        # Preprocessing data using Helper instance
        df_clean['amenities'] = df_clean['amenities'].apply(helper.amenities_to_list)
        df_clean['num_amenities'] = df_clean['amenities'].apply(helper.num_amenities)
        df_clean['price'] = df_clean['price'].apply(helper.price_to_int)
        df_clean['days_host'] = df_clean['host_since'].apply(helper.date_to_days)

        # Mapping implementations for neighbourhood and property
        df_clean['neighbourhood_binary'] = df_clean['neighbourhood_cleansed'].apply(helper.row_to_ordinal, args=("neighbourhood_mapping",))
        df_clean['property_binary_encoded'] = df_clean['property_type'].apply(helper.row_to_ordinal, args=("property_type_mapping",))
        
        # Binary encoding for neighborhoods
        encoder = be.BinaryEncoder(cols=['neighbourhood_binary'])
        binary_neighbourhoods = encoder.transform(df_clean)
        
        # Binary encoding for property_type
        property_encoder = be.BinaryEncoder(cols=['property_binary_encoded'])
        binary_properties = property_encoder.transform(df_clean)
        
        # One-hot and Ordinal Encoding for bed_type and room_type
        bed_type = pd.get_dummies(df_clean.bed_type)    
        df_clean["bed_type"] = df_clean["bed_type"].apply(helper.row_to_ordinal, args=("bed_type_mapping",))
        room_type = pd.get_dummies(df_clean.room_type)    
        df_clean["room_type"] = df_clean["room_type"].apply(helper.row_to_ordinal, args=("room_type_mapping",))

        # One-hot encoding for cancellation policy and renaming of resulting columns
        cancellation_policy = pd.get_dummies(df_clean.cancellation_policy)
        cancellation_policy.rename(columns={'strict':'cancellation_strict'})
        cancellation_policy.rename(columns={'flexible':'cancellation_flexible'})
        cancellation_policy.rename(columns={'moderate':'cancellation_moderate'})
                
        data = pd.concat([df_clean, cancellation_policy, binary_neighbourhoods, bed_type, room_type, binary_properties], axis=1)

        return data

clean_processor_listings = cleanProcessor(listingsAll)

We will now use our clean_processor instance to clean our listings data.

In [23]:
listingsClean = clean_processor_listings.clean_listings()

## Model Development

### Choosing Models

Now that we've done a good amount of initial preprocessing on our data, let's move towards some predictive modeling. Our goal is to develop a model that will predict a price given listing parameters. We'll experiment with including different features, and see which combination fo features and models produces the best results.

After some initial research we've decided to start off by trying four different models:

1. Linear Lasso
       The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent.
       
2. ElasticNet 
       ElasticNet is a linear regression model trained with L1 and L2 prior as regularizer. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.
       
3. Support Vector Regression
       Support Vector Regression is an implementation of regression that uses the SVM approach. In this case, we'll be using the linear kernel.
            
4. Ridge Regression
       Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. The ridge coefficients minimize a penalized residual sum of squares.
       



In [38]:
# Linear Lasso
lasso = linear_model.Lasso(alpha = .01)

# ElasticNet
elasticNet = linear_model.ElasticNet(alpha = 0.1, l1_ratio=0.7)

# Ridge Regression
ridgeRegression = linear_model.Ridge(alpha = .5)

# Support Vector Regression
svr = svm.SVR(C=1.0, epsilon=0.2)

# Dictionary of Models
models = {'lasso': lasso, 'elasticNet': elasticNet, 'ridgeRegression': ridgeRegression }

### Testing Models

Now, let's define `ModelHelper`, a class that will facilitate the process of testing our models.

#### Methodology for Testing

To test our model, we'll be using the `cross_val_score` method provided by `scikit-learn`. Additionally, we'll be using `cross_val_score` coupled with a StratifiedKFold train-test splitting step. This step ensures that each neighborhood will have equal representation in both the training and test datasets. This is necessary because the data is sorted by neighbourhood. By default, a train-test split would likely give us training and test sets where a given neighbourhood only appears in one or the other - essentially nullifying the predictive power of the neighbourhood as a feature.

In [40]:
class ModelHelper():
    
    ''' 
    ModelHelper exposes a set of functions aimed at facilitating the dataset
    manipulation (train-test splitting) and model testing process
    '''
    
    def __init__(self, X, y, features):
        self.X = X
        self.y = y
        self.XFeat = X[features]
    
    def cross_validate(self, model, cv=3):
        '''Cross-validates model within trainnig set with a split of 'cv' - default value of 3'''
        cv = StratifiedKFold(self.X.neighbourhood_cleansed)
        return cross_val_score(model, self.XFeat, self.y, cv=cv).mean()

    def train_test_splitter(self, model, train_size=0.6, save=False):
        '''Performs train-test split on data, trains on train, tests on test, returns score, model, data'''
        X_train, X_test, y_train, y_test = train_test_split(self.XFeat, self.y, train_size=train_size)
        model.fit(X_train, y_train)
        return X_train, X_test, y_train, y_test, model

    def test_models(self, models):
        '''Iterates over all different models and print out their results of train_test_splitter'''
        for modelName, model in models.iteritems():
            print '{} : {}'.format(modelName, self.cross_validate(model))
    
    def save_model(self, model, filename="model.pkl"):
        '''Persists model to disk for later use in backend API'''
        model.fit(self.XFeat, self.y)
        joblib.dump(model, filename) 

    def save_mappings(self, filename="newmapping.pkl"):
        '''Persists model to disk for later use in backend API'''
        property_type_mapping_copy = property_type_mapping.copy()
        room_type_mapping_copy = room_type_mapping.copy()
        bed_type_mapping.update(property_type_mapping_copy)
        bed_type_mapping.update(room_type_mapping_copy)        
        bed_type_mapping.update(neighbourhood_mapping)
        
        pickle.dump(bed_type_mapping, open(filename, 'wb')) 
        

#### Customizing feature columns
As noted above, we performed appropriate encoding regimes to the potential input features of our model. A binary encoding transformation on the property_type coluns generates 4 columns indexed columns that we then pass into our `allFeatures` list, as shown below.

In [10]:
binary_encoded_properties = ["property_binary_encoded_{}".format(i) for i in range(5)]

In [41]:
features = ['bedrooms', 'beds', 'bathrooms', 'accommodates',
            'neighbourhood_binary_0', 'neighbourhood_binary_1', 
            'neighbourhood_binary_2', 'neighbourhood_binary_3',
            'neighbourhood_binary_4', 'neighbourhood_binary_5',
            'neighbourhood_binary_6', 'neighbourhood_binary_7',
            'neighbourhood_binary_8', 'bed_type', 'Airbed', 
            'Couch', 'Futon', 'Pull-out Sofa', 'Real Bed', 
            'room_type','Shared room','Private room', 'Entire home/apt']

# All features added
allFeatures = features + binary_encoded_properties

# Dump features to pickle file for later use
pickle.dump(allFeatures, open( '../app/utils/modelFeatures.pkl', 'wb' ))

#### Prediction outputs

Calling our `ModelHelper` class we can test multiple models and save a specific one for future usage.

In [43]:
# Model instance with clean listings, price and features
mHelper = ModelHelper(listingsClean, listingsClean.price, allFeatures)

# Test multiple models
mHelper.test_models(models)

# Pick to save lasso model representation
mHelper.save_model(lasso)

elasticNet : 0.274001414485
ridgeRegression : 0.274696027367
lasso : 0.274703106196


### Model Results Analysis
We tested a series of different models and as we can see from the results above, the models are producing similar outputs. All such models belong in the category of regression models. We decided to extract the score from those tests which is given as the coefficient of determination $R^{2}$, R-squared, of the prediction. R-squared is a statistical measure of how close the data are to the fitted regression line.

In essence what this output is providing us with, is a coefficient of determination for the price of a given listing. Our `Lasso Model` iteration does a descent job, as it manages to explain to a significant extent the variability of the response data around the mean. The closer the model output is to unity, the higher the R-squared value is and the model explains *all* the variability present. 

### Prediction and Node Integration

The final component of our project was that of actually predicting a price for a given listing. We implemented a NodeJS backend that exposes the prediction power of our model as an API. A frontend ReactJS application with D3.js maps interacts with this API to create a visual price prediction tool for Airbnb.

For instance, a prediction example is discussed below. The input matrix, `sample`, represents a listing with all features properly encoded numerically.

In [49]:
lassomodel = joblib.load('lasso.pkl')

sample = np.array([2, 2, 2, 2, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1])
sample = sample.reshape(1,-1)

lassoload.predict(sample)[0]

IOError: [Errno 2] No such file or directory: 'lasso.pkl'

### Reflection and Future Work


This project was a real eye-opener to the world of Data Science and Vizualization. The former field allowed us to explore routes pertaining to preprocessing, feature engineering and model development of our system, while the latter, combining full-stack development libraries with powerful vizualization tools, gave us the chance to see an integrated user-friendly final product come to life. 

As shown in multiple iPython Notebooks under the model_exploration directory, when we started off with this project, we tried to investigate correlations between different categories and price. This helped us a lot initially understanding how such relationships are formed and aided us in the further developement of a model with a given set of carefuly feature engineered categories.

Certainly a way to improve our model would be to include more features and investigate different ways of encoding certain categorical features, while playing around more with parameters affecting the behavior or models could certainly be worth investigating int the future.