## Airbnb Price Prediction - Optimizing Listings
#### Patrick Huston & Filippos Lymperopoulos | Spring 2016

*This notebook aims to document our process in creating a ML pipeline to predict Airbnb listing prices given an input set of features. A major focus of this exploration is to write modular, well-designed components that could easily be taken and applied to a different modeling situation.*

In [146]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
from sklearn import preprocessing
import seaborn as sns
from datetime import datetime
import category_encoders as ce

from sklearn import linear_model
from sklearn import svm

### The Data

Airbnb provides an expansive listings dataset for public use. It includes all listings for major cities around the world. For the purpose of our exploration, we'll start off by using well-known US cities - Boston, San Francisco, Los Angeles, Washington DC, and Seattle. For each listing, Airbnb provides a large amount of features - check [here](https://github.com/flymperopoulos/DataScience16CTW/blob/master/report/listings_features.md) to see them all listed out.

#### Load in the Listings Data

In [72]:
listingsBoston = pd.read_csv('./data/listingsBoston.csv')
listingsSF = pd.read_csv('./data/listingsSF.csv')
listingsLA = pd.read_csv('./data/listingsLA.csv')
listingsDC = pd.read_csv('./data/listingsDC.csv')
listingsSeattle = pd.read_csv('./data/listingsSeattle.csv')

frames = [listingsBoston, listingsSF, listingsLA, listingsDC, listingsSeattle]

listingsAll = pd.concat(frames)

#### Load in the Calendar Data

In [71]:
calendarBoston = pd.read_csv('./data/calendarBoston.csv')
calendarSF = pd.read_csv('./data/calendarSF.csv')
calendarLA = pd.read_csv('./data/calendarLA.csv')
calendarDC = pd.read_csv('./data/calendarDC.csv')
calendarSeattle = pd.read_csv('./data/calendarSeattle.csv')

frames = [calendarBoston, calendarSF, calendarLA, calendarDC, calendarSeattle]

calendarAll = pd.concat(frames)

### Cleaning Convenience

To facilitate the process of cleaning, we've defined a cleaning helper and cleaning processor class below. 

In [133]:
def create_neighbourhood_mapping(data):
    neighbourhood_mapping = {}
    
    for index, neighbourhood in enumerate(sorted(data.neighbourhood_cleansed.unique())):
        neighbourhood_mapping[neighbourhood] = index
    
    return neighbourhood_mapping

neighbourhood_mapping = create_neighbourhood_mapping(listingsAll)

#### Helper

The class `Helper` exposes a set of helpful functions that each deal with processing an individual feature in the dataset. For example, `amenities_to_list` takes in an individual row from the `amenities` column - represented as a string - and converts it into a more useful list format. 

In [134]:
class Helper():
    def __init__(self, data_attrs):
        self.data_attrs = data_attrs
    
    # Converts string representation of amenities list 
    def amenities_to_list(self, amenities):
        amenities = amenities.replace('{', '')
        amenities = amenities.replace('}', '')
        amenities = amenities.replace('\"', '')
        return amenities.split(',')

    # Creates new feature out of the number of amenities
    def num_amenities(self, amenitiesList):
        return len(amenitiesList)

    # Converts string representation of price to float 
    def price_to_int(self, price):
        price = price.replace(',', '')
        price = price.replace('$', '')
        return float(price)
    
    def date_to_days(self, startDay):
        dStart = datetime.strptime(startDay, "%Y-%m-%d")
        dEnd = datetime.strptime(self.data_attrs['date_min'], "%Y-%m-%d")
        
        return abs((dEnd - dStart).days)
    
    def neighbourhood_to_ordinal(self, neighbourhood):
        return self.data_attrs['neighbourhood_mapping'][neighbourhood]
    
helper = Helper({'date_min': listingsAll[listingsAll.host_since.isnull() == False].host_since.max(), 'neighbourhood_mapping': neighbourhood_mapping});

#### Clean Processor
The class `cleanProcessor` does handles two things - it applies the methods defined in `Helper` to its dataset and performs some additional processing like null-filling.

In [136]:
class cleanProcessor():
    def __init__(self, data):
        self.data = data
        
    def clean_listings(self):
        df_clean = self.data.copy()
        
        # TODO: What? - Better techniques for filling nulls
        df_clean.loc[df_clean.review_scores_rating.isnull(), 'review_scores_rating'] = 90
        df_clean.loc[df_clean.host_since.isnull(), 'host_since'] = '2015-10-02'
        df_clean.loc[df_clean.bedrooms.isnull(), 'bedrooms'] = 0
        
        df_clean['amenities'] = df_clean['amenities'].apply(helper.amenities_to_list)
        df_clean['num_amenities'] = df_clean['amenities'].apply(helper.num_amenities)
        df_clean['price'] = df_clean['price'].apply(helper.price_to_int)
        df_clean['days_host'] = df_clean['host_since'].apply(helper.date_to_days)
        df_clean['neighbourhood_ordinal'] = df_clean['neighbourhood_cleansed'].apply(helper.neighbourhood_to_ordinal)
        
        # Binary encoding for neighborhoods
#         encoder = ce.BinaryEncoder(cols=['neighbourhood'])
        
        # One-hot encoding for cancellation policy
        cancellation_policy = pd.get_dummies(df_clean.cancellation_policy)
        cancellation_policy.rename(columns={'strict':'cancellation_strict'})
        cancellation_policy.rename(columns={'flexible':'cancellation_flexible'})
        cancellation_policy.rename(columns={'moderate':'cancellation_moderate'})
        
        data = pd.concat([df_clean, cancellation_policy], axis=1)

        return data

    def clean_calendar(self):
        self.df_clean = df.copy()
        self.df_clean = self.df_clean[self.df_clean.available == 't']
        self.df_clean['price'] = self.df_clean['price'].apply(helper.price_to_int)
        return self.df_clean

clean_processor_listings = cleanProcessor(listingsAll)
clean_processor_calendar = cleanProcessor(calendarAll)

In [168]:
listingsClean = clean_processor_listings.clean_listings()

print type(listingsClean)

listingsClean.neighbourhood_ordinal

print listingsClean['neighbourhood_ordinal'].max()== 0


encoder = ce.BinaryEncoder(cols=['neighbourhood_ordinal'])

encoder.fit(listingsClean)

encoder.transform(listingsClean)

<class 'pandas.core.frame.DataFrame'>
False
YO YO YO
['neighbourhood_ordinal']
TYPE OF X_IN
<class 'pandas.core.frame.DataFrame'>
COLS
['neighbourhood_ordinal']
<type 'str'>
            id                            listing_url       scrape_id  \
0      1810172   https://www.airbnb.com/rooms/1810172  20151002231814   
1         6976      https://www.airbnb.com/rooms/6976  20151002231814   
2      3075044   https://www.airbnb.com/rooms/3075044  20151002231814   
3      4283698   https://www.airbnb.com/rooms/4283698  20151002231814   
4      4085362   https://www.airbnb.com/rooms/4085362  20151002231814   
5       225834    https://www.airbnb.com/rooms/225834  20151002231814   
6      7252607   https://www.airbnb.com/rooms/7252607  20151002231814   
7      1936861   https://www.airbnb.com/rooms/1936861  20151002231814   
8       225979    https://www.airbnb.com/rooms/225979  20151002231814   
9      2583074   https://www.airbnb.com/rooms/2583074  20151002231814   
10     6933545   https:

Unnamed: 0,neighbourhood_ordinal_0,neighbourhood_ordinal_1,neighbourhood_ordinal_2,neighbourhood_ordinal_3,neighbourhood_ordinal_4,neighbourhood_ordinal_5,neighbourhood_ordinal_6,neighbourhood_ordinal_7,neighbourhood_ordinal_8
0,1,0,0,1,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0
2,1,0,0,1,0,0,0,0,0
3,1,0,0,1,0,0,0,0,0
4,1,0,0,1,0,0,0,0,0
5,1,0,0,1,0,0,0,0,0
6,1,0,0,1,0,0,0,0,0
7,1,0,0,1,0,0,0,0,0
8,1,0,0,1,0,0,0,0,0
9,1,0,0,1,0,0,0,0,0


### Choosing Models

Now that we've done a good amount of initial preprocessing on our data, let's move towards some predictive modeling. Our goal is to develop a model that will predict a price given listing parameters. We'll experiment with including different features, and see which combination fo features and models produces the best results.

After some initial research we've decided to start off by trying four different models - 

1. Linear Lasso
       The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer parameter values, effectively reducing the number of variables upon which the given solution is dependent.
       
2. ElasticNet 
       ElasticNet is a linear regression model trained with L1 and L2 prior as regularizer. This combination allows for learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge.
       
3. Support Vector Regression
       Support Vector Regression is an implementation of regression that uses the SVM approach. In this case, we'll be using the linear kernel.
            
4. Ridge Regression
       Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of coefficients. The ridge coefficients minimize a penalized residual sum of squares.
       



In [35]:
# Linear Lasso
lasso = linear_model.Lasso(alpha = 0.1)

# ElasticNet
elasticNet = linear_model.ElasticNet(alpha = 0.1, l1_ratio=0.7)

# Ridge Regression
ridgeRegression = linear_model.Ridge(alpha = .5)

# Support Vector Regression
svr = svm.SVR(C=1.0, epsilon=0.2)

models = {'lasso': lasso, 'elasticNet': elasticNet, 'ridgeRegression': ridgeRegression, 'svr': svr }

### Testing Models

Now, let's define `ModelHelper`, a class that will facilitate the process of testing our models.

In [43]:
class ModelHelper():
    ''' 
    ModelHelper exposes a set of functions aimed at facilitating the dataset
    manipulation (train-test splitting) and model testing process
    '''
    
    def __init__(self, X, y):
        self.X = X
        self.y = y
    
    def cross_validate(self, model, cv=3):
        '''Cross-validates model within trainnig set with a split of 'cv' - default value of 3'''
        return cross_validation.cross_val_score(model, self.X, self.y, cv=cv).mean()

    def train_test_splitter(self, model, train_size=0.5):
        '''Performs train-test split on data, trains on train, tests on test, returns score, model, data'''
        X_train, X_test, y_train, y_test = train_test_split(self.X, self.y, train_size=train_size)
        model.fit(X_train, y_train)
        return X_train, X_test, y_train, y_test, model

    def test_models(self, models):
        '''Iterates over all different models and print out their results of train_test_splitter'''
        for modelName, model in models.iteritems():
            print modelName
            X_train, X_test, y_train, y_test, model = train_test_splitter(model, train_size=0.5)
            print model.score(X_test, y_test)

    def test_model(self, model):
        '''Test one specific model with train_test_splitter'''
        X_train, X_test, y_train, y_test, model = self.train_test_splitter(model, train_size=0.5)
        print model.score(X_test, y_test)
        return model.score(X_test, y_test)

In [93]:
# TODO: Define list of features

# print listingsClean[listingsClean.beds.isnull()].count()

features = ['days_host', 'num_amenities', 'bedrooms']

mHelper = ModelHelper(listingsClean[features], listingsClean.price)

mHelper.test_model(lasso)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').