# RentHop: Rental Listing Inquiries

url = https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/

### Understanding the Question

Given a set of features for a rental listing, we are to predict how much interest (low, medium, high) a rental listing will receive.  We are given labels for our data.  Our predictions should be represented as class probability (as per the competition rules).

This is a supervised classification problem.

### Getting Started - Load & Inspect Data

The data is available on kaggle at https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/data.  We are given 14 features in our data set and the label column is called 'interest_level'.

In [1]:
import pandas as pd
import numpy as np

train_df = pd.read_json('train.json')
train_df.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,interest_level,latitude,listing_id,longitude,manager_id,photos,price,street_address
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],medium,40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue
10000,1.0,2,c5c8a357cba207596b04d1afd1e4f130,2016-06-12 12:19:27,,Columbus Avenue,"[Doorman, Elevator, Fitness Center, Cats Allow...",low,40.7947,7150865,-73.9667,7533621a882f71e25173b27e3139d83d,[https://photos.renthop.com/2/7150865_be3306c5...,5465,808 Columbus Avenue
100004,1.0,1,c3ba40552e2120b0acfc3cb5730bb2aa,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,"[Laundry In Building, Dishwasher, Hardwood Flo...",high,40.7388,6887163,-74.0018,d9039c43983f6e564b1482b273bd7b01,[https://photos.renthop.com/2/6887163_de85c427...,2850,241 W 13 Street
100007,1.0,1,28d9ad350afeaab8027513a3e52ac8d5,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,"[Hardwood Floors, No Fee]",low,40.7539,6888711,-73.9677,1067e078446a7897d2da493d2f741316,[https://photos.renthop.com/2/6888711_6e660cee...,3275,333 East 49th Street
100013,1.0,4,0,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,[Pre-War],low,40.8241,6934781,-73.9493,98e13ad4b495b9613cef886d79a6291f,[https://photos.renthop.com/2/6934781_1fa4b41a...,3350,500 West 143rd Street


In [2]:
train_df.shape

(49352, 15)

In [3]:
train_df.dtypes

bathrooms          float64
bedrooms             int64
building_id         object
created             object
description         object
display_address     object
features            object
interest_level      object
latitude           float64
listing_id           int64
longitude          float64
manager_id          object
photos              object
price                int64
street_address      object
dtype: object

In [4]:
#Check for NaNs
train_df.isnull().sum()

bathrooms          0
bedrooms           0
building_id        0
created            0
description        0
display_address    0
features           0
interest_level     0
latitude           0
listing_id         0
longitude          0
manager_id         0
photos             0
price              0
street_address     0
dtype: int64

### Feature Engineering

Below I iteratively develop a location box for the listings.  This code then serves the basis for my function add_price_vs_loc_avg().

In [53]:
#Create a location zone based on same rounded latitude and longitude
train_df['lat_round'] = train_df.apply(lambda x : round(x['latitude'],2), axis=1)
train_df['lon_round'] = train_df.apply(lambda x : round(x['longitude'],2), axis=1)
train_df['loc'] = train_df.apply(lambda x : tuple([x['lat_round'], x['lon_round']]), axis=1)

In [55]:
print len(train_df['loc'].unique())
train_df['loc'].value_counts()

460


(40.78, -73.95)    3052
(40.71, -74.01)    2522
(40.74, -73.98)    1949
(40.77, -73.95)    1841
(40.73, -73.98)    1591
(40.73, -73.99)    1547
(40.76, -73.99)    1467
(40.75, -73.97)    1443
(40.76, -74.0)     1412
(40.75, -73.98)    1366
(40.77, -73.99)    1271
(40.77, -73.96)    1258
(40.76, -73.96)    1212
(40.74, -74.0)     1183
(40.73, -74.0)     1177
(40.78, -73.98)    1118
(40.74, -73.99)     909
(40.79, -73.97)     869
(40.8, -73.97)      864
(40.75, -73.99)     813
(40.72, -73.99)     774
(40.76, -73.98)     753
(40.76, -73.97)     739
(40.8, -73.96)      681
(40.75, -74.0)      634
(40.77, -73.98)     620
(40.72, -73.98)     602
(40.72, -74.0)      592
(40.74, -73.97)     559
(40.79, -73.98)     491
                   ... 
(41.75, -87.61)       1
(40.72, -73.83)       1
(40.63, -74.09)       1
(40.63, -73.95)       1
(44.88, -93.27)       1
(40.68, -73.88)       1
(40.83, -73.82)       1
(40.82, -73.93)       1
(40.71, -73.79)       1
(40.81, -73.83)       1
(40.66, -74.02) 

In [61]:
#Get Average Price for each location
'''
This code runs slow.  One way to speed it up is to make a hash table with the averages calculate once, and then just
look them up with this lambda function.

for each loc in locations:
    loc_dict[loc] = avg of price
lambda x : loc_dict[x['loc']]
'''
train_df['AvgLocPrice'] = train_df.apply(lambda x: train_df['price'][train_df['loc']==x['loc']].mean(), axis=1)
train_df.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,interest_level,latitude,listing_id,longitude,manager_id,photos,price,street_address,lat_round,lon_round,loc,AvgLocPrice
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],medium,40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue,40.71,-73.94,"(40.71, -73.94)",2583.075
10000,1.0,2,c5c8a357cba207596b04d1afd1e4f130,2016-06-12 12:19:27,,Columbus Avenue,"[Doorman, Elevator, Fitness Center, Cats Allow...",low,40.7947,7150865,-73.9667,7533621a882f71e25173b27e3139d83d,[https://photos.renthop.com/2/7150865_be3306c5...,5465,808 Columbus Avenue,40.79,-73.97,"(40.79, -73.97)",4723.904488
100004,1.0,1,c3ba40552e2120b0acfc3cb5730bb2aa,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,"[Laundry In Building, Dishwasher, Hardwood Flo...",high,40.7388,6887163,-74.0018,d9039c43983f6e564b1482b273bd7b01,[https://photos.renthop.com/2/6887163_de85c427...,2850,241 W 13 Street,40.74,-74.0,"(40.74, -74.0)",3925.238377
100007,1.0,1,28d9ad350afeaab8027513a3e52ac8d5,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,"[Hardwood Floors, No Fee]",low,40.7539,6888711,-73.9677,1067e078446a7897d2da493d2f741316,[https://photos.renthop.com/2/6888711_6e660cee...,3275,333 East 49th Street,40.75,-73.97,"(40.75, -73.97)",3988.125433
100013,1.0,4,0,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,[Pre-War],low,40.8241,6934781,-73.9493,98e13ad4b495b9613cef886d79a6291f,[https://photos.renthop.com/2/6934781_1fa4b41a...,3350,500 West 143rd Street,40.82,-73.95,"(40.82, -73.95)",2470.606452


In [63]:
#Create Features comparing price to it's location average price
#This could be made more accurate by duing a mean price per room per location
train_df['PriceVsLocAvg'] = train_df['price'] / train_df['AvgLocPrice']
#Prices > than Avg will be 1.0 or more, prices below avg will be < 1.90
train_df.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,interest_level,latitude,listing_id,longitude,manager_id,photos,price,street_address,lat_round,lon_round,loc,AvgLocPrice,PriceVsLocAvg
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],medium,40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue,40.71,-73.94,"(40.71, -73.94)",2583.075,1.161406
10000,1.0,2,c5c8a357cba207596b04d1afd1e4f130,2016-06-12 12:19:27,,Columbus Avenue,"[Doorman, Elevator, Fitness Center, Cats Allow...",low,40.7947,7150865,-73.9667,7533621a882f71e25173b27e3139d83d,[https://photos.renthop.com/2/7150865_be3306c5...,5465,808 Columbus Avenue,40.79,-73.97,"(40.79, -73.97)",4723.904488,1.156882
100004,1.0,1,c3ba40552e2120b0acfc3cb5730bb2aa,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,"[Laundry In Building, Dishwasher, Hardwood Flo...",high,40.7388,6887163,-74.0018,d9039c43983f6e564b1482b273bd7b01,[https://photos.renthop.com/2/6887163_de85c427...,2850,241 W 13 Street,40.74,-74.0,"(40.74, -74.0)",3925.238377,0.726071
100007,1.0,1,28d9ad350afeaab8027513a3e52ac8d5,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,"[Hardwood Floors, No Fee]",low,40.7539,6888711,-73.9677,1067e078446a7897d2da493d2f741316,[https://photos.renthop.com/2/6888711_6e660cee...,3275,333 East 49th Street,40.75,-73.97,"(40.75, -73.97)",3988.125433,0.821188
100013,1.0,4,0,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,[Pre-War],low,40.8241,6934781,-73.9493,98e13ad4b495b9613cef886d79a6291f,[https://photos.renthop.com/2/6934781_1fa4b41a...,3350,500 West 143rd Street,40.82,-73.95,"(40.82, -73.95)",2470.606452,1.355942


In [64]:
def add_price_vs_loc_avg(df, per_room=True, per_listing=False):
    '''(DataFrame, bool, bool) -> DataFrame
    per_room will use the price per room as base
    per_listing will use the 'price' as base
    
    Will add 'PriceVsLocAvg' to the current DataFrame.
    '''
    #Build Location
    df['lat_round'] = df.apply(lambda x : round(x['latitude'],2), axis=1)
    df['lon_round'] = df.apply(lambda x : round(x['longitude'],2), axis=1)
    df['loc'] = df.apply(lambda x : tuple([x['lat_round'], x['lon_round']]), axis=1)
    if per_room:
        df['AvgLocPricePerRoom'] = df.apply(lambda x: df['PricePerRoom'][df['loc']==x['loc']].mean(), axis=1)
        df['PricePerRoomVsLocAvg'] = df['PricePerRoom'] / df['AvgLocPricePerRoom']
    if per_listing:
        df['AvgLocPrice'] = df.apply(lambda x: df['price'][df['loc']==x['loc']].mean(), axis=1)
        df['PriceVsLocAvg'] = df['price'] / df['AvgLocPrice']
       
    return df

In [65]:
def add_features(df):
    '''(DataFrame) -> DataFrame
    
    Will add new features to the current DataFrame.
    '''
    #Create # of Photos Column
    df['NumPhotos'] = df.photos.str.len()
    #Create # of Features Column
    df['NumFeatures'] = df.features.str.len()
    df['NumDescription'] = df.description.str.len()
    #Total Rooms
    df['TotalRooms'] = df['bathrooms'] + df['bedrooms']
    #Room / Price
    #Add one too all -assume every apartment is at least 1 room (studios)
    df['PricePerRoom'] = df['price'] / (df['TotalRooms'] + 1.0)
    df['PricePerBedRoom'] = df['price'] / (df['bedrooms'] + 1.0)
    #Add Price vs Loc
    df = add_price_vs_loc_avg(df)
    return df
    
train_df = add_features(train_df)

### Prepare Data for ML & Transform Test Features


#### Apply same transforms to test features

In [66]:
from sklearn.preprocessing import LabelEncoder

#Load test data
test_df = pd.read_json('test.json')
#Add engineered features
test_df = add_features(test_df)

#ENCODE TEXT FEATURES
#Combine the train and test columns
manager_combo = train_df['manager_id'].append(test_df['manager_id'])
building_combo = train_df['building_id'].append(test_df['building_id'])
loc_combo = train_df['loc'].append(test_df['loc'])
#Encode building_id
le_building = LabelEncoder()
le_building.fit(building_combo)
#Transform Train & Test set
train_df['BuildingID'] = le_building.transform(train_df['building_id'])
test_df['BuildingID'] = le_building.transform(test_df['building_id'])
#Encode manager_id
le_manager = LabelEncoder()
le_manager.fit(manager_combo)
#Transform Train & Test set
train_df['ManagerID'] = le_manager.transform(train_df['manager_id'])
test_df['ManagerID'] = le_manager.transform(test_df['manager_id'])
#Encode loc
le_loc = LabelEncoder()
le_loc.fit(loc_combo)
#Transform Train & Test set
train_df['LocID'] = le_loc.transform(train_df['loc'])
test_df['LocID'] = le_loc.transform(test_df['loc'])

#Inspect to verify
test_df.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,...,PricePerRoom,PricePerBedRoom,lat_round,lon_round,loc,AvgLocPricePerRoom,PricePerRoomVsLocAvg,BuildingID,ManagerID,LocID
0,1.0,1,79780be1514f645d7e6be99a3de696c5,2016-06-11 05:29:41,Large with awesome terrace--accessible via bed...,Suffolk Street,"[Elevator, Laundry in Building, Laundry in Uni...",40.7185,7142618,-73.9865,...,983.333333,1475.0,40.72,-73.99,"(40.72, -73.99)",1096.100837,0.897119,5535,3076,264
1,1.0,2,0,2016-06-24 06:36:34,Prime Soho - between Bleecker and Houston - Ne...,Thompson Street,"[Pre-War, Dogs Allowed, Cats Allowed]",40.7278,7210040,-74.0,...,712.5,950.0,40.73,-74.0,"(40.73, -74.0)",1213.44046,0.587173,0,3593,292
100,1.0,1,3dbbb69fd52e0d25131aa1cd459c87eb,2016-06-03 04:29:40,New York chic has reached a new level ...,101 East 10th Street,"[Doorman, Elevator, No Fee]",40.7306,7103890,-73.989,...,1252.666667,1879.0,40.73,-73.99,"(40.73, -73.99)",1162.180848,1.077859,2813,2677,293
1000,1.0,2,783d21d013a7e655bddc4ed0d461cc5e,2016-06-11 06:17:35,Step into this fantastic new Construction in t...,South Third Street\r,"[Roof Deck, Balcony, Elevator, Laundry in Buil...",40.7109,7143442,-73.9571,...,825.0,1100.0,40.71,-73.96,"(40.71, -73.96)",816.200305,1.010781,5477,201,235
100000,2.0,2,6134e7c4dd1a98d9aee36623c9872b49,2016-04-12 05:24:17,"~Take a stroll in Central Park, enjoy the ente...","Midtown West, 8th Ave","[Common Outdoor Space, Cats Allowed, Dogs Allo...",40.765,6860601,-73.9845,...,980.0,1633.333333,40.77,-73.98,"(40.77, -73.98)",2060.54958,0.475601,4428,3157,384


In [166]:
#Select Features
feature_cols = ['price', 'PricePerRoom', 'PricePerRoomVsLocAvg', 'BuildingID', 'NumDescription', 'ManagerID', 'NumPhotos',
               'NumFeatures', 'latitude', 'longitude', 'bedrooms', 'bathrooms']

#Prepare data for ML
X_train = train_df[feature_cols].values
X_test = test_df[feature_cols].values

#Encode 'interest_level' to numerical
le_interest = LabelEncoder()
train_df['IL'] = le_interest.fit_transform(train_df['interest_level'])
#Set Train Y
Y = train_df['IL'].values
#Inspect to verify
Y [:10]

array([2, 1, 0, 1, 1, 2, 1, 1, 2, 1], dtype=int64)

In [167]:
#Find important features
from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()
model.fit(X_train, Y)
importances = zip(model.feature_importances_, feature_cols)
importances = pd.DataFrame(importances)
print importances.sort_values(0, ascending=False)

           0                     1
1   0.115857          PricePerRoom
0   0.112657                 price
3   0.106013            BuildingID
2   0.105115  PricePerRoomVsLocAvg
4   0.097008        NumDescription
5   0.088695             ManagerID
8   0.082950              latitude
6   0.082042             NumPhotos
9   0.080702             longitude
7   0.078841           NumFeatures
10  0.033062              bedrooms
11  0.017060             bathrooms


In [9]:
#Get Label encodings for reference later
le_interest.classes_

array([u'high', u'low', u'medium'], dtype=object)

In [143]:
train_df[feature_cols].corr()

Unnamed: 0,price,PricePerRoom,PricePerRoomVsLocAvg,BuildingID,NumDescription,ManagerID,NumPhotos,NumFeatures,latitude,longitude,bedrooms,bathrooms
price,1.0,0.987904,0.835022,0.006233,0.009144,0.005894,0.004559,0.024273,-0.000707,-8.7e-05,0.051788,0.069661
PricePerRoom,0.987904,1.0,0.883964,0.00439,-5.7e-05,0.004326,-0.004742,0.01088,-0.000162,-0.000976,-0.019012,0.003441
PricePerRoomVsLocAvg,0.835022,0.883964,1.0,-0.003796,0.004985,0.006219,0.004858,0.015879,-5e-06,-4.7e-05,-0.056443,0.004193
BuildingID,0.006233,0.00439,-0.003796,1.0,0.062359,0.012295,0.069775,0.125244,-0.004357,2.9e-05,0.032185,0.022526
NumDescription,0.009144,-5.7e-05,0.004985,0.062359,1.0,-0.023366,0.212694,0.436028,-0.003696,-0.000811,0.11141,0.150461
ManagerID,0.005894,0.004326,0.006219,0.012295,-0.023366,1.0,0.012403,-0.012124,-0.005478,0.004074,0.006979,0.022955
NumPhotos,0.004559,-0.004742,0.004858,0.069775,0.212694,0.012403,1.0,0.158999,-0.008221,0.00574,0.154515,0.14798
NumFeatures,0.024273,0.01088,0.015879,0.125244,0.436028,-0.012124,0.158999,1.0,0.000833,-0.008338,0.129996,0.230389
latitude,-0.000707,-0.000162,-5e-06,-0.004357,-0.003696,-0.005478,-0.008221,0.000833,1.0,-0.966807,-0.004745,-0.009657
longitude,-8.7e-05,-0.000976,-4.7e-05,2.9e-05,-0.000811,0.004074,0.00574,-0.008338,-0.966807,1.0,0.006892,0.010393


### Test ML Algorithim

#### Random Forest Classifier

In [169]:
#RandomForest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score

#Initialize Model
rf = RandomForestClassifier(n_estimators=10000, min_samples_split=20, criterion='entropy', n_jobs=-1)
#Create KFold
kfold = KFold(n_splits=5, random_state=5)
cross_val_results = cross_val_score(rf, X_train, Y, cv=kfold, scoring='neg_log_loss')
print cross_val_results.mean()

-0.589088190786


#### Grid Search for best parameters

In [149]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'min_samples_split' : [2, 4, 10, 20, 40],
    'criterion' : ['gini', 'entropy'],
    'max_features' : ['auto', 'log2', None]
}

rf100 = RandomForestClassifier(n_estimators=100, n_jobs=-1)
grid_search = GridSearchCV(estimator=rf100, param_grid=param_grid, cv=5, scoring='neg_log_loss')
grid_search.fit(X_train, Y)

print "Best Score: %f" % grid_search.best_score_
print grid_search.best_params_

Best Score: -0.592775
{'max_features': 'auto', 'min_samples_split': 20, 'criterion': 'entropy'}


### Train Model & Make First Submission


In [170]:
#Random Forest
rf = RandomForestClassifier(n_estimators=1000, min_samples_split=20, criterion='entropy', n_jobs=-1)
rf.fit(X_train, Y)
prediction_probabilites = rf.predict_proba(X_test)

In [171]:
#Checkout feature importance
importances = zip(rf.feature_importances_, feature_cols)
importances = pd.DataFrame(importances)
importances

Unnamed: 0,0,1
0,0.129942,price
1,0.13925,PricePerRoom
2,0.125578,PricePerRoomVsLocAvg
3,0.119336,BuildingID
4,0.088121,NumDescription
5,0.072249,ManagerID
6,0.05984,NumPhotos
7,0.060354,NumFeatures
8,0.086112,latitude
9,0.082111,longitude


In [172]:
#Submission must be - listing_id, high, medium, low
#The index of our probabilties is from the label encoder earlier (0=high, 1=low, medium=2)
submission_df = pd.DataFrame({'listing_id':test_df['listing_id'], 'high':prediction_probabilites[:, 0],
                             'medium':prediction_probabilites[:, 2], 'low':prediction_probabilites[:, 1]})
#Re-Order Columns for submission
cols = ['listing_id', 'high', 'medium', 'low']
submission_df = submission_df[cols]
submission_df.head()

Unnamed: 0,listing_id,high,medium,low
0,7142618,0.075037,0.44087,0.484094
1,7210040,0.037448,0.132005,0.830546
100,7103890,0.02418,0.137406,0.838414
1000,7143442,0.072835,0.310256,0.616909
100000,6860601,0.059734,0.270873,0.669393


In [173]:
#Verify all is well (no NaNs)
submission_df.isnull().sum()

listing_id    0
high          0
medium        0
low           0
dtype: int64

In [174]:
#Write to CSV for submission
submission_df.to_csv('rf.csv', index=False)

### Conclusion

This random forest scored a 0.59337 on Kaggle.
