# RentHop: Rental Listing Inquiries

url = https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/

### Understanding the Question

Given a set of features for a rental listing, we are to predict how much interest (low, medium, high) a rental listing will receive.  We are given labels for our data.  Our predictions should be represented as class probability (as per the competition rules).

This is a supervised classification problem.

### Getting Started - Load & Inspect Data

The data is available on kaggle at https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/data.  We are given 14 features in our data set and the label column is called 'interest_level'.

In [34]:
import pandas as pd
import numpy as np

train_df = pd.read_json('train.json')
train_df.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,interest_level,latitude,listing_id,longitude,manager_id,photos,price,street_address
10,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],medium,40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue
10000,1.0,2,c5c8a357cba207596b04d1afd1e4f130,2016-06-12 12:19:27,,Columbus Avenue,"[Doorman, Elevator, Fitness Center, Cats Allow...",low,40.7947,7150865,-73.9667,7533621a882f71e25173b27e3139d83d,[https://photos.renthop.com/2/7150865_be3306c5...,5465,808 Columbus Avenue
100004,1.0,1,c3ba40552e2120b0acfc3cb5730bb2aa,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,"[Laundry In Building, Dishwasher, Hardwood Flo...",high,40.7388,6887163,-74.0018,d9039c43983f6e564b1482b273bd7b01,[https://photos.renthop.com/2/6887163_de85c427...,2850,241 W 13 Street
100007,1.0,1,28d9ad350afeaab8027513a3e52ac8d5,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,"[Hardwood Floors, No Fee]",low,40.7539,6888711,-73.9677,1067e078446a7897d2da493d2f741316,[https://photos.renthop.com/2/6888711_6e660cee...,3275,333 East 49th Street
100013,1.0,4,0,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,[Pre-War],low,40.8241,6934781,-73.9493,98e13ad4b495b9613cef886d79a6291f,[https://photos.renthop.com/2/6934781_1fa4b41a...,3350,500 West 143rd Street


In [35]:
train_df.shape

(49352, 15)

In [36]:
train_df.dtypes

bathrooms          float64
bedrooms             int64
building_id         object
created             object
description         object
display_address     object
features            object
interest_level      object
latitude           float64
listing_id           int64
longitude          float64
manager_id          object
photos              object
price                int64
street_address      object
dtype: object

In [37]:
#Check for NaNs
train_df.isnull().sum()

bathrooms          0
bedrooms           0
building_id        0
created            0
description        0
display_address    0
features           0
interest_level     0
latitude           0
listing_id         0
longitude          0
manager_id         0
photos             0
price              0
street_address     0
dtype: int64

### Feature Engineering

Here I develop various features from the data set.

##### Date Features - Month, Day, Hour
Perhaps the timing of the listing can predict how popular it will be.  Leases tend to start on the 1st of the month maybe the day of the listing will matter.  Also summers tend to see more transitions as school ends and students graduate into new jobs or internships.

##### Price Vs Location Avg
This rounds off the latitude and longitude to make a location 'box'.  Then it calculates the average price per room of all listings in that area.  We then divide the actual price per room by the average price per room in the area to give a ratio of over or under priced for that location.

##### Manager Skill
Heavily inspired from den3b's notebook @ https://www.kaggle.com/den3b81/two-sigma-connect-rental-listing-inquiries/improve-perfomances-using-manager-features.  This is a score for all managers in the training set with at least 30 listings.  The score is based on the % of their listings that are at the 3 interest levels.


In [38]:
#Date Features
def add_date_features(df):
    '''(DataFrame) -> DataFrame
    
    Will add some specific columns based on the date
    the listing was created.
    '''
    #Convert to datetime to make extraction easier
    df['created'] = pd.to_datetime(df['created'])
    #Extract features
    df['created_month'] = df['created'].dt.month
    df['created_day'] = df['created'].dt.day
    df['created_hour'] = df['created'].dt.hour
    return df

In [39]:
def compute_manager_skill(train_df):
    '''(DataFrame) -> DataFrame
    
    Given the training data, build a column for manager skill.
    Return this dataframe with manager skill.  Only compute skill
    for managers with 30+ listings.
    '''
    #Get dummies creates new binary columns for the categories in 'interest level' 
    #This creates 3 new cols = low, medium, high, with value 0 or 1
    dummies = pd.get_dummies(train_df['interest_level'])
    #Build new temporary dataframe
    man_skill = pd.concat([train_df['manager_id'], dummies], axis=1)
    #Get mean and total count for each manager
    man_skill = pd.concat([man_skill.groupby('manager_id').mean(), man_skill.groupby('manager_id').count()], axis=1).iloc[:,:-2] #remove extra count cols
    man_skill.columns = ['low', 'medium', 'high', 'count']
    man_skill = man_skill.sort_values(by='count', ascending=False)
    #Using man_skill['count'].describe(percentiles=[.8, .9, .95])
    #looks like 10% about have 30 or more listings, that seems like a fair sample size to judge a managers skill
    man_skill = man_skill[man_skill['count'] >= 30]
    #Compute skill as average * weighting -> 0 for low, 1 for medium, 2 for high
    #This inspired from den3b's notebook @
    #https://www.kaggle.com/den3b81/two-sigma-connect-rental-listing-inquiries/improve-perfomances-using-manager-features
    man_skill['skill'] = man_skill['medium']*1 + man_skill['high']*2
    man_skill['manager_id'] = man_skill.index

    return man_skill

In [40]:
#Manager Skill
def add_manager_skill(data_df, man_skill_df):
    '''(DataFrame, DataFrame) -> DataFrame
    
    Will add the skill columns to testing/ training sets
    only for managers that in the training set have over 30 listings.
    This info is passed from the man_skill_df
    '''
    #Now add Man_skill to train set
    data_df = data_df.merge(man_skill, how='left', left_on='manager_id', right_on='manager_id')
    data_df = data_df.drop('low', 1)
    data_df = data_df.drop('medium', 1)
    data_df = data_df.drop('high', 1)
    data_df.fillna(0, inplace=True)
    
    return data_df

In [41]:
#Price vs Location Avg
def add_price_vs_loc_avg(df, per_room=True, per_listing=False):
    '''(DataFrame, bool, bool) -> DataFrame
    per_room will use the price per room as base
    per_listing will use the 'price' as base
    
    Will add 'PriceVsLocAvg' to the current DataFrame.
    '''
    #Build Location
    df['lat_round'] = df.apply(lambda x : round(x['latitude'],2), axis=1)
    df['lon_round'] = df.apply(lambda x : round(x['longitude'],2), axis=1)
    df['loc'] = df.apply(lambda x : tuple([x['lat_round'], x['lon_round']]), axis=1)
    if per_room:
        df['AvgLocPricePerRoom'] = df.apply(lambda x: df['PricePerRoom'][df['loc']==x['loc']].mean(), axis=1)
        df['PricePerRoomVsLocAvg'] = df['PricePerRoom'] / df['AvgLocPricePerRoom']
    if per_listing:
        df['AvgLocPrice'] = df.apply(lambda x: df['price'][df['loc']==x['loc']].mean(), axis=1)
        df['PriceVsLocAvg'] = df['price'] / df['AvgLocPrice']
       
    return df

In [42]:
def add_features(df):
    '''(DataFrame) -> DataFrame
    
    Will add new features to the current DataFrame.
    '''
    #Create # of Photos Column
    df['NumPhotos'] = df.photos.str.len()
    #Create # of Features Column
    df['NumFeatures'] = df.features.str.len()
    df['NumDescription'] = df.description.str.len()
    #Total Rooms
    df['TotalRooms'] = df['bathrooms'] + df['bedrooms']
    #Room / Price
    #Add one too all -assume every apartment is at least 1 room (studios)
    df['PricePerRoom'] = df['price'] / (df['TotalRooms'] + 1.0)
    df['PricePerBedRoom'] = df['price'] / (df['bedrooms'] + 1.0)
    #Add Price vs Loc
    df = add_price_vs_loc_avg(df)
    #Add Date Features
    df = add_date_features(df)
    return df
    
#Add features to Training Data

train_df = add_features(train_df)
man_skill = compute_manager_skill(train_df)
train_df = add_manager_skill(train_df, man_skill)
train_df.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,interest_level,latitude,listing_id,...,lat_round,lon_round,loc,AvgLocPricePerRoom,PricePerRoomVsLocAvg,created_month,created_day,created_hour,count,skill
0,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],medium,40.7145,7211212,...,40.71,-73.94,"(40.71, -73.94)",765.314223,0.71272,6,24,7,90.0,1.255556
1,1.0,2,c5c8a357cba207596b04d1afd1e4f130,2016-06-12 12:19:27,,Columbus Avenue,"[Doorman, Elevator, Fitness Center, Cats Allow...",low,40.7947,7150865,...,40.79,-73.97,"(40.79, -73.97)",1133.218273,1.205637,6,12,12,86.0,1.011628
2,1.0,1,c3ba40552e2120b0acfc3cb5730bb2aa,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,"[Laundry In Building, Dishwasher, Hardwood Flo...",high,40.7388,6887163,...,40.74,-74.0,"(40.74, -74.0)",1205.296462,0.788188,4,17,3,134.0,1.30597
3,1.0,1,28d9ad350afeaab8027513a3e52ac8d5,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,"[Hardwood Floors, No Fee]",low,40.7539,6888711,...,40.75,-73.97,"(40.75, -73.97)",1078.049624,1.012631,4,18,2,191.0,1.057592
4,1.0,4,0,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,[Pre-War],low,40.8241,6934781,...,40.82,-73.95,"(40.82, -73.95)",573.579717,0.973419,4,28,1,0.0,0.0


In [43]:
#Add features to Test Data
#Load test data
test_df = pd.read_json('test.json')
#Add engineered features
test_df = add_features(test_df)
test_df = add_manager_skill(test_df, man_skill)
test_df.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,...,lat_round,lon_round,loc,AvgLocPricePerRoom,PricePerRoomVsLocAvg,created_month,created_day,created_hour,count,skill
0,1.0,1,79780be1514f645d7e6be99a3de696c5,2016-06-11 05:29:41,Large with awesome terrace--accessible via bed...,Suffolk Street,"[Elevator, Laundry in Building, Laundry in Uni...",40.7185,7142618,-73.9865,...,40.72,-73.99,"(40.72, -73.99)",1096.100837,0.897119,6,11,5,0.0,0.0
1,1.0,2,0,2016-06-24 06:36:34,Prime Soho - between Bleecker and Houston - Ne...,Thompson Street,"[Pre-War, Dogs Allowed, Cats Allowed]",40.7278,7210040,-74.0,...,40.73,-74.0,"(40.73, -74.0)",1213.44046,0.587173,6,24,6,0.0,0.0
2,1.0,1,3dbbb69fd52e0d25131aa1cd459c87eb,2016-06-03 04:29:40,New York chic has reached a new level ...,101 East 10th Street,"[Doorman, Elevator, No Fee]",40.7306,7103890,-73.989,...,40.73,-73.99,"(40.73, -73.99)",1162.180848,1.077859,6,3,4,0.0,0.0
3,1.0,2,783d21d013a7e655bddc4ed0d461cc5e,2016-06-11 06:17:35,Step into this fantastic new Construction in t...,South Third Street\r,"[Roof Deck, Balcony, Elevator, Laundry in Buil...",40.7109,7143442,-73.9571,...,40.71,-73.96,"(40.71, -73.96)",816.200305,1.010781,6,11,6,61.0,1.032787
4,2.0,2,6134e7c4dd1a98d9aee36623c9872b49,2016-04-12 05:24:17,"~Take a stroll in Central Park, enjoy the ente...","Midtown West, 8th Ave","[Common Outdoor Space, Cats Allowed, Dogs Allo...",40.765,6860601,-73.9845,...,40.77,-73.98,"(40.77, -73.98)",2060.54958,0.475601,4,12,5,72.0,1.236111


### Prepare Data for ML & Transform Features


#### Apply same transforms to test features

In [44]:
from sklearn.preprocessing import LabelEncoder


#ENCODE TEXT FEATURES
#Combine the train and test columns
manager_combo = train_df['manager_id'].append(test_df['manager_id'])
building_combo = train_df['building_id'].append(test_df['building_id'])
loc_combo = train_df['loc'].append(test_df['loc'])
#Encode building_id
le_building = LabelEncoder()
le_building.fit(building_combo)
#Transform Train & Test set
train_df['BuildingID'] = le_building.transform(train_df['building_id'])
test_df['BuildingID'] = le_building.transform(test_df['building_id'])
#Encode manager_id
le_manager = LabelEncoder()
le_manager.fit(manager_combo)
#Transform Train & Test set
train_df['ManagerID'] = le_manager.transform(train_df['manager_id'])
test_df['ManagerID'] = le_manager.transform(test_df['manager_id'])
#Encode loc
le_loc = LabelEncoder()
le_loc.fit(loc_combo)
#Transform Train & Test set
train_df['LocID'] = le_loc.transform(train_df['loc'])
test_df['LocID'] = le_loc.transform(test_df['loc'])

#Inspect to verify
test_df.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,...,AvgLocPricePerRoom,PricePerRoomVsLocAvg,created_month,created_day,created_hour,count,skill,BuildingID,ManagerID,LocID
0,1.0,1,79780be1514f645d7e6be99a3de696c5,2016-06-11 05:29:41,Large with awesome terrace--accessible via bed...,Suffolk Street,"[Elevator, Laundry in Building, Laundry in Uni...",40.7185,7142618,-73.9865,...,1096.100837,0.897119,6,11,5,0.0,0.0,5535,3076,264
1,1.0,2,0,2016-06-24 06:36:34,Prime Soho - between Bleecker and Houston - Ne...,Thompson Street,"[Pre-War, Dogs Allowed, Cats Allowed]",40.7278,7210040,-74.0,...,1213.44046,0.587173,6,24,6,0.0,0.0,0,3593,292
2,1.0,1,3dbbb69fd52e0d25131aa1cd459c87eb,2016-06-03 04:29:40,New York chic has reached a new level ...,101 East 10th Street,"[Doorman, Elevator, No Fee]",40.7306,7103890,-73.989,...,1162.180848,1.077859,6,3,4,0.0,0.0,2813,2677,293
3,1.0,2,783d21d013a7e655bddc4ed0d461cc5e,2016-06-11 06:17:35,Step into this fantastic new Construction in t...,South Third Street\r,"[Roof Deck, Balcony, Elevator, Laundry in Buil...",40.7109,7143442,-73.9571,...,816.200305,1.010781,6,11,6,61.0,1.032787,5477,201,235
4,2.0,2,6134e7c4dd1a98d9aee36623c9872b49,2016-04-12 05:24:17,"~Take a stroll in Central Park, enjoy the ente...","Midtown West, 8th Ave","[Common Outdoor Space, Cats Allowed, Dogs Allo...",40.765,6860601,-73.9845,...,2060.54958,0.475601,4,12,5,72.0,1.236111,4428,3157,384


In [45]:
#Pickle for easy backup
train_df.to_pickle('train_df.pickle')
test_df.to_pickle('test_df.pickle')

### Select Features for Model

In [84]:
#Select Features
feature_cols = ['price', 'PricePerRoom', 'PricePerRoomVsLocAvg', 'BuildingID', 'NumDescription', 'ManagerID', 'NumPhotos',
               'NumFeatures', 'latitude', 'longitude', 'bedrooms', 'bathrooms', 'created_month', 'created_day', 'created_hour',
               'skill']

#Prepare data for ML
X_train = train_df[feature_cols].values
X_test = test_df[feature_cols].values

#Encode 'interest_level' to numerical
le_interest = LabelEncoder()
train_df['IL'] = le_interest.fit_transform(train_df['interest_level'])
#Set Train Y
Y = train_df['IL'].values

In [85]:
#Find important features
from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()
model.fit(X_train, Y)
importances = zip(model.feature_importances_, feature_cols)
importances = pd.DataFrame(importances)
print importances.sort_values(0, ascending=False)

           0                     1
1   0.087919          PricePerRoom
0   0.082722                 price
3   0.078318            BuildingID
2   0.077875  PricePerRoomVsLocAvg
14  0.070911          created_hour
15  0.070178                 skill
13  0.069552           created_day
4   0.068600        NumDescription
8   0.063100              latitude
9   0.061070             longitude
7   0.059877           NumFeatures
5   0.058974             ManagerID
6   0.058770             NumPhotos
12  0.037824         created_month
10  0.036804              bedrooms
11  0.017506             bathrooms


In [49]:
#Get Label encodings for reference later
le_interest.classes_

array([u'high', u'low', u'medium'], dtype=object)

### Test ML Algorithim

#### Random Forest Classifier

In [86]:
#RandomForest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score

#Initialize Model
rf = RandomForestClassifier(n_estimators=100, min_samples_split=20, criterion='entropy', n_jobs=-1)
#Create KFold
kfold = KFold(n_splits=5, random_state=5)
cross_val_results = cross_val_score(rf, X_train, Y, cv=kfold, scoring='neg_log_loss')
print cross_val_results.mean()

-0.581278515186


#### Grid Search for best RF parameters

In [52]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'min_samples_split' : [2, 4, 10, 20, 40],
    'criterion' : ['gini', 'entropy'],
    'max_features' : ['auto', 'log2', None]
}

rf100 = RandomForestClassifier(n_estimators=100, n_jobs=-1)
grid_search = GridSearchCV(estimator=rf100, param_grid=param_grid, cv=5, scoring='neg_log_loss')
grid_search.fit(X_train, Y)

print "Best Score: %f" % grid_search.best_score_
print grid_search.best_params_

Best Score: -0.580035
{'max_features': 'log2', 'min_samples_split': 20, 'criterion': 'entropy'}


#### XGB

In [87]:
from xgboost import XGBClassifier

#Initialize Model
xgb = XGBClassifier(objective='multi:softprob', max_depth=8, subsample=0.7)
#Create cross validation generator
kfold = KFold(n_splits=5, random_state=5)
#Train & Test model
cross_val_results = cross_val_score(xgb, X_train, Y, cv=kfold, scoring='neg_log_loss')
print cross_val_results.mean()

-0.564649263704


#### GridSearch for Best XGB Parameters

In [None]:
parameters = {'learning_rate': [0.1, 0.3],
              'min_child_weight': [3, 5, 8],
              'subsample' : [0.6, 0.7, 0.8],
              'max_depth' : [3, 5, 10]}

xgb = XGBClassifier()
grid_search = GridSearchCV(xgb, parameters, n_jobs=-1, cv=10, scoring='neg_log_loss')
grid_search.fit(X_train, Y)
print "Best Score: %f" % grid_search.best_score_
print grid_search.best_params_

### Train Model & Make Submission


In [73]:
#XGB
xgb = XGBClassifier(objective='multi:softprob', max_depth=8, subsample=0.7, n_estimators=100)
xgb.fit(X_train, Y)
predictions = xgb.predict_proba(X_test)

In [74]:
#Submission must be - listing_id, high, medium, low
#The index of our probabilties is from the label encoder earlier (0=high, 1=low, medium=2)
submission_df = pd.DataFrame({'listing_id':test_df['listing_id'], 'high':predictions[:, 0],
                             'medium':predictions[:, 2], 'low':predictions[:, 1]})
#Re-Order Columns for submission
cols = ['listing_id', 'high', 'medium', 'low']
submission_df = submission_df[cols]
submission_df.head()

Unnamed: 0,listing_id,high,medium,low
0,7142618,0.096804,0.49052,0.412676
1,7210040,0.081878,0.083418,0.834704
2,7103890,0.025219,0.104039,0.870742
3,7143442,0.057672,0.364056,0.578271
4,6860601,0.072051,0.355544,0.572405


In [75]:
#Write to CSV for submission
submission_df.to_csv('xgb.csv', index=False)

#### Make an RF Submission

In [88]:
#Random Forest
rf = RandomForestClassifier(n_estimators=1000, min_samples_split=20, criterion='entropy', n_jobs=-1)
rf.fit(X_train, Y)
prediction_probabilites = rf.predict_proba(X_test)

In [89]:
#Checkout feature importance
importances = zip(rf.feature_importances_, feature_cols)
importances = pd.DataFrame(importances)
importances

Unnamed: 0,0,1
0,0.117786,price
1,0.122826,PricePerRoom
2,0.108902,PricePerRoomVsLocAvg
3,0.095866,BuildingID
4,0.067968,NumDescription
5,0.051013,ManagerID
6,0.044346,NumPhotos
7,0.046239,NumFeatures
8,0.068551,latitude
9,0.064987,longitude


In [90]:
#Submission must be - listing_id, high, medium, low
#The index of our probabilties is from the label encoder earlier (0=high, 1=low, medium=2)
submission_df = pd.DataFrame({'listing_id':test_df['listing_id'], 'high':prediction_probabilites[:, 0],
                             'medium':prediction_probabilites[:, 2], 'low':prediction_probabilites[:, 1]})
#Re-Order Columns for submission
cols = ['listing_id', 'high', 'medium', 'low']
submission_df = submission_df[cols]
submission_df.head()

Unnamed: 0,listing_id,high,medium,low
0,7142618,0.073706,0.43042,0.495873
1,7210040,0.138937,0.203632,0.657431
2,7103890,0.020215,0.127631,0.852154
3,7143442,0.078329,0.263707,0.657964
4,6860601,0.089804,0.366559,0.543637


In [91]:
#Verify all is well (no NaNs)
submission_df.isnull().sum()

listing_id    0
high          0
medium        0
low           0
dtype: int64

In [92]:
#Write to CSV for submission
submission_df.to_csv('rf.csv', index=False)

### Conclusion

This XGB scored a 0.56831 and the random forest scored a 0.58048 on Kaggle.
