# Two Sigma Rentals

Estimate the **interest level** (high, medium, low) for rental advertisements.

Data size: 
* Train: 67mb, 49k records
* Test: 101mb, 74k records

Data provided on each apartment:
* Size (beds, baths)
* Building (potential for initial aggregate building model)
* Time period
* Ad content (description, features)
* Location (lat, lon, address) 
* Manager (potential for initial aggregate manager model)
* Price
* Photo link (potential for photo model with nn)


## Plan of attack

Start building some basic models & evaluate predictive power with basic features, for benchmarking. Understand the relationships, in general terms.
* [basic] evaluate relationship of price and high / low interest levels.
* [basic] bedrooms, baths vs interest levels 

Develop k-fold logic, since data is small. 

Evaluate relationships of buildings with 

### Import libraries

In [1]:
import pandas as pd
import numpy as np
import json
from collections import namedtuple

# modeling
import statsmodels.api as sm
from patsy import dmatrices

# viz
import seaborn as sns
import matplotlib
%matplotlib inline

In [2]:
raw_train = {}
with open('Data/train.json','r') as r:
    for line in r:
        raw_train.update(json.loads(line))

raw_test = {}
with open('Data/test.json','r') as r:
    for line in r:
        raw_test.update(json.loads(line))

In [3]:
raw_train.keys()

[u'listing_id',
 u'building_id',
 u'display_address',
 u'description',
 u'created',
 u'price',
 u'bedrooms',
 u'longitude',
 u'photos',
 u'manager_id',
 u'latitude',
 u'bathrooms',
 u'interest_level',
 u'street_address',
 u'features']

In [4]:
raw_test.keys()

[u'listing_id',
 u'display_address',
 u'description',
 u'created',
 u'price',
 u'bedrooms',
 u'longitude',
 u'photos',
 u'manager_id',
 u'latitude',
 u'bathrooms',
 u'building_id',
 u'street_address',
 u'features']

In [5]:
y_col = 'interest_level'

### Merge datasets

In [6]:
raw_train_df = pd.DataFrame(raw_train)
raw_train_df.index = raw_train_df.listing_id
raw_train_df['is_train'] = 1

raw_test_df = pd.DataFrame(raw_test)
raw_test_df.index = raw_test_df.listing_id
raw_test_df['is_train'] = 0

raw_train_y = raw_train_df[y_col]

fullcols = raw_test_df.columns
raw_full_df = pd.concat((raw_train_df[fullcols],
                         raw_test_df[fullcols]))

In [29]:
np.unique(raw_train_y)

array([u'high', u'low', u'medium'], dtype=object)

### Explore data

In [7]:
raw_train_df.shape

(49352, 16)

In [8]:
raw_test_df.shape

(74659, 15)

In [9]:
raw_full_df.shape

(124011, 15)

In [10]:
np.unique(raw_train_y.values, return_counts=True)

(array([u'high', u'low', u'medium'], dtype=object),
 array([ 3839, 34284, 11229]))

In [11]:
raw_full_df.head()

Unnamed: 0_level_0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,is_train
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
7211212,1.5,3,53a5b119ba8f7b61d4e010512e0dfc85,2016-06-24 07:54:24,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Metropolitan Avenue,[],40.7145,7211212,-73.9425,5ba989232d0489da1b5f2c45f6688adc,[https://photos.renthop.com/2/7211212_1ed4542e...,3000,792 Metropolitan Avenue,1
7150865,1.0,2,c5c8a357cba207596b04d1afd1e4f130,2016-06-12 12:19:27,,Columbus Avenue,"[Doorman, Elevator, Fitness Center, Cats Allow...",40.7947,7150865,-73.9667,7533621a882f71e25173b27e3139d83d,[https://photos.renthop.com/2/7150865_be3306c5...,5465,808 Columbus Avenue,1
6887163,1.0,1,c3ba40552e2120b0acfc3cb5730bb2aa,2016-04-17 03:26:41,"Top Top West Village location, beautiful Pre-w...",W 13 Street,"[Laundry In Building, Dishwasher, Hardwood Flo...",40.7388,6887163,-74.0018,d9039c43983f6e564b1482b273bd7b01,[https://photos.renthop.com/2/6887163_de85c427...,2850,241 W 13 Street,1
6888711,1.0,1,28d9ad350afeaab8027513a3e52ac8d5,2016-04-18 02:22:02,Building Amenities - Garage - Garden - fitness...,East 49th Street,"[Hardwood Floors, No Fee]",40.7539,6888711,-73.9677,1067e078446a7897d2da493d2f741316,[https://photos.renthop.com/2/6888711_6e660cee...,3275,333 East 49th Street,1
6934781,1.0,4,0,2016-04-28 01:32:41,Beautifully renovated 3 bedroom flex 4 bedroom...,West 143rd Street,[Pre-War],40.8241,6934781,-73.9493,98e13ad4b495b9613cef886d79a6291f,[https://photos.renthop.com/2/6934781_1fa4b41a...,3350,500 West 143rd Street,1


In [12]:
raw_full_df.bathrooms.value_counts()

1.0      99086
2.0      19230
3.0       1861
1.5       1642
0.0        787
2.5        702
4.0        364
3.5        164
4.5         83
5.0         60
5.5         12
6.0         11
6.5          3
20.0         2
7.0          1
112.0        1
7.5          1
10.0         1
Name: bathrooms, dtype: int64

In [13]:
bad_bathrooms = raw_full_df.bathrooms==112
raw_full_df[bad_bathrooms]

Unnamed: 0_level_0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,is_train
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
7120577,112.0,3,33fa7be8ea2ffc6353af117cab78f569,2016-06-07 05:22:55,"This is a pretty, charming, prime location 3 b...",East 75th Street,[Hardwood Floors],40.7693,7120577,-73.9529,3e1edc05ca35eaecc90766629d22d078,[https://photos.renthop.com/2/7120577_dea70af4...,3700,433 East 75th Street,0


In [14]:
raw_full_df = raw_full_df[~bad_bathrooms]

In [15]:
raw_full_df.bedrooms.value_counts()

1    39608
2    37114
0    23564
3    18148
4     4887
5      569
6      112
7        6
8        2
Name: bedrooms, dtype: int64

In [16]:
raw_full_df.building_id.value_counts()[:10]

0                                   20664
96274288c84ddd7d5c5d8e425ee75027      705
11e1dec9d14b1a9e528386a2504b3afc      546
bb8658a3e432fb62a440615333376345      522
80a120d6bc3aba97f40fee8c2204524b      510
ce6d18bf3238e668b2bf23f4110b7b67      459
f68bf347f99df026f4faad43cc604048      457
c94301249b8c09429d329864d58e5b82      410
ea9045106c4e1fe52853b6af941f1c69      397
128d4af0683efc5e1eded8dc8044d5e3      385
Name: building_id, dtype: int64

In [17]:
raw_full_df[raw_full_df.building_id=='96274288c84ddd7d5c5d8e425ee75027'].head()

Unnamed: 0_level_0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,is_train
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
6839190,1.0,1,96274288c84ddd7d5c5d8e425ee75027,2016-04-07 05:22:17,This deal won't last long! Fantastic corner 1 ...,W 37th St.,"[Roof Deck, Balcony, Doorman, Elevator, Fitnes...",40.7568,6839190,-73.9982,39af186286605963b1d75543e1492c61,[https://photos.renthop.com/2/6839190_6976106d...,3500,505 W 37th St.,1
6855104,1.0,2,96274288c84ddd7d5c5d8e425ee75027,2016-04-11 02:45:09,WANT A LUXURY BUILDING IN NYC WITH A SOUTH BEA...,West 37th Street,"[Roof Deck, Doorman, Elevator, Fitness Center,...",40.7568,6855104,-73.9982,537e06890f6a86dbb70c187db5be4d55,[https://photos.renthop.com/2/6855104_aad48410...,3100,505 West 37th Street,1
6942565,1.0,0,96274288c84ddd7d5c5d8e425ee75027,2016-04-29 05:25:04,100% NO FEE!!!!This Amazing LUXURY highrise X...,West 37th Street,"[Roof Deck, Dining Room, Balcony, Doorman, Ele...",40.7568,6942565,-73.9982,1fb46c4a72bcf764ac35fc23f394760d,[https://photos.renthop.com/2/6942565_38f5b22b...,2400,505 West 37th Street,1
6814881,2.0,3,96274288c84ddd7d5c5d8e425ee75027,2016-04-02 03:11:16,100%NO BROKER FEE &ONE MONTH FREE! INCREDIBLE ...,W 37 St.,"[Roof Deck, Doorman, Elevator, Fitness Center,...",40.7568,6814881,-73.9982,b531b97b2c0b72472307b38b55a6d5b5,[https://photos.renthop.com/2/6814881_da8a3ac1...,4400,505 W 37 St.,1
6925010,1.0,0,96274288c84ddd7d5c5d8e425ee75027,2016-04-26 02:51:47,AMAZING LOCATION!! building located in Midtown...,W 37 St.,"[Roof Deck, Doorman, Elevator, Fitness Center,...",40.7568,6925010,-73.9982,e6472c7237327dd3903b3d6f6a94515a,[https://photos.renthop.com/2/6925010_24d46f89...,2500,505 W 37 St.,1


In [18]:
raw_full_df['building_id_iszero'] = raw_full_df.building_id == '0'

In [19]:
raw_full_df.building_id_iszero.value_counts()

False    103346
True      20664
Name: building_id_iszero, dtype: int64

In [20]:
raw_full_df[['building_id_iszero','is_train']].groupby('is_train').mean()

Unnamed: 0_level_0,building_id_iszero
is_train,Unnamed: 1_level_1
0,0.165796
1,0.167896


In [21]:
m_df = pd.merge(raw_full_df, pd.DataFrame(raw_train_y), left_index=True, right_index=True)

In [22]:
m_df['interest_level_ishigh'] = (m_df.interest_level == 'high')*1
m_df['interest_level_islow'] = (m_df.interest_level == 'low')*1

In [23]:
def run_model(formula):
    y, X = dmatrices(formula, data=m_df, return_type='dataframe')
    mod = sm.OLS(y, X)
    res = mod.fit()
    print res.summary()

In [24]:
run_model('interest_level_ishigh ~ building_id_iszero')

                              OLS Regression Results                             
Dep. Variable:     interest_level_ishigh   R-squared:                       0.008
Model:                               OLS   Adj. R-squared:                  0.008
Method:                    Least Squares   F-statistic:                     412.0
Date:                   Sat, 01 Apr 2017   Prob (F-statistic):           3.19e-91
Time:                           13:07:34   Log-Likelihood:                -4807.3
No. Observations:                  49352   AIC:                             9619.
Df Residuals:                      49350   BIC:                             9636.
Df Model:                              1                                         
Covariance Type:               nonrobust                                         
                                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------

In [27]:
run_model('interest_level_islow ~ building_id_iszero + bedrooms')

                             OLS Regression Results                             
Dep. Variable:     interest_level_islow   R-squared:                       0.047
Model:                              OLS   Adj. R-squared:                  0.047
Method:                   Least Squares   F-statistic:                     1222.
Date:                  Sat, 01 Apr 2017   Prob (F-statistic):               0.00
Time:                          19:57:13   Log-Likelihood:                -30570.
No. Observations:                 49352   AIC:                         6.115e+04
Df Residuals:                     49349   BIC:                         6.117e+04
Df Model:                             2                                         
Covariance Type:              nonrobust                                         
                                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
----------------------------------------------------------------------------------------------
