We couldn't find any models for predicting the review rating.

Our Accuracy : 96% 

Code from internet : 79 lines


Our Code : 134

In [2]:
import numpy as np
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import tokenize
from nltk.corpus import stopwords   
from nltk import wordpunct_tokenize
from sklearn.model_selection import cross_val_predict, GridSearchCV,cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import MinMaxScaler
import math

#### Key Feature Engineering
1. ```comment (reviews)```: We extensively used this feature in our analysis. The dataset contained reviews in multiple languages such as Chinese, Spanish, and English which made it difficult for it to be analyzed. We subsetted the data to include only the reviews that were in English and performed text filtering to remove common stop words and phrases that do not significantly contribute to the meaning of the review.

2.``` price (listings, calendar)```: The price column contained data in string format with the currency symbol ‘$’ and comma separator ‘,’ attached to it. This column was manipulated to contain integer values for time-series and other analysis.

3. ```date (calendar, listings, reviews)```: The date was contained in mm-dd-yyyy format. It was transformed multiple times during the analysis to obtain weekly, monthly or yearly insights.


4. ```rating (listings)```: There are several ratings that hosts receive including ‘location rating’, ‘cleanliness rating’ and ‘overall rating’. Values in these columns comprised of percentages, integers, and char string. Data were standardized and transformed to a similar scale.

#### Dealing with Missing Values
The data also had null values. To preserve all the information, we imputed or dropped the rows and columns containing null values while conducting exploratory analysis that made use of these features.


The ```comments``` feature in the ```reviews``` datasets constituted of more than 1 million values. To conduct a textual analysis on the review, it had to be split into a bag of words which would contain more than 20 million words.


In [65]:
listings = pd.read_csv("../data/listings.csv")

`host_response_time` is a categorical variable so we are replacing the text with scalar values

In [66]:
host_response_dict = {'within an hour': 1, 
                      'within a few hours': 2, 
                      'within a day': 3,
                     'a few days or more': 4}

listings.host_response_time = listings.host_response_time.map(host_response_dict)
listings.dropna(axis=0, how='any', subset=['host_response_time'], inplace=True)

In [67]:
listings.host_response_time.head()

0    2.0
1    1.0
2    1.0
4    3.0
5    3.0
Name: host_response_time, dtype: float64

`host_response_rate` is in string it needs to converted to float as this feature will used later for model building

In [68]:
listings['host_response_rate'].fillna(0, inplace=True)
listings.host_response_rate[:5]

0    100%
1    100%
2    100%
4     93%
5     93%
Name: host_response_rate, dtype: object

In [69]:
def per_to_numbers(price_string): 
    """converting the percentage to float 
    """
    per_to_number = float(str(price_string).split('%')[0])
    return per_to_number

In [70]:
listings['host_response_rate'] = listings['host_response_rate'].apply(per_to_numbers)
listings.host_response_rate.head()

0    100.0
1    100.0
2    100.0
4     93.0
5     93.0
Name: host_response_rate, dtype: float64

In [71]:
listings.dropna(axis=0, how='any', subset=['host_is_superhost'], inplace=True)
listings['host_is_superhost'].value_counts()

f    3578
t    1339
Name: host_is_superhost, dtype: int64

In [72]:
listings['host_is_superhost'].isnull().sum()

0

In [73]:
listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,3781,https://www.airbnb.com/rooms/3781,20190209175027,2019-02-09,HARBORSIDE-Walk to subway,Fully separate apartment in a two apartment bu...,This is a totally separate apartment located o...,Fully separate apartment in a two apartment bu...,none,"Mostly quiet ( no loud music, no crowed sidewa...",...,f,f,super_strict_30,f,f,1,1,0,0,0.32
1,5506,https://www.airbnb.com/rooms/5506,20190209175027,2019-02-09,**$79 Special ** Private! Minutes to center!,This is a private guest room with private bath...,**THE BEST Value in BOSTON!!*** PRIVATE GUEST ...,This is a private guest room with private bath...,none,"Peacful, Architecturally interesting, historic...",...,t,f,strict_14_with_grace_period,f,f,6,6,0,0,0.66
2,6695,https://www.airbnb.com/rooms/6695,20190209175027,2019-02-09,$99 Special!! Home Away! Condo,,** WELCOME *** FULL PRIVATE APARTMENT In a His...,** WELCOME *** FULL PRIVATE APARTMENT In a His...,none,"Peaceful, Architecturally interesting, histori...",...,t,f,strict_14_with_grace_period,f,f,6,6,0,0,0.73
4,8789,https://www.airbnb.com/rooms/8789,20190209175027,2019-02-09,Curved Glass Studio/1bd facing Park,"Bright, 1 bed with curved glass windows facing...",Fully Furnished studio with enclosed bedroom. ...,"Bright, 1 bed with curved glass windows facing...",none,Beacon Hill is a historic neighborhood filled ...,...,f,f,strict_14_with_grace_period,f,f,10,10,0,0,0.4
5,8792,https://www.airbnb.com/rooms/8792,20190209175027,2019-02-09,Large 1 Bed facing State House,Fully furnished 1bed facing the State House. ...,"Fully furnished, spacious one bed (king) unit ...",Fully furnished 1bed facing the State House. ...,none,Beacon Hill is a historic neighborhood filled ...,...,f,f,strict_14_with_grace_period,f,f,10,10,0,0,0.21


In [74]:
listings['host_listings_count'].fillna(0, inplace=True)

Added a new column which describe the level of verification.

`host_verifications` column has string data

In [75]:
listings['host_verification_level'] = listings.host_verifications.apply(lambda s: len(s.split(', ')))

In [76]:
listings['host_verification_level'].value_counts()

6     1285
4      799
5      685
8      492
7      462
3      452
9      307
2      267
10      92
1       70
11       6
Name: host_verification_level, dtype: int64

In [77]:
listings.dropna(axis=0, how='any', subset=['host_identity_verified'], inplace=True)
listings['host_identity_verified'].value_counts()

f    3245
t    1672
Name: host_identity_verified, dtype: int64

In [78]:
listings['is_location_exact'].fillna(0, inplace=True)
listings['is_location_exact'].value_counts()

t    4043
f     874
Name: is_location_exact, dtype: int64

There are many types of property types in our dataset we are just interested about top 10 types

In [79]:
propertyTypes = listings['property_type'].value_counts()[10:len(listings['property_type'].value_counts())].index.tolist()
def change_prop_type(label):
    if label in propertyTypes:
        label='Other'
    return label

In [80]:
listings.loc[:,'property_type']=listings.loc[:,'property_type'].apply(change_prop_type)
listings['property_type'].value_counts()

Apartment             3263
House                  722
Condominium            394
Serviced apartment     171
Townhouse              112
Other                   83
Guest suite             68
Bed and breakfast       43
Loft                    42
Boutique hotel          19
Name: property_type, dtype: int64

In [81]:
#checking nulls in room_type
listings['room_type'].value_counts()
listings['room_type'].isnull().sum()

0

In [82]:
#checking nulls in bed_type
listings['bed_type'].value_counts()
listings['bed_type'].isnull().sum()

0

In [83]:
listings.host_response_time.head()

0    2.0
1    1.0
2    1.0
4    3.0
5    3.0
Name: host_response_time, dtype: float64

`amenities` column is a string. We are dividing the amenities into seperate columns for the further analysis.

We are assuming that *amenities* play an important role on predicting review rating 

In [84]:
def get_amenities(column):
    am_list=[]
    for am in column:
        am=am.replace('"','')
        am=am.replace('{','')
        am=am.replace('}','')
        am_list += am.split(',')

    #am_list=am_list.remove('translation missing: en.hosting_amenity_50')    
    #am_list=am_list.remove('translation missing: en.hosting_amenity_49')
    
    am_list2=pd.DataFrame(am_list) #Transform list into a data frame

    am_list2.rename(columns={0:'amenities'},inplace=True) #replace name to categories

    am_list2=am_list2.groupby('amenities')['amenities'].count().sort_values(ascending=False) #group by category and count.
    am_list2.rename(columns={'amenities':'count'},inplace=True) #Rename column to count to then drop one level
    am_list2.reset_index(level=0)
    return am_list2

In [85]:
amenities=get_amenities(listings.amenities) #Total

amenities = amenities.drop(['translation missing: en.hosting_amenity_50'],axis=0)

amenities = amenities.drop(['translation missing: en.hosting_amenity_49'],axis=0)

amenities

amenities/amenities.sum()*100 #Percentage of total tags

(amenities[0:35]/amenities.sum()*100).sum()

amenities=amenities[0:10]

In [86]:
amenities.index

Index(['Wifi', 'Heating', 'Essentials', 'Smoke detector', 'Kitchen', 'Hangers',
       'Carbon monoxide detector', 'Air conditioning', 'Shampoo',
       'Hair dryer'],
      dtype='object', name='amenities')

In [87]:
am= listings['amenities'].map(
    lambda amns: "|".join([amn.replace("}", "").replace("{", "").replace('"', "")\
                           for amn in amns.split(",")]))

##Create the list of unique amenities

#am_list=np.unique(np.concatenate(am.map(lambda amns: amns.split("|"))))[1:]

##Split amenities

am_s=am.map(lambda amns: amns.split("|"))

##Create an array with true or false for each listing

amenity_array = np.array([am_s.map(lambda amns: amn in amns) for amn in amenities.index])
amenity_array = np.transpose(amenity_array)



In [88]:
listings_new = pd.concat([listings, pd.DataFrame(data=amenity_array, columns=amenities.index)], axis=1)
listings_new.dropna(axis = 0 ,how='any', subset=['id'], inplace=True)
listings_new.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,Wifi,Heating,Essentials,Smoke detector,Kitchen,Hangers,Carbon monoxide detector,Air conditioning,Shampoo,Hair dryer
0,3781.0,https://www.airbnb.com/rooms/3781,20190210000000.0,2019-02-09,HARBORSIDE-Walk to subway,Fully separate apartment in a two apartment bu...,This is a totally separate apartment located o...,Fully separate apartment in a two apartment bu...,none,"Mostly quiet ( no loud music, no crowed sidewa...",...,True,True,True,True,True,True,True,True,True,True
1,5506.0,https://www.airbnb.com/rooms/5506,20190210000000.0,2019-02-09,**$79 Special ** Private! Minutes to center!,This is a private guest room with private bath...,**THE BEST Value in BOSTON!!*** PRIVATE GUEST ...,This is a private guest room with private bath...,none,"Peacful, Architecturally interesting, historic...",...,True,True,True,True,False,True,True,True,True,True
2,6695.0,https://www.airbnb.com/rooms/6695,20190210000000.0,2019-02-09,$99 Special!! Home Away! Condo,,** WELCOME *** FULL PRIVATE APARTMENT In a His...,** WELCOME *** FULL PRIVATE APARTMENT In a His...,none,"Peaceful, Architecturally interesting, histori...",...,True,True,True,True,True,True,True,True,True,True
4,8789.0,https://www.airbnb.com/rooms/8789,20190210000000.0,2019-02-09,Curved Glass Studio/1bd facing Park,"Bright, 1 bed with curved glass windows facing...",Fully Furnished studio with enclosed bedroom. ...,"Bright, 1 bed with curved glass windows facing...",none,Beacon Hill is a historic neighborhood filled ...,...,True,True,True,True,True,True,False,True,False,True
5,8792.0,https://www.airbnb.com/rooms/8792,20190210000000.0,2019-02-09,Large 1 Bed facing State House,Fully furnished 1bed facing the State House. ...,"Fully furnished, spacious one bed (king) unit ...",Fully furnished 1bed facing the State House. ...,none,Beacon Hill is a historic neighborhood filled ...,...,True,True,True,True,True,True,True,True,True,False


In [89]:
listings_new.dropna(axis=0, how='any', subset=['review_scores_rating'], inplace=True)

In [90]:
def prices_to_numbers(price_string):
    '''
    Converts USD prices from string to numeric format
    
    Args:
        price_string (string): USD price in string format (e.g., '$123,456.00')
    
    Returns:
        price_numeric (float): USD price in numeric format (e.g., 123456.00)
    '''
    
    price_numeric = float(str(price_string).replace(',', '').split('$')[-1])
    return price_numeric

In [91]:
listings_new['price'] = listings_new['price'].apply(prices_to_numbers)

In [92]:
listings_new['price'].value_counts()

150.0    151
99.0     110
175.0     95
100.0     91
125.0     90
50.0      89
75.0      84
200.0     81
250.0     63
65.0      63
199.0     61
60.0      53
70.0      50
149.0     50
80.0      49
229.0     45
120.0     44
90.0      43
95.0      43
195.0     43
85.0      42
40.0      42
45.0      41
55.0      40
180.0     40
160.0     39
49.0      39
225.0     37
185.0     35
190.0     35
        ... 
436.0      1
251.0      1
206.0      1
244.0      1
15.0       1
296.0      1
204.0      1
152.0      1
392.0      1
20.0       1
356.0      1
256.0      1
136.0      1
19.0       1
456.0      1
17.0       1
304.0      1
294.0      1
793.0      1
515.0      1
589.0      1
451.0      1
271.0      1
328.0      1
236.0      1
388.0      1
212.0      1
353.0      1
247.0      1
829.0      1
Name: price, Length: 362, dtype: int64

In [93]:
listings_new['security_deposit'].isnull().sum()
listings_new['security_deposit'].fillna(0, inplace=True)
listings_new['security_deposit'] = listings_new['security_deposit'].apply(prices_to_numbers)
listings_new['security_deposit'].value_counts()

0.0       2085
100.0      386
400.0      373
500.0      267
200.0      228
300.0      185
150.0      141
250.0      130
1000.0      56
350.0       35
450.0       11
125.0        9
600.0        9
120.0        8
750.0        6
2000.0       5
3000.0       4
1500.0       4
700.0        4
180.0        3
299.0        3
550.0        3
1200.0       3
800.0        2
375.0        2
399.0        2
275.0        2
650.0        2
115.0        2
227.0        1
259.0        1
5000.0       1
160.0        1
4000.0       1
240.0        1
95.0         1
699.0        1
425.0        1
499.0        1
799.0        1
295.0        1
2500.0       1
199.0        1
249.0        1
475.0        1
175.0        1
353.0        1
119.0        1
105.0        1
999.0        1
Name: security_deposit, dtype: int64

In [94]:
listings_new['cleaning_fee'].isnull().sum()
listings_new['cleaning_fee'].fillna(0, inplace=True)
listings_new['cleaning_fee'] = listings_new['cleaning_fee'].apply(prices_to_numbers)

In [95]:
listings_new['extra_people'] = listings_new['extra_people'].apply(prices_to_numbers)
listings_new['extra_people'].value_counts()

0.0      1805
5.0       369
25.0      362
20.0      292
10.0      248
15.0      223
30.0      188
50.0      167
35.0       85
40.0       76
100.0      32
45.0       14
75.0       13
12.0       13
8.0        12
19.0       11
60.0        6
18.0        6
300.0       6
65.0        6
11.0        6
13.0        4
49.0        4
80.0        4
22.0        4
29.0        3
9.0         3
17.0        3
200.0       3
39.0        3
28.0        2
85.0        2
70.0        2
150.0       2
7.0         2
6.0         2
120.0       1
16.0        1
89.0        1
111.0       1
99.0        1
38.0        1
59.0        1
34.0        1
Name: extra_people, dtype: int64

In [96]:
listings_new['has_availability'].value_counts()

t    3991
Name: has_availability, dtype: int64

In [97]:
listings_new['number_of_reviews'].value_counts()

1.0      304
2.0      202
3.0      175
4.0      137
5.0      122
6.0      117
7.0       93
10.0      90
8.0       87
9.0       77
13.0      68
11.0      65
17.0      62
14.0      57
18.0      57
15.0      55
12.0      51
23.0      48
21.0      44
22.0      44
20.0      42
28.0      39
39.0      39
19.0      39
30.0      38
25.0      37
16.0      36
32.0      35
29.0      35
31.0      34
        ... 
346.0      1
485.0      1
354.0      1
367.0      1
371.0      1
626.0      1
200.0      1
240.0      1
0.0        1
217.0      1
507.0      1
265.0      1
332.0      1
306.0      1
458.0      1
413.0      1
489.0      1
431.0      1
477.0      1
166.0      1
206.0      1
335.0      1
348.0      1
386.0      1
444.0      1
213.0      1
398.0      1
283.0      1
513.0      1
295.0      1
Name: number_of_reviews, Length: 298, dtype: int64

In [98]:
listings_new['number_of_reviews_ltm'].value_counts()

1.0      367
2.0      230
0.0      227
3.0      178
4.0      142
5.0      125
6.0      122
7.0      114
8.0      106
10.0      98
12.0      84
13.0      83
11.0      79
14.0      79
9.0       78
15.0      63
16.0      59
17.0      58
19.0      54
18.0      48
23.0      47
34.0      47
20.0      44
22.0      43
30.0      42
29.0      42
32.0      41
25.0      41
27.0      41
39.0      40
        ... 
101.0      2
120.0      2
107.0      2
100.0      2
131.0      2
116.0      2
93.0       2
114.0      2
117.0      2
109.0      1
128.0      1
138.0      1
150.0      1
80.0       1
108.0      1
137.0      1
141.0      1
143.0      1
99.0       1
149.0      1
94.0       1
156.0      1
105.0      1
102.0      1
119.0      1
113.0      1
145.0      1
134.0      1
104.0      1
133.0      1
Name: number_of_reviews_ltm, Length: 135, dtype: int64

In [99]:
for index,row in listings_new.iterrows():
    if math.isnan(row['review_scores_accuracy']):
        row['review_scores_accuracy'] = row['review_scores_rating']/10
    if math.isnan(row['review_scores_cleanliness']):
        row['review_scores_cleanliness'] = row['review_scores_rating']/10
    if math.isnan(row['review_scores_checkin']):
        row['review_scores_checkin'] = row['review_scores_rating']/10
    if math.isnan(row['review_scores_communication']):
        row['review_scores_communication'] = row['review_scores_rating']/10
    if math.isnan(row['review_scores_location']):
        row['review_scores_location'] = row['review_scores_rating']/10
    if math.isnan(row['review_scores_value']):
        row['review_scores_value'] = row['review_scores_rating']/10

In [100]:
listings_new.dropna(axis=0, how='any', subset=['review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value'], inplace=True)

In [101]:
listings_new['instant_bookable'].value_counts()

t    2163
f    1824
Name: instant_bookable, dtype: int64

In [102]:
listings_new['cancellation_policy'].value_counts()

strict_14_with_grace_period    2234
moderate                       1030
flexible                        592
super_strict_30                  73
strict                           50
super_strict_60                   8
Name: cancellation_policy, dtype: int64

In [103]:
boolean_cols = ['host_is_superhost',
               'host_identity_verified',
               'is_location_exact',
               'instant_bookable',
               'require_guest_profile_picture',
               'require_guest_phone_verification']

In [104]:
def booleans_to_numbers(s):
    '''
    Converts "first letter boolean" strings to integers
    
    Args:
        bool_string (string): 't', 'f' or other
    
    Returns:
        bool_number (int): 1, 0 or None
    '''
    
    if(s == 'f'):
        return 0
    elif (s == 't'):
        return 1
    return None

In [105]:
for col in boolean_cols:
    listings_new[col] = listings_new[col].apply(booleans_to_numbers)

In [106]:
listings_new['reviews_per_month'].fillna(0, inplace=True)

## Model Building

selecting the features for building the model

In [107]:
required_cols = ['host_response_time','host_response_rate','host_is_superhost','host_verification_level','neighbourhood_cleansed',
               'host_identity_verified',
               'is_location_exact',
               'instant_bookable','property_type','room_type','bed_type','Wifi', 'Heating', 'Essentials', 'Smoke detector', 'Kitchen',
       'Carbon monoxide detector', 'Hangers', 'Air conditioning', 'Shampoo',
       'Hair dryer','price','security_deposit','cleaning_fee','extra_people','number_of_reviews','review_scores_rating','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','cancellation_policy','reviews_per_month']

In [108]:
listings_final = listings_new[required_cols]
listings_final

Unnamed: 0,host_response_time,host_response_rate,host_is_superhost,host_verification_level,neighbourhood_cleansed,host_identity_verified,is_location_exact,instant_bookable,property_type,room_type,...,cleaning_fee,extra_people,number_of_reviews,review_scores_rating,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,cancellation_policy,reviews_per_month
0,2.0,100.0,1,3.0,East Boston,0,1,0,Apartment,Entire home/apt,...,75.0,0.0,14.0,99.0,10.0,10.0,10.0,10.0,super_strict_30,0.32
1,1.0,100.0,1,4.0,Roxbury,1,1,1,Guest suite,Entire home/apt,...,40.0,0.0,80.0,95.0,10.0,10.0,10.0,9.0,strict_14_with_grace_period,0.66
2,1.0,100.0,1,4.0,Roxbury,1,1,1,Condominium,Entire home/apt,...,70.0,8.0,85.0,97.0,10.0,10.0,10.0,9.0,strict_14_with_grace_period,0.73
4,3.0,93.0,0,8.0,Downtown,0,1,0,Apartment,Entire home/apt,...,250.0,0.0,22.0,92.0,9.0,10.0,10.0,10.0,strict_14_with_grace_period,0.40
5,3.0,93.0,0,8.0,Downtown,0,1,0,Apartment,Entire home/apt,...,250.0,0.0,24.0,93.0,9.0,10.0,10.0,10.0,strict_14_with_grace_period,0.21
6,4.0,33.0,0,6.0,South End,1,1,0,Apartment,Entire home/apt,...,75.0,0.0,9.0,89.0,10.0,10.0,9.0,10.0,super_strict_30,0.10
7,4.0,33.0,0,6.0,Back Bay,1,1,0,Serviced apartment,Entire home/apt,...,0.0,0.0,23.0,79.0,9.0,10.0,9.0,10.0,super_strict_30,0.25
8,4.0,33.0,0,6.0,Downtown,1,1,0,Serviced apartment,Entire home/apt,...,150.0,0.0,8.0,96.0,10.0,9.0,9.0,9.0,super_strict_30,0.08
10,4.0,33.0,0,6.0,Back Bay,1,1,0,Apartment,Entire home/apt,...,0.0,0.0,25.0,88.0,10.0,10.0,10.0,10.0,super_strict_30,0.28
12,4.0,33.0,0,6.0,Back Bay,1,1,0,Serviced apartment,Entire home/apt,...,0.0,0.0,3.0,87.0,10.0,10.0,10.0,10.0,super_strict_30,0.03


In [109]:
listings_final.to_csv('../data/cleanListings.csv')

In [110]:
listings_dum = pd.get_dummies(listings_final)

checking nulls before passing the dependant variables to the randrom forest regressor

In [111]:
listings_dum.isnull().sum()

host_response_time                                 0
host_response_rate                                 0
host_is_superhost                                  0
host_verification_level                            0
host_identity_verified                             0
is_location_exact                                  0
instant_bookable                                   0
price                                              0
security_deposit                                   0
cleaning_fee                                       0
extra_people                                       0
number_of_reviews                                  0
review_scores_rating                               0
review_scores_cleanliness                          0
review_scores_checkin                              0
review_scores_communication                        0
review_scores_location                             0
reviews_per_month                                  0
neighbourhood_cleansed_Allston                

In [112]:
labels = np.array(listings_dum['review_scores_rating'])

# Remove the labels from the features, so we have one table for independent variables 
# (axis 1 refers to columns)
listings_dum = listings_dum.drop('review_scores_rating', axis = 1)

# Saving feature names for later use
feature_list = list(listings_dum.columns)

# Convert to numpy array
listings_dum = np.array(listings_dum)

In [113]:
# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
train_features, test_features, train_labels, test_labels = train_test_split(listings_dum, labels, test_size = 0.25,
                                                                           random_state = 42)

In [114]:
print('Training Features Shape:', train_features.shape)
print('Training Labels Shape:', train_labels.shape)
print('Testing Features Shape:', test_features.shape)
print('Testing Labels Shape:', test_labels.shape)

Training Features Shape: (2990, 86)
Training Labels Shape: (2990,)
Testing Features Shape: (997, 86)
Testing Labels Shape: (997,)


In [115]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate model 
rf = RandomForestRegressor(n_estimators= 1000, random_state=42)

# Train the model on training data
rf.fit(train_features, train_labels);

In [116]:
rf.get_params

<bound method BaseEstimator.get_params of RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False)>

In [117]:
# Use the forest's predict method on the test data
predictions = rf.predict(test_features)

# Calculate the absolute errors
errors = abs(predictions - test_labels)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2))

Mean Absolute Error: 2.84


In [None]:
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels)

# Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')

Accuracy: 96.72 %.


Whoops!! there a chance of overfitting as the accuracy is ~ 97%

But also the values of review score rating in our dataset are in mostly in the range of 90 - 100. so getting a Mean Absolute error of 2.84 is considerable

In [None]:
# Get numerical feature importances
importances = list(rf.feature_importances_)

# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]

# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)

# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];

Variable: review_scores_communication Importance: 0.32
Variable: review_scores_cleanliness Importance: 0.31
Variable: number_of_reviews    Importance: 0.04
Variable: reviews_per_month    Importance: 0.04
Variable: price                Importance: 0.03
Variable: review_scores_checkin Importance: 0.03
Variable: review_scores_location Importance: 0.03
Variable: host_response_rate   Importance: 0.02
Variable: cleaning_fee         Importance: 0.02
Variable: neighbourhood_cleansed_Downtown Importance: 0.02
Variable: host_response_time   Importance: 0.01
Variable: host_is_superhost    Importance: 0.01
Variable: host_verification_level Importance: 0.01
Variable: security_deposit     Importance: 0.01
Variable: extra_people         Importance: 0.01
Variable: neighbourhood_cleansed_West Roxbury Importance: 0.01
Variable: host_identity_verified Importance: 0.0
Variable: is_location_exact    Importance: 0.0
Variable: instant_bookable     Importance: 0.0
Variable: neighbourhood_cleansed_Allston Impo

Based on the given information about data `review_score_cleanliness` is the rating given for cleanliness that listing.`review_scores_communication` is tells about the interaction with the host.

The variable importance of random forest also tells that the features review_score_cleanliness, review_scores_communication are important 

Performing GridSearchCV for Hyperparamter tuning

In [None]:
gsc = GridSearchCV(
        estimator=RandomForestRegressor(),
        param_grid={
            'max_depth': range(3,10),
            'n_estimators': (10, 50, 100, 1000),
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)
    
grid_result = gsc.fit(listings_dum, labels)

In [None]:
best_params = grid_result.best_params_
rfr = RandomForestRegressor(max_depth=best_params["max_depth"], n_estimators=best_params["n_estimators"],                          random_state=False, verbose=False)
# Perform 10-Fold CV 
scores = cross_val_score(rfr, listings_dum, labels, cv=10, scoring='neg_mean_squared_error')

In [None]:
best_params

In [None]:
rfr.fit(train_features,train_labels)

In [None]:
print(rfr.get_params)

In [None]:
predictions_cv = rfr.predict(test_features)

In [None]:
# Calculate the absolute errors
errors = abs(predictions_cv - test_labels)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(errors), 2))

In [None]:
pd.DataFrame(grid_result.cv_results_)

In [None]:
predictions = cross_val_predict(rfr, listings_dum, labels, cv = 10)

In [None]:
len(predictions)

In [None]:
listings_final['neighbourhood_cleansed']