## 4) Machine learning 

### (a) Machine learning question

The goal is now to predict airbnb prices by utilizing prices from neighbouring airbnb data aggregated in different ways and the other variables discussed and finally selected as potentially helpful in the EDA section. Prices from the housing market and data crime data did not show significant and large effects in the EDA part and are thus not considered in any model. 

Predicting airbnb prices is a regression problem. The following models are implemented in the following as they usually show a good performance on tabular data:
 
1. Linear models as linear regression, Huber, Lasso and Ridge
2. Random Forest (a bagging approach)
3. Boosting algorithms as `XGBoost` and `AdaBoost`

The idea is to test smaller and larger models. The baseline model is a linear model.

All algorithms are trained on training sets and tested on test sets. Grid search is applied to find hyper-parameters. Stratified random sampling is performed to create training, validation and test sets. A validation set is used instead of cross-validation in order to not mix up information from other datasets when using the aggregated price information. The performance is measured by r2, MAE (mean absolute error) and RMSE (root mean square error).

### (b) Training, validation and test data

The preprocessing and the aggregations needs to be done separately on the training, validation and test set.

*Renaming variables*

In [221]:
# clean some variables
airbnb_prepro_df = air.prepro_dataset(airbnb_df, restrict=None)
airbnb_prepro_df.head(1)

Unnamed: 0,Air_id,Air_listing_url,Air_scrape_id,Air_last_scraped,Air_name,Air_summary,Air_space,Air_description,Air_experiences_offered,Air_neighborhood_overview,...,Air_license,Air_jurisdiction_names,Air_instant_bookable,Air_is_business_travel_ready,Air_cancellation_policy,Air_require_guest_profile_picture,Air_require_guest_phone_verification,Air_calculated_host_listings_count,Air_reviews_per_month,Air_smart_location_cleaned
0,9835,https://www.airbnb.com/rooms/9835,20181207034809,2018-12-07,Beautiful Room & House,,"House: Clean, New, Modern, Quite, Safe. 10Km f...","House: Clean, New, Modern, Quite, Safe. 10Km f...",none,Very safe! Family oriented. Older age group.,...,,,f,f,strict_14_with_grace_period,f,f,1.0,0.04,bulleen


*Log airbnb prices*

In [222]:
airbnb_prepro_df['Air_log_price'] = np.log10(airbnb_prepro_df.Air_price)

*Other adaptions* 

*Split into a training, validation and testing set*

In [223]:
train, test = train_test_split(airbnb_prepro_df, test_size=0.2, random_state=1)
train, valid = train_test_split(train, test_size=0.2, random_state=1)
train.head(1)

Unnamed: 0,Air_id,Air_listing_url,Air_scrape_id,Air_last_scraped,Air_name,Air_summary,Air_space,Air_description,Air_experiences_offered,Air_neighborhood_overview,...,Air_jurisdiction_names,Air_instant_bookable,Air_is_business_travel_ready,Air_cancellation_policy,Air_require_guest_profile_picture,Air_require_guest_phone_verification,Air_calculated_host_listings_count,Air_reviews_per_month,Air_smart_location_cleaned,Air_log_price
9718,18132900,https://www.airbnb.com/rooms/18132900,20181207034809,2018-12-07,PRIVATE BEDROOM & BATHROOM in SOUTHBANK,This is a 2 bedroom 2 bathroom/toilet apartmen...,Private room with queen size bed Own bathroom/...,This is a 2 bedroom 2 bathroom/toilet apartmen...,none,"Not far to Crown Casino, Flinders Station, Fed...",...,,f,f,moderate,f,f,1.0,7.42,southbank,1.954243


In [224]:
valid.head(1)

Unnamed: 0,Air_id,Air_listing_url,Air_scrape_id,Air_last_scraped,Air_name,Air_summary,Air_space,Air_description,Air_experiences_offered,Air_neighborhood_overview,...,Air_jurisdiction_names,Air_instant_bookable,Air_is_business_travel_ready,Air_cancellation_policy,Air_require_guest_profile_picture,Air_require_guest_phone_verification,Air_calculated_host_listings_count,Air_reviews_per_month,Air_smart_location_cleaned,Air_log_price
14141,22286586,https://www.airbnb.com/rooms/22286586,20181207034809,2018-12-07,墨尔本CBD 两室合租 独立房间 The fifth,"交通便利,楼下有多条市区免费巴士,距离Southen cross station 步行仅有三...",,"交通便利,楼下有多条市区免费巴士,距离Southen cross station 步行仅有三...",none,,...,,t,f,flexible,f,f,1.0,,melbourne,1.832509


*Remove outliers and implausible values*

In [225]:
# Implausible values
# ------------------
# Train
impl_obs_train = train.loc[(train.Air_price >= train.Air_weekly_price) |
                                (train.Air_price >= train.Air_monthly_price), 
                                ['Air_price', 'Air_weekly_price', 'Air_monthly_price']]

# Validation
impl_obs_valid = valid.loc[(valid.Air_price >= valid.Air_weekly_price) |
                                (valid.Air_price >= valid.Air_monthly_price), 
                                ['Air_price', 'Air_weekly_price', 'Air_monthly_price']]

# Test
impl_obs_test = test.loc[(test.Air_price >= test.Air_weekly_price) |
                                (test.Air_price >= test.Air_monthly_price), 
                                ['Air_price', 'Air_weekly_price', 'Air_monthly_price']]

# Outliers
# --------
# Train
quantiles = train.Air_log_price.quantile([.25, 0.75])
iqr = quantiles.iloc[1] - quantiles.iloc[0]
upper_bounds = quantiles.iloc[1] + 1.5*iqr
lower_bounds = quantiles.iloc[0] - 1.5*iqr

train_wo_out = train.loc[(train.Air_log_price >= lower_bounds) & (train.Air_log_price <= upper_bounds),:]
#print(train_wo_out.shape)

# Validation
quantiles = valid.Air_log_price.quantile([.25, 0.75])
iqr = quantiles.iloc[1] - quantiles.iloc[0]
upper_bounds = quantiles.iloc[1] + 1.5*iqr
lower_bounds = quantiles.iloc[0] - 1.5*iqr

valid_wo_out = valid.loc[(valid.Air_log_price >= lower_bounds) & (valid.Air_log_price <= upper_bounds),:]
#print(valid_wo_out.shape)

# Test
quantiles = test.Air_log_price.quantile([.25, 0.75])
iqr = quantiles.iloc[1] - quantiles.iloc[0]
upper_bounds = quantiles.iloc[1] + 1.5*iqr
lower_bounds = quantiles.iloc[0] - 1.5*iqr

test_wo_out = test.loc[(test.Air_log_price >= lower_bounds) & (test.Air_log_price <= upper_bounds),:]
#print(train_wo_out.shape)

# Implementation
# --------------
# Train
train_cleaned = train_wo_out.loc[(train_wo_out.Air_price != 0) &
                                          ~train_wo_out.index.isin(impl_obs_train.index), :]
print(train_cleaned.shape)

# Validation
valid_cleaned = valid_wo_out.loc[(valid_wo_out.Air_price != 0) &
                                          ~valid_wo_out.index.isin(impl_obs_valid.index), :]
print(valid_cleaned.shape)

# Test
test_cleaned = test_wo_out.loc[(test_wo_out.Air_price != 0) &
                                          ~test_wo_out.index.isin(impl_obs_test.index), :]
print(test_cleaned.shape)

(14296, 98)
(3574, 98)
(4465, 98)


#### Aggregated prices

We aggregate the information about neighbouring observations.

In [226]:
train_cleaned.index = range(train_cleaned.shape[0])
valid_cleaned.index = range(valid_cleaned.shape[0])
test_cleaned.index = range(test_cleaned.shape[0])

**Suburbs**

*Mean*

In [227]:
# Test
train_airbnb_agg = air.aggregate_data(train_cleaned, 
                                            ['Air_accommodates', 'Air_bathrooms', 'Air_bedrooms',
                                            'Air_beds', 'Air_guests_included', 'Air_log_price'], 
                                            ['Air_neighbourhood_cleansed'], agg_fun ='mean',
                                            classifier='Air_room_type')
train_airbnb_agg;

# Validation
valid_airbnb_agg = air.aggregate_data(valid_cleaned, 
                                            ['Air_accommodates', 'Air_bathrooms', 'Air_bedrooms',
                                            'Air_beds', 'Air_guests_included', 'Air_log_price'], 
                                            ['Air_neighbourhood_cleansed'], agg_fun ='mean',
                                            classifier='Air_room_type')
valid_airbnb_agg;

# Train
test_airbnb_agg = air.aggregate_data(test_cleaned, 
                                            ['Air_accommodates', 'Air_bathrooms', 'Air_bedrooms',
                                            'Air_beds', 'Air_guests_included', 'Air_log_price'], 
                                            ['Air_neighbourhood_cleansed'], agg_fun ='mean',
                                            classifier='Air_room_type')
test_airbnb_agg;

*Sum*

In [228]:
# Train
train_airbnb_agg_sum = air.aggregate_data(train_cleaned, 
                                            ['Air_accommodates', 'Air_bathrooms', 'Air_bedrooms',
                                            'Air_beds', 'Air_guests_included', 'Air_log_price'], 
                                            ['Air_neighbourhood_cleansed'], agg_fun ='sum',
                                            classifier='Air_room_type')

# Validation
valid_airbnb_agg_sum = air.aggregate_data(valid_cleaned, 
                                            ['Air_accommodates', 'Air_bathrooms', 'Air_bedrooms',
                                            'Air_beds', 'Air_guests_included', 'Air_log_price'], 
                                            ['Air_neighbourhood_cleansed'], agg_fun ='sum',
                                            classifier='Air_room_type')

# Test
test_airbnb_agg_sum = air.aggregate_data(test_cleaned, 
                                            ['Air_accommodates', 'Air_bathrooms', 'Air_bedrooms',
                                            'Air_beds', 'Air_guests_included', 'Air_log_price'], 
                                            ['Air_neighbourhood_cleansed'], agg_fun ='sum',
                                            classifier='Air_room_type')

*Count*

In [229]:
# Train
train_airbnb_agg_count = air.aggregate_data(train_cleaned, 
                                            ['Air_accommodates', 'Air_bathrooms', 'Air_bedrooms',
                                            'Air_beds', 'Air_guests_included', 'Air_log_price'], 
                                            ['Air_neighbourhood_cleansed'], agg_fun ='count',
                                            classifier='Air_room_type')

# Validation
valid_airbnb_agg_count = air.aggregate_data(valid_cleaned, 
                                            ['Air_accommodates', 'Air_bathrooms', 'Air_bedrooms',
                                            'Air_beds', 'Air_guests_included', 'Air_log_price'], 
                                            ['Air_neighbourhood_cleansed'], agg_fun ='count',
                                            classifier='Air_room_type')

# Test
test_airbnb_agg_count = air.aggregate_data(test_cleaned, 
                                            ['Air_accommodates', 'Air_bathrooms', 'Air_bedrooms',
                                            'Air_beds', 'Air_guests_included', 'Air_log_price'], 
                                            ['Air_neighbourhood_cleansed'], agg_fun ='count',
                                            classifier='Air_room_type')

In [230]:
train_airbnb_agg.head(5)

Unnamed: 0,Air_neighbourhood_cleansed,Air_accommodates_Entire home/apt,Air_accommodates_Private room,Air_accommodates_Shared room,Air_bathrooms_Entire home/apt,Air_bathrooms_Private room,Air_bathrooms_Shared room,Air_bedrooms_Entire home/apt,Air_bedrooms_Private room,Air_bedrooms_Shared room,Air_beds_Entire home/apt,Air_beds_Private room,Air_beds_Shared room,Air_guests_included_Entire home/apt,Air_guests_included_Private room,Air_guests_included_Shared room,Air_log_price_Entire home/apt,Air_log_price_Private room,Air_log_price_Shared room
0,Banyule,4.537313,2.265625,2.0,1.358209,1.1875,1.166667,2.0,1.1875,1.0,2.955224,1.578125,2.0,2.477612,1.28125,1.0,2.093758,1.718233,1.817773
1,Bayside,4.688742,1.901408,2.428571,1.692053,1.232394,1.571429,2.205298,1.056338,1.0,2.735099,1.239437,2.0,1.801325,1.239437,1.285714,2.23954,1.874226,1.683638
2,Boroondara,4.283582,1.892857,1.75,1.400498,1.139896,1.0,1.930348,1.040816,1.0,2.437811,1.205128,1.0,1.920398,1.132653,1.25,2.154398,1.780887,1.68158
3,Brimbank,8.066667,2.0,,1.566667,1.163462,,2.6,1.173077,,5.733333,1.25,,3.466667,1.519231,,2.040816,1.647705,
4,Cardinia,4.878049,2.46875,2.0,1.390244,1.046875,1.0,2.0,1.34375,1.0,2.815789,1.375,2.0,1.926829,1.125,1.0,2.268909,1.939552,1.60206


In [231]:
train_airbnb_agg_sum.head(5)

Unnamed: 0,Air_neighbourhood_cleansed,Air_accommodates_Entire home/apt,Air_accommodates_Private room,Air_accommodates_Shared room,Air_bathrooms_Entire home/apt,Air_bathrooms_Private room,Air_bathrooms_Shared room,Air_bedrooms_Entire home/apt,Air_bedrooms_Private room,Air_bedrooms_Shared room,Air_beds_Entire home/apt,Air_beds_Private room,Air_beds_Shared room,Air_guests_included_Entire home/apt,Air_guests_included_Private room,Air_guests_included_Shared room,Air_log_price_Entire home/apt,Air_log_price_Private room,Air_log_price_Shared room
0,Banyule,304.0,145.0,6.0,91.0,76.0,3.5,134.0,76.0,3.0,198.0,101.0,6.0,166.0,82.0,3.0,140.281813,109.966911,5.453318
1,Bayside,708.0,135.0,17.0,255.5,87.5,11.0,333.0,75.0,7.0,413.0,88.0,14.0,272.0,88.0,9.0,338.170479,133.070053,11.785464
2,Boroondara,861.0,371.0,7.0,281.5,220.0,3.0,388.0,204.0,4.0,490.0,235.0,3.0,386.0,222.0,5.0,433.03397,349.053947,6.72632
3,Brimbank,121.0,104.0,,23.5,60.5,,39.0,61.0,,86.0,65.0,,52.0,79.0,,30.612233,85.680662,
4,Cardinia,200.0,79.0,2.0,57.0,33.5,1.0,82.0,43.0,1.0,107.0,44.0,2.0,79.0,36.0,1.0,93.025282,62.065663,1.60206


In [232]:
train_airbnb_agg_count.head(5)

Unnamed: 0,Air_neighbourhood_cleansed,Air_accommodates_Entire home/apt,Air_accommodates_Private room,Air_accommodates_Shared room,Air_bathrooms_Entire home/apt,Air_bathrooms_Private room,Air_bathrooms_Shared room,Air_bedrooms_Entire home/apt,Air_bedrooms_Private room,Air_bedrooms_Shared room,Air_beds_Entire home/apt,Air_beds_Private room,Air_beds_Shared room,Air_guests_included_Entire home/apt,Air_guests_included_Private room,Air_guests_included_Shared room,Air_log_price_Entire home/apt,Air_log_price_Private room,Air_log_price_Shared room
0,Banyule,67.0,64.0,3.0,67.0,64.0,3.0,67.0,64.0,3.0,67.0,64.0,3.0,67.0,64.0,3.0,67.0,64.0,3.0
1,Bayside,151.0,71.0,7.0,151.0,71.0,7.0,151.0,71.0,7.0,151.0,71.0,7.0,151.0,71.0,7.0,151.0,71.0,7.0
2,Boroondara,201.0,196.0,4.0,201.0,193.0,3.0,201.0,196.0,4.0,201.0,195.0,3.0,201.0,196.0,4.0,201.0,196.0,4.0
3,Brimbank,15.0,52.0,,15.0,52.0,,15.0,52.0,,15.0,52.0,,15.0,52.0,,15.0,52.0,
4,Cardinia,41.0,32.0,1.0,41.0,32.0,1.0,41.0,32.0,1.0,38.0,32.0,1.0,41.0,32.0,1.0,41.0,32.0,1.0


Save to disk

In [233]:
# Train
pickle.dump(train_airbnb_agg, open("train_airbnb_agg.p", "wb"))
pickle.dump(train_airbnb_agg_sum, open("train_airbnb_agg_sum.p", "wb"))
pickle.dump(train_airbnb_agg_count, open("train_airbnb_agg_count.p", "wb"))

# Validation
pickle.dump(valid_airbnb_agg, open("valid_airbnb_agg.p", "wb"))
pickle.dump(valid_airbnb_agg_sum, open("valid_airbnb_agg_sum.p", "wb"))
pickle.dump(valid_airbnb_agg_count, open("valid_airbnb_agg_count.p", "wb"))

# Test
pickle.dump(test_airbnb_agg, open("test_airbnb_agg.p", "wb"))
pickle.dump(test_airbnb_agg_sum, open("test_airbnb_agg_sum.p", "wb"))
pickle.dump(test_airbnb_agg_count, open("test_airbnb_agg_count.p", "wb"))

Load files

In [234]:
# Train
train_airbnb_agg = pickle.load(open("train_airbnb_agg.p", "rb"))
train_airbnb_agg_sum = pickle.load(open("train_airbnb_agg_sum.p", "rb"))
train_airbnb_agg_count = pickle.load(open("train_airbnb_agg_count.p", "rb"))

# Validation
valid_airbnb_agg = pickle.load(open("valid_airbnb_agg.p", "rb"))
valid_airbnb_agg_sum = pickle.load(open("valid_airbnb_agg_sum.p", "rb"))
valid_airbnb_agg_count = pickle.load(open("valid_airbnb_agg_count.p", "rb"))

# Test
test_airbnb_agg = pickle.load(open("test_airbnb_agg.p", "rb"))
test_airbnb_agg_sum = pickle.load(open("test_airbnb_agg_sum.p", "rb"))
test_airbnb_agg_count = pickle.load(open("test_airbnb_agg_count.p", "rb"))

**50, 100 and 500 meters**

*These steps are time consuming and thus set to Raw*

Open the files

In [235]:
# Train
train_airbnb_obs_50m = pickle.load(open("train_airbnb_obs_50m.p", "rb"))
train_airbnb_obs_100m = pickle.load(open("train_airbnb_obs_100m.p", "rb"))
train_airbnb_obs_500m = pickle.load(open("train_airbnb_obs_500m.p", "rb"))

# Validation
valid_airbnb_obs_50m = pickle.load(open("valid_airbnb_obs_50m.p", "rb"))
valid_airbnb_obs_100m = pickle.load(open("valid_airbnb_obs_100m.p", "rb"))
valid_airbnb_obs_500m = pickle.load(open("valid_airbnb_obs_500m.p", "rb"))

# Test
test_airbnb_obs_50m = pickle.load(open("test_airbnb_obs_50m.p", "rb"))
test_airbnb_obs_100m = pickle.load(open("test_airbnb_obs_100m.p", "rb"))
test_airbnb_obs_500m = pickle.load(open("test_airbnb_obs_500m.p", "rb"))

Extract the values

*These steps are time consuming and thus set to Raw*

Save the results

Open the files

In [236]:
# Train
train_airbnb_agg_50m = pickle.load(open("train_airbnb_agg_50m.p", "rb"))
train_airbnb_agg_100m = pickle.load(open("train_airbnb_agg_100m.p", "rb"))
train_airbnb_agg_500m = pickle.load(open("train_airbnb_agg_500m.p", "rb"))

# Validation
valid_airbnb_agg_50m = pickle.load(open("valid_airbnb_agg_50m.p", "rb"))
valid_airbnb_agg_100m = pickle.load(open("valid_airbnb_agg_100m.p", "rb"))
valid_airbnb_agg_500m = pickle.load(open("valid_airbnb_agg_500m.p", "rb"))

# Test
test_airbnb_agg_50m = pickle.load(open("test_airbnb_agg_50m.p", "rb"))
test_airbnb_agg_100m = pickle.load(open("test_airbnb_agg_100m.p", "rb"))
test_airbnb_agg_500m = pickle.load(open("test_airbnb_agg_500m.p", "rb"))

**Nearest neighbour**

In [237]:
# Train
train_lat_long = train_cleaned.loc[:, ['Air_latitude', 'Air_longitude', 'Air_id']]
train_lat_long_type = train_cleaned.loc[:, ['Air_latitude', 'Air_longitude', 
                                            'Air_room_type', 'Air_property_type_2',
                                            'Air_id']]

# Validation
valid_lat_long = valid_cleaned.loc[:, ['Air_latitude', 'Air_longitude', 'Air_id']]
valid_lat_long_type = valid_cleaned.loc[:, ['Air_latitude', 'Air_longitude', 
                                            'Air_room_type', 'Air_property_type_2',
                                            'Air_id']]

# Test
test_lat_long = test_cleaned.loc[:, ['Air_latitude', 'Air_longitude', 'Air_id']]
test_lat_long_type = test_cleaned.loc[:, ['Air_latitude', 'Air_longitude', 
                                          'Air_room_type', 'Air_property_type_2',
                                          'Air_id']]

First, we select the observations ...

*These steps are time consuming and thus set to Raw*

Save the results

Load the results

In [238]:
train_airbnb_obs_nearest = pickle.load(open("train_airbnb_obs_nearest.p", "rb"))
valid_airbnb_obs_nearest = pickle.load(open("valid_airbnb_obs_nearest.p", "rb"))
test_airbnb_obs_nearest = pickle.load(open("test_airbnb_obs_nearest.p", "rb"))

... then we select the values.

In [239]:
# Train
train_airbnb_agg_nearest = air.the_nearest_obs_prices(train_cleaned, 
                                                      train_cleaned,
                                                      train_airbnb_obs_nearest, 
                                                      ['Air_log_price', 'Air_accommodates', 
                                                       'Air_bathrooms', 'Air_bedrooms',
                                                       'Air_beds', 'Air_guests_included'],
                                                      classifier='Air_room_type',
                                                      distance=True)

In [240]:
# Validation
valid_airbnb_agg_nearest = air.the_nearest_obs_prices(valid_cleaned, 
                                                      valid_cleaned,
                                                      valid_airbnb_obs_nearest, 
                                                      ['Air_log_price', 'Air_accommodates', 
                                                       'Air_bathrooms', 'Air_bedrooms',
                                                       'Air_beds', 'Air_guests_included'],
                                                      classifier='Air_room_type',
                                                      distance=True)

In [241]:
# Test
test_airbnb_agg_nearest = air.the_nearest_obs_prices(test_cleaned, 
                                                     test_cleaned,
                                                     test_airbnb_obs_nearest, 
                                                     ['Air_log_price', 'Air_accommodates', 
                                                     'Air_bathrooms', 'Air_bedrooms',
                                                     'Air_beds', 'Air_guests_included'],
                                                     classifier='Air_room_type',
                                                     distance=True)

Save to disk

In [242]:
pickle.dump(train_airbnb_agg_nearest, open("train_airbnb_agg_nearest.p", "wb"))
pickle.dump(valid_airbnb_agg_nearest, open("valid_airbnb_agg_nearest.p", "wb"))
pickle.dump(test_airbnb_agg_nearest, open("test_airbnb_agg_nearest.p", "wb"))

Load the file

In [243]:
train_airbnb_agg_nearest = pickle.load(open("train_airbnb_agg_nearest.p", "rb"))
valid_airbnb_agg_nearest = pickle.load(open("valid_airbnb_agg_nearest.p", "rb"))
test_airbnb_agg_nearest = pickle.load(open("test_airbnb_agg_nearest.p", "rb"))

**Merge**

We first load the files.

In [244]:
# 50, 100 and 500 meters
train_airbnb_agg_50m = pickle.load(open("train_airbnb_agg_50m.p", "rb"))
train_airbnb_agg_100m = pickle.load(open("train_airbnb_agg_100m.p", "rb"))
train_airbnb_agg_500m = pickle.load(open("train_airbnb_agg_500m.p", "rb"))

valid_airbnb_agg_50m = pickle.load(open("valid_airbnb_agg_50m.p", "rb"))
valid_airbnb_agg_100m = pickle.load(open("valid_airbnb_agg_100m.p", "rb"))
valid_airbnb_agg_500m = pickle.load(open("valid_airbnb_agg_500m.p", "rb"))

test_airbnb_agg_50m = pickle.load(open("test_airbnb_agg_50m.p", "rb"))
test_airbnb_agg_100m = pickle.load(open("test_airbnb_agg_100m.p", "rb"))
test_airbnb_agg_500m = pickle.load(open("test_airbnb_agg_500m.p", "rb"))

# nearest
train_airbnb_agg_nearest = pickle.load(open("train_airbnb_agg_nearest.p", "rb"))
valid_airbnb_agg_nearest = pickle.load(open("valid_airbnb_agg_nearest.p", "rb"))
test_airbnb_agg_nearest = pickle.load(open("test_airbnb_agg_nearest.p", "rb"))

# suburbs
train_airbnb_agg = pickle.load(open("train_airbnb_agg.p", "rb"))
train_airbnb_agg_sum = pickle.load(open("train_airbnb_agg_sum.p", "rb"))
train_airbnb_agg_count = pickle.load(open("train_airbnb_agg_count.p", "rb"))

valid_airbnb_agg = pickle.load(open("valid_airbnb_agg.p", "rb"))
valid_airbnb_agg_sum = pickle.load(open("valid_airbnb_agg_sum.p", "rb"))
valid_airbnb_agg_count = pickle.load(open("valid_airbnb_agg_count.p", "rb"))

test_airbnb_agg = pickle.load(open("test_airbnb_agg.p", "rb"))
test_airbnb_agg_sum = pickle.load(open("test_airbnb_agg_sum.p", "rb"))
test_airbnb_agg_count = pickle.load(open("test_airbnb_agg_count.p", "rb"))

In [245]:
# Train
train_airbnb_agg_50m_mean = train_airbnb_agg_50m['mean']
train_airbnb_agg_100m_mean = train_airbnb_agg_100m['mean']
train_airbnb_agg_500m_mean = train_airbnb_agg_500m['mean']

train_airbnb_agg_50m_count = train_airbnb_agg_50m['count']
train_airbnb_agg_100m_count = train_airbnb_agg_100m['count']
train_airbnb_agg_500m_count = train_airbnb_agg_500m['count']

# Validation
valid_airbnb_agg_50m_mean = valid_airbnb_agg_50m['mean']
valid_airbnb_agg_100m_mean = valid_airbnb_agg_100m['mean']
valid_airbnb_agg_500m_mean = valid_airbnb_agg_500m['mean']

valid_airbnb_agg_50m_count = valid_airbnb_agg_50m['count']
valid_airbnb_agg_100m_count = valid_airbnb_agg_100m['count']
valid_airbnb_agg_500m_count = valid_airbnb_agg_500m['count']

# Test
test_airbnb_agg_50m_mean = test_airbnb_agg_50m['mean']
test_airbnb_agg_100m_mean = test_airbnb_agg_100m['mean']
test_airbnb_agg_500m_mean = test_airbnb_agg_500m['mean']

test_airbnb_agg_50m_count = test_airbnb_agg_50m['count']
test_airbnb_agg_100m_count = test_airbnb_agg_100m['count']
test_airbnb_agg_500m_count = test_airbnb_agg_500m['count']

In [246]:
train_airbnb_agg_50m_mean.head(1)

Unnamed: 0,Air_group,Air_accommodates_Entire home/apt,Air_accommodates_Private room,Air_accommodates_Shared room,Air_bathrooms_Entire home/apt,Air_bathrooms_Private room,Air_bathrooms_Shared room,Air_bedrooms_Entire home/apt,Air_bedrooms_Private room,Air_bedrooms_Shared room,Air_beds_Entire home/apt,Air_beds_Private room,Air_beds_Shared room,Air_guests_included_Entire home/apt,Air_guests_included_Private room,Air_guests_included_Shared room,Air_log_price_Entire home/apt,Air_log_price_Private room,Air_log_price_Shared room
0,0,5.0,2.0,,1.333333,1.0,,2.0,1.0,,2.666667,1.0,,2.0,1.5,,2.274274,1.864316,


In [247]:
train_airbnb_agg_100m_mean.head(1)

Unnamed: 0,Air_group,Air_accommodates_Entire home/apt,Air_accommodates_Private room,Air_accommodates_Shared room,Air_bathrooms_Entire home/apt,Air_bathrooms_Private room,Air_bathrooms_Shared room,Air_bedrooms_Entire home/apt,Air_bedrooms_Private room,Air_bedrooms_Shared room,Air_beds_Entire home/apt,Air_beds_Private room,Air_beds_Shared room,Air_guests_included_Entire home/apt,Air_guests_included_Private room,Air_guests_included_Shared room,Air_log_price_Entire home/apt,Air_log_price_Private room,Air_log_price_Shared room
0,0,4.647059,1.928571,1.0,1.411765,1.214286,1.0,1.823529,1.071429,1.0,2.588235,1.071429,1.0,2.176471,1.5,1.0,2.237612,1.849264,1.665714


In [248]:
train_airbnb_agg_500m_mean.head(1)

Unnamed: 0,Air_group,Air_accommodates_Entire home/apt,Air_accommodates_Private room,Air_accommodates_Shared room,Air_bathrooms_Entire home/apt,Air_bathrooms_Private room,Air_bathrooms_Shared room,Air_bedrooms_Entire home/apt,Air_bedrooms_Private room,Air_bedrooms_Shared room,Air_beds_Entire home/apt,Air_beds_Private room,Air_beds_Shared room,Air_guests_included_Entire home/apt,Air_guests_included_Private room,Air_guests_included_Shared room,Air_log_price_Entire home/apt,Air_log_price_Private room,Air_log_price_Shared room
0,0,4.693237,1.848684,1.222222,1.426329,1.078947,1.166667,1.864734,1.019737,1.0,2.620773,1.072848,1.222222,2.702899,1.203947,1.111111,2.263383,1.887373,1.629927


We rename the variable so that we can distinguish between them.

In [249]:
# Train
# -----
# 50m, 100m and 500m - mean
cols = [e + '_50m' for e in train_airbnb_agg_50m_mean.columns if e != 'Air_group']
train_airbnb_agg_50m_mean.columns = ['Air_group'] + cols
cols = [e + '_100m' for e in train_airbnb_agg_100m_mean.columns if e != 'Air_group']
train_airbnb_agg_100m_mean.columns = ['Air_group'] + cols
cols = [e + '_500m' for e in test_airbnb_agg_500m_mean.columns if e != 'Air_group']
train_airbnb_agg_500m_mean.columns = ['Air_group'] + cols

# 50m, 100m and 500m - count
cols = [e + '_50m_count' for e in train_airbnb_agg_50m_count.columns if e != 'Air_group']
train_airbnb_agg_50m_count.columns = ['Air_group'] + cols
cols = [e + '_100m_count' for e in train_airbnb_agg_100m_count.columns if e != 'Air_group']
train_airbnb_agg_100m_count.columns = ['Air_group'] + cols
cols = [e + '_500m_count' for e in train_airbnb_agg_500m_count.columns if e != 'Air_group']
train_airbnb_agg_500m_count.columns = ['Air_group'] + cols

# nearest
cols = [e + '_nearest' for e in train_airbnb_agg_nearest.columns if e != 'Air_group']
train_airbnb_agg_nearest.columns = ['Air_group'] + cols

# suburbs - sum
cols = [e + '_suburb_sum' for e in train_airbnb_agg_sum.columns if e != 'Air_neighbourhood_cleansed']
train_airbnb_agg_sum.columns = ['Air_neighbourhood_cleansed'] + cols

# suburbs - count
cols = [e + '_suburb_count' for e in train_airbnb_agg_count.columns if e != 'Air_neighbourhood_cleansed']
train_airbnb_agg_count.columns = ['Air_neighbourhood_cleansed'] + cols


# Validation
# ----------
# 50m, 100m and 500m - mean
cols = [e + '_50m' for e in valid_airbnb_agg_50m_mean.columns if e != 'Air_group']
valid_airbnb_agg_50m_mean.columns = ['Air_group'] + cols
cols = [e + '_100m' for e in valid_airbnb_agg_100m_mean.columns if e != 'Air_group']
valid_airbnb_agg_100m_mean.columns = ['Air_group'] + cols
cols = [e + '_500m' for e in valid_airbnb_agg_500m_mean.columns if e != 'Air_group']
valid_airbnb_agg_500m_mean.columns = ['Air_group'] + cols

# 50m, 100m and 500m - count
cols = [e + '_50m_count' for e in valid_airbnb_agg_50m_count.columns if e != 'Air_group']
valid_airbnb_agg_50m_count.columns = ['Air_group'] + cols
cols = [e + '_100m_count' for e in valid_airbnb_agg_100m_count.columns if e != 'Air_group']
valid_airbnb_agg_100m_count.columns = ['Air_group'] + cols
cols = [e + '_500m_count' for e in valid_airbnb_agg_500m_count.columns if e != 'Air_group']
valid_airbnb_agg_500m_count.columns = ['Air_group'] + cols

# nearest
cols = [e + '_nearest' for e in valid_airbnb_agg_nearest.columns if e != 'Air_group']
valid_airbnb_agg_nearest.columns = ['Air_group'] + cols

# suburbs - sum
cols = [e + '_suburb_sum' for e in valid_airbnb_agg_sum.columns if e != 'Air_neighbourhood_cleansed']
valid_airbnb_agg_sum.columns = ['Air_neighbourhood_cleansed'] + cols

# suburbs - count
cols = [e + '_suburb_count' for e in valid_airbnb_agg_count.columns if e != 'Air_neighbourhood_cleansed']
valid_airbnb_agg_count.columns = ['Air_neighbourhood_cleansed'] + cols


# Test
# ----
# 50m, 100m and 500m - mean
cols = [e + '_50m' for e in test_airbnb_agg_50m_mean.columns if e != 'Air_group']
test_airbnb_agg_50m_mean.columns = ['Air_group'] + cols
cols = [e + '_100m' for e in test_airbnb_agg_100m_mean.columns if e != 'Air_group']
test_airbnb_agg_100m_mean.columns = ['Air_group'] + cols
cols = [e + '_500m' for e in test_airbnb_agg_500m_mean.columns if e != 'Air_group']
test_airbnb_agg_500m_mean.columns = ['Air_group'] + cols

# 50m, 100m and 500m - count
cols = [e + '_50m_count' for e in test_airbnb_agg_50m_count.columns if e != 'Air_group']
test_airbnb_agg_50m_count.columns = ['Air_group'] + cols
cols = [e + '_100m_count' for e in test_airbnb_agg_100m_count.columns if e != 'Air_group']
test_airbnb_agg_100m_count.columns = ['Air_group'] + cols
cols = [e + '_500m_count' for e in test_airbnb_agg_500m_count.columns if e != 'Air_group']
test_airbnb_agg_500m_count.columns = ['Air_group'] + cols

# nearest
cols = [e + '_nearest' for e in test_airbnb_agg_nearest.columns if e != 'Air_group']
test_airbnb_agg_nearest.columns = ['Air_group'] + cols

# suburbs - sum
cols = [e + '_suburb_sum' for e in test_airbnb_agg_sum.columns if e != 'Air_neighbourhood_cleansed']
test_airbnb_agg_sum.columns = ['Air_neighbourhood_cleansed'] + cols

# suburbs - count
cols = [e + '_suburb_count' for e in test_airbnb_agg_count.columns if e != 'Air_neighbourhood_cleansed']
test_airbnb_agg_count.columns = ['Air_neighbourhood_cleansed'] + cols

Let's check one dataset.

In [250]:
train_airbnb_agg_50m_mean.head(1)

Unnamed: 0,Air_group,Air_accommodates_Entire home/apt_50m,Air_accommodates_Private room_50m,Air_accommodates_Shared room_50m,Air_bathrooms_Entire home/apt_50m,Air_bathrooms_Private room_50m,Air_bathrooms_Shared room_50m,Air_bedrooms_Entire home/apt_50m,Air_bedrooms_Private room_50m,Air_bedrooms_Shared room_50m,Air_beds_Entire home/apt_50m,Air_beds_Private room_50m,Air_beds_Shared room_50m,Air_guests_included_Entire home/apt_50m,Air_guests_included_Private room_50m,Air_guests_included_Shared room_50m,Air_log_price_Entire home/apt_50m,Air_log_price_Private room_50m,Air_log_price_Shared room_50m
0,0,5.0,2.0,,1.333333,1.0,,2.0,1.0,,2.666667,1.0,,2.0,1.5,,2.274274,1.864316,


Now, the datasets can be merged.

In [251]:
train_cleaned['Air_group'] = range(train_cleaned.shape[0])
valid_cleaned['Air_group'] = range(valid_cleaned.shape[0])
test_cleaned['Air_group'] = range(test_cleaned.shape[0])

Let's print again the shapes of the different datasets so that we can check whether the merge is performed correctly. 

In [252]:
print(train_cleaned.shape)
print(valid_cleaned.shape)
print(test_cleaned.shape)

(14296, 99)
(3574, 99)
(4465, 99)


In [253]:
# Train
merged_train = train_cleaned.merge(train_airbnb_agg_50m_mean, left_on='Air_group', right_on='Air_group',
                                 how='left')
print(merged_train.shape)
merged_train = merged_train.merge(train_airbnb_agg_100m_mean, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_train.shape)
merged_train = merged_train.merge(train_airbnb_agg_500m_mean, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_train.shape)
merged_train = merged_train.merge(train_airbnb_agg_50m_count, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_train.shape)
merged_train = merged_train.merge(train_airbnb_agg_100m_count, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_train.shape)
merged_train = merged_train.merge(train_airbnb_agg_500m_count, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_train.shape)

merged_train = merged_train.merge(train_airbnb_agg_nearest, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_train.shape)

merged_train = merged_train.merge(train_airbnb_agg, left_on='Air_neighbourhood_cleansed', 
                     right_on='Air_neighbourhood_cleansed', how='left')
print(merged_train.shape)
merged_train = merged_train.merge(train_airbnb_agg_sum, left_on='Air_neighbourhood_cleansed', 
                     right_on='Air_neighbourhood_cleansed', how='left')
print(merged_train.shape)
merged_train = merged_train.merge(train_airbnb_agg_count, left_on='Air_neighbourhood_cleansed', 
                     right_on='Air_neighbourhood_cleansed', how='left')
print(merged_train.shape)


# Validation
# ----------
merged_valid = valid_cleaned.merge(valid_airbnb_agg_50m_mean, left_on='Air_group', right_on='Air_group',
                                 how='left')
print(merged_valid.shape)
merged_valid = merged_valid.merge(valid_airbnb_agg_100m_mean, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_valid.shape)
merged_valid = merged_valid.merge(valid_airbnb_agg_500m_mean, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_valid.shape)
merged_valid = merged_valid.merge(valid_airbnb_agg_50m_count, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_valid.shape)
merged_valid = merged_valid.merge(valid_airbnb_agg_100m_count, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_valid.shape)
merged_valid = merged_valid.merge(valid_airbnb_agg_500m_count, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_valid.shape)

merged_valid = merged_valid.merge(valid_airbnb_agg_nearest, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_valid.shape)

merged_valid = merged_valid.merge(valid_airbnb_agg, left_on='Air_neighbourhood_cleansed', 
                     right_on='Air_neighbourhood_cleansed', how='left')
print(merged_valid.shape)
merged_valid = merged_valid.merge(valid_airbnb_agg_sum, left_on='Air_neighbourhood_cleansed', 
                     right_on='Air_neighbourhood_cleansed', how='left')
print(merged_valid.shape)
merged_valid = merged_valid.merge(valid_airbnb_agg_count, left_on='Air_neighbourhood_cleansed', 
                     right_on='Air_neighbourhood_cleansed', how='left')
print(merged_valid.shape)


# Test
# ----
merged_test = test_cleaned.merge(test_airbnb_agg_50m_mean, left_on='Air_group', right_on='Air_group',
                                 how='left')
print(merged_test.shape)
merged_test = merged_test.merge(test_airbnb_agg_100m_mean, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_test.shape)
merged_test = merged_test.merge(test_airbnb_agg_500m_mean, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_test.shape)
merged_test = merged_test.merge(test_airbnb_agg_50m_count, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_test.shape)
merged_test = merged_test.merge(test_airbnb_agg_100m_count, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_test.shape)
merged_test = merged_test.merge(test_airbnb_agg_500m_count, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_test.shape)

merged_test = merged_test.merge(test_airbnb_agg_nearest, left_on='Air_group', right_on='Air_group',
                     how='left')
print(merged_test.shape)

merged_test = merged_test.merge(test_airbnb_agg, left_on='Air_neighbourhood_cleansed', 
                     right_on='Air_neighbourhood_cleansed', how='left')
print(merged_test.shape)
merged_test = merged_test.merge(test_airbnb_agg_sum, left_on='Air_neighbourhood_cleansed', 
                     right_on='Air_neighbourhood_cleansed', how='left')
print(merged_test.shape)
merged_test = merged_test.merge(test_airbnb_agg_count, left_on='Air_neighbourhood_cleansed', 
                     right_on='Air_neighbourhood_cleansed', how='left')
print(merged_test.shape)

(14296, 117)
(14296, 135)
(14296, 153)
(14296, 171)
(14296, 189)
(14296, 207)
(14296, 228)
(14296, 246)
(14296, 264)
(14296, 282)
(3574, 117)
(3574, 135)
(3574, 153)
(3574, 171)
(3574, 189)
(3574, 207)
(3574, 228)
(3574, 246)
(3574, 264)
(3574, 282)
(4465, 117)
(4465, 135)
(4465, 153)
(4465, 171)
(4465, 189)
(4465, 207)
(4465, 228)
(4465, 246)
(4465, 264)
(4465, 282)


**Additional variables**

Now, we have to define the additional variables as we did it in the EDA part.

In [254]:
# Train
# -----
# AIRBNB SUBURBS
# Averages
merged_train['Air_log_price_suburb_same_room_type'] = np.nan
merged_train['Air_accommodates_suburb_same_room_type'] = np.nan
merged_train['Air_bathrooms_suburb_same_room_type'] = np.nan
merged_train['Air_bedrooms_suburb_same_room_type'] = np.nan
merged_train['Air_beds_suburb_same_room_type'] = np.nan
merged_train['Air_guests_included_suburb_same_room_type'] = np.nan

# Sums (needed to correct the suburbs averages later)
merged_train['Air_log_price_suburb_same_room_type_sum'] = np.nan
merged_train['Air_accommodates_suburb_same_room_type_sum'] = np.nan
merged_train['Air_bathrooms_suburb_same_room_type_sum'] = np.nan
merged_train['Air_bedrooms_suburb_same_room_type_sum'] = np.nan
merged_train['Air_beds_suburb_same_room_type_sum'] = np.nan
merged_train['Air_guests_included_suburb_same_room_type_sum'] = np.nan

# Counts
merged_train['Air_suburb_same_room_type_count'] = np.nan

# AIRBNB OTHER FORMS OF AGGREGATION
# Averages
merged_train['Air_log_price_50m_same_room_type'] = np.nan
merged_train['Air_accommodates_50m_same_room_type'] = np.nan
merged_train['Air_bathrooms_50m_same_room_type'] = np.nan
merged_train['Air_bedrooms_50m_same_room_type'] = np.nan
merged_train['Air_beds_50m_same_room_type'] = np.nan
merged_train['Air_guests_included_50m_same_room_type'] = np.nan

merged_train['Air_log_price_100m_same_room_type'] = np.nan
merged_train['Air_accommodates_100m_same_room_type'] = np.nan
merged_train['Air_bathrooms_100m_same_room_type'] = np.nan
merged_train['Air_bedrooms_100m_same_room_type'] = np.nan
merged_train['Air_beds_100m_same_room_type'] = np.nan
merged_train['Air_guests_included_100m_same_room_type'] = np.nan

merged_train['Air_log_price_500m_same_room_type'] = np.nan
merged_train['Air_accommodates_500m_same_room_type'] = np.nan
merged_train['Air_bathrooms_500m_same_room_type'] = np.nan
merged_train['Air_bedrooms_500m_same_room_type'] = np.nan
merged_train['Air_beds_500m_same_room_type'] = np.nan
merged_train['Air_guests_included_500m_same_room_type'] = np.nan

merged_train['Air_log_price_nearest_same_room_type'] = np.nan

# Counts
merged_train['Air_50m_same_room_type_count'] = np.nan
merged_train['Air_100m_same_room_type_count'] = np.nan
merged_train['Air_500m_same_room_type_count'] = np.nan

merged_train['Air_nearest_same_room_type_count'] = np.nan


# Validation
# ----------
# AIRBNB SUBURBS
# Averages
merged_valid['Air_log_price_suburb_same_room_type'] = np.nan
merged_valid['Air_accommodates_suburb_same_room_type'] = np.nan
merged_valid['Air_bathrooms_suburb_same_room_type'] = np.nan
merged_valid['Air_bedrooms_suburb_same_room_type'] = np.nan
merged_valid['Air_beds_suburb_same_room_type'] = np.nan
merged_valid['Air_guests_included_suburb_same_room_type'] = np.nan

# Sums (needed to correct the suburbs averages later)
merged_valid['Air_log_price_suburb_same_room_type_sum'] = np.nan
merged_valid['Air_accommodates_suburb_same_room_type_sum'] = np.nan
merged_valid['Air_bathrooms_suburb_same_room_type_sum'] = np.nan
merged_valid['Air_bedrooms_suburb_same_room_type_sum'] = np.nan
merged_valid['Air_beds_suburb_same_room_type_sum'] = np.nan
merged_valid['Air_guests_included_suburb_same_room_type_sum'] = np.nan

# Counts
merged_valid['Air_suburb_same_room_type_count'] = np.nan

# AIRBNB OTHER FORMS OF AGGREGATION
# Averages
merged_valid['Air_log_price_50m_same_room_type'] = np.nan
merged_valid['Air_accommodates_50m_same_room_type'] = np.nan
merged_valid['Air_bathrooms_50m_same_room_type'] = np.nan
merged_valid['Air_bedrooms_50m_same_room_type'] = np.nan
merged_valid['Air_beds_50m_same_room_type'] = np.nan
merged_valid['Air_guests_included_50m_same_room_type'] = np.nan

merged_valid['Air_log_price_100m_same_room_type'] = np.nan
merged_valid['Air_accommodates_100m_same_room_type'] = np.nan
merged_valid['Air_bathrooms_100m_same_room_type'] = np.nan
merged_valid['Air_bedrooms_100m_same_room_type'] = np.nan
merged_valid['Air_beds_100m_same_room_type'] = np.nan
merged_valid['Air_guests_included_100m_same_room_type'] = np.nan

merged_valid['Air_log_price_500m_same_room_type'] = np.nan
merged_valid['Air_accommodates_500m_same_room_type'] = np.nan
merged_valid['Air_bathrooms_500m_same_room_type'] = np.nan
merged_valid['Air_bedrooms_500m_same_room_type'] = np.nan
merged_valid['Air_beds_500m_same_room_type'] = np.nan
merged_valid['Air_guests_included_500m_same_room_type'] = np.nan

merged_valid['Air_log_price_nearest_same_room_type'] = np.nan

# Counts
merged_valid['Air_50m_same_room_type_count'] = np.nan
merged_valid['Air_100m_same_room_type_count'] = np.nan
merged_valid['Air_500m_same_room_type_count'] = np.nan

merged_valid['Air_nearest_same_room_type_count'] = np.nan


# Test
# ----
# AIRBNB SUBURBS
# Averages
merged_test['Air_log_price_suburb_same_room_type'] = np.nan
merged_test['Air_accommodates_suburb_same_room_type'] = np.nan
merged_test['Air_bathrooms_suburb_same_room_type'] = np.nan
merged_test['Air_bedrooms_suburb_same_room_type'] = np.nan
merged_test['Air_beds_suburb_same_room_type'] = np.nan
merged_test['Air_guests_included_suburb_same_room_type'] = np.nan

# Sums (needed to correct the suburbs averages later)
merged_test['Air_log_price_suburb_same_room_type_sum'] = np.nan
merged_test['Air_accommodates_suburb_same_room_type_sum'] = np.nan
merged_test['Air_bathrooms_suburb_same_room_type_sum'] = np.nan
merged_test['Air_bedrooms_suburb_same_room_type_sum'] = np.nan
merged_test['Air_beds_suburb_same_room_type_sum'] = np.nan
merged_test['Air_guests_included_suburb_same_room_type_sum'] = np.nan

# Counts
merged_test['Air_suburb_same_room_type_count'] = np.nan

# AIRBNB OTHER FORMS OF AGGREGATION
# Averages
merged_test['Air_log_price_50m_same_room_type'] = np.nan
merged_test['Air_accommodates_50m_same_room_type'] = np.nan
merged_test['Air_bathrooms_50m_same_room_type'] = np.nan
merged_test['Air_bedrooms_50m_same_room_type'] = np.nan
merged_test['Air_beds_50m_same_room_type'] = np.nan
merged_test['Air_guests_included_50m_same_room_type'] = np.nan

merged_test['Air_log_price_100m_same_room_type'] = np.nan
merged_test['Air_accommodates_100m_same_room_type'] = np.nan
merged_test['Air_bathrooms_100m_same_room_type'] = np.nan
merged_test['Air_bedrooms_100m_same_room_type'] = np.nan
merged_test['Air_beds_100m_same_room_type'] = np.nan
merged_test['Air_guests_included_100m_same_room_type'] = np.nan

merged_test['Air_log_price_500m_same_room_type'] = np.nan
merged_test['Air_accommodates_500m_same_room_type'] = np.nan
merged_test['Air_bathrooms_500m_same_room_type'] = np.nan
merged_test['Air_bedrooms_500m_same_room_type'] = np.nan
merged_test['Air_beds_500m_same_room_type'] = np.nan
merged_test['Air_guests_included_500m_same_room_type'] = np.nan

merged_test['Air_log_price_nearest_same_room_type'] = np.nan

# Counts
merged_test['Air_50m_same_room_type_count'] = np.nan
merged_test['Air_100m_same_room_type_count'] = np.nan
merged_test['Air_500m_same_room_type_count'] = np.nan

merged_test['Air_nearest_same_room_type_count'] = np.nan

In [255]:
# Train
# -----

# AIRBNB SUBURBS
# Sums
# ----
# log price
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_log_price_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_log_price_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_log_price_Private room_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_Shared room_suburb_sum']
# accommodates
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_accommodates_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_accommodates_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_accommodates_Private room_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_accommodates_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_suburb_sum']
# bathrooms
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bathrooms_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bathrooms_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_bathrooms_Private room_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bathrooms_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_suburb_sum']
# bedrooms
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bedrooms_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bedrooms_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_bedrooms_Private room_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bedrooms_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_suburb_sum']
# beds
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_beds_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_beds_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_beds_Private room_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_beds_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_beds_Shared room_suburb_sum']
# guests_included
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_guests_included_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_guests_included_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_guests_included_Private room_suburb_sum']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_guests_included_suburb_same_room_type_sum'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_suburb_sum']

# Counts
# ------
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_suburb_same_room_type_count'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_suburb_count']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_suburb_same_room_type_count'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_log_price_Private room_suburb_count']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_suburb_same_room_type_count'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_Shared room_suburb_count']

# Means
# -----
# log price
merged_train['Air_log_price_suburb_same_room_type'] = (merged_train.Air_log_price_suburb_same_room_type_sum - \
                                                 merged_train.Air_log_price) / \
                                                (merged_train.Air_suburb_same_room_type_count - 1)
# accommodates
merged_train['Air_accommodates_suburb_same_room_type'] = (merged_train.Air_accommodates_suburb_same_room_type_sum - \
                                                    merged_train.Air_accommodates) / \
                                                   (merged_train.Air_suburb_same_room_type_count - 1)
# bathrooms
merged_train['Air_bathrooms_suburb_same_room_type'] = (merged_train.Air_bathrooms_suburb_same_room_type_sum - \
                                                 merged_train.Air_bathrooms) / \
                                                (merged_train.Air_suburb_same_room_type_count - 1)
# bedrooms
merged_train['Air_bedrooms_suburb_same_room_type'] = (merged_train.Air_bedrooms_suburb_same_room_type_sum - \
                                                merged_train.Air_bedrooms) / \
                                               (merged_train.Air_suburb_same_room_type_count - 1)
# beds
merged_train['Air_beds_suburb_same_room_type'] = (merged_train.Air_beds_suburb_same_room_type_sum - \
                                            merged_train.Air_beds) / \
                                           (merged_train.Air_suburb_same_room_type_count - 1)
# guests_included
merged_train['Air_guests_included_suburb_same_room_type'] = (merged_train.Air_guests_included_suburb_same_room_type_sum - \
                                                       merged_train.Air_guests_included) / \
                                                      (merged_train.Air_suburb_same_room_type_count - 1)


# Validation
# ----------

# AIRBNB SUBURBS
# Sums
# ----
# log price
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_log_price_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_log_price_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_log_price_Private room_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_Shared room_suburb_sum']
# accommodates
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_accommodates_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_accommodates_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_accommodates_Private room_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_accommodates_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_suburb_sum']
# bathrooms
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bathrooms_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bathrooms_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_bathrooms_Private room_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bathrooms_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_suburb_sum']
# bedrooms
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bedrooms_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bedrooms_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_bedrooms_Private room_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bedrooms_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_suburb_sum']
# beds
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_beds_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_beds_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_beds_Private room_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_beds_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_beds_Shared room_suburb_sum']
# guests_included
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_guests_included_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_guests_included_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_guests_included_Private room_suburb_sum']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_guests_included_suburb_same_room_type_sum'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_suburb_sum']

# Counts
# ------
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_suburb_same_room_type_count'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_suburb_count']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_suburb_same_room_type_count'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_log_price_Private room_suburb_count']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_suburb_same_room_type_count'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_Shared room_suburb_count']

# Means
# -----
# log price
merged_valid['Air_log_price_suburb_same_room_type'] = (merged_valid.Air_log_price_suburb_same_room_type_sum - \
                                                 merged_valid.Air_log_price) / \
                                                (merged_valid.Air_suburb_same_room_type_count - 1)
# accommodates
merged_valid['Air_accommodates_suburb_same_room_type'] = (merged_valid.Air_accommodates_suburb_same_room_type_sum - \
                                                    merged_valid.Air_accommodates) / \
                                                   (merged_valid.Air_suburb_same_room_type_count - 1)
# bathrooms
merged_valid['Air_bathrooms_suburb_same_room_type'] = (merged_valid.Air_bathrooms_suburb_same_room_type_sum - \
                                                 merged_valid.Air_bathrooms) / \
                                                (merged_valid.Air_suburb_same_room_type_count - 1)
# bedrooms
merged_valid['Air_bedrooms_suburb_same_room_type'] = (merged_valid.Air_bedrooms_suburb_same_room_type_sum - \
                                                merged_valid.Air_bedrooms) / \
                                               (merged_valid.Air_suburb_same_room_type_count - 1)
# beds
merged_valid['Air_beds_suburb_same_room_type'] = (merged_valid.Air_beds_suburb_same_room_type_sum - \
                                            merged_valid.Air_beds) / \
                                           (merged_valid.Air_suburb_same_room_type_count - 1)
# guests_included
merged_valid['Air_guests_included_suburb_same_room_type'] = (merged_valid.Air_guests_included_suburb_same_room_type_sum - \
                                                       merged_valid.Air_guests_included) / \
                                                      (merged_valid.Air_suburb_same_room_type_count - 1)


# Test
# ----

# AIRBNB SUBURBS
# Sums
# ----
# log price
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_log_price_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_suburb_sum']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_log_price_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_log_price_Private room_suburb_sum']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_Shared room_suburb_sum']
# accommodates
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_accommodates_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_suburb_sum']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_accommodates_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_accommodates_Private room_suburb_sum']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_accommodates_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_suburb_sum']
# bathrooms
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bathrooms_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_suburb_sum']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bathrooms_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_bathrooms_Private room_suburb_sum']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bathrooms_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_suburb_sum']
# bedrooms
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bedrooms_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_suburb_sum']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bedrooms_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_bedrooms_Private room_suburb_sum']
merged_test.loc[merged_train.Air_room_type == 'Shared room', 'Air_bedrooms_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_suburb_sum']
# beds
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_beds_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_suburb_sum']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_beds_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_beds_Private room_suburb_sum']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_beds_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_beds_Shared room_suburb_sum']
# guests_included
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_guests_included_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_suburb_sum']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_guests_included_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_guests_included_Private room_suburb_sum']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_guests_included_suburb_same_room_type_sum'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_suburb_sum']

# Counts
# ------
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_suburb_same_room_type_count'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_suburb_count']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_suburb_same_room_type_count'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_log_price_Private room_suburb_count']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_suburb_same_room_type_count'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_Shared room_suburb_count']

# Means
# -----
# log price
merged_test['Air_log_price_suburb_same_room_type'] = (merged_test.Air_log_price_suburb_same_room_type_sum - \
                                                 merged_test.Air_log_price) / \
                                                (merged_test.Air_suburb_same_room_type_count - 1)
# accommodates
merged_test['Air_accommodates_suburb_same_room_type'] = (merged_test.Air_accommodates_suburb_same_room_type_sum - \
                                                    merged_test.Air_accommodates) / \
                                                   (merged_test.Air_suburb_same_room_type_count - 1)
# bathrooms
merged_test['Air_bathrooms_suburb_same_room_type'] = (merged_test.Air_bathrooms_suburb_same_room_type_sum - \
                                                 merged_test.Air_bathrooms) / \
                                                (merged_test.Air_suburb_same_room_type_count - 1)
# bedrooms
merged_test['Air_bedrooms_suburb_same_room_type'] = (merged_test.Air_bedrooms_suburb_same_room_type_sum - \
                                                merged_test.Air_bedrooms) / \
                                               (merged_test.Air_suburb_same_room_type_count - 1)
# beds
merged_test['Air_beds_suburb_same_room_type'] = (merged_test.Air_beds_suburb_same_room_type_sum - \
                                            merged_test.Air_beds) / \
                                           (merged_test.Air_suburb_same_room_type_count - 1)
# guests_included
merged_test['Air_guests_included_suburb_same_room_type'] = (merged_test.Air_guests_included_suburb_same_room_type_sum - \
                                                       merged_test.Air_guests_included) / \
                                                      (merged_test.Air_suburb_same_room_type_count - 1)

In [256]:
# Train
# -----

# AIRBNB OTHER FORMS OF AGGREGATION
# Means
# #####
# Same room type - 50m
# --------------------
# log price
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_log_price_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_50m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_log_price_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_log_price_Private room_50m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_Shared room_50m']
# accommodates
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_accommodates_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_50m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_accommodates_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_accommodates_Private room_50m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_accommodates_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_50m']
# bathrooms
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bathrooms_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_50m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bathrooms_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_bathrooms_Private room_50m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bathrooms_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_50m']
# bedrooms
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bedrooms_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_50m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bedrooms_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_bedrooms_Private room_50m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bedrooms_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_50m']
# beds
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_beds_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_50m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_beds_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_beds_Private room_50m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_beds_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_Shared room_50m']
# guests_included
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_guests_included_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_50m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_guests_included_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_guests_included_Private room_50m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_guests_included_50m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_50m']

# Same room type - 100m
# ---------------------
# log price
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_log_price_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_100m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_log_price_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_log_price_Private room_100m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_Shared room_100m']
# accommodates
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_accommodates_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_100m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_accommodates_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_accommodates_Private room_100m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_accommodates_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_100m']
# bathrooms
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bathrooms_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_100m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bathrooms_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_bathrooms_Private room_100m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bathrooms_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_100m']
# bedrooms
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bedrooms_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_100m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bedrooms_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_bedrooms_Private room_100m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bedrooms_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_100m']
# beds
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_beds_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_100m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_beds_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_beds_Private room_100m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_beds_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_Shared room_100m']
# guests_included
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_guests_included_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_100m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_guests_included_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_guests_included_Private room_100m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_guests_included_100m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_100m']

# Same room type - 500m
# ---------------------
# log price
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_log_price_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_500m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_log_price_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_log_price_Private room_500m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_Shared room_500m']
# accommodates
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_accommodates_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_500m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_accommodates_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_accommodates_Private room_500m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_accommodates_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_500m']
# bathrooms
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bathrooms_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_500m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bathrooms_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_bathrooms_Private room_500m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bathrooms_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_500m']
# bedrooms
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bedrooms_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_500m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bedrooms_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_bedrooms_Private room_500m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bedrooms_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_500m']
# beds
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_beds_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_500m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_beds_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_beds_Private room_500m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_beds_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_Shared room_500m']
# guests_included
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_guests_included_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_500m']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_guests_included_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room','Air_guests_included_Private room_500m']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_guests_included_500m_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_500m']

# Same room type - nearest
# ------------------------
# log price
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_log_price_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_nearest']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_log_price_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_log_price_Private room_nearest']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_log_price_Shared room_nearest']
# accommodates
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_accommodates_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_nearest']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_accommodates_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_accommodates_Private room_nearest']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_accommodates_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_nearest']
# bathrooms
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bathrooms_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_nearest']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bathrooms_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bathrooms_Private room_nearest']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bathrooms_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_nearest']
# bedrooms
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bedrooms_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_nearest']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bedrooms_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_bedrooms_Private room_nearest']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bedrooms_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_nearest']
# beds
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_beds_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_nearest']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_beds_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_beds_Private room_nearest']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_beds_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_beds_Shared room_nearest']
# guests_included
merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_guests_included_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_nearest']
merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_guests_included_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Private room', 'Air_guests_included_Private room_nearest']
merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_guests_included_nearest_same_room_type'] = \
            merged_train.loc[merged_train.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_nearest']



# Validation
# ----------

# AIRBNB OTHER FORMS OF AGGREGATION
# Means
# #####
# Same room type - 50m
# --------------------
# log price
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_log_price_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_log_price_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_log_price_Private room_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_Shared room_50m']
# accommodates
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_accommodates_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_accommodates_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_accommodates_Private room_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_accommodates_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_50m']
# bathrooms
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bathrooms_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bathrooms_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_bathrooms_Private room_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bathrooms_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_50m']
# bedrooms
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bedrooms_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bedrooms_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_bedrooms_Private room_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bedrooms_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_50m']
# beds
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_beds_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_beds_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_beds_Private room_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_beds_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_Shared room_50m']
# guests_included
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_guests_included_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_guests_included_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_guests_included_Private room_50m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_guests_included_50m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_50m']

# Same room type - 100m
# ---------------------
# log price
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_log_price_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_log_price_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_log_price_Private room_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_Shared room_100m']
# accommodates
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_accommodates_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_accommodates_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_accommodates_Private room_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_accommodates_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_100m']
# bathrooms
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bathrooms_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bathrooms_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_bathrooms_Private room_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bathrooms_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_100m']
# bedrooms
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bedrooms_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bedrooms_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_bedrooms_Private room_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bedrooms_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_100m']
# beds
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_beds_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_beds_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_beds_Private room_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_beds_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_Shared room_100m']
# guests_included
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_guests_included_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_guests_included_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_guests_included_Private room_100m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_guests_included_100m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_100m']

# Same room type - 500m
# ---------------------
# log price
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_log_price_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_log_price_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_log_price_Private room_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_Shared room_500m']
# accommodates
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_accommodates_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_accommodates_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_accommodates_Private room_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_accommodates_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_500m']
# bathrooms
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bathrooms_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bathrooms_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_bathrooms_Private room_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bathrooms_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_500m']
# bedrooms
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bedrooms_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bedrooms_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_bedrooms_Private room_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bedrooms_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_500m']
# beds
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_beds_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_beds_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_beds_Private room_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_beds_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_Shared room_500m']
# guests_included
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_guests_included_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_guests_included_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room','Air_guests_included_Private room_500m']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_guests_included_500m_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_500m']

# Same room type - nearest
# ------------------------
# log price
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_log_price_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_log_price_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_log_price_Private room_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_log_price_Shared room_nearest']
# accommodates
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_accommodates_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_accommodates_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_accommodates_Private room_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_accommodates_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_nearest']
# bathrooms
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bathrooms_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bathrooms_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bathrooms_Private room_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bathrooms_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_nearest']
# bedrooms
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bedrooms_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bedrooms_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_bedrooms_Private room_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bedrooms_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_nearest']
# beds
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_beds_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_beds_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_beds_Private room_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_beds_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_beds_Shared room_nearest']
# guests_included
merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_guests_included_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_guests_included_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Private room', 'Air_guests_included_Private room_nearest']
merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_guests_included_nearest_same_room_type'] = \
            merged_valid.loc[merged_valid.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_nearest']


# Test
# ----

# AIRBNB OTHER FORMS OF AGGREGATION
# Means
# #####
# Same room type - 50m
# --------------------
# log price
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_log_price_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_50m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_log_price_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_log_price_Private room_50m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_Shared room_50m']
# accommodates
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_accommodates_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_50m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_accommodates_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_accommodates_Private room_50m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_accommodates_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_50m']
# bathrooms
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bathrooms_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_50m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bathrooms_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_bathrooms_Private room_50m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bathrooms_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_50m']
# bedrooms
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bedrooms_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_50m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bedrooms_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_bedrooms_Private room_50m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bedrooms_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_50m']
# beds
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_beds_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_50m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_beds_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_beds_Private room_50m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_beds_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_Shared room_50m']
# guests_included
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_guests_included_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_50m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_guests_included_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_guests_included_Private room_50m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_guests_included_50m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_50m']

# Same room type - 100m
# ---------------------
# log price
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_log_price_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_100m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_log_price_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_log_price_Private room_100m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_Shared room_100m']
# accommodates
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_accommodates_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_100m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_accommodates_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_accommodates_Private room_100m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_accommodates_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_100m']
# bathrooms
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bathrooms_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_100m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bathrooms_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_bathrooms_Private room_100m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bathrooms_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_100m']
# bedrooms
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bedrooms_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_100m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bedrooms_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_bedrooms_Private room_100m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bedrooms_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_100m']
# beds
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_beds_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_100m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_beds_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_beds_Private room_100m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_beds_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_Shared room_100m']
# guests_included
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_guests_included_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_100m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_guests_included_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_guests_included_Private room_100m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_guests_included_100m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_100m']

# Same room type - 500m
# ---------------------
# log price
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_log_price_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_500m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_log_price_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_log_price_Private room_500m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_Shared room_500m']
# accommodates
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_accommodates_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_500m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_accommodates_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_accommodates_Private room_500m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_accommodates_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_500m']
# bathrooms
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bathrooms_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_500m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bathrooms_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_bathrooms_Private room_500m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bathrooms_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_500m']
# bedrooms
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bedrooms_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_500m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bedrooms_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_bedrooms_Private room_500m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bedrooms_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_500m']
# beds
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_beds_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_500m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_beds_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_beds_Private room_500m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_beds_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_Shared room_500m']
# guests_included
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_guests_included_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_500m']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_guests_included_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room','Air_guests_included_Private room_500m']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_guests_included_500m_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_500m']

# Same room type - nearest
# ------------------------
# log price
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_log_price_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_log_price_Entire home/apt_nearest']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_log_price_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_log_price_Private room_nearest']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_log_price_Shared room_nearest']
# accommodates
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_accommodates_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_accommodates_Entire home/apt_nearest']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_accommodates_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_accommodates_Private room_nearest']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_accommodates_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_accommodates_Shared room_nearest']
# bathrooms
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bathrooms_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bathrooms_Entire home/apt_nearest']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bathrooms_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bathrooms_Private room_nearest']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bathrooms_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bathrooms_Shared room_nearest']
# bedrooms
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bedrooms_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_bedrooms_Entire home/apt_nearest']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bedrooms_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_bedrooms_Private room_nearest']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bedrooms_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_bedrooms_Shared room_nearest']
# beds
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_beds_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_beds_Entire home/apt_nearest']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_beds_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_beds_Private room_nearest']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_beds_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_beds_Shared room_nearest']
# guests_included
merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_guests_included_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Entire home/apt', 'Air_guests_included_Entire home/apt_nearest']
merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_guests_included_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Private room', 'Air_guests_included_Private room_nearest']
merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_guests_included_nearest_same_room_type'] = \
            merged_test.loc[merged_test.Air_room_type == 'Shared room', 'Air_guests_included_Shared room_nearest']

Distance to CBD

In [257]:
# Train
merged_train['latitude_CBD'] = -37.814
merged_train['longitude_CBD'] = 144.963

merged_train['Air_distance_to_CBD'] = air._haversine_np(merged_train.Air_longitude, merged_train.Air_latitude,
                                                merged_train.longitude_CBD, merged_train.latitude_CBD)

# Validation
merged_valid['latitude_CBD'] = -37.814
merged_valid['longitude_CBD'] = 144.963

merged_valid['Air_distance_to_CBD'] = air._haversine_np(merged_valid.Air_longitude, merged_valid.Air_latitude,
                                                merged_valid.longitude_CBD, merged_valid.latitude_CBD)

# Test
merged_test['latitude_CBD'] = -37.814
merged_test['longitude_CBD'] = 144.963

merged_test['Air_distance_to_CBD'] = air._haversine_np(merged_test.Air_longitude, merged_test.Air_latitude,
                                                merged_test.longitude_CBD, merged_test.latitude_CBD)

Property type

In [258]:
# Train
merged_train['Air_property_type_2'] = 'Other'
merged_train.loc[merged_train.Air_property_type.isin(['House', 'Cottage', 'Villa']), 'Air_property_type_2'] = 'House_Cottage_Villa'
merged_train.loc[merged_train.Air_property_type.isin(['Apartment', 'Condominium']), 'Air_property_type_2'] = 'Apartment_Condominium'
merged_train.loc[merged_train.Air_property_type.isin(['Townhouse']), 'Air_property_type_2'] = 'Townhouse'

# Validate
merged_valid['Air_property_type_2'] = 'Other'
merged_valid.loc[merged_valid.Air_property_type.isin(['House', 'Cottage', 'Villa']), 'Air_property_type_2'] = 'House_Cottage_Villa'
merged_valid.loc[merged_valid.Air_property_type.isin(['Apartment', 'Condominium']), 'Air_property_type_2'] = 'Apartment_Condominium'
merged_valid.loc[merged_valid.Air_property_type.isin(['Townhouse']), 'Air_property_type_2'] = 'Townhouse'

# Test
merged_test['Air_property_type_2'] = 'Other'
merged_test.loc[merged_test.Air_property_type.isin(['House', 'Cottage', 'Villa']), 'Air_property_type_2'] = 'House_Cottage_Villa'
merged_test.loc[merged_test.Air_property_type.isin(['Apartment', 'Condominium']), 'Air_property_type_2'] = 'Apartment_Condominium'
merged_test.loc[merged_test.Air_property_type.isin(['Townhouse']), 'Air_property_type_2'] = 'Townhouse'

Cancellation policy

In [259]:
# Train
merged_train['Air_cancellation_policy_2'] = merged_train.Air_cancellation_policy
merged_train.replace(to_replace={'Air_cancellation_policy_2': {'super_strict_60': 'strict_tmp',
                                                              'super_strict_30': 'strict_tmp',
                                                              'strict_14_with_grace_period': 'strict_tmp'}}, inplace=True)
merged_train.replace(to_replace={'Air_cancellation_policy_2': {'strict_tmp': 'strict'}}, inplace=True)

# Valid
merged_valid['Air_cancellation_policy_2'] = merged_valid.Air_cancellation_policy
merged_valid.replace(to_replace={'Air_cancellation_policy_2': {'super_strict_60': 'strict_tmp',
                                                              'super_strict_30': 'strict_tmp',
                                                              'strict_14_with_grace_period': 'strict_tmp'}}, inplace=True)
merged_valid.replace(to_replace={'Air_cancellation_policy_2': {'strict_tmp': 'strict'}}, inplace=True)

# Test
merged_test['Air_cancellation_policy_2'] = merged_test.Air_cancellation_policy
merged_test.replace(to_replace={'Air_cancellation_policy_2': {'super_strict_60': 'strict_tmp',
                                                              'super_strict_30': 'strict_tmp',
                                                              'strict_14_with_grace_period': 'strict_tmp'}}, inplace=True)
merged_test.replace(to_replace={'Air_cancellation_policy_2': {'strict_tmp': 'strict'}}, inplace=True)

**Variables of interest**

Let's gather the variables that are interesting for this project.

In [260]:
# Train
train = merged_train.loc[:, ['Air_log_price', 
                      'Air_log_price_suburb_same_room_type',
                      'Air_log_price_50m_same_room_type', 
                      'Air_log_price_100m_same_room_type',
                      'Air_log_price_500m_same_room_type', 
                      'Air_log_price_nearest_same_room_type',
                      'Air_accommodates_50m_same_room_type', 
                      'Air_accommodates_100m_same_room_type', 
                      'Air_accommodates_500m_same_room_type',
                      'Air_bathrooms_50m_same_room_type', 
                      'Air_bathrooms_100m_same_room_type', 
                      'Air_bathrooms_500m_same_room_type',
                      'Air_bedrooms_50m_same_room_type',
                      'Air_bedrooms_100m_same_room_type',
                      'Air_bedrooms_500m_same_room_type',
                      'Air_beds_50m_same_room_type', 
                      'Air_beds_100m_same_room_type', 
                      'Air_beds_500m_same_room_type',
                      'Air_guests_included_50m_same_room_type', 
                      'Air_guests_included_100m_same_room_type', 
                      'Air_guests_included_500m_same_room_type',
                      'Air_suburb_same_room_type_count',
                      'Air_50m_same_room_type_count', 
                      'Air_100m_same_room_type_count',
                      'Air_500m_same_room_type_count', 
                      'Air_property_type_2', 
                      'Air_room_type', 
                      'Air_cancellation_policy_2',
                      'Air_calculated_host_listings_count', 
                      'Air_host_total_listings_count', 
                      'Air_host_listings_count',
                      'Air_neighbourhood_cleansed', 
                      'Air_bathrooms', 
                      'Air_beds', 
                      'Air_bedrooms', 
                      'Air_accommodates', 
                      'Air_extra_people',
                      'Air_guests_included',
                      'Air_distance_to_CBD']]
train.head(5)

Unnamed: 0,Air_log_price,Air_log_price_suburb_same_room_type,Air_log_price_50m_same_room_type,Air_log_price_100m_same_room_type,Air_log_price_500m_same_room_type,Air_log_price_nearest_same_room_type,Air_accommodates_50m_same_room_type,Air_accommodates_100m_same_room_type,Air_accommodates_500m_same_room_type,Air_bathrooms_50m_same_room_type,...,Air_host_total_listings_count,Air_host_listings_count,Air_neighbourhood_cleansed,Air_bathrooms,Air_beds,Air_bedrooms,Air_accommodates,Air_extra_people,Air_guests_included,Air_distance_to_CBD
0,1.954243,1.875238,1.864316,1.849264,1.887373,1.892095,2.0,1.928571,1.848684,1.0,...,1.0,1.0,Melbourne,1.0,1.0,1.0,2.0,0.0,1.0,1.302894
1,1.812913,1.784687,1.80618,1.792166,1.774898,1.80618,2.0,2.5,2.333333,,...,3.0,3.0,Whitehorse,,1.0,1.0,1.0,0.0,1.0,17.143772
2,2.176091,2.239963,,,2.327473,2.255273,,,4.5,,...,2.0,2.0,Bayside,2.0,2.0,2.0,4.0,0.0,1.0,15.83771
3,1.662758,1.875496,1.840866,1.877888,1.887697,1.897627,1.625,2.047619,1.863946,1.0,...,1.0,1.0,Melbourne,1.0,1.0,1.0,2.0,0.0,1.0,1.146271
4,1.94939,1.737162,,2.09864,1.700186,2.021189,,2.0,1.6,,...,1.0,1.0,Moreland,1.0,1.0,2.0,2.0,0.0,1.0,4.437853


In [261]:
# Train
valid = merged_valid.loc[:, ['Air_log_price', 
                      'Air_log_price_suburb_same_room_type',
                      'Air_log_price_50m_same_room_type', 
                      'Air_log_price_100m_same_room_type',
                      'Air_log_price_500m_same_room_type', 
                      'Air_log_price_nearest_same_room_type',
                      'Air_accommodates_50m_same_room_type', 
                      'Air_accommodates_100m_same_room_type', 
                      'Air_accommodates_500m_same_room_type',
                      'Air_bathrooms_50m_same_room_type', 
                      'Air_bathrooms_100m_same_room_type', 
                      'Air_bathrooms_500m_same_room_type',
                      'Air_bedrooms_50m_same_room_type',
                      'Air_bedrooms_100m_same_room_type',
                      'Air_bedrooms_500m_same_room_type',
                      'Air_beds_50m_same_room_type', 
                      'Air_beds_100m_same_room_type', 
                      'Air_beds_500m_same_room_type',
                      'Air_guests_included_50m_same_room_type', 
                      'Air_guests_included_100m_same_room_type', 
                      'Air_guests_included_500m_same_room_type',
                      'Air_suburb_same_room_type_count',
                      'Air_50m_same_room_type_count', 
                      'Air_100m_same_room_type_count',
                      'Air_500m_same_room_type_count', 
                      'Air_property_type_2', 
                      'Air_room_type', 
                      'Air_cancellation_policy_2',
                      'Air_calculated_host_listings_count', 
                      'Air_host_total_listings_count', 
                      'Air_host_listings_count',
                      'Air_neighbourhood_cleansed', 
                      'Air_bathrooms', 
                      'Air_beds', 
                      'Air_bedrooms', 
                      'Air_accommodates', 
                      'Air_extra_people',
                      'Air_guests_included',
                      'Air_distance_to_CBD']]
valid.head(5)

Unnamed: 0,Air_log_price,Air_log_price_suburb_same_room_type,Air_log_price_50m_same_room_type,Air_log_price_100m_same_room_type,Air_log_price_500m_same_room_type,Air_log_price_nearest_same_room_type,Air_accommodates_50m_same_room_type,Air_accommodates_100m_same_room_type,Air_accommodates_500m_same_room_type,Air_bathrooms_50m_same_room_type,...,Air_host_total_listings_count,Air_host_listings_count,Air_neighbourhood_cleansed,Air_bathrooms,Air_beds,Air_bedrooms,Air_accommodates,Air_extra_people,Air_guests_included,Air_distance_to_CBD
0,1.832509,1.852932,1.80103,1.831385,1.848968,1.90309,1.5,1.666667,1.725,1.0,...,1.0,1.0,Melbourne,1.0,1.0,0.0,1.0,0.0,1.0,0.731065
1,1.812913,2.207816,,,,,,,,,...,1.0,1.0,Monash,1.0,1.0,1.0,2.0,20.0,1.0,14.734709
2,2.037426,2.175507,,,2.200705,2.290035,,,3.533333,,...,5.0,5.0,Melbourne,1.0,1.0,1.0,3.0,10.0,2.0,2.005208
3,2.217484,2.175296,2.165798,2.130782,2.151418,2.037426,4.5,5.065217,4.836842,1.333333,...,40.0,40.0,Melbourne,2.0,3.0,2.0,6.0,15.0,4.0,0.9206
4,2.187521,2.175331,2.13971,2.125597,2.178881,2.09691,5.0,3.5,4.1,1.333333,...,22.0,22.0,Melbourne,2.0,2.0,0.0,4.0,0.0,1.0,1.838892


In [262]:
# Test
test = merged_test.loc[:, ['Air_log_price', 
                      'Air_log_price_suburb_same_room_type',
                      'Air_log_price_50m_same_room_type', 
                      'Air_log_price_100m_same_room_type',
                      'Air_log_price_500m_same_room_type', 
                      'Air_log_price_nearest_same_room_type',
                      'Air_accommodates_50m_same_room_type', 
                      'Air_accommodates_100m_same_room_type', 
                      'Air_accommodates_500m_same_room_type',
                      'Air_bathrooms_50m_same_room_type', 
                      'Air_bathrooms_100m_same_room_type', 
                      'Air_bathrooms_500m_same_room_type',
                      'Air_bedrooms_50m_same_room_type',
                      'Air_bedrooms_100m_same_room_type',
                      'Air_bedrooms_500m_same_room_type',
                      'Air_beds_50m_same_room_type', 
                      'Air_beds_100m_same_room_type', 
                      'Air_beds_500m_same_room_type',
                      'Air_guests_included_50m_same_room_type', 
                      'Air_guests_included_100m_same_room_type', 
                      'Air_guests_included_500m_same_room_type',
                      'Air_suburb_same_room_type_count',
                      'Air_50m_same_room_type_count', 
                      'Air_100m_same_room_type_count',
                      'Air_500m_same_room_type_count', 
                      'Air_property_type_2', 
                      'Air_room_type', 
                      'Air_cancellation_policy_2',
                      'Air_calculated_host_listings_count', 
                      'Air_host_total_listings_count', 
                      'Air_host_listings_count',
                      'Air_neighbourhood_cleansed', 
                      'Air_bathrooms', 
                      'Air_beds', 
                      'Air_bedrooms', 
                      'Air_accommodates', 
                      'Air_extra_people',
                      'Air_guests_included',
                      'Air_distance_to_CBD']]
test.head(5)

Unnamed: 0,Air_log_price,Air_log_price_suburb_same_room_type,Air_log_price_50m_same_room_type,Air_log_price_100m_same_room_type,Air_log_price_500m_same_room_type,Air_log_price_nearest_same_room_type,Air_accommodates_50m_same_room_type,Air_accommodates_100m_same_room_type,Air_accommodates_500m_same_room_type,Air_bathrooms_50m_same_room_type,...,Air_host_total_listings_count,Air_host_listings_count,Air_neighbourhood_cleansed,Air_bathrooms,Air_beds,Air_bedrooms,Air_accommodates,Air_extra_people,Air_guests_included,Air_distance_to_CBD
0,2.133539,2.150811,,2.065043,2.126423,2.0,,2.0,2.925,,...,89.0,89.0,Stonnington,1.0,1.0,0.0,2.0,0.0,1.0,4.412457
1,1.544068,1.752372,,,1.69897,1.69897,,,2.0,,...,1.0,1.0,Whitehorse,1.0,1.0,1.0,2.0,0.0,1.0,18.900217
2,2.0,2.18132,2.298853,2.149427,2.231913,2.298853,5.0,3.5,4.717391,2.0,...,1.0,1.0,Melbourne,1.0,1.0,1.0,2.0,0.0,1.0,1.764125
3,2.176091,2.132502,,,,,,,,,...,1.0,1.0,Whitehorse,2.0,3.0,3.0,6.0,10.0,2.0,21.697192
4,2.161368,2.181161,,2.060698,2.107812,2.060698,,3.0,3.916667,,...,6.0,6.0,Melbourne,1.0,3.0,2.0,6.0,15.0,1.0,1.88617


**Dataset preparation**

*One-hot encoding for categorical variables*

We use one-hot encoding for categorical variables when there is no meaningful order.

In [263]:
train_one_hot = pd.get_dummies(train.loc[:, ['Air_property_type_2', 'Air_room_type', 
                                      'Air_cancellation_policy_2', 
                                      'Air_neighbourhood_cleansed']],
                              drop_first=True)
train_one_hot.head(5)

valid_one_hot = pd.get_dummies(valid.loc[:, ['Air_property_type_2', 'Air_room_type', 
                                      'Air_cancellation_policy_2', 
                                      'Air_neighbourhood_cleansed']],
                              drop_first=True)
valid_one_hot.head(5)

test_one_hot = pd.get_dummies(test.loc[:, ['Air_property_type_2', 'Air_room_type', 
                                      'Air_cancellation_policy_2',
                                      'Air_neighbourhood_cleansed']],
                             drop_first=True)
test_one_hot.head(5)

Unnamed: 0,Air_property_type_2_House_Cottage_Villa,Air_property_type_2_Other,Air_property_type_2_Townhouse,Air_room_type_Private room,Air_room_type_Shared room,Air_cancellation_policy_2_moderate,Air_cancellation_policy_2_strict,Air_neighbourhood_cleansed_Bayside,Air_neighbourhood_cleansed_Boroondara,Air_neighbourhood_cleansed_Brimbank,...,Air_neighbourhood_cleansed_Moonee Valley,Air_neighbourhood_cleansed_Moreland,Air_neighbourhood_cleansed_Nillumbik,Air_neighbourhood_cleansed_Port Phillip,Air_neighbourhood_cleansed_Stonnington,Air_neighbourhood_cleansed_Whitehorse,Air_neighbourhood_cleansed_Whittlesea,Air_neighbourhood_cleansed_Wyndham,Air_neighbourhood_cleansed_Yarra,Air_neighbourhood_cleansed_Yarra Ranges
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [264]:
print(train_one_hot.shape)
print(valid_one_hot.shape)
print(test_one_hot.shape)

(14296, 36)
(3574, 36)
(4465, 36)


#### Datasets

Next, the different datasets for the modelling part are created.

*Aggregation by suburbs*

In [265]:
# Train
train_suburb = pd.concat([merged_train.loc[:, ['Air_log_price', 
                             'Air_log_price_suburb_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], train_one_hot], axis=1)
train_suburb.head(5)

# Valid
valid_suburb = pd.concat([merged_valid.loc[:, ['Air_log_price', 
                             'Air_log_price_suburb_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], valid_one_hot], axis=1)
valid_suburb.head(5)

# Test
test_suburb = pd.concat([merged_test.loc[:, ['Air_log_price', 
                             'Air_log_price_suburb_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], test_one_hot], axis=1)
test_suburb.head(5)

Unnamed: 0,Air_log_price,Air_log_price_suburb_same_room_type,Air_calculated_host_listings_count,Air_bathrooms,Air_beds,Air_bedrooms,Air_accommodates,Air_extra_people,Air_guests_included,Air_distance_to_CBD,...,Air_neighbourhood_cleansed_Moonee Valley,Air_neighbourhood_cleansed_Moreland,Air_neighbourhood_cleansed_Nillumbik,Air_neighbourhood_cleansed_Port Phillip,Air_neighbourhood_cleansed_Stonnington,Air_neighbourhood_cleansed_Whitehorse,Air_neighbourhood_cleansed_Whittlesea,Air_neighbourhood_cleansed_Wyndham,Air_neighbourhood_cleansed_Yarra,Air_neighbourhood_cleansed_Yarra Ranges
0,2.133539,2.150811,62.0,1.0,1.0,0.0,2.0,0.0,1.0,4.412457,...,0,0,0,0,1,0,0,0,0,0
1,1.544068,1.752372,1.0,1.0,1.0,1.0,2.0,0.0,1.0,18.900217,...,0,0,0,0,0,1,0,0,0,0
2,2.0,2.18132,1.0,1.0,1.0,1.0,2.0,0.0,1.0,1.764125,...,0,0,0,0,0,0,0,0,0,0
3,2.176091,2.132502,1.0,2.0,3.0,3.0,6.0,10.0,2.0,21.697192,...,0,0,0,0,0,1,0,0,0,0
4,2.161368,2.181161,5.0,1.0,3.0,2.0,6.0,15.0,1.0,1.88617,...,0,0,0,0,0,0,0,0,0,0


*Constructed aggregation - 500m*

In [266]:
# Train
train_500m = pd.concat([merged_train.loc[:, ['Air_log_price', 
                             'Air_log_price_500m_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], train_one_hot], axis=1)
train_500m.head(5)

# Valid
valid_500m = pd.concat([merged_valid.loc[:, ['Air_log_price', 
                             'Air_log_price_500m_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], valid_one_hot], axis=1)
valid_500m.head(5)

# Test
test_500m = pd.concat([merged_test.loc[:, ['Air_log_price', 
                             'Air_log_price_500m_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], test_one_hot], axis=1)
test_500m.head(5)

Unnamed: 0,Air_log_price,Air_log_price_500m_same_room_type,Air_calculated_host_listings_count,Air_bathrooms,Air_beds,Air_bedrooms,Air_accommodates,Air_extra_people,Air_guests_included,Air_distance_to_CBD,...,Air_neighbourhood_cleansed_Moonee Valley,Air_neighbourhood_cleansed_Moreland,Air_neighbourhood_cleansed_Nillumbik,Air_neighbourhood_cleansed_Port Phillip,Air_neighbourhood_cleansed_Stonnington,Air_neighbourhood_cleansed_Whitehorse,Air_neighbourhood_cleansed_Whittlesea,Air_neighbourhood_cleansed_Wyndham,Air_neighbourhood_cleansed_Yarra,Air_neighbourhood_cleansed_Yarra Ranges
0,2.133539,2.126423,62.0,1.0,1.0,0.0,2.0,0.0,1.0,4.412457,...,0,0,0,0,1,0,0,0,0,0
1,1.544068,1.69897,1.0,1.0,1.0,1.0,2.0,0.0,1.0,18.900217,...,0,0,0,0,0,1,0,0,0,0
2,2.0,2.231913,1.0,1.0,1.0,1.0,2.0,0.0,1.0,1.764125,...,0,0,0,0,0,0,0,0,0,0
3,2.176091,,1.0,2.0,3.0,3.0,6.0,10.0,2.0,21.697192,...,0,0,0,0,0,1,0,0,0,0
4,2.161368,2.107812,5.0,1.0,3.0,2.0,6.0,15.0,1.0,1.88617,...,0,0,0,0,0,0,0,0,0,0


*Constructed aggregations - 100m*

In [267]:
# Train
train_100m = pd.concat([merged_train.loc[:, ['Air_log_price', 
                             'Air_log_price_100m_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], train_one_hot], axis=1)
train_100m.head(5)

# Valid
valid_100m = pd.concat([merged_valid.loc[:, ['Air_log_price', 
                             'Air_log_price_100m_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], valid_one_hot], axis=1)
valid_100m.head(5)

# Test
test_100m = pd.concat([merged_test.loc[:, ['Air_log_price', 
                             'Air_log_price_100m_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], test_one_hot], axis=1)
test_100m.head(5)

Unnamed: 0,Air_log_price,Air_log_price_100m_same_room_type,Air_calculated_host_listings_count,Air_bathrooms,Air_beds,Air_bedrooms,Air_accommodates,Air_extra_people,Air_guests_included,Air_distance_to_CBD,...,Air_neighbourhood_cleansed_Moonee Valley,Air_neighbourhood_cleansed_Moreland,Air_neighbourhood_cleansed_Nillumbik,Air_neighbourhood_cleansed_Port Phillip,Air_neighbourhood_cleansed_Stonnington,Air_neighbourhood_cleansed_Whitehorse,Air_neighbourhood_cleansed_Whittlesea,Air_neighbourhood_cleansed_Wyndham,Air_neighbourhood_cleansed_Yarra,Air_neighbourhood_cleansed_Yarra Ranges
0,2.133539,2.065043,62.0,1.0,1.0,0.0,2.0,0.0,1.0,4.412457,...,0,0,0,0,1,0,0,0,0,0
1,1.544068,,1.0,1.0,1.0,1.0,2.0,0.0,1.0,18.900217,...,0,0,0,0,0,1,0,0,0,0
2,2.0,2.149427,1.0,1.0,1.0,1.0,2.0,0.0,1.0,1.764125,...,0,0,0,0,0,0,0,0,0,0
3,2.176091,,1.0,2.0,3.0,3.0,6.0,10.0,2.0,21.697192,...,0,0,0,0,0,1,0,0,0,0
4,2.161368,2.060698,5.0,1.0,3.0,2.0,6.0,15.0,1.0,1.88617,...,0,0,0,0,0,0,0,0,0,0


*Constructed aggregations - 50m*

In [268]:
# Train
train_50m = pd.concat([merged_train.loc[:, ['Air_log_price', 
                             'Air_log_price_50m_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], train_one_hot], axis=1)
train_50m.head(5)

# Valid
valid_50m = pd.concat([merged_valid.loc[:, ['Air_log_price', 
                             'Air_log_price_50m_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], valid_one_hot], axis=1)
valid_50m.head(5)

# Test
test_50m = pd.concat([merged_test.loc[:, ['Air_log_price', 
                             'Air_log_price_50m_same_room_type',
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], test_one_hot], axis=1)
test_50m.head(5)

Unnamed: 0,Air_log_price,Air_log_price_50m_same_room_type,Air_calculated_host_listings_count,Air_bathrooms,Air_beds,Air_bedrooms,Air_accommodates,Air_extra_people,Air_guests_included,Air_distance_to_CBD,...,Air_neighbourhood_cleansed_Moonee Valley,Air_neighbourhood_cleansed_Moreland,Air_neighbourhood_cleansed_Nillumbik,Air_neighbourhood_cleansed_Port Phillip,Air_neighbourhood_cleansed_Stonnington,Air_neighbourhood_cleansed_Whitehorse,Air_neighbourhood_cleansed_Whittlesea,Air_neighbourhood_cleansed_Wyndham,Air_neighbourhood_cleansed_Yarra,Air_neighbourhood_cleansed_Yarra Ranges
0,2.133539,,62.0,1.0,1.0,0.0,2.0,0.0,1.0,4.412457,...,0,0,0,0,1,0,0,0,0,0
1,1.544068,,1.0,1.0,1.0,1.0,2.0,0.0,1.0,18.900217,...,0,0,0,0,0,1,0,0,0,0
2,2.0,2.298853,1.0,1.0,1.0,1.0,2.0,0.0,1.0,1.764125,...,0,0,0,0,0,0,0,0,0,0
3,2.176091,,1.0,2.0,3.0,3.0,6.0,10.0,2.0,21.697192,...,0,0,0,0,0,1,0,0,0,0
4,2.161368,,5.0,1.0,3.0,2.0,6.0,15.0,1.0,1.88617,...,0,0,0,0,0,0,0,0,0,0


*Baseline model*

In [269]:
# Train
train_base = pd.concat([merged_train.loc[:, ['Air_log_price', 
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], train_one_hot], axis=1)
train_base.head(5)

# Valid
valid_base = pd.concat([merged_valid.loc[:, ['Air_log_price', 
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], valid_one_hot], axis=1)
valid_base.head(5)

# Test
test_base = pd.concat([merged_test.loc[:, ['Air_log_price', 
                             'Air_calculated_host_listings_count', 
                             'Air_bathrooms', 
                             'Air_beds', 
                             'Air_bedrooms', 
                             'Air_accommodates', 
                             'Air_extra_people',
                             'Air_guests_included',
                             'Air_distance_to_CBD']], test_one_hot], axis=1)
test_base.head(5)

Unnamed: 0,Air_log_price,Air_calculated_host_listings_count,Air_bathrooms,Air_beds,Air_bedrooms,Air_accommodates,Air_extra_people,Air_guests_included,Air_distance_to_CBD,Air_property_type_2_House_Cottage_Villa,...,Air_neighbourhood_cleansed_Moonee Valley,Air_neighbourhood_cleansed_Moreland,Air_neighbourhood_cleansed_Nillumbik,Air_neighbourhood_cleansed_Port Phillip,Air_neighbourhood_cleansed_Stonnington,Air_neighbourhood_cleansed_Whitehorse,Air_neighbourhood_cleansed_Whittlesea,Air_neighbourhood_cleansed_Wyndham,Air_neighbourhood_cleansed_Yarra,Air_neighbourhood_cleansed_Yarra Ranges
0,2.133539,62.0,1.0,1.0,0.0,2.0,0.0,1.0,4.412457,0,...,0,0,0,0,1,0,0,0,0,0
1,1.544068,1.0,1.0,1.0,1.0,2.0,0.0,1.0,18.900217,1,...,0,0,0,0,0,1,0,0,0,0
2,2.0,1.0,1.0,1.0,1.0,2.0,0.0,1.0,1.764125,0,...,0,0,0,0,0,0,0,0,0,0
3,2.176091,1.0,2.0,3.0,3.0,6.0,10.0,2.0,21.697192,0,...,0,0,0,0,0,1,0,0,0,0
4,2.161368,5.0,1.0,3.0,2.0,6.0,15.0,1.0,1.88617,0,...,0,0,0,0,0,0,0,0,0,0


#### Features and targets

Datasets storing features `X_` and the target `y_` are created.

In [270]:
# Suburb
X_train_suburb = train_suburb.loc[:, ~train_suburb.columns.isin(['Air_log_price'])].values
y_train_suburb = train_suburb.loc[:, 'Air_log_price'].values

X_valid_suburb = valid_suburb.loc[:, ~valid_suburb.columns.isin(['Air_log_price'])].values
y_valid_suburb = valid_suburb.loc[:, 'Air_log_price'].values

X_test_suburb = test_suburb.loc[:, ~test_suburb.columns.isin(['Air_log_price'])].values
y_test_suburb = test_suburb.loc[:, 'Air_log_price'].values

# 500m
X_train_500m = train_500m.loc[:, ~train_500m.columns.isin(['Air_log_price'])].values
y_train_500m = train_500m.loc[:, 'Air_log_price'].values

X_valid_500m = valid_500m.loc[:, ~valid_500m.columns.isin(['Air_log_price'])].values
y_valid_500m = valid_500m.loc[:, 'Air_log_price'].values

X_test_500m = test_500m.loc[:, ~test_500m.columns.isin(['Air_log_price'])].values
y_test_500m = test_500m.loc[:, 'Air_log_price'].values

# 100m
X_train_100m = train_100m.loc[:, ~train_100m.columns.isin(['Air_log_price'])].values
y_train_100m = train_100m.loc[:, 'Air_log_price'].values

X_valid_100m = valid_100m.loc[:, ~valid_100m.columns.isin(['Air_log_price'])].values
y_valid_100m = valid_100m.loc[:, 'Air_log_price'].values

X_test_100m = test_100m.loc[:, ~test_100m.columns.isin(['Air_log_price'])].values
y_test_100m = test_100m.loc[:, 'Air_log_price'].values

# 50m
X_train_50m = train_50m.loc[:, ~train_50m.columns.isin(['Air_log_price'])].values
y_train_50m = train_50m.loc[:, 'Air_log_price'].values

X_valid_50m = valid_50m.loc[:, ~valid_50m.columns.isin(['Air_log_price'])].values
y_valid_50m = valid_50m.loc[:, 'Air_log_price'].values

X_test_50m = test_50m.loc[:, ~test_50m.columns.isin(['Air_log_price'])].values
y_test_50m = test_50m.loc[:, 'Air_log_price'].values

# baseline
X_train_base = train_base.loc[:, ~train_base.columns.isin(['Air_log_price'])].values
y_train_base = train_base.loc[:, 'Air_log_price'].values

X_valid_base = valid_base.loc[:, ~valid_base.columns.isin(['Air_log_price'])].values
y_valid_base = valid_base.loc[:, 'Air_log_price'].values

X_test_base = test_base.loc[:, ~test_base.columns.isin(['Air_log_price'])].values
y_test_base = test_base.loc[:, 'Air_log_price'].values