# Task 1

It is 2022, and companies all over the world are still reeling from the impact that COVID-19 has had on them. AirBnb is one of these businesses, and the hosts that list their spare rooms or extra homes are trying their hardest to make their places look as appealing as possible to tempt travellers to spend their hard earned money.

In this project I will aim to answer the following question:

### "What can hosts do to maximise their chances of rental to increase income? How does this differ across Europe, if at all?"

This can be broken down into several sub-questions:

1. What hosts' behaviors or profiles would influence AirBNB tenants reviews across Europe?
 
2. What words should hosts include in listings?
 
3. What features should hosts focus on to maximise booking potential?
 

## The dataset

The dataset used for this project comes from [insideairbnb.com](http://insideairbnb.com/), an investigatory/watchdog website launched by Murray Cox in 2016. It reports and visualizes scraped data on the property rental marketplace company Airbnb,focusing on highlighting illegal renting on the site and gentrification caused by landlords buying properties to rent on Airbnb.


The data is quite messy, and has some limitations. The major one is that it only includes the advertised price (sometimes called the 'sticker' price). The sticker price is the overall nightly price that is advertised to potential guests, rather than the actual average amount paid per night by previous guests. The advertised prices can be set to any arbitrary amount by the host, and hosts that are less experienced with Airbnb will often set these to very low (e.g. £0) or very high (e.g. £10,000) amounts.

Nevertheless, this dataset can be used as a proof of concept. A more accurate version could be built using data on the actual average nightly rates paid, e.g. from sites like [AirDNA](https://www.airdna.co/) that sell higher quality Airbnb data.

# Task 2

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats
from pylab import *

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import  r2_score, mean_squared_error
from sklearn import metrics
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import statsmodels.api as sm
from statsmodels.formula.api import ols

import folium
from wordcloud import WordCloud, ImageColorGenerator
from gensim.parsing.preprocessing import remove_stopwords
import collections
from collections import Counter
import string

import warnings
warnings.filterwarnings('ignore')

In [None]:
#setting seaborn size parameters
sns.set(rc={'figure.figsize':(15,8)})

First, I will undertake some basic preprocessing of the data and carry out some exploratory data analysis for both cities. This will allow us to get some insight into the data before attempting to answer the question set in Task 1.

# Exploratory Data Analysis (Amsterdam)

In [None]:
#reading in Amsterdam data
ams = pd.read_csv('airbnb_amsterdam/listings.csv')
ams.set_index('id',inplace=True)
ams.head()

In [None]:
ams.shape

In [None]:
num_listings = len(ams)
num_hosts = len(ams['host_id'].unique())

print(f'The Amsterdam data contains information about {num_listings} AirBnB listings from {num_hosts} hosts.')

In [None]:
ams.describe()

In [None]:
ams.dtypes

Most of the features are numeric, with some continuous floats such as `latitude` and `longitude`, and some integer variables such as `price`, `minimum_nights` and `availability_365`.



In [None]:
#dropping unnecessary columns
ams.drop(['host_name','last_review', 'neighbourhood_group'], axis=1, inplace=True)
# Visualize the first 5 rows
ams.head()

In [None]:
#checking for null values
ams.isnull().sum()

The missing values in the `reviews_per_month` column correspond to rows where there are no reviews. As there have been no reviews left for this listing, the number of reviews per month cannot be calculated. This means that we can fill the null values with 0.

In [None]:
#replacing null values in reviews_per_month column with 0
ams.fillna({'reviews_per_month':0}, inplace=True)
#replacing null values in name column with an empty string
ams['name'] = ams[['name']].fillna((''))

In [None]:
#examining the dataset
(ams[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365']]
 .describe())

From the table above, we can see that the minimum price for a room is 0. This is obviously an error, so the Amsterdam data will be changed to just show listings with prices over £0 per night.

In [None]:
#only include listings where price >0 
ams = ams.loc[ams['price'] > 0]

Similarly, the maximum value for `minimum_nights` is 1001, which equates to nearly three years as a minimum stay! For the purpose of this project, we will remove all values of `minimum_nights` that are over 31.

In [None]:
ams = ams.loc[ams['minimum_nights'] < 31]

In [None]:
#creating a dataset for the correlation matrix, removing unnecessary variables
corr_ams = ams[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365']]

In [None]:
Q1 = corr_ams.quantile(q=.25)
Q3 = corr_ams.quantile(q=.75)
IQR = corr_ams.apply(stats.iqr)

ams_no_outliers = corr_ams[~((corr_ams < (Q1-1.5*IQR)) | (corr_ams > (Q3+1.5*IQR))).any(axis=1)]
ams_no_outliers.head()

We can now use some basic Natural Language Processing (NLP) to create a word cloud of the most common words and phrases used in listings names for places in Amsterdam. 

In [None]:
#creating a series of Amsterdam listing names
ams_name = ams['name']

In [None]:
text = " ".join(str(each) for each in ams_name)
# Create and generate a word cloud image:
wordcloud = WordCloud(max_words=200, background_color="white").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
# Display the generated image:
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.show()

Unsuprisingly, "Amsterdam" is one of the most commonly mentioned words in the dataset. Other words include city center/centre, spacious, beautiful, and apartment. Later on we can see if the common use of the word 'apartment' corresponds to the number of apartment listings in the Amsterdam data.

In [None]:
#removing puncuation
text = text.translate(str.maketrans('', '', string.punctuation))
#removing numbers
text = ''.join([i for i in text if not i.isdigit()])
#making all text lower case
text = text.lower()
#removing stop words from text to include only words that are relevant
text = remove_stopwords(text)

In [None]:
#assigning the Counter instance 'most_common' call to a variable
word_frequency = Counter("".join(text).split()).most_common(10)

#'most_common' returns a list of (word, count) tuples
words = [word for word, _ in word_frequency]
counts = [counts for _, counts in word_frequency]

#creating plot

plt.bar(words, counts, color = '#ff8882')
plt.title("10 most frequent tokens in description")
plt.ylabel("Frequency")
plt.xlabel("Words")
xticks(rotation=45)
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(16, 12))

#creating distplots for each variable
subplot(2,3,1)
sns.distplot(ams['price'])

subplot(2,3,2)
sns.distplot(ams['minimum_nights'])

subplot(2,3,3)
sns.distplot(ams['number_of_reviews'])

subplot(2,3,4)
sns.distplot(ams['reviews_per_month'])

subplot(2,3,4)
sns.distplot(ams['availability_365'])

plt.tight_layout() # avoid overlap of plots
plt.draw()

In [None]:
f, ax = plt.subplots(figsize=(16, 12))

#creating boxplots for each variable
subplot(2,3,1)
sns.boxplot(y = ams['price']) 

subplot(2,3,2)
sns.boxplot(y = ams['minimum_nights'])

subplot(2,3,3)
sns.boxplot(y = ams['number_of_reviews'])

subplot(2,3,4)
sns.boxplot(y = ams['reviews_per_month'])

subplot(2,3,6)
sns.boxplot(y = ams['availability_365'])

plt.tight_layout() # avoid overlap of plots
plt.draw()

What's immediately evident from these boxplots is the number of outliers for each variable. This matches with how right skewed each of the distribution plots above are, as they show the data to be very positively skewed with a mean to the right of the median.  

Another thing that is noticeable is how positively skewed the boxplot for `availability_365` is. The outliers take up nearly 85% of the whole chart.

We can also create a bar chart to see the counts of values for `room_type`.

In [None]:
title = 'Properties per Room Type'
sns.countplot(ams['room_type'])
plt.title(title)
plt.ioff()

The bar chart clearly shows that entire homes and apartments are by far the most popular room type in Amsterdam, with almost 16,000 listings. Shared rooms are few and far between, with the bar not visible.

We can also look at the count of listings for each neighbourhood in Amsterdam.

In [None]:
#creating a countplot
title = 'Properties per Neighbourhood'
ax = sns.countplot(ams['neighbourhood'])

#setting font size and rotation
ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")

plt.title(title)
plt.ioff()

We can see that neighbourhoods such as De Baarsjes - Oud West, De Pijp - Rivierenbuurt and Centrum West are three of the most popular in Amsterdam, whereas the Bijlmer-Oost, Bijlmer-Centrum and Gaasperdam-Driemond neighbourhoods are much less popular.

In [None]:
#creating a dataset for the correlation matrix, removing unnecessary variables
corr_ams = ams[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365']]

In [None]:
plt.figure(figsize=(20,10))
title = 'Correlation matrix of numerical variables'
sns.heatmap(corr_ams.corr(), annot=True, square=True, cmap='Reds')
plt.title(title)
plt.ioff()

The above correlation matrix shows the correlation between the numerical values in the dataset. The following conclusions can be reached from this:

* None of the variables are particularly negatively correlated (all correlations are above -0.1)

* The strongest positive correlation is between `reviews_per_month` and `number_of_reviews` (0.66), which makes sense as the more reviews a listing has, the more reviews it will have in a month.

* The `availability_365` variable seems to positively affect several other variables. For example, the higher the availability of a room, the more reviews the listing will have (again, this makes sense as the room is available for more days so more people have the opportunity to review it). Additionally, the higher the availability of a room, the higher the price of the room.

Next, we can see where listings are in relation to their latitude and longitude. Using a scatterplot will create a map-like image for us to see where neighbourhoods are positioned.

In [None]:
title = 'Neighbourhood Location'
plt.figure(figsize=(10,6))
#creating scatterplot
sns.scatterplot(ams.longitude,ams.latitude,hue=ams.neighbourhood).set_title(title)
#moving legend
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
plt.ioff()

We can look at this in more detail by setting the `hue` as `room_type`.

In [None]:
title = 'Room type location per Neighbourhood'
plt.figure(figsize=(10,6))
sns.scatterplot(ams.longitude,ams.latitude,hue=ams.room_type).set_title(title)
plt.ioff()

From this, we can see there is no particular pattern of where the type of room would be located; the shared rooms, for example, are equally spread across the city.

Let's see what our data points look like on an interactive map.

In [None]:
m = folium.Map(
    location = [52.377956, 4.897070],
    tiles = 'Stamen Terrain',
    zoom_start = 12             
              )
ams.apply(lambda x: folium.Circle([x.latitude, x.longitude], 50, fill=True).add_to(m).add_to(m),axis = 1)

m

We can use a boxplot to look at the distribution of listing prices between neighbourhoods.

In [None]:
x= 'neighbourhood'
y= 'price'
title = 'Price per Neighbourhood'



f, ax = plt.subplots(figsize=(8, 6))
ax = sns.boxplot(x=x, y=y, data=ams)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.title(title)
plt.ioff()

Again, the many outliers mean that the boxplots are quite compressed. The majority of listings seem to be under $300, so we can focus on this in the next set of charts.

## Price in Relation to Neighbourhood

In [None]:
x='neighbourhood'
y='price'



title = 'Price per neighbourhood for properties under $300'
ams_filtered = ams.loc[ams['price'] < 300]
f, ax = plt.subplots(figsize=(8, 6))
ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
sns.boxplot(x=x, y=y, data=ams_filtered, notch=True, showmeans=True,
           meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"black"})
plt.title(title)
plt.ioff()
f
title = 'Price per neighbourhood for properties more than $300'
ams_filtered = ams.loc[ams['price'] > 300]
f, ax = plt.subplots(figsize=(8, 6))
ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
sns.boxplot(x=x, y=y, data=ams_filtered, notch=False, showmeans=True,
           meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"black"})
plt.title(title)
plt.ioff()

The white squares on each box plots demotes the mean. From this, we can see that Biljmer-Centrum is the cheapest neighbourhood to stay in according to the mean, while Oud-Oost is one of the most expensive. 

## Price in Relation to Room Type

In [None]:
title = 'Price per Room Type for Properties under $300'
ams_filtered = ams.loc[ams['price'] < 300]
f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x='room_type', y='price', data=ams_filtered, notch=True, showmeans=True,
           meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"black"})
plt.title(title)
plt.ioff()

title = 'Price per Room Type for Properties more than $300'
ams_filtered = ams.loc[ams['price'] > 300]
f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x='room_type', y='price', data=ams_filtered, notch=False, showmeans=True,
           meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"black"})
plt.title(title)
plt.ioff()

From this, we can note that there are very few shared rooms that cost more than \\$300, and that entire home/apartments are both the most expensive and have largest amount of spread of data points. For properties under \\$300,the mean price of an entire home/apartment is nearly \\$150, compared to a shared room, which is about \\$80.

### Price in Relation to Number of Reviews per Month

In [None]:
x = 'reviews_per_month'
y = 'price'

title = 'Price relation to number of reviews per month for Properties under $300'
ams_filtered = ams.loc[(ams['price'] < 300)]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ams_filtered)
plt.title(title)
plt.ioff()

title = 'Price relation to number of reviews per month for Properties more than $300'
ams_filtered = ams.loc[ams['price'] > 300]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ams_filtered)
plt.title(title)
plt.ioff()

### Price in Relation to Number of Reviews per Month and Room Type

In [None]:
x = 'number_of_reviews'
y = 'price'

title = 'Price relation to number of review per month and Room Type for Properties under $300'
ams_filtered = ams.loc[ams['price'] < 300]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ams_filtered)
plt.title(title)
plt.ioff()

title = 'Price relation to number of review per month and Room Type for Properties more than $300'
ams_filtered = ams.loc[ams['price'] > 300]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ams_filtered)
plt.title(title)
plt.ioff()

### Price in Relation to Minimum Nights

In [None]:
x = 'minimum_nights'
y = 'price'

title = 'Price relation to minimum_nights for Properties under $300'
ams_filtered = ams.loc[ams['price'] < 300]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ams_filtered)
plt.title(title)
plt.ioff()

title = 'Price relation to minimum_nights Properties more than $300'
ams_filtered = ams.loc[ams['price'] > 300]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ams_filtered)
plt.title(title)
plt.ioff()

### Price in Relation to Availability

In [None]:
x = 'availability_365'
y = 'price'

title = 'Price relation to availability for Properties under $300'
ams_filtered = ams.loc[ams['price'] < 300]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ams_filtered)
plt.title(title)
plt.ioff()

title = 'Price relation to availability for Properties more than $300'
ams_filtered = ams.loc[ams['price'] > 300]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ams_filtered)
plt.title(title)
plt.ioff()

# Exploratory Data Analysis (London)

In [None]:
# reading in London data
ldn = pd.read_csv('airbnb_london/listings_summary.csv')
ldn.set_index('id',inplace=True)
ldn.head()

In [None]:
ldn.shape

In [None]:
num_listings = len(ldn)
num_hosts = len(ldn['host_id'].unique())

print(f'The London data contains information about {num_listings} AirBnB listings from {num_hosts} hosts.')

In [None]:
ldn.describe()

In [None]:
ldn.dtypes

Like the Amsterdam data, most of the features are numeric, with some continuous floats such as `latitude` and `longitude`, and some integer variables such as `price`, `minimum_nights` and `availability_365`.

In [None]:
# dropping unnecessary columns
ldn.drop(['host_name','last_review', 'neighbourhood_group'], axis=1, inplace=True)
# Visualize the first 5 rows
ldn.head()

In [None]:
ldn.isnull().sum()

The missing values in the `reviews_per_month` column correspond to rows where there are no reviews. As there have been no reviews left for this listing, the number of reviews per month cannot be calculated. This means that we can fill the null values with 0.

In [None]:
#replacing null values in reviews_per_month column with 0
ldn.fillna({'reviews_per_month':0}, inplace=True)
#replacing null values in name column with an empty string
ldn['name'] = ldn[['name']].fillna((''))

In [None]:
#examine the dataset
(ldn[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365']]
 .describe())

From the table above, we can see that the minimum price for a room is 0, the same as what happened above for the Amsterdam data. We will only use values that are more than 0. Similarly, we have a maximum value of 1125 for `minimum_nights`, so we will cut the dataset so it only shows rows where the value for `minimum_nights` is less than 31.

In [None]:
#only include listings where price > 0
ldn = ldn.loc[ldn['price'] > 0]
#only include listings where minimum_nights < 31
ldn = ldn.loc[ldn['minimum_nights'] < 31]

In [None]:
ldn.describe()

In [None]:
#creating a dataset for the correlation matrix, removing unnecessary variables
corr_ldn = ldn[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365']]

In [None]:
Q1 = corr_ldn.quantile(q=.25)
Q3 = corr_ldn.quantile(q=.75)
IQR = corr_ldn.apply(stats.iqr)

ldn_no_outliers = corr_ldn[~((corr_ldn < (Q1-1.5*IQR)) | (corr_ldn > (Q3+1.5*IQR))).any(axis=1)]
ldn_no_outliers.head()

We can now use some basic Natural Language Processing (NLP) to create a word cloud of the most common words and phrases used in listings names for places in London. 

In [None]:
ldn_name = ldn['name']

In [None]:
text = " ".join(str(each) for each in ldn_name)
# Create and generate a word cloud image:
wordcloud = WordCloud(max_words=200, background_color="white").generate(text)
plt.figure(figsize=(10,6))
plt.figure(figsize=(15,10))
# Display the generated image:
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.show()

From this word cloud, we could assume that listings are most popular in Central London, along with double room/bedroom. The words 'apartment' and 'beautiful' are prominent in this word cloud, which were were words that were common in the Amsterdam word cloud.

A couple of place names are mentioned here, for example Hyde Park, Canary Wharf, Shoreditch and Notting Hill - it seems hosts are keen to advertise their closeness to these popular destinations.

In [None]:
text = text.translate(str.maketrans('', '', string.punctuation))
text = ''.join([i for i in text if not i.isdigit()])
text = text.lower()
text = remove_stopwords(text)

In [None]:
# Assign the Counter instance `most_common` call to a variable:
word_frequency = Counter("".join(text).split()).most_common(10)

# `most_common` returns a list of (word, count) tuples
words = [word for word, _ in word_frequency]
counts = [counts for _, counts in word_frequency]

plt.bar(words, counts, color = "#ff8882")
plt.title("10 most frequent tokens in description")
plt.ylabel("Frequency")
plt.xlabel("Words")
xticks(rotation=45)
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(8, 6))

subplot(2,3,1)
sns.distplot(ldn['price'])

subplot(2,3,2)
sns.distplot(ldn['minimum_nights'])

subplot(2,3,3)
sns.distplot(ldn['number_of_reviews'])

subplot(2,3,4)
sns.distplot(ldn['reviews_per_month'])

subplot(2,3,5)
sns.distplot(ldn['calculated_host_listings_count'])

subplot(2,3,6)
sns.distplot(ldn['availability_365'])


plt.tight_layout() # avoid overlap of plots
plt.draw()

In [None]:
f, ax = plt.subplots(figsize=(8, 6))



subplot(2,3,1)
sns.boxplot(y = ldn['price']) 

subplot(2,3,2)
sns.boxplot(y = ldn['minimum_nights'])

subplot(2,3,3)
sns.boxplot(y = ldn['number_of_reviews'])

subplot(2,3,4)
sns.boxplot(y = ldn['reviews_per_month'])

subplot(2,3,6)
sns.boxplot(y = ldn['availability_365'])

plt.tight_layout() # avoid overlap of plots
plt.draw()

Similar to the Amsterdam boxplots, there is a large amount of outliers present for the London data. The box plots for `price` and `number_of_reviews` are very squashed so it is quite hard to read data from these. One major difference between the Amsterdam and London boxplots is that there are no outliers present on the `availabilty_365` box plot for London. Amsterdam's median for `availability_365` was about 10, whereas for London it is about 60. Likewise, Amsterdam's upper quartile is about 70, compared to London's which is much higher at about 250. This means that hosts in London tend to have a much higher availability out of the 365 days in the year compared to Amsterdam.

We can also create a bar chart to see the counts of values for `room_type`.

In [None]:
title = 'Properties per Room Type'
sns.countplot(ldn['room_type'])
plt.title(title)
plt.ioff()

Another room type has appeared that wasn't present in the Amsterdam data; hosts have offered hotel rooms. However, these and shared rooms take up a very small proportion of the total listings - entire homes/apartments and private rooms are much more popular. Like Amsterdam, entire homes/apartments are the most popular, with close to 5000 being listed on Airbnb in London.

We can also look at the count of listings for each neighbourhood in London.

In [None]:
title = 'Properties per Neighbourhood Group'
ax = sns.countplot(ldn['neighbourhood'])

ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
rcParams['figure.figsize'] = 25,15
plt.title(title)
plt.ioff()

We can see that the three most popular neighbourhoods to stay in in London are Westminster, Town Hamlets and Hackney, whereas the three least popular are Havering, Bexley and Sutton.

In [None]:
plt.figure(figsize=(20,10))
title = 'Correlation matrix of numerical variables'
sns.heatmap(corr_ldn.corr(), annot=True, square=True, cmap='RdBu')
plt.title(title)
plt.ioff()

This correlation matrix shares quite a few things in common with the Amsterdam one. Again, none of the variables are particularly correlated (nothing lower than -0.2). The most negative correlation is betwen `reviews_per_month` and `minimum_nights`. Like the Amsterdam dataset, there is a positive correlation of roughly 0.6 between `reviews_per_month` and `number_of_reviews`. Also, several other positive correlations can be seen between `availability_365` and other variables such as `price` and `number_of_reviews`, but less so in this matrix compared with the Amsterdam one. 

Next, we can see where listings are in relation to their latitude and longitude. Using a scatterplot will create a map-like image for us to see where neighbourhoods are positioned.

In [None]:
title = 'Neighbourhood Location'
plt.figure(figsize=(10,6))
sns.scatterplot(ldn.longitude,ldn.latitude,hue=ldn.neighbourhood).set_title(title)
# moving legend
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
plt.ioff()

We can look at this in more detail by setting the hue as `room_type`.

In [None]:
title = 'Room type location per Neighbourhood'
plt.figure(figsize=(10,6))
sns.scatterplot(ldn.longitude,ldn.latitude,hue=ldn.room_type).set_title(title)
plt.ioff()

There is slightly more of a pattern seen here than for the Amsterdam data. There are no hotel rooms or shared rooms in central London, and although both entire homes and private rooms are listed in abundance across London, entire homes seem to be more promininent in central London.

Now we can see what our data points look like on an interactive map.

In [None]:
m = folium.Map(
    location = [51.509865, -0.118092],
    tiles = 'Stamen Terrain',
    zoom_start = 12             
              )
ldn.apply(lambda x: folium.Circle([x.latitude, x.longitude], 50, fill=True).add_to(m).add_to(m),axis = 1)
m

In [None]:
x= 'neighbourhood'
y= 'price'
title = 'Price per Neighbourhood'



f, ax = plt.subplots(figsize=(8, 6))
ax = sns.boxplot(x=x, y=y, data=ldn)
ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.title(title)
plt.ioff()

Again, the many outliers mean that the boxplots are very compressed. The majority of listings seem to be under $200, so we can focus on this in the next set of charts.

## Price in Relation to Neighbourhood

In [None]:
x='neighbourhood'
y='price'

title = 'Price per neighbourhood for properties under $200'
ldn_filtered = ldn.loc[ldn['price'] < 200]
f, ax = plt.subplots(figsize=(10, 8))
ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
sns.boxplot(x=x, y=y, data=ldn_filtered, notch=True, showmeans=True,
           meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"black"})
plt.title(title)
plt.ioff()
f
title = 'Price per neighbourhood for properties more than $200'
ldn_filtered = ldn.loc[ldn['price'] > 200]
f, ax = plt.subplots(figsize=(10, 8))
ax.set_xticklabels(ax.get_xticklabels(), fontsize=14)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
sns.boxplot(x=x, y=y, data=ldn_filtered, notch=False, showmeans=True,
           meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"black"})
plt.title(title)
plt.ioff()

The white squares on each box plots demotes the mean. From this, we can see that if someone was looking for a listing under \\$200, the City of London is by far the most expensive borough (neighbourhood) to stay in, with a mean of around \\$130 per night - over double the mean cost of a night in Croydon (\\$60). When looking at the boxplots for listings over \\$200, we can see in more detail that boroughs like Camden and Westminster have many outliers. For example, the most expensive listing in Westminster is $10,000 a night. Whether this is a mistake or a genuine listing price, we can't be sure.

## Prices in Relation to Room Type

In [None]:
title = 'Price per Room Type for Properties under $200'
ldn_filtered = ldn.loc[ldn['price'] < 200]
f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x='room_type', y='price', data=ldn_filtered, notch=True, showmeans=True,
           meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"black"})
plt.title(title)
plt.ioff()

title = 'Price per Room Type for Properties more than $200'
ldn_filtered = ldn.loc[ldn['price'] > 200]
f, ax = plt.subplots(figsize=(8, 6))
sns.boxplot(x='room_type', y='price', data=ldn_filtered, notch=False, showmeans=True,
           meanprops={"marker":"s","markerfacecolor":"white", "markeredgecolor":"black"})
plt.title(title)
plt.ioff()

From this, we can note that again, shared rooms tend to be the cheapest type of room to rent. For listings under \\$200, the mean for entire home/apartments and hotel rooms is almost exactly the same (around $110), but hotel rooms have a larger spread of prices and a slightly higher median than entire home/apartments.

## Price in Relation to Reviews per Month

In [None]:
x = 'reviews_per_month'
y = 'price'

title = 'Price relation to number of review per month for Properties under $175'
ldn_filtered = ldn.loc[(ldn['price'] < 175) & (ldn['reviews_per_month'] < 30)]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ldn_filtered)
plt.title(title)
plt.ioff()

title = 'Price relation to number of review per month for Properties more than $175'
ldn_filtered = ldn.loc[ldn['price'] > 175]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ldn_filtered)
plt.title(title)
plt.ioff()

## Price in Relation to Number of Reviews per Month and Room Type


In [None]:
x = 'number_of_reviews'
y = 'price'

title = 'Price relation to number of review per month and Room Type for Properties under $175'
ldn_filtered = ldn.loc[ldn['price'] < 175]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ldn_filtered)
plt.title(title)
plt.ioff()

title = 'Price relation to number of review per month and Room Type for Properties more than $175'
ldn_filtered = ldn.loc[ldn['price'] > 175]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ldn_filtered)
plt.title(title)
plt.ioff()

## Price in Relation to Minimum Nights

In [None]:
x = 'minimum_nights'
y = 'price'

title = 'Price relation to minimum_nights for Properties under $175'
ldn_filtered = ldn.loc[ldn['price'] < 175]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ldn_filtered)
plt.title(title)
plt.ioff()

title = 'Price relation to minimum_nights Properties more than $175'
ldn_filtered = ldn.loc[ldn['price'] > 175]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ldn_filtered)
plt.title(title)
plt.ioff()

## Price in Relation to Availability


In [None]:
x = 'availability_365'
y = 'price'

title = 'Price relation to availability for Properties under $175'
ldn_filtered = ldn.loc[ldn['price'] < 175]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ldn_filtered)
plt.title(title)
plt.ioff()

title = 'Price relation to availability for Properties more than $175'
ldn_filtered = ldn.loc[ldn['price'] > 175]
f, ax = plt.subplots(figsize=(8, 6))
sns.scatterplot(x=x, y=y, data=ldn_filtered)
plt.title(title)
plt.ioff()

# What hosts' behaviors or profiles would influence AirBNB tenants reviews in both London and Amsterdam?

In [None]:
df_ldn = pd.read_csv('/Users/georginadangerfield/Downloads/assignment/airbnb_london/listings.csv')
df_ams = pd.read_csv('/Users/georginadangerfield/Downloads/assignment/airbnb_madrid/listings_detailed.csv')

num_rows_ldn = df_ldn.shape[0]
num_cols_ldn = df_ldn.shape[1]

num_rows_ams = df_ams.shape[0]
num_cols_ams = df_ams.shape[1]

most_missing_cols_ldn = set(df_ldn.columns[df_ldn.isnull().mean() > 0.75])
most_missing_cols_ams = set(df_ams.columns[df_ams.isnull().mean() > 0.75])

print(num_rows_ldn, num_cols_ldn, num_rows_ams, num_cols_ams)
print(most_missing_cols_ldn, most_missing_cols_ams)

In [None]:
# Basic Data Cleaning function for London
def clean_dataset_ldn(df):
    '''
    INPUT
    df - pandas dataframe containing data 
    
    OUTPUT
    new_df - cleaned dataset, which contains:
    1. string containing price are converted into numbers;
    2. missing values are imputed with mean or mode or drop
    '''
    
    useless_columns = ['access', 'interaction', 'house_rules','name', 'host_name', 'square_feet', 'id', 'host_id','summary', 'space', 'description', 'neighborhood_overview', 'notes', 
                       'host_since', 'host_location', 'host_about', 'host_neighbourhood', 'host_total_listings_count', 'street', 'neighbourhood', 
                       'minimum_nights', 'maximum_nights', 'city', 'zipcode', 'smart_location', 'latitude', 
                       'longitude', 'is_location_exact', 'weekly_price', 'monthly_price', 'require_guest_profile_picture', 
                       'require_guest_phone_verification', 'calculated_host_listings_count', 'availability_30', 'availability_60', 'availability_90', 
                       'availability_365', 'calendar_updated','transit', 'medium_url', 'xl_picture_url']
    
    # if all values are unique in this column, like ID, or if the values are url links, then drop it
    for col in df.columns:
        if len(df[col].unique()) == 1:
            df.drop(col, inplace=True, axis=1)
        if ('url' in col):
            df.drop(col, inplace=True, axis=1)
        if col in useless_columns:
            df.drop(col, inplace=True, axis=1)
    
    # generate review columns
    review_columns = []
    for col in df:
        if 'review' in col:
            review_columns.append(col)
    
    
    #convert all related 'price' columns values from string to number
    df['price'] = df['price'].astype(str).str.replace("[$, ]", "").astype("float")
    df['security_deposit'] = df['security_deposit'].astype(str).str.replace("[$, ]", "").astype("float")
    df['cleaning_fee'] = df['cleaning_fee'].astype(str).str.replace("[$, ]", "").astype("float")
    df['extra_people'] = df['extra_people'].astype(str).str.replace("[$, ]", "").astype("float")
    #convert all percentage columns values to float number
    df['host_response_rate'] = df['host_response_rate'].astype(str).str.replace("[%, ]", "").astype("float")/100
    #generate new review metric
    df['new_review_metric'] = df['reviews_per_month'] * df['review_scores_rating']/100
    #drop original review columns
    df = df.drop(review_columns, axis=1)
    
    return df

In [None]:
# Basic Data Cleaning function for Amsterdam
def clean_dataset_ams(df):
    '''
    INPUT
    df - pandas dataframe containing data 
    
    OUTPUT
    new_df - cleaned dataset, which contains:
    1. string containing price are converted into numbers;
    2. missing values are imputed with mean or mode or drop
    '''
    
    useless_columns = ['access', 'interaction', 'house_rules','name', 'host_name', 'square_feet', 'id', 'host_id','summary', 'space', 'description', 'neighborhood_overview', 'notes', 
                       'host_since', 'host_location', 'host_about', 'host_neighbourhood', 'host_total_listings_count', 'street', 'neighbourhood', 
                       'minimum_nights', 'maximum_nights', 'city', 'zipcode', 'smart_location', 'latitude', 
                       'longitude', 'is_location_exact', 'weekly_price', 'monthly_price', 'require_guest_profile_picture', 
                       'require_guest_phone_verification', 'calculated_host_listings_count', 'availability_30', 'availability_60', 'availability_90', 
                       'availability_365', 'transit', 'medium_url', 'xl_picture_url',
                      'host_acceptance_rate', 'xl_picture_url', 'host_acceptance_rate']
    
    # if all values are unique in this column, like ID, or if the values are url links, then drop it
    for col in df.columns:
        if len(df[col].unique()) == 1:
            df.drop(col, inplace=True, axis=1)
        if ('url' in col):
            df.drop(col, inplace=True, axis=1)
        if col in useless_columns:
            df.drop(col, inplace=True, axis=1)
    
    # generate review columns
    review_columns = []
    for col in df:
        if 'review' in col:
            review_columns.append(col)
    
    
    #convert all related 'price' columns values from string to number
    df['price'] = df['price'].astype(str).str.replace("[$, ]", "").astype("float")
    #convert all percentage columns values to float number
    df['host_response_rate'] = df['host_response_rate'].astype(str).str.replace("[%, ]", "").astype("float")/100
    #generate new review metric
    df['new_review_metric'] = df['reviews_per_month'] * df['review_scores_rating']/100
    #drop original review columns
    df = df.drop(review_columns, axis=1)
    
    return df

In [None]:
# Apply data cleaning functions above to clean dataset
clean_df_ldn = clean_dataset_ldn(df_ldn)
clean_df_ams = clean_dataset_ams(df_ams)
clean_df_ldn.drop('state', axis=1, inplace = True)

In [None]:
def element_len(df, colname):
    coliloc = df.columns.get_loc(colname)
    
    for i, row in enumerate(df[colname]):
        df.iloc[i, coliloc] = row.replace('[', '').replace("'", '').replace("]", '').replace('"', '').replace('{', '').replace('}', '').replace(' ','')
        df.iloc[i, coliloc] = len(df.iloc[i, coliloc].split(','))
    return df

def create_dummy_df(df, dummy_na):
    '''
    INPUT:
    df - pandas dataframe with categorical variables you want to dummy
    cat_cols - list of strings that are associated with names of the categorical columns
    dummy_na - Bool holding whether you want to dummy NA vals of categorical columns or not
    
    OUTPUT:
    df - a new dataframe that has the following characteristics:
            1. contains all columns that were not specified as categorical
            2. removes all the original columns in cat_cols
            3. dummy columns for each of the categorical columns in cat_cols
            4. if dummy_na is True - it also contains dummy columns for the NaN values
            5. Use a prefix of the column name with an underscore (_) for separating 
    '''
    # Dummy the categorical variables
    cat_cols = ['host_response_time', 'host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'cancellation_policy']

    for col in  cat_cols:
        try:
            # for each cat add dummy var, drop original column
            df = pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col], prefix=col, prefix_sep='_', drop_first=True, dummy_na=dummy_na)], axis=1)
        except:
            continue
    return df


In [None]:
clean_df_ldn = element_len(clean_df_ldn, 'amenities')
clean_df_ldn = element_len(clean_df_ldn, 'host_verifications')

In [None]:
clean_df_ams = element_len(clean_df_ams, 'amenities')
clean_df_ams = element_len(clean_df_ams, 'host_verifications')

In [None]:
clean_df_ldn = create_dummy_df(clean_df_ldn, dummy_na=False)
clean_df_ams = create_dummy_df(clean_df_ams, dummy_na=False)

In [None]:
for col in clean_df_ams:
    if col not in clean_df_ldn:
        print(col)

In [None]:
# Generate a new behavior_review dataframe for analysis
behavior_review_ldn_cols =  ['host_response_rate',
                        'host_response_time_within a day',
                        'host_response_time_within a few hours',
                        'host_response_time_within an hour',
                        'host_has_profile_pic_t', 
                        'host_identity_verified_t', 
                        'host_is_superhost_t', 
                        'instant_bookable_t', 
                        'cancellation_policy_moderate',
                        'cancellation_policy_strict',
                        'cancellation_policy_super_strict_30',
                        'amenities',
                        'host_verifications',
                        'guests_included', 'extra_people', 'price']

behavior_review_ldn = clean_df_ldn[behavior_review_ldn_cols].copy()

In [None]:
# Generate a new behavior_review dataframe for analysis
behavior_review_ams_cols =  ['host_response_rate',
                        'host_response_time_within a day',
                        'host_response_time_within a few hours',
                        'host_response_time_within an hour',
                        'host_has_profile_pic_t', 
                        'host_identity_verified_t', 
                        'host_is_superhost_t', 
                        'instant_bookable_t',
                        'amenities',
                        'host_verifications','price']

behavior_review_ams = clean_df_ams[behavior_review_ams_cols].copy()

In [None]:


corr = behavior_review_ldn.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
plt.rcParams['figure.figsize'] = [11, 9]
sns.heatmap(corr, mask=mask, annot = True, fmt='.2f')

In [None]:


corr = behavior_review_ams.corr()
mask = np.triu(np.ones_like(corr, dtype=np.bool))
plt.rcParams['figure.figsize'] = [11, 9]
sns.heatmap(corr, mask=mask, annot = True, fmt=".2f")

# Machine Learning

I will be performing the following in order to answer the research questions for this project:

- Multiple linear regression
- Random forest
- OLS Regression

## Multiple Linear Regression (Amsterdam)

In [None]:
ams.drop(['name'], axis=1, inplace=True)

In [None]:
# log10 transformation
ams.minimum_nights += 0.000000001
ams['minimum_nights'] = np.log10(ams['minimum_nights'])
ams.number_of_reviews += 0.000000001
ams['number_of_reviews'] = np.log10(ams['number_of_reviews'])
ams.reviews_per_month += 0.000000001
ams['reviews_per_month'] = np.log10(ams['reviews_per_month'])
ams.calculated_host_listings_count += 0.000000001
ams['calculated_host_listings_count'] = np.log10(ams['calculated_host_listings_count'])
ams.availability_365 += 0.000000001
ams['availability_365'] = np.log10(ams['availability_365'])

In [None]:
# Encoding categorical data
ams = pd.get_dummies(ams, columns=['room_type'], drop_first=True)
ams = pd.get_dummies(ams, columns=['neighbourhood'], drop_first=True)

In [None]:
# Filter the dataset for prices more than $300
ams_filtered_high = ams.loc[(ams['price'] > 300)]
# Filter the dataset for prices less that $300
ams_filtered_low = ams.loc[(ams['price'] < 300)]

### Modelling lower price dataset

In [None]:
X = ams_filtered_low.drop('price', axis=1).values
y = ams_filtered_low['price'].values
y = np.log10(y)

In [None]:
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Fitting Multiple Linear Regression to the Training set
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predicting the Test set results
y_pred = lr.predict(X_test)

In [None]:
df = pd.DataFrame({'Actual': np.round(10 ** y_test, 0), 
                   'Predicted': np.round(10 ** y_pred, 0)})
df.head(10)

In [None]:
print('Price mean:', np.round(np.mean(y), 2))  
print('Price std:', np.round(np.std(y), 2))
print('RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_test, lr.predict(X_test))), 2))
print('R2 score train:', np.round(r2_score(y_train, lr.predict(X_train), multioutput='variance_weighted'), 2))
print('R2 score test:', np.round(r2_score(y_test, lr.predict(X_test), multioutput='variance_weighted'), 2))

RMSE is close to 0 so suggests high accuracy

However, the R2 score is not very close to 1 so suggests the accuracy might not be as good as first thought (Malekinezhad et al., 2020)

### Modelling higher price dataset

In [None]:
X = ams_filtered_high.drop('price', axis=1).values
y = ams_filtered_high['price'].values
y = np.log10(y)

In [None]:
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Fitting Multiple Linear Regression to the Training set
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predicting the Test set results
y_pred = lr.predict(X_test)

In [None]:
df = pd.DataFrame({'Actual': np.round(10 ** y_test, 0), 
                   'Predicted': np.round(10 ** y_pred, 0)})
df.head(10)

In [None]:
print('Price mean:', np.round(np.mean(y), 2))  
print('Price std:', np.round(np.std(y), 2))
print('RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_test, lr.predict(X_test))), 2))
print('R2 score train:', np.round(r2_score(y_train, lr.predict(X_train), multioutput='variance_weighted'), 2))
print('R2 score test:', np.round(r2_score(y_test, lr.predict(X_test), multioutput='variance_weighted'), 2))

RMSE suggests high accuracy of this model as score is close to 0, but R2 score suggests otherwise

## Multiple Linear Regression (London)

In [None]:
ldn.drop(['name'], axis=1, inplace=True)

In [None]:
# log10 transformation
ldn.minimum_nights += 0.000000001
ldn['minimum_nights'] = np.log10(ldn['minimum_nights'])
ldn.number_of_reviews += 0.000000001
ldn['number_of_reviews'] = np.log10(ldn['number_of_reviews'])
ldn.reviews_per_month += 0.000000001
ldn['reviews_per_month'] = np.log10(ldn['reviews_per_month'])
ldn.calculated_host_listings_count += 0.000000001
ldn['calculated_host_listings_count'] = np.log10(ldn['calculated_host_listings_count'])
ldn.availability_365 += 0.000000001
ldn['availability_365'] = np.log10(ldn['availability_365'])

In [None]:
# Encoding categorical data
ldn = pd.get_dummies(ldn, columns=['room_type'], drop_first=True)
ldn = pd.get_dummies(ldn, columns=['neighbourhood'], drop_first=True)

In [None]:
# Filter the dataset for prices more than $300
ldn_filtered_high = ldn.loc[(ldn['price'] > 300)]
# Filter the dataset for prices less that $300
ldn_filtered_low = ldn.loc[(ldn['price'] < 300)]

### Modelling lower price dataset

In [None]:
X = ldn_filtered_low.drop('price', axis=1).values
y = ldn_filtered_low['price'].values
y = np.log10(y)

In [None]:
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Fitting Multiple Linear Regression to the Training set
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predicting the Test set results
y_pred = lr.predict(X_test)

In [None]:
df = pd.DataFrame({'Actual': np.round(10 ** y_test, 0), 
                   'Predicted': np.round(10 ** y_pred, 0)})
df.head(10)

In [None]:
print('Price mean:', np.round(np.mean(y), 2))  
print('Price std:', np.round(np.std(y), 2))
print('RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_test, lr.predict(X_test))), 2))
print('R2 score train:', np.round(r2_score(y_train, lr.predict(X_train), multioutput='variance_weighted'), 2))
print('R2 score test:', np.round(r2_score(y_test, lr.predict(X_test), multioutput='variance_weighted'), 2))

Relatively high accuracy

### Modelling higher price dataset

In [None]:
X = ldn_filtered_high.drop('price', axis=1).values
y = ldn_filtered_high['price'].values
y = np.log10(y)

In [None]:
# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# Fitting Multiple Linear Regression to the Training set
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predicting the Test set results
y_pred = lr.predict(X_test)

In [None]:
df = pd.DataFrame({'Actual': np.round(10 ** y_test, 0), 
                   'Predicted': np.round(10 ** y_pred, 0)})
df.head(10)

In [None]:
print('Price mean:', np.round(np.mean(y), 2))  
print('Price std:', np.round(np.std(y), 2))
print('RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_test, lr.predict(X_test))), 2))
print('R2 score train:', np.round(r2_score(y_train, lr.predict(X_train), multioutput='variance_weighted'), 2))
print('R2 score test:', np.round(r2_score(y_test, lr.predict(X_test), multioutput='variance_weighted'), 2))

Poor accuracy

# Random Forest Regression (Amsterdam)

### Random Forest - lower price dataset

In [None]:
# Split the dataset
X = ams_filtered_low.drop('price', axis=1).values
y = ams_filtered_low['price'].values
y = np.log10(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [None]:
rfr = RandomForestRegressor(max_depth=8, n_estimators = 100, random_state = 0)
rfr.fit(X_train, y_train)

# Predicting the Test set results
y_pred = rfr.predict(X_test)

In [None]:
df = pd.DataFrame({'Actual': np.round(10 ** y_test, 0), 
                   'Predicted': np.round(10 ** y_pred, 0)})
df.head(10)

In [None]:
print('Price mean:', np.round(np.mean(y), 2))  
print('Price std:', np.round(np.std(y), 2))
print('RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_test, rfr.predict(X_test))), 2))
print('R2 score train:', np.round(r2_score(y_train, rfr.predict(X_train), multioutput='variance_weighted'), 2))
print('R2 score test:', np.round(r2_score(y_test, rfr.predict(X_test), multioutput='variance_weighted'), 2))

Good accuracy

### Random Forest - higher price dataset

In [None]:
# Split the dataset
X = ams_filtered_high.drop('price', axis=1)
y = ams_filtered_high['price']
y = np.log10(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [None]:
rfr = RandomForestRegressor(max_depth=8, n_estimators = 100, random_state = 0)
rfr.fit(X_train, y_train)

# Predicting the Test set results
y_pred = rfr.predict(X_test)

In [None]:
df = pd.DataFrame({'Actual': np.round(10 ** y_test, 0), 
                   'Predicted': np.round(10 ** y_pred, 0)})
df.head(10)

In [None]:
print('Price mean:', np.round(np.mean(y), 2))  
print('Price std:', np.round(np.std(y), 2))
print('RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_test, rfr.predict(X_test))), 2))
print('R2 score train:', np.round(r2_score(y_train, rfr.predict(X_train), multioutput='variance_weighted'), 2))
print('R2 score test:', np.round(r2_score(y_test, rfr.predict(X_test), multioutput='variance_weighted'), 2))

# Random Forest Regression (London)

### Random Forest - lower price dataset

In [None]:
# Split the dataset
X = ldn_filtered_low.drop('price', axis=1).values
y = ldn_filtered_low['price'].values
y = np.log10(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [None]:
rfr = RandomForestRegressor(max_depth=8, n_estimators = 100, random_state = 0)
rfr.fit(X_train, y_train)

# Predicting the Test set results
y_pred = rfr.predict(X_test)

In [None]:
df = pd.DataFrame({'Actual': np.round(10 ** y_test, 0), 
                   'Predicted': np.round(10 ** y_pred, 0)})
df.head(10)

In [None]:
print('Price mean:', np.round(np.mean(y), 2))  
print('Price std:', np.round(np.std(y), 2))
print('RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_test, rfr.predict(X_test))), 2))
print('R2 score train:', np.round(r2_score(y_train, rfr.predict(X_train), multioutput='variance_weighted'), 2))
print('R2 score test:', np.round(r2_score(y_test, rfr.predict(X_test), multioutput='variance_weighted'), 2))

### Random Forest - higher price dataset

In [None]:
# Split the dataset
X = ldn_filtered_high.drop('price', axis=1)
y = ldn_filtered_high['price']
y = np.log10(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [None]:
rfr = RandomForestRegressor(max_depth=8, n_estimators = 100, random_state = 0)
rfr.fit(X_train, y_train)

# Predicting the Test set results
y_pred = rfr.predict(X_test)

In [None]:
df = pd.DataFrame({'Actual': np.round(10 ** y_test, 0), 
                   'Predicted': np.round(10 ** y_pred, 0)})
df.head(10)

In [None]:
print('Price mean:', np.round(np.mean(y), 2))  
print('Price std:', np.round(np.std(y), 2))
print('RMSE:', np.round(np.sqrt(metrics.mean_squared_error(y_test, rfr.predict(X_test))), 2))
print('R2 score train:', np.round(r2_score(y_train, rfr.predict(X_train), multioutput='variance_weighted'), 2))
print('R2 score test:', np.round(r2_score(y_test, rfr.predict(X_test), multioutput='variance_weighted'), 2))

In [None]:
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 20)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 150, num = 11)]
min_samples_split = [2, 5, 10, 20]
min_samples_leaf = [1, 2, 4, 10, 20]
bootstrap = [True, False]

parametrs = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

# Statsmodels OLS (Amsterdam)

In [None]:
#set response column and df to use
response_col = 'price'
df_to_use_ols = ams

# set X matrix and y 
X = df_to_use_ols.drop(response_col, axis=1)
X = sm.add_constant(X)
y = df_to_use_ols[response_col]

# fit and predict
est = sm.OLS(y.astype(float), X.astype(float)).fit()
ypred = est.predict(X)

# evaluate
rmse = np.sqrt(mean_squared_error(y, ypred))
print(rmse)

# show stats summary
est.summary()

# Statsmodels OLS (London)

In [None]:
#set response column and df to use
response_col = 'price'
df_to_use_ols = ldn

# set X matrix and y 
X = df_to_use_ols.drop(response_col, axis=1)
X = sm.add_constant(X)
y = df_to_use_ols[response_col]

# fit and predict
est = sm.OLS(y.astype(float), X.astype(float)).fit()
ypred = est.predict(X)

# evaluate
rmse = np.sqrt(mean_squared_error(y, ypred))
print(rmse)

# show stats summary
est.summary()

# Using Random Forest Regressor for feature extraction

In [None]:
# Amsterdam

In [None]:
listings_ams = pd.read_csv('/Users/georginadangerfield/Downloads/airbnb_amsterdam/listings_details.csv')

In [None]:
listings_ams['price'] = listings_ams['price'].str.replace(',', '')
listings_ams['price'] = listings_ams['price'].str.replace('$', '')
listings_ams['price'] = listings_ams['price'].astype(float)
listings_ams = listings_ams.loc[(listings_ams.price <= 600) & (listings_ams.price > 0)]

In [None]:
listings_ams.amenities = listings_ams.amenities.str.replace("[{}]", "").str.replace('"', "")

In [None]:
listings_ams.amenities.head()

In [None]:
count_vectorizer =  CountVectorizer(tokenizer=lambda x: x.split(','))
amenities = count_vectorizer.fit_transform(listings_ams['amenities'])
df_amenities_ams = pd.DataFrame(amenities.toarray(), columns=count_vectorizer.get_feature_names())
df_amenities_ams = df_amenities_ams.drop('',1)

In [None]:
columns =  ['host_is_superhost', 'host_identity_verified', 'host_has_profile_pic',
                   'is_location_exact', 'requires_license', 'instant_bookable',
                   'require_guest_profile_picture', 'require_guest_phone_verification']
for c in columns:
    listings_ams[c] = listings_ams[c].replace('f',0,regex=True)
    listings_ams[c] = listings_ams[c].replace('t',1,regex=True)

In [None]:
listings_ams['security_deposit'] = listings_ams['security_deposit'].fillna(value=0)
listings_ams['security_deposit'] = listings_ams['security_deposit'].replace( '[\$,)]','', regex=True ).astype(float)
listings_ams['cleaning_fee'] = listings_ams['cleaning_fee'].fillna(value=0)
listings_ams['cleaning_fee'] = listings_ams['cleaning_fee'].replace( '[\$,)]','', regex=True ).astype(float)

In [None]:
listings_new_ams = listings_ams[['host_is_superhost', 'host_identity_verified', 'host_has_profile_pic','is_location_exact', 
                         'requires_license', 'instant_bookable', 'require_guest_profile_picture', 
                         'require_guest_phone_verification', 'security_deposit', 'cleaning_fee', 
                         'host_listings_count', 'host_total_listings_count', 'minimum_nights',
                     'bathrooms', 'bedrooms', 'guests_included', 'number_of_reviews','review_scores_rating', 'price']]

In [None]:
for col in listings_new_ams.columns[listings_new_ams.isnull().any()]:
    print(col)

In [None]:
for col in listings_new_ams.columns[listings_new_ams.isnull().any()]:
    listings_new_ams[col] = listings_new_ams[col].fillna(listings_new_ams[col].median())

In [None]:
for cat_feature in ['zipcode', 'property_type', 'room_type', 'cancellation_policy', 'neighbourhood_cleansed', 'bed_type']:
    listings_new_ams = pd.concat([listings_new_ams, pd.get_dummies(listings_ams[cat_feature])], axis=1)

In [None]:
listings_new_ams = pd.concat([listings_new_ams, df_amenities_ams], axis=1, join='inner')

In [None]:
y = listings_new_ams['price']
x = listings_new_ams.drop('price', axis =1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state=1)
rf = RandomForestRegressor(n_estimators=500, 
                               criterion='mse', 
                               random_state=3, 
                               n_jobs=-1)
rf.fit(X_train, y_train)
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
rmse_rf= (mean_squared_error(y_test,y_test_pred))**(1/2)

print('RMSE test: %.3f' % rmse_rf)
print('R^2 test: %.3f' % (r2_score(y_test, y_test_pred)))

### Feature Importance of Random Forest

In [None]:
coefs_df = pd.DataFrame()

coefs_df['est_int'] = X_train.columns
coefs_df['coefs'] = rf.feature_importances_
coefs_df.sort_values('coefs', ascending=False).head(20)
coefs_df = pd.DataFrame()

coefs_df['est_int'] = X_train.columns
coefs_df['coefs'] = rf.feature_importances_
coefs_df.sort_values('coefs', ascending=False).head(20)

In [None]:
# London

In [None]:
listings_ldn = pd.read_csv('/Users/georginadangerfield/Downloads/assignment/airbnb_london/listings.csv')

In [None]:
listings_ldn.head()

In [None]:
listings_ldn['price'] = listings_ldn['price'].str.replace(',', '')
listings_ldn['price'] = listings_ldn['price'].str.replace('$', '')
listings_ldn['price'] = listings_ldn['price'].astype(float)
listings_ldn = listings_ldn.loc[(listings_ldn.price <= 600) & (listings_ldn.price > 0)]

In [None]:
listings_ldn.amenities = listings_ldn.amenities.str.replace("[{}]", "").str.replace('"', "")

In [None]:
listings_ldn.amenities.head()

In [None]:
count_vectorizer =  CountVectorizer(tokenizer=lambda x: x.split(','))
amenities = count_vectorizer.fit_transform(listings_ldn['amenities'])
df_amenities_ldn = pd.DataFrame(amenities.toarray(), columns=count_vectorizer.get_feature_names())
df_amenities_ldn = df_amenities_ldn.drop('',1)

In [None]:
columns =  ['host_is_superhost', 'host_identity_verified', 'host_has_profile_pic',
                   'is_location_exact', 'requires_license', 'instant_bookable',
                   'require_guest_profile_picture', 'require_guest_phone_verification']
for c in columns:
    listings_ldn[c] = listings_ldn[c].replace('f',0,regex=True)
    listings_ldn[c] = listings_ldn[c].replace('t',1,regex=True)

In [None]:
listings_ldn['security_deposit'] = listings_ldn['security_deposit'].fillna(value=0)
listings_ldn['security_deposit'] = listings_ldn['security_deposit'].replace( '[\$,)]','', regex=True ).astype(float)
listings_ldn['cleaning_fee'] = listings_ldn['cleaning_fee'].fillna(value=0)
listings_ldn['cleaning_fee'] = listings_ldn['cleaning_fee'].replace( '[\$,)]','', regex=True ).astype(float)

In [None]:
listings_new_ldn = listings_ldn[['host_is_superhost', 'host_identity_verified', 'host_has_profile_pic','is_location_exact', 
                         'requires_license', 'instant_bookable', 'require_guest_profile_picture', 
                         'require_guest_phone_verification', 'security_deposit', 'cleaning_fee', 
                         'host_listings_count', 'host_total_listings_count', 'minimum_nights',
                     'bathrooms', 'bedrooms', 'guests_included', 'number_of_reviews','review_scores_rating', 'price']]

In [None]:
for col in listings_new_ldn.columns[listings_new_ldn.isnull().any()]:
    print(col)

In [None]:
for col in listings_new_ldn.columns[listings_new_ldn.isnull().any()]:
    listings_new_ldn[col] = listings_new_ldn[col].fillna(listings_new_ldn[col].median())

In [None]:
for cat_feature in ['zipcode', 'property_type', 'room_type', 'cancellation_policy', 'neighbourhood_cleansed', 'bed_type']:
    listings_new_ldn = pd.concat([listings_new_ldn, pd.get_dummies(listings_ldn[cat_feature])], axis=1)

In [None]:
listings_new_ldn = pd.concat([listings_new_ldn, df_amenities_ldn], axis=1, join='inner')

In [None]:
len(listings_new_ldn)

In [None]:
# taking a random sample of listings_new_ldn, as for some reason the model had still not run after nearly 
# 2 hours (the Amsterdam model only took a few mins)
listings_new_ldn = listings_new_ldn.sample(n=10000)

In [None]:
len(listings_new_ldn)

In [None]:
y = listings_new_ldn['price']
x = listings_new_ldn.drop('price', axis =1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state=1)

In [None]:
rf.fit(X_train, y_train)

In [None]:
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
rmse_rf= (mean_squared_error(y_test,y_test_pred))**(1/2)

In [None]:
print('RMSE test: %.3f' % rmse_rf)
print('R^2 test: %.3f' % (r2_score(y_test, y_test_pred)))

### Feature Importance of Random Forest

In [None]:
coefs_df = pd.DataFrame()

coefs_df['est_int'] = X_train.columns
coefs_df['coefs'] = rf.feature_importances_
coefs_df.sort_values('coefs', ascending=False).head(20)
coefs_df = pd.DataFrame()

coefs_df['est_int'] = X_train.columns
coefs_df['coefs'] = rf.feature_importances_
coefs_df.sort_values('coefs', ascending=False).head(20)