In [1]:
import pandas as pd
import numpy as np
import scipy
import sklearn
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from PIL import Image
import nltk
from nltk import word_tokenize
import collections
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words  = set(stopwords.words('english'))
nltk.download('averaged_perceptron_tagger')

# Introduction

[Airbnb](https://www.airbnb.com/) allows property owners to rent out their homes for a short period of time to travelers from around the world. Because tourists have many options to select from, it would be beneficial for property owners to predict the popularity of their homes before putting them on the market, as well as knowing the factors that make a property more popular. Our purpose for the project is to perform analysis on Airbnb data in order to recognize certain trends and train models that can be used to predict how popular a property may be.

# ETL / Data Wrangling

## Data

The data was accessed via [OpenDataSoft](https://public.opendatasoft.com/explore/dataset/airbnb-ratings/table/?disjunctive.city&disjunctive.neighbourhood_cleansed&sort=number_of_reviews). The original dataset contains 520,440 observations of 94 variables. For easier analysis, we used only data from New York City, New York. Additionally, we dropped several string variables. This subset contains 20,807 observations. Notable variables include:

- Reviews per month - our metric for measuring the popularity of a property
- Number of reviews - the total number of reviews since the property was listed
- Price - the price per night of the unit (not including fees such as cleaning)
- Bedrooms - the number of bedrooms in the unit
- Bathrooms - the number of bathrooms in the unit

For our predictive models, we use the Reviews per month as the response variable.

In [4]:
# reading in data from website 
df = pd.read_csv("https://public.opendatasoft.com/explore/dataset/airbnb-ratings/download/?format=csv&disjunctive.city=true&disjunctive.neighbourhood_cleansed=true&refine.city=New+York&timezone=America/Chicago&use_labels_for_header=true&csv_separator=%3B", sep = ';')

## Choosing Variables

Not all of the columns in the dataset are meaningful in predicting reviews per month. The following code selects the columns we want to use from the data frame that was loaded above.

In [6]:
#columns that we want to keep
col_names = ["Description", "Review scores rating", "Review scores value", "Reviews per month", "Review scores location", "Review scores communication", "Review scores checkin", "Review scores cleanliness", "Review scores accuracy", "Host since", "Host response time", "Host response rate", "Host is superhost", "Host listings count", "Host has profile pic", "Host identity verified", "City", "State", "Room type", "Accommodates", "Bathrooms", "Bedrooms", "Beds", "Price", "Weekly price", "Security deposit", "Cleaning fee", "Guests included", "Extra people", "Minimum nights", "Maximum nights", "Availability 30", "Availability 60", "Availability 90", "Availability 365", "Requires license"]
#subsetting data to include only columns that we want
df = df.loc[:, col_names]

In [7]:
#first 5 observations of dataset
df.head(5)

Unnamed: 0,Description,Review scores rating,Review scores value,Reviews per month,Review scores location,Review scores communication,Review scores checkin,Review scores cleanliness,Review scores accuracy,Host since,Host response time,Host response rate,Host is superhost,Host listings count,Host has profile pic,Host identity verified,City,State,Room type,Accommodates,Bathrooms,Bedrooms,Beds,Price,Weekly price,Security deposit,Cleaning fee,Guests included,Extra people,Minimum nights,Maximum nights,Availability 30,Availability 60,Availability 90,Availability 365,Requires license
0,"Professionally cleaned, newly furnished studio...",94.0,9.0,2.09,10.0,10.0,10.0,10.0,10.0,2014-10-02,within an hour,100%,True,1.0,True,True,New York,NY,Entire home/apt,4.0,1.0,0.0,2.0,236.0,625.558337,500.0,150.0,1.0,0.0,1.0,1125.0,0.0,0.0,0.0,0.0,False
1,Our cool charming studio in an elevator bld is...,97.0,9.0,3.84,10.0,10.0,10.0,9.0,10.0,2012-06-10,within an hour,100%,True,1.0,True,True,New York,NY,Entire home/apt,4.0,1.0,1.0,1.0,120.0,625.558337,200.0,150.0,3.0,15.0,1.0,1125.0,1.0,10.0,21.0,295.0,False
2,Cute 2 bedroom in the heart of the East Villag...,85.0,9.0,2.79,10.0,9.0,9.0,9.0,9.0,2015-11-06,within an hour,93%,False,3.0,True,True,New York,NY,Entire home/apt,6.0,1.0,2.0,3.0,219.0,625.558337,243.460718,100.0,1.0,0.0,1.0,1125.0,3.0,15.0,32.0,286.0,False
3,A serene space to rest and recharge while you'...,100.0,10.0,1.2,10.0,10.0,10.0,10.0,10.0,2013-01-23,within a few hours,100%,False,1.0,True,True,New York,NY,Private room,1.0,1.0,1.0,1.0,225.0,625.558337,0.0,25.0,1.0,0.0,1.0,4.0,0.0,7.0,7.0,7.0,False
4,"Beautiful Apartment, 1st Flr, BrownStone build...",90.0,9.0,4.0,9.0,9.0,10.0,9.0,9.0,2015-10-21,within an hour,100%,False,1.0,True,True,New York,NY,Entire home/apt,6.0,1.5,2.0,3.0,325.0,625.558337,243.460718,150.0,4.0,25.0,2.0,1125.0,9.0,22.0,36.0,292.0,False


## Missing Values

The dataset contains NaN values that are unusable. The table below shows the percentage of missing entries for each column of the data frame.

In [9]:
# Show percentage of missing values in each column
percent_missing = df.isnull().sum().to_frame('Percent missing') * 100 / len(df)
percent_missing.sort_values(by=['Percent missing'], inplace = True, ascending = False)
pd.DataFrame(percent_missing)

Unnamed: 0,Percent missing
Weekly price,89.825539
Security deposit,49.531408
Host response rate,32.666891
Host response time,32.666891
Cleaning fee,27.159129
Review scores location,23.477676
Review scores value,23.468064
Review scores checkin,23.424809
Review scores accuracy,23.246984
Review scores communication,23.222954


In order to address the issue of missing values in the data, we make use of the SimpleImputer class from the scikit-learn library. For numerical variables, we replace missing values with the mean of the known values. For categorical variables, we replace the missing values with the most frequent value among the known values.

In [11]:
from sklearn.impute import SimpleImputer
# replace missing values with mean of that column
imp_mean = SimpleImputer(missing_values = np.nan, strategy = "mean")
# replace missing values with mode of that column
imp_freq = SimpleImputer(missing_values = np.nan, strategy = "most_frequent")
# list of categorical features
cat_cols = ["Host since", "Host response time", "Host response rate", "Host is superhost", "Host has profile pic", "Host identity verified", "City", "State", "Room type", "Requires license"]
# list of numeric features
num_cols = ["Review scores rating", "Review scores value", "Reviews per month", "Review scores location", "Review scores communication", "Review scores checkin", "Review scores cleanliness", "Review scores accuracy", "Host listings count", "Accommodates", "Bathrooms", "Bedrooms", "Beds", "Price", "Weekly price", "Security deposit", "Cleaning fee", "Guests included", "Extra people", "Minimum nights", "Maximum nights", "Availability 30", "Availability 60", "Availability 90", "Availability 365"]

# Replacing missing values in categorical columns with most frequent values
imp_freq.fit(df.loc[:, cat_cols])
df.loc[:, cat_cols] = imp_freq.transform(df.loc[:, cat_cols])

# Replacing missing values in numerical columns with mean values
imp_mean.fit(df.loc[:, num_cols])
df.loc[:, num_cols] = imp_mean.transform(df.loc[:, num_cols])

## Data Visualizations

In [13]:
tokenizer = nltk.RegexpTokenizer(r'\w+')
def get_adj_and_noun(row):
  w = tokenizer.tokenize(row)
  filtered = [i.lower() for i in w if i not in stop_words]
  sent = nltk.pos_tag(filtered)
  bigrams = list(nltk.bigrams(filtered))
  nnps = [w[0] for w in sent if w[1] in ['NN', 'NNS', 'NNP', 'NNPS']]
  return {'nns': nnps, 'bigrams': bigrams}
m = [get_adj_and_noun(s) for s in df['Description']]

In [14]:
frequent_nnps = collections.Counter([w for d in m for w in d['nns']]).most_common(20)
plt.figure(figsize=(10,10))
plt.bar(range(len(frequent_nnps)), [x[1] for x in frequent_nnps], tick_label=[x[0] for x in frequent_nnps])
plt.xticks(rotation=90)
plt.title('Top 20 frequent nouns used by hosts in Description')
display(plt.show())

In [15]:
frequent_bigram = collections.Counter([w for d in m for w in d['bigrams']]).most_common(30)
plt.figure(figsize=(15,15))
plt.bar(range(len(frequent_bigram)), [x[1] for x in frequent_bigram], tick_label=['{} {}'.format(x[0], x[1]) for x,y in frequent_bigram])
plt.xticks(rotation=90)
plt.title('Top 30 frequent bigram used by host in Description')
display(plt.show())

In [16]:
# Replacing missing values with string "NA"
df["Description"].fillna("NA", inplace = True)
# getting a list of words in description
text = " ".join(description for description in df.Description)
# adding "NA" (which indicates missing value) into list of stopwords
stopwords = ['NA'] + list(STOPWORDS)
# generate wordcloud of descriptions
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
fig = plt.figure(figsize = (8,8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
display(fig)

In [17]:
plt.figure(figsize=(6,5))
plt.xlabel("Reviews score ratings")
plt.title("Distribution of Ratings")
display(df['Review scores rating'].plot(
    kind='hist',
    bins=50))

Most of the users rating scores are good and we can see why Airbnb is a popular option in New York.

In [19]:
#graph on price vs number of reviews grouped by room types 
fig, ax = plt.subplots()
# set figure size
fig.set_figheight(10)
fig.set_figwidth(10)
# color map for groups
colors = {'Entire home/apt':'red', 'Private room':'blue', 'Shared room':'green'}
# generate scatterplot
grouped = df.groupby('Room type')
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='Price', y='Reviews per month', label=key, color=colors[key])
plt.title("Relationship between Price and Reviews")
plt.xlabel("Price")
plt.ylabel("Number of Reviews Per Month")
display(fig)

In [20]:
# Heat map to show correlation between features
import seaborn as sns
fig = plt.figure(figsize=(15,15))
display(sns.heatmap(df.corr()))

# Modeling

Two modeling techniques were explored and validated in order to predict the number of reviews a property receives per month. In order to detect redundant features, we make use of Recursive Feature Eliminiation on a Linear Regression Model. Then we train a Random Forest Regression model using the feature set that excludes these redundant features.

In [22]:
from sklearn.model_selection import train_test_split

feature_names = ["Review scores rating", "Review scores value", "Review scores location", "Review scores communication", "Review scores checkin", "Review scores cleanliness", "Review scores accuracy", "Host is superhost", "Host listings count", "Host has profile pic", "Host identity verified", "Accommodates", "Bathrooms", "Bedrooms", "Beds", "Price", "Weekly price", "Security deposit", "Cleaning fee", "Guests included", "Extra people", "Minimum nights", "Maximum nights","Availability 30", "Availability 60", "Availability 90", "Availability 365"]

X = df[feature_names]
y = df['Reviews per month']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.25,random_state=42)
y_train = y_train.ravel()
y_test = y_test.ravel()
print('Training dataset shape:', X_train.shape, y_train.shape)
print('Testing dataset shape:', X_test.shape, y_test.shape)

In [23]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

#no of features
nof_list=np.arange(1,27)
high_score=0
#Variable to store the optimum features
nof=0           
score_list =[]
for n in range(len(nof_list)):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
    model = LinearRegression()
    rfe = RFE(model, nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train, y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe, y_train)
    score = model.score(X_test_rfe, y_test)
    score_list.append(score)
    if (score > high_score):
        high_score = score
        nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))

We notice that the highly correlated features (Availability 30..90) from the heatmap earlier are dropped by the Linear Regression RFE feature selection. This makes sense as we do not want the target variable to be predicted by highly correlated predictors as that would be redundant.
We are now using the selected features from Recursive Feature Elimination (RFE) to feed into a Random Forest Regressor model.

In [25]:
selected_features = ["Review scores rating", "Review scores value", "Review scores location", "Review scores communication", "Review scores checkin", "Review scores cleanliness", "Review scores accuracy", "Host is superhost", "Host listings count", "Host has profile pic", "Host identity verified", "Accommodates", "Bathrooms", "Bedrooms", "Beds", "Price", "Weekly price", "Security deposit", "Cleaning fee", "Guests included", "Extra people", "Minimum nights"]
X = df[selected_features]
y = df['Reviews per month']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.25,random_state=42)
y_train = y_train.ravel()
y_test = y_test.ravel()
print('Training dataset shape:', X_train.shape, y_train.shape)
print('Testing dataset shape:', X_test.shape, y_test.shape)

In [26]:
from sklearn.ensemble import RandomForestRegressor
# Build RF resgressor
clf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
clf.fit(X_train, y_train)

# Use the forest's predict method on the test data
predictions = clf.predict(X_test)

# Print out the Root Mean Square Error
print('RMSE :', np.sqrt(((predictions - y_test) ** 2).mean()))

d 
 
## Evaluation

Our model was evaulated using Root Mean Squared Error (RMSE).

The output RMSE of the Random Forest model is ~1.16 which is a pretty low error for the values used in our model. This goes to show that it performs very well for the features we have selected.

In [28]:
# Plot of predicted vs actual values for Random Forest Regressor
fig, ax = plt.subplots()
fig.set_figheight(10)
fig.set_figwidth(10)
plt.title("Predicted vs Actual for Random Forest Model")
plt.scatter(predictions, y_test)
plt.plot(range(16), range(16),c = "black")
plt.xlabel("Predicted")
plt.ylabel("Actual")
display(fig)

# Discussion

We used natural language processing to tokenize the Description and perform part-of-speech tagging on each word. Then we found out the most frequent nouns to see what the hosts tend to include in their description to make their Airbnb popular. Besides, we also extract the 20 most popular bigrams from the description. From the histogram, we can see the host tends to mention what they have in the apartment, such as living room, size bed, ac, etc. And they also mention different attractions, such as Time Squares, and talk about how close they are from their Airbnb, as they mention "minutes walk", "blocks away" etc. It makes sense because New York is one of the biggest tourism city, and tourists love to stay close to different attractions.

We have used model selection using a Recursive Feature elimination method in a Linear model. Using the selected features, we have built a Random Forest Model to predict the reviews per month. The random forest model have a pretty low testing RMSE which means that it is a pretty good fit for the data.

Using average reviews per month as our metric to determine popularity may be flawed. People may be more likely to leave reviews when they have grievances. The cheaper homes may receive more reviews just based on complaints, and the more expensive homes might not live up to expectations. If there was a variable such as 'Bookings per month', that would be a much better response variable to predict.

# Appendix

## Data Dictionary

* Reviews per month - Average reviews per month, our metric for popularity.
* Number of reviews - Total number of reviews since the property was listed.
* Host since - Date the host registered to Airbnb.
* Host response time - Factor of the average time it takes for the host to respond to messages.
* Host response rate - What percent of messages the host responds to.
* Host is superhost - Logical value whether or not the host is a 'superhost'.
* Host listings count - The number of listings this host has.
* Host has profile pic - Logical value whether or not the host has a profile picture.
* Host identity verified - Logical value whether or not the host has verified their identity.
* City - The city of the listing.
* State - The State of the listing.
* Room type - The type of listing: 'Entire home/apt', 'Private Room', 'Shared Room'.
* Accommodates - The number of guests the listing accommodates.
* Bathrooms - The number of bathrooms in the listing.
* Bedrooms - The number of bedrooms in the listing.
* Beds - The number of beds in the listing (could be multiple beds per room).
* Price - The nightly price of the listing.
* Weekly price - The weekly price of the listing.
* Security deposit - The amount for the required security deposit.
* Cleaning fee - The cleaning fee.
* Guests included - The number of people the listing comfortably fits.
* Extra people - The number of extra people allowed to visit the listed property.
* Minimum nights - The minimum number of nights required to rent the listed property.
* Maximum nights - The maximum number of nights required to rent the listed property.
* Availability 30 - Number of days available in a 30 day period.
* Availability 60 - Number of days available in a 60 day period.
* Availability 90 - Number of days available in a 90 day period.
* Availability 365 - Number of days available in a 365 day period.
* Requires license - Logical value whether or not a license is required to rent the listed property.
* Review scores rating - The average rating given for the overall experience.
* Review scores value - The average rating given for the financial value of the property.
* Review scores accuracy - The average rating given for the accuracy of the listed amenities and price.
* Review scores location - The average rating given for the location of the property.
* Review scores communication - The average rating given for the communication with the host.
* Review scores checkin - The average rating given for the check-in process.
* Review scores cleanliness - The average rating given for the cleanliness of the property.