# Predicting Average Ratings of Edinburgh Airbnbs through Review Texts Analysis

## Notebook 1: Data Cleaning and EDA

In this notebook, we will first outline the objectives of the project, then we will begin the foundational steps of data cleaning and basic exploratory data analysis.

## Introduction

#### Problem Statement

For this analysis, the focus lies in the quest to improve the hosting experience on Airbnb, particularly in the city of Edinburgh. As Airbnb continues to gain popularity as a preferred accommodation choice for travelers, ensuring high-quality experiences becomes more and more important for hosts. 

We aim to uncover valuable insights by analyzing review texts. Our primary goal is to predict average star ratings based on these reviews, providing hosts with actionable feedback on areas of improvement. By extracting key features from reviews, such as cleanliness, communication, and amenities, we can offer guidance to both new hosts seeking to establish themselves and experienced hosts looking to enhance their offerings. 

Ultimately, this analysis aims to empower hosts by providing useful insights and actionable recommendations to enhance their performance and maximize their ratings. Furthermore, we aim for these insights to not only benefit hosts in Edinburgh but also serve as valuable guidance for hosts in other cities facing similar challenges in the competitive Airbnb landscape.

#### Data Collection

The data used in this project consists of two datasets downloaded from [Inside Airbnb](http://insideairbnb.com/get-the-data). The first dataset contains information about Edinburgh Airbnb Listings, and the second dataset contains the reviews of these listings.

The data dictionary can be downloaded [here](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=1322284596).

## Data import and data cleaning

#### Import libraries

In [1]:
# Import libraries

# Main libraries
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#### Data Dictionaries

In [2]:
# Run below codes to reveal all texts in the dataframe by removing the hashtags.

#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_colwidth', None)

In [3]:
# Import raw data dictionary csv
df_dictionary = pd.read_csv('data/Data Dictionary.csv')

# Drop the first 6 unnecessary rows
df_dictionary_clean=df_dictionary.drop(range(6), axis=0)

# Reset index
df_dictionary_clean.reset_index(drop=True, inplace=True)

# Use the first row as column rows
df_dictionary_clean.columns=list(df_dictionary_clean.loc[0])

# Remove the first row
df_dictionary_clean.drop(0, inplace=True, axis=0)

# Reset index
df_dictionary_clean.reset_index(inplace=True, drop = True)

# Drop the 'Calculated' and 'Reference' column
df_dictionary_clean.drop(['Calculated', 'Reference'], axis=1, inplace=True)

# Show the dataframe
df_dictionary_clean

Unnamed: 0,Field,Type,Description
0,id,integer,Airbnb's unique identifier for the listing
1,listing_url,text,
2,scrape_id,bigint,"Inside Airbnb ""Scrape"" this was part of"
3,last_scraped,datetime,"UTC. The date and time this listing was ""scrap..."
4,source,text,"One of ""neighbourhood search"" or ""previous scr..."
...,...,...,...
74,reviews_per_month,numeric,The number of reviews the listing has over the...
75,,,
76,Change control,,
77,Field,Change,


#### Import datasets

In [4]:
# Import datasets
df_listings= pd.read_csv('data/listings.csv')
df_reviews=pd.read_csv('data/reviews.csv')

#### Check the dimensions of the datasets

In [5]:
# Show the first row of the listing dataframe
# As we have too many columns, we will make a matrix transpose to show all information
df_listings.head(1).T

Unnamed: 0,0
id,15420
listing_url,https://www.airbnb.com/rooms/15420
scrape_id,20231217045056
last_scraped,2023-12-17
source,city scrape
...,...
calculated_host_listings_count,1
calculated_host_listings_count_entire_homes,1
calculated_host_listings_count_private_rooms,0
calculated_host_listings_count_shared_rooms,0


In [6]:
# Show the shape of the listing dataframe
df_listings.shape

(7049, 75)

In [7]:
# Show the first five rows of the reviews dataframe
df_reviews.head(5)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,15420,171793,2011-01-18,186358,Nels,My wife and I stayed at this beautiful apartme...
1,15420,176350,2011-01-31,95218,Gareth,Charlotte couldn't have been a more thoughtful...
2,15420,232149,2011-04-19,429751,Guido,I went to Edinburgh for the second time on Apr...
3,15420,236073,2011-04-23,420830,Mariah,This flat was incredible. As other guests have...
4,15420,263713,2011-05-15,203827,Linda,Fantastic host and the apartment was perfect. ...


In [8]:
# Show the shape of the reviews dataframe
df_reviews.shape

(535577, 6)

In [9]:
# Show the columns of the listings dataframe
df_listings.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'ca

There are a lot of features in the listing data. Our next step is to narrow down our feature domain. We can first remove the urls and scraping related columns. Then remove columns contain no values.

In [10]:
# Remove url and scrape related columns
ls=[]
for i in df_listings.columns:
    if ('url' in i) | ('scrape' in i):
        ls.append(i)
    df_listings_clean=df_listings.drop(ls, axis=1)
df_listings_clean.drop('source', axis=1, inplace=True)

In [11]:
# Remove columns contain no values
for i in df_listings_clean.columns:
    if df_listings_clean.isnull().sum()[i] == df_listings_clean.shape[0]:
        df_listings_clean.drop(i, axis=1, inplace=True)    

In [12]:
df_listings_clean.columns

Index(['id', 'name', 'neighborhood_overview', 'host_id', 'host_name',
       'host_since', 'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'latitude', 'longitude', 'property_type',
       'room_type', 'accommodates', 'bathrooms_text', 'beds', 'amenities',
       'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'first_review', 'last_review'

##### Host columns

We can look through columns by types and remove irrelevant columns. First, we will check and deal with the host related columns.

In [13]:
# Create a list which has host related column names
host_columns=[]
for i in df_listings_clean.columns:
    if 'host' in i:
        host_columns.append(i)

In [14]:
# Show the host related column names
print(host_columns)

['host_id', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms']


We will remove the 'host_name' column as names are irrelevant for our analysis, also it will include plenty duplicate values. The 'host_neighbourhood' is dropped as it contains duplicate information compare to the 'host_location' column. We will also remove 'host_total_listings_count' and 'host_listings_count' as the 'calculated_host_listings_count' is the correct column showing the number of listings per host after checking. 

In [15]:
# Create an array contains number of listings per host
host_listing_num=df_listings_clean.groupby('host_id').count()['id']

# Check if the number of listings per host matches the 'calculated_host_listings_count' column
for i in host_listing_num.index:
    if host_listing_num[i] != df_listings_clean[df_listings_clean['host_id']==i]['calculated_host_listings_count'].unique():
        print(i, 'This one is wrong')
# No error appears so the column is good.

In [16]:
# Remove 'host_name','host_total_listings_count' and 'host_listings_count' columns
df_listings_clean.drop(['host_name','host_neighbourhood', 'host_total_listings_count', 'host_listings_count'],
                       inplace=True, axis=1)

##### Other columns

Then we will look through the columns and remove other irrelevant columns.

In [17]:
# Remove amentities
df_listings_clean.drop('amenities', axis=1, inplace=True)

We arrive at a dataframe containing possible useful columns. 

Let's check the number of duplicate rows and rows contain null values in the current dataframe.

In [18]:
# Check duplicates
df_listings_clean[df_listings_clean.duplicated()]

Unnamed: 0,id,name,neighborhood_overview,host_id,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month


Our next step is to investigate if each listing has enough reviews for us to perform review texts analysis. We will then remove the listings that have no reviews.

In [19]:
# Check if there are any null values in the reviews dataframe
df_reviews.isnull().sum()

listing_id        0
id                0
date              0
reviewer_id       0
reviewer_name     0
comments         46
dtype: int64

In [20]:
# Create a dataframe contains these null review data
null_reviews = df_reviews[df_reviews['comments'].isnull()]

In [21]:
# Check if the listings in the null_reviews dataframe have no reviews at all in the entire dataframe
df_reviews.groupby('listing_id').count().loc[null_reviews['listing_id']]

Unnamed: 0_level_0,id,date,reviewer_id,reviewer_name,comments
listing_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1012100,233,233,233,233,232
2220535,329,329,329,329,328
3088183,360,360,360,360,359
4915970,924,924,924,924,923
5559720,664,664,664,664,663
4358584,160,160,160,160,159
4362294,906,906,906,906,905
6517915,410,410,410,410,409
6857638,375,375,375,375,374
10142027,212,212,212,212,211


In [22]:
# Remove these rows in the reviews dataframe
df_reviews_clean=df_reviews.drop(null_reviews.index, axis=0)

In [23]:
# Reset index
df_reviews_clean=df_reviews_clean.reset_index(drop=True)

Therefore, we can use all the unique listing ids in the reviews dataframe to filter our listing dataframe.

In [24]:
# Extract listing ids that have reviews in the review dataframe
listing_has_reviews= df_reviews_clean.groupby('listing_id').count().index

In [25]:
# Select 'id' column as dataframe index
df_listings_clean.set_index('id', inplace=True)

In [26]:
# Remove rows that have ids not in the generated list and reset index.
df_listings_clean=df_listings_clean.loc[listing_has_reviews].reset_index()

In [27]:
# Create temporary series object that contains the number of null values for each column in the listing dataframe
temp_null_series=df_listings_clean.isnull().sum()

# Use list comprehension to show the columns that have null values
temp_null_series[[i for i in temp_null_series.index if temp_null_series[i] !=0]]

neighborhood_overview    1857
host_location            1038
host_about               2671
host_response_time       1617
host_response_rate       1617
host_acceptance_rate      705
host_is_superhost          32
neighbourhood            1857
bathrooms_text              6
beds                       55
price                    1370
has_availability         1370
license                  5746
dtype: int64

We need to deal with these rows one by one.

The 'neighborhood_overview' column shows the neighborhood information about the listing, the 'host_location' column contains the location details about the host, the 'host_about' column shows the host introductions. As these columns contains texts, we can fill the null values in these columns as 'None'.

In [28]:
# Fill null values in neighbourhood_overview columns
df_listings_clean[['neighborhood_overview', 'host_location', 'host_about']] = df_listings_clean[['neighborhood_overview', 'host_location', 'host_about']].fillna(value='None')

df_listings_clean.isnull().sum()[['neighborhood_overview','host_location','host_about']]

neighborhood_overview    0
host_location            0
host_about               0
dtype: int64

In [29]:
# Check the values in 'host_response_time' columns
df_listings_clean['host_response_time'].value_counts()

host_response_time
within an hour        3787
within a few hours     621
within a day           311
a few days or more      88
Name: count, dtype: int64

In [30]:
# Check the values in 'host_response_rate' columns
df_listings_clean['host_response_rate'].value_counts()

host_response_rate
100%    3851
99%      278
90%      146
98%       80
0%        50
80%       34
96%       28
50%       23
93%       21
97%       21
94%       20
92%       18
83%       17
67%       17
60%       16
86%       16
75%       13
89%       12
71%       12
88%       12
70%       12
81%       12
57%       11
84%       11
95%       11
25%        7
91%        7
20%        7
82%        6
63%        6
29%        5
33%        4
40%        4
78%        4
30%        3
66%        3
58%        2
34%        2
10%        2
13%        1
87%        1
43%        1
Name: count, dtype: int64

In [31]:
# Fill the null values in 'host_response_time' as 'Not provided'.
df_listings_clean['host_response_time'] = df_listings_clean['host_response_time'].fillna('Not provided')

In [32]:
# Convert the response rate as floats from percentage string and fill null values as 0.
df_listings_clean['host_response_rate'] = df_listings_clean['host_response_rate'].apply(lambda col: float(str(col).replace("%", "")))
df_listings_clean['host_response_rate'] = df_listings_clean['host_response_rate'].fillna(value=0);