# Predicting Star Ratings of Edinburgh Airbnbs through Review Texts Analysis

# Notebook 3: Pre-Processing

## Introduction

In this notebook, our objective is to preprocess all the data obtained using various cleaning methods to prepare for future modeling tasks. 

- First, we will preprocess the listing dataset by ensuring that all of the listing related columns are of numerical type. 
- Next, we will utilize natural language processing (NLP) and machine learning techniques to preprocess the review data, we will specifically remove all non-English reviews in this step. By doing so, we aim to transform the raw data into a format suitable for analysis and modeling, laying the foundation for extracting valuable insights and building predictive models. 
- Finally, we will merge the cleaned listing data and the review data together using two methods, eventually obtain two different types of Review datasets: **Collapsed review data with same corresponding listing details**, and **Corresponding listing details aggregate to individual review datapoints**.

***

# Table of Contents

1. [**Pre-processing Listing Data**](#a1) <br>
    1.1 [**Remove Irrelevant columns**](#a1.1) <br>
    1.2 [**Clean current non-numerical columns**](#a1.2) <br>
2. [**Pre-process Review Data**](#a2) <br>
    2.1 [**Remove Non-English texts**](#a2.1) <br>
3. [**Merge Reviews and Listing data**](#a3) <br>
    3.1 [**Obtain all reviews individually**](#a3.1) <br>
    3.2 [**Obtain collapsed reviews by listings**](#a3.2) <br>
4. [**Extract merged data**](#a4) <br>
5. [**Summary**](#a5) <br>


***

#### Import Libraries

In [6]:
# Main Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

# Text analysis libraries
from langdetect import detect

#### Loading Relevant Data

In [7]:
df_reviews_clean=joblib.load('data/df_reviews_clean.pkl')
df_listings_clean_senti=joblib.load('data/df_listings_clean_senti.pkl')

***

# Pre-process objectives

- Clean and process Listing data by minimizing the number of non-numerical columns
- Clean and process review data to prepare for text vectorizing
- Merge both datasets to prepare for modelling

***

# Pre-process Listing Data <a id="a1"></a>

We will first pre-process the listing data by converting as many non-numerical columns as possible before we deal with the review data, as well as removing irrelevant columns for future review analysis. 

### Create copy of current listing data

In [8]:
df_listings_processed = df_listings_clean_senti.copy()

In [9]:
pd.set_option('display.max_row', 100)
df_listings_clean_senti.head(1).T

Unnamed: 0,0
host_id,60423
id,15420
neighborhood_overview,"The neighbourhood is in the historic New Town,..."
host_since,2009-12-06
host_location,"Edinburgh, United Kingdom"
host_about,"I have a background in property, having worked..."
host_response_time,within a few hours
host_response_rate,100.0
host_acceptance_rate,92.0
host_is_superhost,1


In [10]:
# Show current listing data columns
df_listings_processed.columns

Index(['host_id', 'id', 'neighborhood_overview', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified',
       'neighbourhood_cleansed', 'latitude', 'longitude', 'property_type',
       'room_type', 'accommodates', 'bathroom_num', 'beds', 'price',
       'minimum_nights', 'maximum_nights', 'has_availability',
       'number_of_reviews', 'number_of_reviews_ltm', 'first_review',
       'last_review', 'review_scores_Overall', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instant_bookable',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_room

## Remove Irrelevant columns <a id="a1.1"></a>

Here we will remove irrelevant columns from the listing dataset. These includes:

- host_id: We can use the listing id to match guest reviews from the review data. The host_id is irrelevant for our analysis at the moment.
- neighbourhood_cleansed: The neighbourhood details of the listing can be analysed using the `neighbourhood_overview`, and the exact location characteristics can be analysed using the latitude and the longtitude columns.
- all types of review scores: After we transformed and extracted the sentiment scores of the reviews, the review rating scores are of no use for current analysis.
- two selected binary columns: `host_has_profile_pic` and `host_identity_verified`. These two columns consists highly imbalanced data thus we decide to drop them for our analysis.
- property_type: `property_type` shares more specific information about the listing. However, we can extract the information using `room_type` from the current dataframe.

In [11]:
df_listings_processed.drop([
    'host_id',
    'neighbourhood_cleansed',
    'review_scores_Overall',
    'review_scores_accuracy',
    'review_scores_cleanliness',
    'review_scores_checkin',
    'review_scores_communication',
    'review_scores_location',
    'review_scores_value',
    'host_has_profile_pic',
    'host_identity_verified',
    'property_type'
], axis=1, inplace=True)

## Clean current non-numerical columns <a id="a1.2"></a>

In [12]:
# Extract current non numerical columns
df_listings_obj=df_listings_processed.select_dtypes(include='object')

# Check column contents
df_listings_obj.head(1).T

Unnamed: 0,0
neighborhood_overview,"The neighbourhood is in the historic New Town,..."
host_since,2009-12-06
host_location,"Edinburgh, United Kingdom"
host_about,"I have a background in property, having worked..."
host_response_time,within a few hours
host_verifications,"['email', 'phone']"
room_type,Entire home/apt
first_review,2011-01-18
last_review,2023-12-11


Among the current non-numerical columns, we will isolate the `neighborhood_overview` and `host_about` columns,  as they contain information about the listing that can be analyzed later using text transformation methods. We can potentially use them to determine if this information serves as key indicators of certain sub-ratings.

Apart from these two columns, we can categorize the remaining features into different groups:

- **Dates**: `host_since`, `first_review`, and `last_review`. These columns will be transformed into Years and Months with correspoonding prefixes.
- **Host related columns**: `host_location`, `host_response_time`, and `host_verifications`. We will examine the contents of these columns and convert them into binary features using encoding strategies.
- **Property related columns**: `property_type` and `room_type`. Similar to the host-related columns, we will check the contents of these columns and convert them into binary features.

#### Extract `neighborhood_overview` and `host_about`

In [13]:
# Extract listing description columns 'neighborhood_overview' and 'host_about'
Listing_descriptions = df_listings_processed[['neighborhood_overview', 'host_about']]

# Drop these columns from the processed dataframe
df_listings_processed.drop(['neighborhood_overview', 'host_about'], axis=1, inplace=True)

#### Extract year and month information from date related columns

In [14]:
# Create list contains date related column names
date_columns= ['host_since', 'first_review', 'last_review']

# Use for loop to generate new features contain year and month information for each date related column
for i in date_columns:
    # Convert columns to datetime format
    df_listings_processed[i] = pd.to_datetime(df_listings_processed[i])
    # Extract year and month information
    temp_year_list = df_listings_processed[i].dt.year
    temp_month_list = df_listings_processed[i].dt.month
    # Add new features to original dataframe
    df_listings_processed[i+'_year'] = temp_year_list
    df_listings_processed[i+'_month'] = temp_month_list

# Drop original date columns
df_listings_processed.drop(date_columns, axis=1, inplace=True)

#### Binarize `host_location`

In [15]:
# Check host_location contents
df_listings_processed['host_location'].value_counts()

Edinburgh, United Kingdom         4143
None                               902
Scotland, United Kingdom           257
London, United Kingdom             100
North Berwick, United Kingdom       55
                                  ... 
Haugh of Urr, United Kingdom         1
Sedbergh, United Kingdom             1
Sheffield, United Kingdom            1
Bergen, Norway                       1
Little Waltham, United Kingdom       1
Name: host_location, Length: 214, dtype: int64

We observe that most of the hosts are located in Edinburgh, as our data was specifically chosen for Edinburgh Airbnb listings. Therefore, we can create a binary column to distinguish whether the host is from Edinburgh or not.

In [16]:
# Create binary column to transform host_location column to binary 
df_listings_processed['host_from_Edinburgh']=df_listings_processed['host_location'].apply(lambda x: 1 if 'Edinburgh' in str(x) else 0)

# Drop host_location column
df_listings_processed.drop('host_location', axis=1, inplace=True)

#### Binarize `host_response_time`

In [17]:
# Check host_response_time contents
df_listings_processed['host_response_time'].value_counts()

within an hour        3595
Not provided          1417
within a few hours     574
within a day           265
a few days or more      59
Name: host_response_time, dtype: int64

We observe that there are several possible outcomes for the `host_response_time` column. Therefore, we can use One hot encoder method to convert this column into several binary columns. 

In [18]:
# Convert host_response_time to binary columns
response_time_dummies = pd.get_dummies(df_listings_processed['host_response_time'], prefix='host_response_time')

# Concatenate the dummy columns with the original processed listing DataFrame
df_listings_processed = pd.concat([df_listings_processed, response_time_dummies], axis=1)

# Drop the 'None' column to prevent dummy variable trap
df_listings_processed.drop(['host_response_time_Not provided', 'host_response_time'], axis=1, inplace=True)

#### Binarize `host_verifications`

In [19]:
# Check host_verifications contents
df_listings_processed['host_verifications'].value_counts()

['email', 'phone']                  4599
['email', 'phone', 'work_email']     896
['phone']                            394
['phone', 'work_email']               12
['email']                              5
[]                                     4
Name: host_verifications, dtype: int64

In [20]:
# Create binary columns using contents in host_verifications 
host_verification_email=[]
host_verification_phone=[]
host_verification_work_email=[]

# We will use the unique values to filter each option
unique_verification_options = list(df_listings_processed['host_verifications'].value_counts().index)

# Append different options in created lists
for i in range(df_listings_processed.shape[0]):
    temp_value = df_listings_processed.loc[i, 'host_verifications']
    if temp_value == "['email', 'phone']":
        host_verification_email.append(1)
        host_verification_phone.append(1)
        host_verification_work_email.append(0)
    elif temp_value =="['email', 'phone', 'work_email']":
        host_verification_email.append(1)
        host_verification_phone.append(1)
        host_verification_work_email.append(1)
    elif temp_value == "['phone']":
        host_verification_email.append(0)
        host_verification_phone.append(1)
        host_verification_work_email.append(0)
    elif temp_value == "['phone', 'work_email']":
        host_verification_email.append(0)
        host_verification_phone.append(1)
        host_verification_work_email.append(1)
    elif temp_value == "['email']":
        host_verification_email.append(1)
        host_verification_phone.append(0)
        host_verification_work_email.append(0)
    else:
        host_verification_email.append(0)
        host_verification_phone.append(0)
        host_verification_work_email.append(0)

# Add extracted lists to dataframe
df_listings_processed['host_verifications_email'] = host_verification_email
df_listings_processed['host_verifications_phone'] = host_verification_phone
df_listings_processed['host_verifications_work_email'] = host_verification_work_email

# Drop original column
df_listings_processed.drop('host_verifications', axis=1, inplace=True)

#### Binarize `room_type`

In [21]:
# Check the room_type column content
df_listings_processed['room_type'].value_counts()

Entire home/apt    4102
Private room       1765
Hotel room           28
Shared room          15
Name: room_type, dtype: int64

In [22]:
# Convert room_type to binary columns
room_type_dummies = pd.get_dummies(df_listings_processed['room_type'], prefix='room_type')

# Concatenate the dummy columns with the original processed listing DataFrame
df_listings_processed = pd.concat([df_listings_processed, room_type_dummies], axis=1)

# Drop the 'None' column to prevent dummy variable trap
df_listings_processed.drop(['room_type', 'room_type_Shared room'], axis=1, inplace=True)

***

# Pre-process Review Data <a id="a2"></a>

Before we merge the review data with our listing data, we need to vectorize the review texts. First, the text will be cleaned into appropriate format so that errors and bias will be reduced in the corpus.

## Remove Non-English texts <a id="a2.1"></a>

In [24]:
# Function to detect language
def detect_language(text):
    try:
        return detect(text) == 'en'
    except:
        return False  # Return False if language detection fails

# Filter out non-English texts
df_reviews_processed= df_reviews_clean[df_reviews_clean['comments'].apply(detect_language)]

# Reset index
df_reviews_processed.reset_index(drop=True, inplace=True)

In [25]:
print(f'The current dataset contains {df_reviews_processed.shape[0]} reviews.')

The current dataset contains 470695 reviews.


After removing non-English texts, we are left with **470,695** reviews. Next, we will merge the reviews with Airbnb listings. This will be done in two ways:

- We can merge the listing information, including sentiment scores, to all reviews individually. This may affect our model accuracy as we do not have the sentiment score for each review. However, since the sentiment score was determined by the average rating scores, we can infer that the reviews are reflective of the listing sentiment.

- We can collapse all review texts for the same listing into one corpus. This transformation can provide us with a more accurate sentiment score overall. However, we will only have about 6,000 data points as we have fewer than 6000 listings.

The concatenated data for both options will be generated and saved. We will perform further preparation for the implementation of NLP and modelling in the next notebook.

***

# Merge Reviews and Listing data <a id="a3"></a>

## Obtain all reviews individually <a id="a3.1"></a>

Now the review and listing data will be merged using the first method listed above. We will retain the numerical columns in the listing dataset and remove irrelevant columns in the review dataset after merging.

In [26]:
# Rename listing id in listing dataframe
df_listings_processed.rename(columns={'id': 'listing_id'}, inplace=True)

# Merge listing and review datasets
df_reviews_by_listing = pd.merge(df_listings_processed, 
                                 df_reviews_processed,
                                on ='listing_id',
                                how='inner')
# Remove irrelevant columns
drop_columns=['id', 'date', 'reviewer_id', 'reviewer_name', 'price']
df_reviews_by_listing.drop(drop_columns, axis=1, inplace=True)

## Obtain collapsed reviews by listings <a id="a3.2"></a>

In [27]:
# Collapse all reviews to one corpus for each unique listing
df_collapsed_reviews = df_reviews_processed.groupby('listing_id')['comments'].apply(lambda x: ' '.join(x)).reset_index()

# Check if all listing id exists in the collapsed review dataset
temp_list = list(df_reviews_processed['listing_id'].unique())
df_listings_processed = df_listings_processed[df_listings_processed['listing_id'].isin(temp_list)]

# Merge the collapsed review data and the listing data
df_collapsed_reviews_by_listing = pd.merge(df_listings_processed, 
                                 df_collapsed_reviews,
                                on ='listing_id',
                                how='inner')
# Remove irrelevant columns
df_collapsed_reviews_by_listing.drop(['listing_id', 'price'], axis=1, inplace=True)

***

# Extract merged data <a id="a4"></a>

In [28]:
# Save data as pickle file in my data folder
joblib.dump(df_collapsed_reviews_by_listing, 'data/df_collapsed_reviews_by_listing.pkl')

# Save data as pickle file in my data folder
joblib.dump(df_reviews_by_listing, 'data/df_reviews_by_listing.pkl')

['data/df_reviews_by_listing.pkl']

***

# Summary <a id="a5"></a>

In this Notebook, we have performed first stage data preprocessing including cleaning all non-numerical columns and merged processed review texts with listing data using two seperate methods. We will perform further text processing using Natural Language Processing as well as other matchine learning techinques in the next notebook. Furthermore, we will utilise different machine learning models to attempt extracting word features that can help us identify an outstanding Airbnb listing.