# Cleaning TripAdvisor Reviews

Now that raw data has been downloaded for 20+ cities, cleaning is required. The goal here is to remove hotels with less than 20 reviewes, separate the reviews into pre-pandemic and pandemic reviews, discard extraneous data and combine into a single dataframe for easier NLP processing

In [2]:
#Import Required Library
import json
import datetime 
import pandas as pd
from dateutil.parser import parse
import os
import re
import langdetect

In [3]:
# Set Directory for OS
directory = "data/Trip Advisor"
len(os.listdir(directory)) # Checking to ensure all files are present

27

In [4]:
#Function to remove hotels with no reviews or less than 20 reviews
def hotel_filter(df):
    filtered_data = []
    for hotel in df:
        if hotel['reviews'] and len(hotel['reviews']) > 20:
            filtered_data.append(hotel)
    return filtered_data

#### Quick Exploration

In [5]:
#Loading an example dataset
with open('data/Trip Advisor/dallas_hotels.json', 'r', encoding = 'utf8') as read_file:
    data = json.load(read_file)

In [6]:
data[0]

{'id': '98650',
 'type': 'HOTEL',
 'name': 'Crowne Plaza Hotel Dallas Downtown',
 'awards': [],
 'rankingPosition': '210',
 'priceLevel': '$$',
 'category': 'hotel',
 'rating': '3.5',
 'hotelClass': '3.0',
 'hotelClassAttribution': 'This property is classified according to Giata.',
 'phone': '+1 214-742-5678',
 'address': '1015 Elm St, Dallas, TX 75202-3126',
 'email': 'guestservices@crownedallas.com',
 'amenities': [],
 'prices': [],
 'latitude': '32.780964',
 'longitude': '-96.80347',
 'webUrl': 'https://www.tripadvisor.com/Hotel_Review-g55711-d98650-Reviews-Crowne_Plaza_Hotel_Dallas_Downtown-Dallas_Texas.html',
 'website': 'https://www.ihg.com/crowneplaza/hotels/us/en/dallas/dalem/hoteldetail',
 'rankingString': '#210 of 228 hotels in Dallas',
 'numberOfReviews': '11',
 'rankingDenominator': '228',
 'reviews': [{'text': 'Wouldn’t recommend this place to my worst enemy. Rooms smell of smoke. 3 different rooms and no type of heat or proper ventilation. Stairway are filled with human f

A lot of extraneous data in the json, we are interested in the *reviews*, *date*, and *priceLevel*. The strategy will be to append all of these to separate lists and create a *pandas* dataframe from the lists

In [34]:
%%time
reviews_list = []
date = []
city = []
zip_codes = []
price_level = []
for f in os.listdir(directory):
    city_name = re.split('_hotel', f)[0] #Get City name from file name
    with open(os.path.join(directory, f), 'r', encoding = 'utf8') as read_file:
        data = json.load(read_file)
        print(city_name + ' raw # of hotels: ', len(data))
        cleaned_data = hotel_filter(data)
        print(city_name + ' filtered # of hotels: ', len(cleaned_data))
        for hotel in cleaned_data:
            for review in hotel['reviews']:
                reviews_list.append(review['title'] + ' ' + review['text'])
                if review['stayDate']:
                    date.append(parse(review['stayDate']))
                else:
                    date.append(None)
                city.append(city_name)
                price_level.append(hotel['priceLevel'])
                zipcode = re.search('(\d{5})[- ]?', hotel['address']).group(1)
                zip_codes.append(zipcode)
                               


asheville raw # of hotels:  157
asheville filtered # of hotels:  144
austin raw # of hotels:  290
austin filtered # of hotels:  226
boston raw # of hotels:  332
boston filtered # of hotels:  320
cambridge raw # of hotels:  21
cambridge filtered # of hotels:  19
chicago raw # of hotels:  389
chicago filtered # of hotels:  321
columbus raw # of hotels:  265
columbus filtered # of hotels:  209
dallas raw # of hotels:  503
dallas filtered # of hotels:  340
denver raw # of hotels:  216
denver filtered # of hotels:  156
fort_lauderdale raw # of hotels:  365
fort_lauderdale filtered # of hotels:  302
hawaii raw # of hotels:  177
hawaii filtered # of hotels:  119
jersey_city raw # of hotels:  26
jersey_city filtered # of hotels:  20
LosAngeles raw # of hotels:  2
LosAngeles filtered # of hotels:  2
MinnStPaul raw # of hotels:  497
MinnStPaul filtered # of hotels:  375
nashville raw # of hotels:  421
nashville filtered # of hotels:  370
new_orleans raw # of hotels:  351
new_orleans filtered # o

A quick check to make sure all lists are of the same length

In [7]:
len(price_level)

1981164

In [36]:
len(zip_codes)

1981164

In [8]:
len(date)

1981164

In [9]:
len(reviews_list)

1981164

In [10]:
len(city)

1981164

Now we can build out the dataframe using the four lists

In [41]:
df = pd.DataFrame(list(zip(city, zip_codes, date, price_level, reviews_list)), columns = ['City', 'zip_codes', 'Date', 'Price', 'Review'])

In [42]:
df.head()

Unnamed: 0,City,zip_codes,Date,Price,Review
0,asheville,28803,2021-03-31 00:00:00,$$$,"Wonderful Hotel, Great time! We just returned ..."
1,asheville,28803,2021-03-31 00:00:00,$$$,Great stop - clean and new Perfect place for o...
2,asheville,28803,2021-02-28 00:00:00,$$$,Clean with Spectacular Service This hotel was ...
3,asheville,28803,2021-01-31 00:00:00,$$$,Awesome Hotel! My boyfriend and I had a weeken...
4,asheville,28803,2021-02-28 00:00:00,$$$,"Great hotel Decided on a trip to Asheville, it..."


In [43]:
df['Date'].isnull().sum() #Check to see how many reviews did not have a date

207

In [44]:
df.dropna(axis = 0, inplace = True) # Remove rows with no dates

In [65]:
#Drop reviews prior to 2018 and later than 03-11-2021
df_filtered = df[(df['Date'] >= parse('2018-01-01')) & (df['Date'] < parse('2021-03-11'))]

In [50]:
pandemic_date = parse('2020-03-11') #Date of WHO Pandemic Declaration

In [72]:
pre_pandemic_reviews = len(df_filtered[df_filtered['Date'] < pandemic_date])

In [73]:
pre_pandemic_reviews 

1812808

In [74]:
pandemic_reviews = len(df_filtered[df_filtered['Date'] >= pandemic_date])

In [75]:
pandemic_reviews

109986

In [76]:
print(f'The number of pre-pandemic reviews is {round(pre_pandemic_reviews/pandemic_reviews, 2)} times greater than pandemic reviews')

The number of pre-pandemic reviews is 16.48 times greater than pandemic reviews


In [30]:
#Write dataframe to CSV
df_filtered.to_csv('export_dataframe.csv', index = False, header = True)