### Hotel Review Classifier (working title)

*Flatiron School Data Science Bootcamp*

Captsone Project - NLP Sentiment/Ratings Analysis

Anna D'Angela | [annaadangela@gmail.com](annaadangela@gmail.com)

[Return to GitHub](https://github.com/anna-dang/mod05-capstone)


# TODO

- continue gathering to reach 25,000 reviews
- try function with less wait time for selenium to speed it up
- update url function to process mid range pages numbers (if "-or(any # of digits)-), get that, extract number, convert to in and use as start number in range
- upgrade plot design

In [None]:
# Auto - reload custom function library
%load_ext autoreload
%autoreload 2

In [None]:
# Import libraries
from re import compile, split
from bs4 import BeautifulSoup
from selenium import webdriver
from time import sleep, time

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Custom functions
import collection_functions as cf

# Data Collection

The Trip Advisor home page of selected hotels will be stored seperately as a text file to keep this notebook tidy.
I will scrape 1000 reviews from 10 hotels in the Denver, Colorado metro area. I manaually confirmed the hotels had at least that many reviews as I collected their urls.

How I decided which hotels: 

Mid-range hotels with a standard average nightly rate under $200/night (at time of scrape, January 2021). I selected nighly rated properties and low rated to try and gather a balanced distribution of rating classes.

Due to built it wait times for to avoid Selenium TimeOut Exceptions, long runtimes observed:

    - 5 minutes per 50pgs 
    - 8-10 min per 100pgs
    - 12 min per 150pgs
    - 18 minutes for 200pgs
    - 60 minutes for 800pgs - 4000 reviews

### Scrape website

In [None]:
# Import hotel homepage extensions as a list 
with open("./data/denver_urls.txt", "r") as scrape_1_urls:
    hotel_homepage_urls = scrape_1_urls.read().splitlines()

# Remove white space caused by 'splitlines'
hotel_homepage_urls = [i for i in hotel_homepage_urls if i]

# Check format and length
print(len(hotel_homepage_urls))
hotel_homepage_urls[10:13]

In [None]:
# Create a base df to populate with hotel information
base_df = pd.DataFrame(columns = ['Location', 'Hotel', 'Title', 'Review', 'Rating']) 
base_df

In [None]:
# Iterate through each hotel in list, record runtimes
start_time = time()

for i, hotel_url in enumerate(hotel_homepage_urls[10:13]):
    
    print("--- Hotel ", i+1, " ---")
    
    start_run = time()
    
    # Scrape each hotel for specified number of pages
    hotel_df = cf.scrape_hotel(hotel_url, n=200)
    
    # Add to base df
    base_df = pd.concat([base_df, hotel_df])
    
    # Calculation runtime
    run_time = round((time() - start_run)/60, 2)
    
    print(f"--- Run {i+1} complete:", run_time, "minutes ---")
    
    # Avoid being booted
    sleep(3)

# Total run time
full_run_time = time() - start_time

# Rename to hold a backup (so don't accidentally save over)
df = base_df

In [None]:
# Total scrape runtime
print(round(full_run_time/60, 2), "minutes")

### Check data

In [None]:
# Check full populated df
base_df.head()

In [None]:
# Preview review composition (ensure that they have been fully expanded)
for r in df['Review'][0:5]:
    print(r, "\n")

In [None]:
# Check df
print("Reviews:", df.shape[0])
print("Missing values?", df.isna().sum().sum())
print("Hotels:", df['Hotel'].nunique())
list(df['Hotel'].unique())

In [None]:
# Examine rating distribution
df['Rating'].value_counts(normalize=True).plot(kind='bar', rot = 0);
plt.title("Distribution of Ratings")
plt.xlabel('Rating')
plt.ylabel("Percent")
plt.show()

### Save data

In [None]:
# Save
df.to_csv("./data/test_scrape_6.csv", index=False)

In [None]:
# Check load
check_df = pd.read_csv("./data/test_scrape_6.csv")
check_df.head()

In [None]:
# Looks good!

# Concat many scrapes

- test_data : 2,500 Detroit hotel reviews gathered during function building
        
        'Motor City Casino'
        'The Siren'
        'aLoft'
        'Westin Book Cadillac'
        'The Foundation Hotel'

Denver reviews:
- test_scrape_1 : 300 from url 2, work on bugs 
        
        'Clarion Hotel'
- test_scrape_2 : 500 from url 3, test time outs
        
        'La Quinta'
- test_scrape_3 : 750 from url 4, correct time outs
        
        'The Crawford Hotel'
- test_scrape_4 : 2000 from url 5 - 6, 
    
         'Baymont by Wyndham Denver International Airport',
         'Grand Hyatt Denver Downtown', cruisin'! green light
         
- test_scrape_5: 3995 reviews from urls 7 - 10, 

         'Warwick Denver Hotel',
         'The Westin Denver International Airport',
         'Hyatt Place Denver/Cherry Creek',
         'Microtel Inn & Suites by Wyndham Denver'

- test_scrape_6: 3930 reviews from urls1-3
    
         'Best Western Plus Denver International Airport Inn & Suites',
         'Clarion Hotel Denver Central',
         'La Quinta Inn & Suites by Wyndham Denver Airport Dia',
         'The Crawford Hotel'
    

In [None]:
# load all 

In [None]:
# concat
# check df

In [None]:
# duplicates?

In [None]:
# empty strings / Null ?

In [None]:
# # Examine rating distribution
df['Rating'].value_counts().plot(kind='bar', rot = 0);
plt.title("Distribution of Ratings")
plt.xlabel('Rating')
plt.ylabel("Percent")
plt.show()

### Export final data

In [None]:
# save as one file

In [None]:
# check load