<h1><center>HW2 Scrape Hotel Reviews</center></h1>

Choose one hotel at tripadvisor.com (e.g.https://www.tripadvisor.com/Hotel_Review-g60763-d23448880-Reviews-Motto_by_Hilton_New_York_City_Chelsea-New_York_City_New_York.html)

- Q1. Write a function to scrape all **reviews on the first page**, including, 
    - **username** (see (1) in Figure)
    - **review date** (see (2) in Figure)
    - **rating** (see (3) in Figure)
    - **title** (see (4) in Figure)
    - **review text** (see (5) in Figure. For each review text, need to get the **complete text**.)
    - **date of stay**(see (6) in Figure)
    - If a field, e.g. rating, is missing, use `None` to indicate it. 
- `Function Input`: hotel page URL
- `Function Output`: save all reviews as a DataFrame of columns. E.g., for the given URL, you can get 10 reviews.

    
![alt text](tripadvisor.png "TripAdvisor")

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import re

from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager


# use GeckoDriver manager to access firefox page via gecko
executable = FirefoxService(GeckoDriverManager().install())

driver = webdriver.Firefox(service=executable)

[WDM] - Downloading: 16.9kB [00:00, 8.51MB/s]                                                                          


In [2]:
# go to next page
# next_page = driver.find_element(by=By.XPATH, value=".//div/div/div[3]/div[13]/div/a")
# next_page.click()

In [3]:
def getReviews(page_url):
    # load webpage
    driver.get(page_url)   
    driver.implicitly_wait(10)
    
#     review_data = {"username":[], "review date":[], "rating":[], 
#                "title":[], "review text":[], "date of stay":[]}

    #select css container for review cards and make empty df
    select_reviews = '[data-test-target="HR_CC_CARD"]'
    review_cards = driver.find_elements(By.CSS_SELECTOR, select_reviews)
    
    df = pd.DataFrame(index = range(len(review_cards)), 
              columns=["username", "review date", "rating", "title", 
                       "review text", "date of stay"])

    for i, review in enumerate(review_cards):
        # look for review[i]
        found = False
        max_try = 10  # max number of scroll-downs
        cnt = 0

        while not found and cnt < max_try:
            try:
                # Try to find inner html for selected review card
                soup = BeautifulSoup(review.get_attribute('innerHTML'), 'lxml')
                driver.implicitly_wait(10)

                found = True

            except:     # item not there yet, scroll down
                driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
                cnt += 1

        soup_children = list(soup.children)

        html_children = list(soup_children[0].children)
        # select body of the review
        body = html_children[0]

        # go down from the first body section to the text span
        name_post_date = list(body.children)[0]
        name_post_date = list(list(name_post_date.children)[1].children)[1]
        
        #separate username and post date strings
        for idx, child in enumerate(name_post_date):
            n_str = child.get_text().split("wrote a review")
            
            df['username'][i] = n_str[0]
            df['review date'][i] = n_str[1]

        # turn html of the rest of the review card into list of children
        everything_else = list(body.children)[-1]
        text_body = list(everything_else.children)
        
        r = 0
        rating = text_body[r]

        rating = rating.find('span',{"class":"ui_bubble_rating"})
        # if the first div of the review does not have the rating bubbles, look until it is found
        while rating is None:
            rating = text_body[r].find('span',{"class":"ui_bubble_rating"})
            if rating:
                break
            r += 1
        
        # get rating from class tag and covert to float
        rating = int(rating.get("class")[-1][-2:])/10
        df['rating'][i] = rating
        
        title = text_body[r+1].text
        df['title'][i] = title
        
        # go to the review text section and find the text span
        b = text_body[-1].children
        rev_text = list(b)[0].text[:-10]

        df["review text"][i] = rev_text

        # split all the parts of the review body that are not the review into lists of children
        stay=list(list(text_body[-1].children)[1].children)[0].text
        
        # split text for the "date of stay" child and keep the date
        date_of_stay = stay.split("Date of stay: ")[-1]
        df['date of stay'][i] = date_of_stay
        
    return df

In [4]:
test_url = 'https://www.tripadvisor.com/Hotel_Review-g60763-d7891458-Reviews-Arlo_SoHo-New_York_City_New_York.html'
getReviews(test_url)

Unnamed: 0,username,review date,rating,title,review text,date of stay
0,Ken M,Yesterday,5.0,Great place to stay,Front end/reception was excellent. They really...,February 2023
1,goldnjumbo,Jul 2022,5.0,"Excellent hotel, everything you need","I needed a well priced, downtown hotel near my...",July 2022
2,Barry R,Feb 21,5.0,"Fantastic place to stay….in particular, locati...",Great location.. Rooms small but very comforta...,February 2023
3,Aashka A,Feb 20,5.0,Best of SOHO,"The beds are super comfy, we want the same for...",February 2023
4,uk,Feb 20,2.0,An honest review,We booked Arlo Soho for a 6 night stay based o...,February 2023
5,Christopher L,Feb 20,5.0,NY Trip,I had such a great experience here. Everything...,February 2023
6,Stella D,Feb 19,5.0,Robert enhanced our stay!,I came here Feb 19th and have tried many diffe...,February 2023
7,Taylor W,Feb 19,5.0,"Excellent Hotel, Fantastic Staff","The Arlo Soho is well located, walkable and cl...",February 2023
8,Road20126263866,Feb 18,5.0,Great Place to Stay!,High rate for cleanliness. the staff and servi...,May 2022
9,Zeeba K,Feb 17,5.0,A fabulous week,Great team working at the Arlo and at the hote...,February 2023


In [5]:
page_url = 'https://www.tripadvisor.com/Hotel_Review-g60763-d23448880-Reviews-Motto_by_Hilton_New_York_City_Chelsea-New_York_City_New_York.html'
getReviews(page_url)


Unnamed: 0,username,review date,rating,title,review text,date of stay
0,angiethenewfie1,Yesterday,5.0,Beautiful hotel and excellent service,I have been to the hotel about 10 times and it...,February 2023
1,Alice L,Feb 2022,5.0,Perfect for Us,We recently chose Motto for an overnight in NY...,February 2022
2,Ishea Young-Taylor,Feb 21,5.0,Uh Freaking Amazing!!!!,Hotel is so cute. As soon as u step in to the ...,February 2023
3,uk,Feb 20,4.0,The saviour of our NYC trip,We originally had booked to stay at a differen...,February 2023
4,Jonathan F,Feb 20,5.0,"Pleasant, value hotel smack in the middle of N...",Just a really great hotel. Achieved a lot with...,February 2023
5,Kristina D,Feb 20,5.0,Very friendly staff,I stayed at motto for 6 nights! The rooms were...,February 2023
6,scott w,Feb 19,5.0,"Great room, great service, great location",Room was fantastic and the service was outstan...,February 2023
7,Louis M,Feb 19,5.0,Motto NYC Chelsea,We had an excellent one-night stay at the Mott...,February 2023
8,Brady,Feb 19,3.0,Whats in a name?,Its the Hilton name so booked on that assumpti...,February 2023
9,Robsen,Feb 17,5.0,Just perfect!,At first the arrival there was amazing. Ina fr...,February 2023


In [6]:
driver.quit()

### Bonus point. 
* Modify the function you defined in Q1 to Scrape **all reviews on the first five page**.


In [7]:
def getReviews2(page_url):
    
    # add your code
    
    
    return reviews_tb

    