<a href="https://colab.research.google.com/github/ZiyaoShang/HyperDenseNet_pytorch/blob/master/scraper/MBA742_2022_Class11_Review_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MBA 742 Data Science and AI in Marketing**
## *Spring 2022*  
Daniel M. Ringel  
Kenan-Flagler Business School  
*The University of North Carolina at Chapel Hill*  
dmr@unc.edu  


## Class 11 - **Launchpad Team Assignment**: Scraping Hotel Reviews from Google and TripAdvisor
*February 16th, 2022*  
Version 1.0

![Report: 78% of All Online Hotel Reviews Come From the Top Four Sites ](https://www.revinate.com/wp-content/uploads/marriott-tripadvisor-1140x411.jpg "Source: www.revinate.com")

# Today's Agenda

1. **Welcome Graduate Hotels: Kayla Sherwyn (Marketing Manager)**

2. **Team Work: Plan your Team Assignment**
    
3. **How to Scrape Hotel Reviews on YOUR computer**

## Prep-Check:
- Set-up Python on your computer: See [Anaconda Set-up Instructions](https://kenan-flagler.instructure.com/courses/3376078/pages/installing-python-on-your-computer) on Canvas
- Read Team Assignment on Canvas and prepared questions for Graduate Hotels

# 1. Welcome Graduate Hotels

- Zoom with Kayla Sherwyn
- Learnd about Graduate Hotels
- Ask Questions

https://kenan-flagler.zoom.us/my/danielringel

**IMPORTANT** 
- You may not share any part of this project outside of this course. 
- Do not post about your analysis and findings for Graduate Hotels on social media (e.g., LinkedIn, Twitter, Facebook, etc.). 
- I will write a post on LinkedIn at the end of the course that you are all welcome to comment on. 

# 2. Team Work: Plan your Team Assignment
- What questions will you answer?
- Why and how will your answers and findings help Graduate Hotels?
- What data will you use? How to collect it?
- What analyses will need to be done?
- What insights do you anticipate?
- How will these insights inform your recommendation to Graduate Hotels?
- What is your project plan? (Timeline, team member roles, inputs, meetings, milestones, deliverables)

![Data Science Pipeline: source Worfram](https://mapxp.app/busi488/datasciencepipeline.png)

# 3. How to Scrape Hotel Reviews on YOUR computer
- Each team appoints one member to work in-class with us on getting their computers set-up to scrape reviews
- You need to have Anaconda installed on your computer: see [Anaconda Set-up Instructions](https://kenan-flagler.instructure.com/courses/3376078/pages/installing-python-on-your-computer) on Canvas for instructions.



## 3.1 Step 1: Install Some Software


1. Install Google's Chrome Webbrowser on your computer (if you don't already have it installed): https://www.google.com/chrome/   


2. Create a Folder (Directory) in your computer where you plan to do your scraping (i.e., run the scraper from and save scraping results to). As an example, my directory is: 
  ```
  "/Users/Daniel/OneDrive - UNC Kenan-Flagler Business School/Teaching/2022-Spring/MBA742/ReviewScraper"
  ```

3. Download Chromedriver https://chromedriver.chromium.org/downloads to the directory that you plan to scrape form/to (see 2. above)/
    - <font color='red'>Make sure to select the version of chromedriver that matches your Chrome Browser installation</font>

    - Check your Chrome version from the Chrome Browser's menu 
    * upper-right corner-->Help-->about Google Chrome, ***or*** 
    * upper-left corner-->Chrome-->About Google Chrome  
      
    - Download the version that is appropriate for your operating system. I have a MacBook with an Intel processor and with Google Chrome version 98 installed, so I will download ***Chrome version 98 --> chromedriver_mac64.zip***  
      
      
4. Go to your folder (where you downloaded the Chromedriver to), unpack it, and double-click the file to run it once. A command prompt / terminal will pop-up - Wait until it prints out a successful message before closing it. (If you are a Mac user, and it's the first time you open it, you will see an error message now allowing to to open it. Go to System Preference on Mac --> Security & Privacy --> General--> Open Anyway)

## 3.2 Step 2: Create Python Environment and Install Python Libraries to it

We need three libraries to scrape hotel reviews with this notebook:
- **Selenium**: Selenium refers to a suite of tools that are widely used in the testing community when it comes to cross-browser testing. Selenium cannot automate desktop applications; it can only be used in browsers. It is considered to be one of the most preferred tool suites for automation testing of web applications as it provides support for popular web browsers which makes it very powerful.   
  

- **beautifulsoup4**: Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.  

  
- **Pandas**: pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

Before we get started, we need to create a python environment on our computer. See the [Tutorial](https://kenan-flagler.instructure.com/courses/3376078/pages/installing-python-on-your-computer) on Canvas.

### 3.2.1 Create Python Environment


1. You should have already created a folder on your computer where you want the scraped reviews to be stored. Chromedriver should also be located in this folder. For example: 

  ```
  "/Users/Daniel/OneDrive - UNC Kenan-Flagler Business School/Teaching/2022-Spring/MBA742/ReviewScraper"
  ```

2. Open your Terminal (Mac) or Anaconda/Command Prompt (Windows) and navigate to that folder:
  ```
  cd "OneDrive - UNC Kenan-Flagler Business School/Teaching/2022-Spring/MBA742/ReviewScraper"
  ```

3. Make sure your Python and Anaconda are up to date:

  ```
  conda update conda --yes
  conda update anaconda --yes
  conda update python --yes
  conda update --all --yes
  ```

4. Create a new Python Environment if you have not already done so by cloning your base environment (this can take a moment):

  ```
  conda create --name Reviews --clone base
  conda activate Reviews
  python -m ipykernel install --user --name Reviews --display-name “Reviews”
  ```

5. Make sure you have activated your new environment:

  ```
  conda activate Reviews
  ```

### 3.2.3 Install Libraries to New Environment

In your terminal (anaconde / command prompt), type the following to install the required libraries:

  ```
  conda activate Reviews
  ```
  
  ```
  pip install pandas
  ```
  
  ```
  pip install selenium
  ```
  
  ```
  pip install beautifulsoup4
  ```

### 3.2.3 Reload Notebook and Change Kernel
To use our new environment, we need to reload the notebook and change the kernel
- Click on File > Close and Halt
- Go back to the Jupyter tab in your browser and load the notebook again
- Click on Kernel > Change Kernel > Reviews

You should see "Reviews" in the top right-hand corner of your notebook (right under the "Logout" button)

## 3.3 Step 3: Import some Libraries

In [None]:
from selenium import webdriver
import numpy as np
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.service import Service
import time
import random
import datetime
import pandas as pd
import os
import json
import pickle
import warnings

## 3.4 The ReviewScraper Code

- We wrote this scraper specifically for this course so that you can use it in your team assignment.
- You don't need to understand the code in detail. That is beyond this course
- You will need to use the code if you want to collect reviews from the internet.

Please run through the next three cells, which include the review scraper class and helper functions

In [None]:
class ReviewScraper:
    def __init__(self, path_to_driver, site):
        """
        :param path_to_driver: the path to your ChromeDriver. 
        
        """
        self.path_to_driver = path_to_driver
        self.site = site
        
    def scrapeGoogle(self,
                url, 
                driver, 
                scroll=True,
                slptime=2.25,
                creview = "Svr5cf bKhjM",
                treview = 'div',
                ctime_site = "iUtr1",
                ttime_site = 'span',
                cname = "DHIhE" ,
                cnameNotInGoogle = "faBUBf",
                tname = 'a',
                tnameNotInGoogle = 'span',
                crating = "MfbzKb",
                trating = 'div',
                ctripType = "VURE3b",
                ttripType = 'span',
                crsl = "dA5Vzb",
                trsl = 'div',
                ctext = 'K7oBsc',
                ttext = 'div',
                chtname = 'CQYfx hAP9Pd gEBR9d',
                thtname = 'a',
                calthtname = 'nCqM5e',
                talthtname = 'div',
                creply = 'n7uVJf',
                treply = 'div'):
        
        """
        Scrapes results from a hotel review page from Google Travel
        :param url: the url to the review site
        :param driver: the web driver object
        :param scroll: whether or not to scroll down until all the reviews are available for scraping. 
        
        # HTML tag information
        :param creview: class name of the tag containing the whole review section  
        :param treview: type of the tag containing the whole review section  
        
        :param cscroll: class name of tag containing each scroll of ten reviews
        :param tscroll: type of tag containing each scroll of ten reviews
        
        :param ctime_site: class name of tag containing time and site
        :param ttime_site: type of tag containing time and site
        
        :param cname: class name of the tag containing reviewer's name if the review is from Google
        :param cnameNotInGoogle: class name of the tag containing reviewer's name if the review is not from Google
        :param tname: type of tag containing reviewer's name if the review is from Google
        :param tnameNotInGoogle: type of tag containing reviewer's name if the review is not from Google
        
        :param crating: class name of tag containing overall rating 
        :param trating: type of tag containing overall rating 

        :param ctripType: class name of tag containing trip type 
        :param ttripType: type of tag containing trip type 

        # the tags containing ratings for rooms, services, and location (r,s,l) all have the same class and type, so we scrape them together 
        :param crsl: class name of tag containing r,s,l
        :param trsl: type of tag containing r,s,l

        :param ctext: class name of tag containing review text. There may be multiple components within each review
        :param ttext: type of tag containing review text. There may be multiple components within each review
        
        :param chtname: class name of tag containing the hotel name.
        :param thtname: type of tag containing the hotel name.
        
        # The alternative hotel name is at the bottom of the page
        :param calthtname: class of tag containing the alternatve location of the hotel name (in case it does not load using the previous tag)
        :param talthtname: type of tag containing the alternatve location of the hotel name (in case it does not load using the previous tag)
        
        :param creply: class of tag containing all the hotel's reply
        :param treply: type of tag containing all the hotel's reply
        
        """

        print("Collecting data from specified URL ...")
        driver.get(url)  
        time.sleep(5)
        driver.set_window_size(1050,660)

        if scroll:
            while True:  
                print("Scrolling down to get all information... \n\n**** Do not manipulate the chrome browser window in any way ****\n\n")
                print("Depending on the total number of reviews, this may take a long time.")
                print("Do not allow your computer or screen to go to sleep during the scraping process.")
                scroll_down(driver, sleepTime=slptime)
                time.sleep(slptime)
                lctime = time.time()
                slptime += 1
                scroll_down(driver, sleepTime=slptime)
                if time.time() - lctime > slptime+2:
                    print("WARNING: Your sleeptime may be too slow, adjusting and restarting scroll...")
                    continue
                break
                

        print("Starting data extraction (parsing HTML code) ...")
        content = driver.page_source

        months, names, ratings, sites, tripTypes, rooms, locations, services, ids, texts, replys = (list() for l in range(11))
        
        soup = BeautifulSoup(content)
        
        # get hotel name         
        htnameTag = soup.find(thtname, class_=chtname)
        if htnameTag is None:
            htnameTag = soup.find(talthtname, class_=calthtname)
            assert htnameTag is not None, "Hotel name cannot be found in any of the two possible spots"
        htname = htnameTag.get_text()
        print('Collecting data for: ' + htname)

        # parse each review
        for review in soup.findAll(treview, class_=creview):
            soup2 = BeautifulSoup(str(review))

            # get name of each reviewer
            name = soup2.find(tname, class_=cname)
            if name is None:
                names.append(soup2.find(tnameNotInGoogle, class_=cnameNotInGoogle).get_text())
            else:
                names.append(name.get_text())

            # get review time and site 
            time_site = soup2.find(ttime_site, class_=ctime_site).get_text()
            month = ' '.join(time_site.split(' ')[:-3])
            site = time_site.split(' ')[-1]
            months.append(month)
            sites.append(site)

            # get overall rating 
            ratings.append(soup2.find(trating, class_=crating).get_text())

            # get trip type
            tripType = soup2.find(ttripType, class_=ctripType)
            tripTypes.append('N/A' if tripType is None else tripType.get_text())

            # get ratings for rooms, services, and locations (any number of any of these may appear)
            
            #TODO: this part looks stupid, make it more efficient
            rsl = soup2.findAll(trsl, class_=crsl)
            if len(rsl) == 0:
                rooms.append('N/A')
                services.append('N/A')
                locations.append('N/A')
            else:
                hasRoom = False
                hasServices = False
                hasLocations = False
                for rt in rsl:
                    txt = rt.get_text()
                    if 'Room' in txt:
                        rooms.append(txt[-3:])
                        hasRoom = True
                    elif 'Serv' in txt:
                        services.append(txt[-3:])
                        hasServices = True
                    elif 'Loc' in txt:
                        locations.append(txt[-3:])
                        hasLocations = True
                    else:
                        assert False, "rsl tag found, but none of rsl appeared"
                if not hasRoom:
                    rooms.append("N/A")
                if not hasServices:
                    services.append("N/A")
                if not hasLocations:
                    locations.append("N/A")
                assert len(rooms) == len(services) == len(locations), "wrong array length after appending rsl ratings"        

            # merge all review text segments (the may be multiple segments)
            textSegs = soup2.findAll(ttext, class_=ctext)
#             print(textSegs[-1].get_text())
#             allText = list()
#             for segment in textSegs:
#                 if len(textSegs) > 1:
#                     print(segment.get_text())
#                 allText.append(segment.get_text())
#             text = ' '.join(allText)
#             print(len(textSegs))
            texts.append(textSegs[-1].get_text() if len(textSegs)!=0 else "N/A")
    
    
            reps = soup2.findAll(treply, class_=creply)
            assert len(reps)<2, "text segment error, please contact Ziyao"
            replys.append(reps[0].get_text() if len(reps)==1 else 'N/A')

        # generate id for all ratings for each hotel
        ids = np.arange(len(services))   
        htnames = [htname] * len(services)

        assert len(replys) == len(names) == len(months) == len(ratings) == len(sites) == len(tripTypes) == len(rooms) == len(locations) == len(services) == len(texts) == len(ids) == len(htnames), 'different number of inputs for some review features'

        # convert each review feature list into numpy arrays
        months = np.array(months)
        names = np.array(names)
        ratings = np.array(ratings).astype('str')
        sites = np.array(sites)
        tripTypes = np.array(tripTypes)
        rooms = np.array(rooms)
        locations = np.array(locations)
        services = np.array(services)
        ids = np.array(ids)
        texts = np.array(texts)
        htnames = np.array(htnames)
        replys = np.array(replys)

        # vertically stack each feature together
        allInfo = np.vstack([ids, htnames, names,months,ratings,sites,tripTypes,rooms,locations,services,texts,replys])
        #print('shape of final array:')
        #print(allInfo.shape)

        return allInfo    

    
   
    def scrapeAdvisor(self, url, driver, scroll=True):
        """
        Scrapes results from a hotel review page from trip advisor
        :param url: the url to the review site
        :param driver: the web driver object
        :param scroll: whether or not to press "next page" until all the reviews are gathered 
        """
        
        # class and type of tag containing both the reviewer username and the review time
        cname_time = 'bcaHz'
        tname_time = 'div'
        
        # class and type tag containing of reviewer username (unverified/verified)
        cname = "ui_header_link bPvDb"
        tname = 'a'
        cnameverified = 'ui_header_link bPvDb verified'
        tnameverified = 'a'
        
        # class and type of the tag containing each review block
        creview = "cWwQK MC R2 Gi z Z BB dXjiy"
        treview = 'div'

        # class and type of the tag containing the rating, we use the tag containing the span 
        crating = "emWez F1"
        trating = 'div'

        # class and type of the tag containing the review title, we use the tag containing the span
        ctitle = "fCitC"
        ttitle = 'a'

        # class and type of tag containing all other optional ratings 
        coptional = "cnFkU Me f"
        toptional = 'div'

        # class and type of tag of the review text
        ctext = "XllAv H4 _a"
        ttext = 'q'

        # class of tag of expand button
        cexpand = 'eljVo _S Z'
        
        # class and type of tag of hotel name
        chtname = 'fkWsC b d Pn'
        thtname = 'h1'
        
        # class and type of the number of contributions as well as the number of upvotes (same)
        ccontributions = 'ckXjS'
        tcontributions = 'span'
        
        # class and type of the time of stay
        ctimestay = "euPKI _R Me S4 H3"
        ttimestay = "span"
        
        # class and type of reviewer location 
        creviewerloc = "default ShLyt small"
        treviewerloc = "span"
        
        # class and type of the hotel's reply
        creply = "eBsXT _a"
        treply = 'span'
        
        #class and type of travel type 
        ctptype = 'eHSjO _R Me'
        ttptype = 'span'
        
        #sleep time for DOM reload
        sleepTime = 2.5

        htnames, times, names, ratings, titles, values, rooms, locations, cleanliness, services, slpqualitys, texts, numcontributions, timestays, reviewerlocs, numupvotes, replys, tptypes = (list() for i in range(18))
        
        print("getting url...")
        driver.get(url) 
        time.sleep(5)
        
        # get hotel name   
        content = driver.page_source
        soup = BeautifulSoup(content)
        htnameTag = soup.find(thtname, class_=chtname)
        assert htnameTag is not None, "Hotel name not found"
        htname = htnameTag.get_text()
        print('Collecting data for: ' + htname)
        
        
        while True:
            content = driver.page_source
            soup = BeautifulSoup(content)
            all_reviews = soup.findAll(treview, class_=creview)

            for review in all_reviews:
                soup2 = BeautifulSoup(str(review))
                name_time = soup2.findAll(tname_time, class_=cname_time)

                # add review time 
                times.append(str(name_time).split("wrote a review ")[1].split('</')[0])

                #add reviewer name
                reviewer = soup2.find(tname, class_=cname)
                if reviewer is None:
                    reviewer = soup2.find(tnameverified, class_=cnameverified)
                names.append(reviewer.get_text())

                #add overall rating
                ratings.append(str(soup2.find(trating, class_=crating)).split("ui_bubble_rating bubble_")[1][0])

                # add review title
                titles.append(soup2.find(ttitle, class_=ctitle).get_text())

                # add all optional ratings
                optionals = soup2.find(toptional, class_=coptional)
                if optionals is None:
#                     print("None")
                    values.append('N/A') 
                    rooms.append('N/A')
                    locations.append('N/A')
                    cleanliness.append('N/A')
                    services.append('N/A')
                    slpqualitys.append('N/A')
                else:
                    op = [j for s in str(optionals).split('</span></div>') for j in s.split('</span></span><span>')]
                    values.append(op[op.index('Value')-1][-4] if 'Value' in op else "N/A")
                    rooms.append(op[op.index('Rooms')-1][-4] if 'Rooms' in op else "N/A")
                    locations.append(op[op.index('Location')-1][-4] if 'Location' in op else "N/A")
                    cleanliness.append(op[op.index('Cleanliness')-1][-4] if 'Cleanliness' in op else "N/A")
                    services.append(op[op.index('Service')-1][-4] if 'Service' in op else "N/A")
                    slpqualitys.append(op[op.index('Sleep Quality')-1][-4] if 'Sleep Quality' in op else "N/A")

                #add review text 
                texts.append(soup2.find(ttext, class_=ctext).get_text())
                
                #add number of contributions
                c = soup2.findAll(tcontributions, class_=ccontributions)
                if len(c) == 1:
                    numcontributions.append(c[0].get_text())
                    numupvotes.append("N/A")
                elif len(c) == 2:
                    numcontributions.append(c[0].get_text())
                    numupvotes.append(c[1].get_text())
            
                
                #add the time of stay ttimestay  
                t = soup2.find(ttimestay, class_=ctimestay)
                timestays.append(t.get_text() if t is not None else "N/A")
                
                #add the location of the reviewer 
                l = soup2.find(treviewerloc, class_=creviewerloc)
                reviewerlocs.append(l.get_text() if l is not None else "N/A")
                
                # add the reply from the hotel
                r = soup2.find(treply, class_=creply)
                replys.append(r.get_text() if r is not None else "N/A")
                
                tp = soup2.find(ttptype, class_=ctptype)
                tptypes.append(tp.get_text()[11:] if tp is not None else "N/A")

    
            if not scroll:
                break
                
            # go to the next page by clicking on "next"
            next_buttons = driver.find_elements(By.XPATH, "//*[@class='{next_page}']".format(next_page = "ui_button nav next primary "))
            if (len(next_buttons)) == 1:
#                 print("Going to the next page...")
#                 next_buttons[0].click()
                driver.execute_script("arguments[0].click();", next_buttons[0])


                # wait until page is completely loaded
                time.sleep(random.uniform(sleepTime,sleepTime + 0.25))
                while str(driver.execute_script("return document.readyState")) != "complete" or not (driver.find_elements(By.XPATH, "//*[@class='{expand}']".format(expand = cexpand))[0].is_displayed()):
                    time.sleep(random.uniform(0,0.25))
                    
                # expand reviews 
                expands = driver.find_elements(By.XPATH, "//*[@class='{expand}']".format(expand = cexpand))
                assert expands[0].is_displayed(), "expand url not properly loaded"
#                 expands[0].click()
                driver.execute_script("arguments[0].click();", expands[0])
            else:
#                 print("Reached the last page")
                break 

        ids = np.arange(len(texts))
        htnames = np.array([htname] * len(services))
        
        
        assert len(tptypes) == len(replys) == len(reviewerlocs) == len(numupvotes) == len(timestays) == len(numcontributions) == len(htnames) == len(times) == len(names) == len(ratings) == len(titles) == len(values) == len(rooms) == len(locations) == len(cleanliness) == len(services) == len(slpqualitys) == len(texts)

        ids = np.array(ids)
        texts = np.array(texts).astype('str')
        values = np.array(values)
        rooms = np.array(rooms)
        locations = np.array(locations)
        cleanliness = np.array(cleanliness)
        services = np.array(services)
        slpqualitys = np.array(slpqualitys)
        times = np.array(times)
        names = np.array(names)
        ratings = np.array(ratings)
        titles = np.array(titles)
        numcontributions = np.array(numcontributions)
        numupvotes = np.array(numupvotes)
        timestays = np.array(timestays)
        reviewerlocs = np.array(reviewerlocs)
        replys = np.array(replys)
        tptypes = np.array(tptypes).astype('str')
        

        allInfo = np.vstack([ids,htnames,texts,values,rooms,locations,cleanliness,services,slpqualitys,times,names,ratings,titles,numcontributions, numupvotes, timestays,reviewerlocs, replys, tptypes])
        print("shape of final array: " + str(allInfo.shape))
        
        return allInfo



        
    def scrape(self, 
               url, 
               csv_dir,
               npy_dir,
               final_name="final",
               scroll=False):
        """
        Scrapes the results from a url or an array of urls.
        
        :param url: the url to the google review site. It must be either a String or a list/array of Strings.  
        :param npy_dir: a numpy array containing the scrapped reviews for each hotel will be saved in this directory.
        :param csv_dir: the final report containing all the scrapped results will be saved in this directory.
        :param final_name: the name the final file will be saved as.
        # since Google only releases ten reviews each time it reloads, scrolling may take a long time for hotels with a large number of comments
        :param scroll: whether or not to continue scrolling down until all the results are shown in the window and could be scrapped, for trip advisor, whether or not to click "next page" until the last page. 
        
        """
        
        assert (csv_dir is not None) and (npy_dir is not None), "both csv_dir and npy_dir must be provided" 
#         driver = webdriver.Chrome(self.path_to_driver)

        driver = webdriver.Chrome(service=Service(self.path_to_driver), options=webdriver.ChromeOptions())
        
        if (self.site == "Google_review"):
            if isinstance(url, str):
                res = self.scrapeGoogle(url, driver, scroll=scroll)
                
                numReviews = res.shape[1]

                #add timestamp
                timestamp = np.array([str(datetime.datetime.now())]*numReviews)
                finalArr = np.vstack([res, timestamp])

                #save numpy array
                np.savez_compressed(os.path.join(csv_dir, 'temp.npz').replace("\\","/"), a=finalArr)

                # generate final report
                headers = np.array(['id_hotel', 'hotel', 'name', 'time', 'rating', 'site', 'trip type', 'room rating', 'location rating', 'service rating', 'text', 'hotel reply', 'timestamp']) 
                hotel = np.load(os.path.join(csv_dir, 'temp.npz').replace("\\","/"))["a"].T

                # save as string and add indexes
                df = pd.DataFrame(hotel, columns = headers).astype(str)
                duplicate = 1
                final_name += '0'
                while True: 
                    if os.path.isfile(os.path.join(csv_dir, final_name+'.csv').replace("\\","/")):
                        final_name = final_name[:-1] + str(duplicate)
                        duplicate += 1 
                        continue
                    break

                df.to_csv(os.path.join(csv_dir, final_name + '.csv'), encoding='utf-8-sig', index=True)
                print("Scraped " + str(numReviews) + " reviews")
                
            elif isinstance(url, (list, tuple, np.ndarray)):
                headers = np.array(['id_hotel', 'hotel', 'name', 'time', 'rating', 'site', 'trip type', 'room rating', 'location rating', 'service rating', 'text', 'hotel reply', 'timestamp']) 
                all_hotels = np.zeros(len(headers)) 
                np.savez_compressed(os.path.join(csv_dir, 'temp.npz').replace("\\","/"), a=all_hotels)
                first = True
                for index in range(len(url)):
                    print('Scraping Target: ' + url[index])
                    res = self.scrapeGoogle(url[index], driver, scroll=scroll)
                    numReviews = res.shape[1]
                    print("Scraped " + str(numReviews) + " reviews")

                    #add timestamp
                    timestamp = np.array([str(datetime.datetime.now())]*numReviews)
                    finalArr = np.vstack([res, timestamp])
                    
                    #load temp array
                    hotels = np.load(os.path.join(csv_dir, 'temp.npz').replace("\\","/"))["a"]
                    if first:
                        all_hotels = np.vstack([hotels, finalArr.T]).T
                        first = False
                    else:
                         all_hotels = np.hstack([hotels, finalArr])
                    
                    #save temp array
                    np.savez_compressed(os.path.join(csv_dir, 'temp.npz').replace("\\","/"), a=all_hotels)
                   
                    
                # generate final report 
                all_hotels = np.load(os.path.join(csv_dir, 'temp.npz').replace("\\","/"))["a"].T
                df = pd.DataFrame(all_hotels[1:, :], columns = headers).astype(str)
                duplicate = 1
                final_name += '0'
                while True: 
                    if os.path.isfile(os.path.join(csv_dir, final_name+'.csv').replace("\\","/")):
                        final_name = final_name[:-1] + str(duplicate)
                        duplicate += 1 
                        continue
                    break

                df.to_csv(os.path.join(csv_dir, final_name + '.csv').replace("\\","/"), encoding='utf-8-sig', index=True)
                
                
                
        elif (self.site == "tpadvisor"):
            if isinstance(url, str):
                res = self.scrapeAdvisor(url, driver, scroll=scroll)
                numReviews = res.shape[1]
                
                #add timestamp
                timestamp = np.array([str(datetime.datetime.now())]*numReviews)      
                finalArr = np.vstack([res, timestamp])

                #save numpy array
                np.savez_compressed(os.path.join(csv_dir, 'temp.npz').replace("\\","/"), a=finalArr)

                # generate final report   added: numcontributions, numupvotes, timestays,reviewerlocs, replys, tptypes
                headers = np.array(['ID','hotel name','review text','value rating','room rating','location rating','cleanliness rating','service rating','sleep quality','date','rater name','overall rating','rating title', 'number of contributions','number of upvotes', 'time of stay','reviewer location', 'hotel reply', 'trip type', 'timestamp']) 
                hotel = np.load(os.path.join(csv_dir, 'temp.npz').replace("\\","/"))["a"].T

                # save as string and add indexes
                df = pd.DataFrame(hotel, columns = headers).astype(str)
                duplicate = 1
                final_name += '0'
                while True: 
                    if os.path.isfile(os.path.join(csv_dir, final_name+'.csv').replace("\\","/")):
                        final_name = final_name[:-1] + str(duplicate)
                        duplicate += 1 
                        continue
                    break

                df.to_csv(os.path.join(csv_dir, final_name+'.csv'), encoding='utf-8-sig', index=True)
                print("Scraped " + str(numReviews) + " reviews")

            elif isinstance(url, (list, tuple, np.ndarray)):
                headers = np.array(['ID','hotel name','review text','value rating','room rating','location rating','cleanliness rating','service rating','sleep quality','date','rater name','overall rating','rating title', 'number of contributions','number of upvotes', 'time of stay','reviewer location', 'hotel reply', 'trip type', 'timestamp']) 
                all_hotels = np.arange(len(headers)) 
                np.savez_compressed(os.path.join(csv_dir, 'temp.npz').replace("\\","/"), a=all_hotels)
                first = True
                for index in range(len(url)):
                    print('Scraping Target: ' + url[index])
                    res = self.scrapeAdvisor(url[index], driver, scroll=scroll)
                    numReviews = res.shape[1]
                    print("Scraped " + str(numReviews) + " reviews")

                    #add timestamp
                    timestamp = np.array([str(datetime.datetime.now())]*numReviews)
                    finalArr = np.vstack([res, timestamp])
                    
                    #load temp array
                    hotels = np.load(os.path.join(csv_dir, 'temp.npz').replace("\\","/"))["a"]

                    if first:
                        all_hotels = np.vstack([hotels, finalArr.T]).T
                        first = False
                    else:
                         all_hotels = np.hstack([hotels, finalArr])
                    
                    #save temp array
                    np.savez_compressed(os.path.join(csv_dir, 'temp.npz').replace("\\","/"), a=all_hotels)
                    
                # generate final report 
                all_hotels = np.load(os.path.join(csv_dir, 'temp.npz').replace("\\","/"))["a"].T
                df = pd.DataFrame(all_hotels[1:, :], columns = headers).astype(str)
                
                duplicate = 1
                final_name += '0'
                while True: 
                    if os.path.isfile(os.path.join(csv_dir, final_name+'.csv').replace("\\","/")):
                        final_name = final_name[:-1] + str(duplicate)
                        duplicate += 1 
                        continue
                    break

                df.to_csv(os.path.join(csv_dir, final_name+'.csv').replace("\\","/"), encoding='utf-8-sig', index=True)
        else:
            print("We are sorry, this website is currently not supported")
            return
                
        

In [None]:
def scroll_down(driver, scrollDistanceToBottom=700, sleepTime=2.25):
    """
    A method for scrolling the page.
    :param driver: the web driver object.

    # Tune these parameters according to your browser and screen
    :param scrollDistanceToBottom: Google does not load more results if you directly scroll to the very bottom. This parameter serves as the margin between the location we scroll to and the bottom of the window.
    :param sleepTime: the scrolling stops when no more information is loaded after waiting for a certain time past the previous scroll. sleepTime will be the time we wait for the webpage to finish reloading. 

    """

    # Get scroll height.
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:

        # Scroll down to the bottom.
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight-"+ str(scrollDistanceToBottom) + ");")

        # Wait to load the page.
        time.sleep(random.uniform(sleepTime,sleepTime+0.25))

        # Calculate new scroll height and compare with last scroll height.
        new_height = driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:
            break

        last_height = new_height
        
# this scrolling function is inspired by Ratmir Asanov's answer at https://stackoverflow.com/questions/48850974/selenium-scroll-to-end-of-page-in-dynamically-loading-webpage

In [None]:
def loadCheckpoint(checkpoint_path, csv_dir, final_name, site):
    """converts a checkpoint npz file into a csv report"""
    temp = np.load(checkpoint_path)["a"]
    headers=[]
    if site=="tpadvisor":
        headers = np.array(['ID','hotel name','review text','value rating','room rating','location rating','cleanliness rating','service rating','sleep quality','date','rater name','overall rating','rating title', 'number of contributions','number of upvotes', 'time of stay','reviewer location', 'hotel reply', 'trip type', 'timestamp']) 
    elif site=="Google_review":
        headers = np.array(['id_hotel', 'hotel', 'name', 'time', 'rating', 'site', 'trip type', 'room rating', 'location rating', 'service rating', 'text', 'hotel reply', 'timestamp'])
    else: 
        print("site not supported")
        return
    
    print("Loading checkpoint...")
    # generate final report 
    all_hotels = np.load(checkpoint_path)["a"].T
    df = pd.DataFrame(all_hotels[1:, :], columns = headers).astype(str)

    duplicate = 1
    final_name += '0'
    while True: 
        if os.path.isfile(os.path.join(csv_dir, final_name+'.csv').replace("\\","/")):
            final_name = final_name[:-1] + str(duplicate)
            duplicate += 1 
            continue
        break

    df.to_csv(os.path.join(csv_dir, final_name+'.csv').replace("\\","/"), encoding='utf-8-sig', index=True)
    print("Success!")
    
    
    

## 3.5 Let's Scrape some Reviews with our ReviewScraper from Google

We need to provide some information to the ReviewScraper:

* **path_to_driver**: The path to your chromedriver.exe file (String).  


* **csv_dir**: where to save the final excel sheet containing all the scrapped content (String, path to a folder).    


* **npy_dir**: where the temp.npz checkpoint will be saved. This file contains all the previous scrapped records in case of errors or internet problems. (String, path to a folder). Note that this file will be overwritten each time the scrape() function is run (in the same npy_dir). If you want to convert the checkpoint into a csv report, please use loadCheckpoint() punction (more details later).  


* **site**: The review site you want to scrape from. (String, It must be either "Google_review", or "tpadvisor").  


* **final_name**: The file name to which the scraped data is to be saved as (String).  


* **scroll**: whether or not to get **all** the results available for each hotel (Boolean). If scroll=True, the scraper will repeatedly scroll down or press "next page" until every review is available for scraping. This may take an very long time since we need to wait for the page to load every time new reviews are loaded. If scroll=False, the scraper will not attempt to load any more reviews other than the ones present in the first page of the website.


* **url**: The url to the hotel's review webpage (String, it could be either a single url to a single hotel property, or a list of urls. Make sure that the urls are from the review website (Google vs. TripAdvisor) that you specified above in ***site***.  


In [None]:
# 1 Initialize File Locations

# 1a Please specify the path to your chromedriver file here (include filename and extension of chromedriver!)
chromedrive_path = r'C:\Users\zs\Desktop\forIntern\chromedriver.exe' # replace with your local folder

# 1b Please enter the paths where you want the scraped data to be stored (do not include filenames here!)
csv_dir=r'C:\Users\zs\Desktop\forIntern\junk' # replace with your local folder
npy_dir=r'C:\Users\zs\Desktop\forIntern\junk' # replace with your local folder 

### <font color='red'>Important:</font>
    
1. When you run the cell below, you will see a Chrome browser window pop up. Look at it and make sure that the page is scrolling down repeatedly for more information.   
    
    
2. Except for viweing, ***do not interact with the Chrome browser window in any*** other way. Clicking, scrolling or resizing may cause the scraper to not work properly.  
    
    
3. If you are running code that would take a long time, ***please make sure that your computer does not go to sleep*** or that your screen turns off in the middle. This will also cause problems. 

In [None]:
%%time

# 2 Define a single url
url='https://www.google.com/travel/hotels/graduate%20hotel%20chapel%20hill/entity/CgsInqmaiMnBi4yIARAB/reviews?g2lb=4649665,4640247,4258168,4596364,4718399,4270442,4716126,4317915,4605861,4306835,4641139,4707947,4401769,2502548,2503781,4597339,4647135,2503771,4270859,4291517,4284970&hl=en-US&gl=us&ssta=1&grf=EmQKLAgOEigSJnIkKiIKBwjmDxACGBsSBwjmDxACGBwgADAeQMoCSgcI5g8QAhgQCjQIDBIwEi6yASsSKQonCiUweDg5YWNjMmUxMjFjNTk5NWY6MHg4ODE4MmUwYzkxMDY5NDll&q=graduate+hotel+chapel+hill&rp=EJ6pmojJwYuMiAEQnqmaiMnBi4yIATgCQABIAcABAg&ictx=1&sa=X&ved=2ahUKEwiWqMDxu4T2AhWYpnIEHYwvAbUQ4gl6BAgfEAU'

# 2a Instantiate ReviewScraper
# warnings.filterwarnings("ignore", category=DeprecationWarning)
scraper = ReviewScraper(chromedrive_path, site="Google_review")

# 2b Scrape all possible results for this hotel (this will take some time)
scraper.scrape(url=url, 
               csv_dir=csv_dir, 
               npy_dir=npy_dir,
               final_name='singlegoogleurl',
               scroll=True)

#### **Once the scraping process finished, you should have collected more than 300 reviews.**

The reviews are saved to a CSV file in the folder that your specified (csv_dir folder). 

***If you scraped less than 300 reviews for Graduate Hotel Chapel Hill, look into these possible causes:***
1. the browser loads the new reviews too slowly, so the scraper thinks that there are no more comments and stops scrolling down. A way to figure out whether this is the problem is to rerun the above cell. Check whether you get a different number of reviews each time. You could fix this problem by modifying the "slptime" variable in the function "scrapeGoogle" in the ReviewScraper class. (ReviewScraper is able to auto-adjust the sleeping time if the loading time isn't __too__ slow. Whenever the sleeping time is adjusted you will get the "your sleeptime may be too slow" message.)
2. The three problems specified in <font color='red'>"Important:"</font> in the previous instruction cell.
3. Random website or internet outages

#### It is good practice to verify your scraping results before you scrape a large number of reviews.

- Inspect the scraped data in the csv file and compare reviews to those you find when you go to the website you scraped off in your browser. 

#### If there are inconsistencies, the following might have occured:
1. Update by Google or tripadvisor. In the ReviewScraper class, the scrapeAdvisor() and scrapeGoogle() function contains various variables/parameters that contain the information about the HTML tags containing the information we need to scrape. These variables and parameters need to be updated (see Appendix).


## 3.6 Scrape a list of URLs from Google

The urls below are review pages on google travel for several graduate hotels. You could replace them with other review urls:
1. enter the **exact full** name of the hotel in a google searchbar and search.
2. click on "**google reviews**" (see below)
![Google Review](https://mapxp.app/MBA742/google-review.png)
3. copy the url to the current page

In [None]:
# 1 Specify list of URLS for several hotel reviews on Google
urls = ['https://www.google.com/travel/hotels/entity/CgoIyZLQmLyR4uJREAE/reviews?g2lb=2503771%2C4644488%2C4596364%2C4597339%2C4419364%2C4317915%2C4270442%2C4371335%2C4645479%2C4306835%2C4641139%2C4605861%2C4401769%2C4624411%2C4258168%2C4671810%2C2503781%2C2502548%2C4640247%2C4672717%2C4284970%2C4270859%2C4291517&hl=en-US&gl=us&ssta=1&rp=EMmS0Ji8keLiURDJktCYvJHi4lE4AkAASAHAAQI&ictx=1&sa=X&sqi=2&ved=0CAAQ5JsGahcKEwi4lq39xrf0AhUAAAAAHQAAAAAQAg&utm_campaign=sharing&utm_medium=link&utm_source=htls&ts=CAESABpJCisSJzIlMHg4ODNjYWUzZmYxMmIzNjYxOjB4NTFjNTg4OGJjMzE0MDk0ORoAEhoSFAoHCOUPEAsYHRIHCOUPEAsYHhgBMgIQACoJCgU6A1VTRBoA',
       'https://www.google.com/travel/hotels/entity/CgsIi5HS6P-dzdyXARAB/reviews?g2lb=4672717%2C4640247%2C2503781%2C2502548%2C4671810%2C4258168%2C4401769%2C4624411%2C4605861%2C4306835%2C4641139%2C4371335%2C4645479%2C4270442%2C4317915%2C4597339%2C4419364%2C4644488%2C4596364%2C2503771%2C4291517%2C4270859%2C4284970&hl=en-US&gl=us&ssta=1&rp=EIuR0uj_nc3clwEQi5HS6P-dzdyXATgCQABIAcABAg&ictx=1&sa=X&ved=0CAAQ5JsGahcKEwio8_Cnx7f0AhUAAAAAHQAAAAAQAg&utm_campaign=sharing&utm_medium=link&utm_source=htls&ts=CAESABpJCisSJzIlMHg4OWI3ZjdlMTUzNmY1MzI1OjB4OTdiOTM0ZWZmZDE0ODg4YhoAEhoSFAoHCOUPEAwYChIHCOUPEAwYDBgCMgIQACoJCgU6A1VTRBoA',
       'https://www.google.com/travel/hotels/entity/CgoIv5fgkq67qdNmEAE/reviews?g2lb=2503781%2C2502548%2C4672717%2C4640247%2C4258168%2C4401769%2C4624411%2C4671810%2C4371335%2C4645479%2C4270442%2C4317915%2C4605861%2C4306835%2C4641139%2C4644488%2C4596364%2C2503771%2C4597339%2C4419364%2C4270859%2C4291517%2C4284970&hl=en-US&gl=us&ssta=1&rp=EL-X4JKuu6nTZhC_l-CSrrup02Y4AkAASAHAAQI&ictx=1&sa=X&ved=0CAAQ5JsGahcKEwjYmuzox7f0AhUAAAAAHQAAAAAQAg&utm_campaign=sharing&utm_medium=link&utm_source=htls&ts=CAESCgoCCAMKAggDEAEaSQorEicyJTB4ODhmNjZjZDllNWE3MTgzOToweDY2YTZhNWRhZTI1ODBiYmYaABIaEhQKBwjlDxAMGA0SBwjlDxAMGA4YATICEAAqCQoFOgNVU0QaAA']

In [None]:
%%time

# 2 Instantiate ReviewScraper 
# warnings.filterwarnings("ignore", category=DeprecationWarning)
scraper = ReviewScraper(chromedrive_path, "Google_review")

# 3 Start scraping
print("Depending on the total number of URLs and Reviews, this may take a long time.")
print("Do not allow your computer or screen to go to sleep during the scraping process.")
scraper.scrape(url=urls, 
               csv_dir=csv_dir, 
               npy_dir=npy_dir,
               final_name='multiplegoogleurls',
               scroll=False) # "scroll" is being set to false in order to save time! If you want to collect all reviews, se this to True

- After the execution, a csv file containing all the results, as well as a temp.npz file will be saved in respective directories. The csv should contain about 30 entries.  
  
- If we set scroll=True in the cell above, we will be scrapping every google review for every hotel in the URL list. The execution will take a long time and you should end up with all reviews available online.

## 3.7 Scrape Reviews for a single Hotel from TripAdvisor 

**Note** Scraping TripAdvisor is a lot slower!

In [None]:
%%time

# 1 Set URL to scrape
url = "https://www.tripadvisor.com/Hotel_Review-g38020-d91957-Reviews-Graduate_Iowa_City-Iowa_City_Iowa.html"

# 2 Instatiate ReviewScraper
# warnings.filterwarnings("ignore", category=DeprecationWarning)
scraper = ReviewScraper(chromedrive_path, site="tpadvisor")

# 3 Scrape reviews (this will take some time)
scraper.scrape(url=url, 
               csv_dir=csv_dir, 
               npy_dir=npy_dir,
               final_name='singletripadvisorurl',
               scroll=True)

After executing, you should get 220+ results. The sleeptime in this case (default=2.5s) is the "sleepTime" variable in the "scrapeAdvisor" function of the ReviewScraper class. Please adjust if needed

## 3.8 Scrape Reviews from a List of Hotels from Tripadvisor

#### Just as with Google reviews, you could replace these urls with your own:

1. Go to https://www.tripadvisor.com/
2. In the searchbar, enter the **exact full** name of the hotel.
3. Click on the search result for that particular hotel (see below)
![TripAdvisor Review](https://mapxp.app/MBA742/tripadvisor-review.png)
4. Copy the current url

In [None]:
# 1 Define list of URLs for multiple hotels on trip advisor 
urls2 = ['https://www.tripadvisor.com/Hotel_Review-g29556-d89935-Reviews-Graduate_Ann_Arbor-Ann_Arbor_Michigan.html',
        'https://www.tripadvisor.com/Hotel_Review-g29494-d89319-Reviews-Graduate_Annapolis-Annapolis_Maryland.html',
        'https://www.tripadvisor.com/Hotel_Review-g29209-d242422-Reviews-Graduate_Athens-Athens_Georgia.html']

In [None]:
%%time
# 2 Instatiate ReviewScraper for TripAdvisor
warnings.filterwarnings("ignore", category=DeprecationWarning)
scraper = ReviewScraper(chromedrive_path, site="tpadvisor")

# 3 Scrape Reviews (here, we do not scroll to save time)
scraper.scrape(url=urls2, 
               csv_dir=csv_dir, 
               npy_dir=npy_dir,
               final_name='multipletripadvisorurls',
               scroll=False) # (set to false to save time - set to True to get all reviews)

- After the execution, a csv file containing all scraped reviews as well as a temp.npz file will be saved in respective directories.  
  
- The final csv file should contain about 15 entries.  
  
- If we set scroll=True in the cell above, we will be scrapping every tripadvisor review for each hotel in the URL list. The execution will take a long time and you should end up with many more reviews! 

- Don't forget to keep the chrome browser window that pops-up open, do not manipualte it, and do not allow your computer to go to sleep!

## 3.9 Continue Scraping after an early Termination

- Sometimes, the ReviewScraper might terminate early (due to internet problems, errors, etc.) In this case, you don't have all the data collected and no CSV file was written.

- Use the loadCheckpoint() function to help you convert the checkpoint temp.npz file into a csv report so that you don't need to start over from the very beginning.

In [None]:
loadCheckpoint(checkpoint_path=npy_dir+"/temp.npz", # the path to the checkpoint file
               csv_dir=csv_dir, # the folder to save the generated report 
               final_name="loaded_checkpoint", # the name of the generated report 
               site="tpadvisor") # specify the review website 

# **Looking Ahead:**  

#### **Next Class:** Monday, February 21st, 2022  

#### ***Movie Recommender Systems*** via Latent Feature Discovery using Neural Networks


# Appendix: Fixing the scraper in the case of website updates

At the beginning of the scrapeAdvisor() and scrapeGoogle() function, I have specified the identifiers for all the tags we need. Variable names starting with 't' contains the type of the tag and variables starting with 'c' contains the class of the tag. 

For example, the tag above contains the overall rating, so we have crating = "MfbzKb" and trating = 'div'. 
You may need to change these identifiers if Google updates the website.

### A light introduction to the scraper
1. A web scraper extracts information from the HTML source page of website. (please go to [this page](https://www.google.com/travel/hotels/graduate%20hotel%20chapel%20hill/entity/CgsInqmaiMnBi4yIARAB/reviews?g2lb=4597339%2C2503946%2C2503781%2C2502548%2C2503771%2C4647135%2C4306835%2C4641139%2C4605861%2C4401769%2C4270442%2C4317915%2C4258168%2C4640247%2C4649665%2C4596364%2C4284970%2C4291517%2C4270859&hl=en-US&gl=us&ssta=1&q=graduate%20hotel%20chapel%20hill&rp=EJ6pmojJwYuMiAEQnqmaiMnBi4yIATgCQABIAcABApoCAggA&ictx=1&sa=X&ved=0CAAQ5JsGahcKEwiQrKSTwYP2AhUAAAAAHQAAAAAQAg&utm_campaign=sharing&utm_medium=link&utm_source=htls&ts=CAESABpJCisSJzIlMHg4OWFjYzJlMTIxYzU5OTVmOjB4ODgxODJlMGM5MTA2OTQ5ZRoAEhoSFAoHCOYPEAIYGxIHCOYPEAIYHBgBMgIQACoJCgU6A1VTRBoA))
2. Hover your mouse onto the overall rating of the first review and Right-click->inspect.
3. A window on the right will open. The window contains the HTML source of the website. The selected portion is the HTML tag of the rating ![TripAdvisor Review](https://mapxp.app/MBA742/css-scraper.png) The Beautifulsoup package allows us to look for this tag and extract the information within. This scraper uses "MfbzKb" as the identifier for the tag.
4. However, Google does not want people scraping its websites, so the identifiers may be modified from time to time. If that happens, the scraper will no longer work.

In [None]:
###############################################for testing purposes#############################################################

In [None]:

#parsing a single tripadvisor page
# def scrapeAdvisor(driver, url):

# # class and type of tag containing both the reviewer username and the review time
# cname_time = 'bcaHz'
# tname_time = 'div'

# # class and type tag containing of reviewer username (unverified/verified)
# cname = "ui_header_link bPvDb"
# tname = 'a'
# cnameverified = 'ui_header_link bPvDb verified'
# tnameverified = 'a'

# # class and type of the tag containing each review block
# creview = "cWwQK MC R2 Gi z Z BB dXjiy"
# treview = 'div'

# # class and type of the tag containing the rating, we use the tag containing the span 
# crating = "emWez F1"
# trating = 'div'

# # class and type of the tag containing the review title, we use the tag containing the span
# ctitle = "fCitC"
# ttitle = 'a'

# # class and type of tag containing all other optional ratings 
# coptional = "cnFkU Me f"
# toptional = 'div'

# # class and type of tag of the review text
# ctext = "XllAv H4 _a"
# ttext = 'q'

# # class of tag of expand button
# cexpand = 'eljVo _S Z'

# # class and type of tag of hotel name
# chtname = 'fkWsC b d Pn'
# thtname = 'h1'

# # class and type of the number of contributions
# ccontributions = 'ckXjS'
# tcontributions = 'span'

# # class and type of the time of stay
# ctimestay = "euPKI _R Me S4 H3"
# ttimestay = "span"

# # class and type of reviewer location 
# creviewerloc = "default ShLyt small"
# creviewerloc = "span"

# #sleep time for DOM reload
# sleepTime = 2.5

# htnames, times, names, ratings, titles, values, rooms, locations, cleanliness, services, slpqualitys, texts, numcontributions, timestays, reviewerlocs = (list() for i in range(15))

# print("getting url...")
# driver.get(url) 
# time.sleep(5)

# # get hotel name   
# content = driver.page_source
# soup = BeautifulSoup(content)
# htnameTag = soup.find(thtname, class_=chtname)
# assert htnameTag is not None, "Hotel name not found"
# htname = htnameTag.get_text()
# print('Collecting data for: ' + htname)


# while True:
#     content = driver.page_source
#     soup = BeautifulSoup(content)
#     all_reviews = soup.findAll(treview, class_=creview)

#     for review in all_reviews:
#         soup2 = BeautifulSoup(str(review))
#         name_time = soup2.findAll(tname_time, class_=cname_time)

#         # add review time 
#         times.append(str(name_time).split("wrote a review ")[1].split('</')[0])

#         #add reviewer name
#         reviewer = soup2.find(tname, class_=cname)
#         if reviewer is None:
#             reviewer = soup2.find(tnameverified, class_=cnameverified)
#         names.append(reviewer.get_text())

#         #add overall rating
#         ratings.append(str(soup2.find(trating, class_=crating)).split("ui_bubble_rating bubble_")[1][0])

#         # add review title
#         titles.append(soup2.find(ttitle, class_=ctitle).get_text())

#         # add all optional ratings
#         optionals = soup2.find(toptional, class_=coptional)
#         if optionals is None:
# #                     print("None")
#             values.append('N/A') 
#             rooms.append('N/A')
#             locations.append('N/A')
#             cleanliness.append('N/A')
#             services.append('N/A')
#             slpqualitys.append('N/A')
#         else:
#             op = [j for s in str(optionals).split('</span></div>') for j in s.split('</span></span><span>')]
#             values.append(op[op.index('Value')-1][-4] if 'Value' in op else "N/A")
#             rooms.append(op[op.index('Rooms')-1][-4] if 'Rooms' in op else "N/A")
#             locations.append(op[op.index('Location')-1][-4] if 'Location' in op else "N/A")
#             cleanliness.append(op[op.index('Cleanliness')-1][-4] if 'Cleanliness' in op else "N/A")
#             services.append(op[op.index('Service')-1][-4] if 'Service' in op else "N/A")
#             slpqualitys.append(op[op.index('Sleep Quality')-1][-4] if 'Sleep Quality' in op else "N/A")

#         #add review text 
#         texts.append(soup2.find(ttext, class_=ctext).get_text())

#         #add number of contributions
#         numcontributions.append(soup2.find(tcontributions, class_=ccontributions).get_text())

#         #add the time of stay ttimestay  
#         timestays.append(soup2.find(ttimestay, class_=ctimestay).get_text())

#         #add the location of the reviewer 
#         reviewerlocs.append(soup2.find(treviewerloc, class_=creviewerloc).get_text())





#     if not scroll:
#         break

#     # go to the next page by clicking on "next"
#     next_buttons = driver.find_elements(By.XPATH, "//*[@class='{next_page}']".format(next_page = "ui_button nav next primary "))
#     if (len(next_buttons)) == 1:
#         print("Going to the next page...")
# #                 next_buttons[0].click()
#         driver.execute_script("arguments[0].click();", next_buttons[0])


#         # wait until page is completely loaded
#         time.sleep(random.uniform(sleepTime,sleepTime + 0.25))
#         while str(driver.execute_script("return document.readyState")) != "complete" or not (driver.find_elements(By.XPATH, "//*[@class='{expand}']".format(expand = cexpand))[0].is_displayed()):
#             time.sleep(random.uniform(0,0.25))

#         # expand reviews 
#         expands = driver.find_elements(By.XPATH, "//*[@class='{expand}']".format(expand = cexpand))
#         assert expands[0].is_displayed(), "expand url not properly loaded"
# #                 expands[0].click()
#         driver.execute_script("arguments[0].click();", expands[0])
#     else:
# #                 print("Reached the last page")
#         break 

# ids = np.arange(len(texts))
# htnames = np.array([htname] * len(services))

# assert len(timestays) == len(numcontributions) == len(htnames) == len(times) == len(names) == len(ratings) == len(titles) == len(values) == len(rooms) == len(locations) == len(cleanliness) == len(services) == len(slpqualitys) == len(texts)

# ids = np.array(ids)
# texts = np.array(texts).astype('str')
# values = np.array(values)
# rooms = np.array(rooms)
# locations = np.array(locations)
# cleanliness = np.array(cleanliness)
# services = np.array(services)
# slpqualitys = np.array(slpqualitys)
# times = np.array(times)
# names = np.array(names)
# ratings = np.array(ratings)
# titles = np.array(titles)
# numcontributions = np.array(numcontributions)
# timestays = np.array(timestays)
# reviewerlocs = np.array(reviewerlocs)

# print(ids[:5])
# print(texts[:5])
# print(values[:5])
# print(rooms[:5])
# print(locations[:5])
# print(cleanliness[:5])
# print(services[:5])
# print(slpqualitys[:5])
# print(times[:5])
# print(names[:5])
# print(ratings[:5])
# print(titles[:5])

# allInfo = np.vstack([ids,texts,values,rooms,locations,cleanliness,services,slpqualitys,times,names,ratings,titles])
# print("shape of final array: " + str(allInfo.shape))


# npy_dir = r"C:\Users\zs\Desktop\forIntern\advisor_test"
# csv_dir = r"C:\Users\zs\Desktop\forIntern\advisor_test"

# #add timestamp
# timestamp = np.array([str(datetime.datetime.now())]*(allInfo.shape[1]))      
# finalArr = np.vstack([allInfo, timestamp])

# #save numpy array
# np.save(os.path.join(npy_dir, 'hotel_'+ str(0) + '.npy'), finalArr)

# # generate final report
# headers = np.array(['ID','review text','value rating','room rating','location rating','cleanliness rating','service rating','sleep quality','date','rater name','overall rating','rating title','timestamp']) 
# hotel = np.load(os.path.join(npy_dir, 'hotel_'+ str(0) + '.npy')).T

# # save as string and add indexes
# df = pd.DataFrame(hotel, columns = headers).astype(str)
# df.to_csv(os.path.join(csv_dir, 'final.csv'), encoding='utf-8-sig', index=True)

In [None]:
# url = 'https://www.google.com/travel/hotels/entity/CgoIyZLQmLyR4uJREAE/reviews?g2lb=2503771%2C4644488%2C4596364%2C4597339%2C4419364%2C4317915%2C4270442%2C4371335%2C4645479%2C4306835%2C4641139%2C4605861%2C4401769%2C4624411%2C4258168%2C4671810%2C2503781%2C2502548%2C4640247%2C4672717%2C4284970%2C4270859%2C4291517&hl=en-US&gl=us&ssta=1&rp=EMmS0Ji8keLiURDJktCYvJHi4lE4AkAASAHAAQI&ictx=1&sa=X&sqi=2&ved=0CAAQ5JsGahcKEwi4lq39xrf0AhUAAAAAHQAAAAAQAg&utm_campaign=sharing&utm_medium=link&utm_source=htls&ts=CAESABpJCisSJzIlMHg4ODNjYWUzZmYxMmIzNjYxOjB4NTFjNTg4OGJjMzE0MDk0ORoAEhoSFAoHCOUPEAsYHRIHCOUPEAsYHhgBMgIQACoJCgU6A1VTRBoA'
# driver = webdriver.Chrome('C:/Users/zs/Desktop/forIntern/chromedriver')
# driver2 = webdriver.Chrome('C:/Users/zs/Desktop/forIntern/chromedriver')

# driver.get(url)
# driver2.get(url)

# driver.get(url)
# WebDriverWait(driver,10000).until(EC.visibility_of_element_located((By.TAG_NAME,'body')))
# body = driver.find_element(By.TAG_NAME,'body')
# body.click()
# body.send_keys((Keys.chord(Keys.CONTROL, "t")))
# ActionChains(driver).key_down(Keys.CONTROL).send_keys('t').key_up(Keys.CONTROL).perform()




# driver = webdriver.Chrome('C:/Users/zs/Desktop/forIntern/chromedriver')
# driver.get('https://www.tripadvisor.com/Hotel_Review-g60993-d123322-Reviews-or450-Graduate_Cincinnati-Cincinnati_Ohio.html')

In [None]:
# conda list

In [None]:
# arr = os.listdir(r'C:/Users/zs/Desktop/forIntern/junk')
# arr