# Scraper for user reviews
- [Trip by skyscanner website](https://www.trip.skyscanner.com/leaderboard/region/2000000000399)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Scraper-for-user-reviews" data-toc-modified-id="Scraper-for-user-reviews-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Scraper for user reviews</a></span></li><li><span><a href="#Import-Modules-and-packages" data-toc-modified-id="Import-Modules-and-packages-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import Modules and packages</a></span><ul class="toc-item"><li><span><a href="#Load-in-.csv-file-of-top-200-users" data-toc-modified-id="Load-in-.csv-file-of-top-200-users-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load in .csv file of top 200 users</a></span></li></ul></li><li><span><a href="#Define-scraping-function-to-loop-through-urls" data-toc-modified-id="Define-scraping-function-to-loop-through-urls-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Define scraping function to loop through urls</a></span></li><li><span><a href="#Use-scraper-function-to-loop-through-urls" data-toc-modified-id="Use-scraper-function-to-loop-through-urls-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Use scraper function to loop through urls</a></span></li><li><span><a href="#Save-dataframe-to-csv" data-toc-modified-id="Save-dataframe-to-csv-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Save dataframe to csv</a></span></li></ul></div>

# Import Modules and packages

In [8]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
from time import sleep 
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

## Load in .csv file of top 200 users

In [5]:
user_name_df = pd.read_csv('./atx_leaderboards_top200.csv', index_col=0)
user_name_list = user_name_df['id'].tolist()

In [6]:
user_name_list[0:50] # view first 50 usernames

['what-moms-think',
 'yelena-konetchy',
 'bailey-k_2',
 'ramona-flume',
 'patty-drew',
 'nina-pena_2',
 'tiffany-z',
 'meg-brooker',
 'brent-wistrom',
 'stephanie-asmus',
 'amit-anandwala',
 'mark-payne_3',
 'melody-lowe',
 'elisa-regulski',
 'ryan-onyxhotels',
 'stephen-andears',
 'alice-chase',
 'mary-beth-c_2',
 'justine_14',
 'jen-knoedl',
 'travis-katz',
 'broke-girls-guide',
 'mary-lou-t_2',
 'persis-ratouis',
 'laraye-rushing',
 'netta-drimer_2',
 'paula_19',
 'diane-austin',
 'suzanne-wyble',
 'audrey-lo-travelingaudrey_2',
 'libby_3',
 'connie-chang',
 'patrick-keltner',
 'van-le',
 'baxter-jackson',
 'steve-deangelo',
 'susan-l_19',
 'jessica-abramson',
 'alicia-moylan',
 'nina-h_3',
 'keane',
 'derrick-sison',
 'deborah-peacock_2',
 'rebecca-goglia-beccabandit',
 'rie_4',
 'frances-nguyen-ha',
 'dale-c_3',
 'april-mccormick-miner',
 'katherine_9',
 'gmack']

In [5]:
base = 'https://www.trip.skyscanner.com/user/' # base url
user_id = user_name_list[0]                    # starting username
passport = '/passport/austin'                  # remainder of url
url = base+user_id+passport                    # combined url

In [6]:
url

'https://www.trip.skyscanner.com/user/what-moms-think/passport/austin'

In [7]:
# create list of urls to loop through
urls = []              
for name in user_name_list:
    urls.append(base+name+passport)
urls

['https://www.trip.skyscanner.com/user/what-moms-think/passport/austin',
 'https://www.trip.skyscanner.com/user/yelena-konetchy/passport/austin',
 'https://www.trip.skyscanner.com/user/bailey-k_2/passport/austin',
 'https://www.trip.skyscanner.com/user/ramona-flume/passport/austin',
 'https://www.trip.skyscanner.com/user/patty-drew/passport/austin',
 'https://www.trip.skyscanner.com/user/nina-pena_2/passport/austin',
 'https://www.trip.skyscanner.com/user/tiffany-z/passport/austin',
 'https://www.trip.skyscanner.com/user/meg-brooker/passport/austin',
 'https://www.trip.skyscanner.com/user/brent-wistrom/passport/austin',
 'https://www.trip.skyscanner.com/user/stephanie-asmus/passport/austin',
 'https://www.trip.skyscanner.com/user/amit-anandwala/passport/austin',
 'https://www.trip.skyscanner.com/user/mark-payne_3/passport/austin',
 'https://www.trip.skyscanner.com/user/melody-lowe/passport/austin',
 'https://www.trip.skyscanner.com/user/elisa-regulski/passport/austin',
 'https://www.tr

# Define scraping function to loop through urls

In [42]:
def scrape_test(urls): # function to scrape through urls
    last_height = driver.execute_script("return document.body.scrollHeight") # Using selenium, when page is loaded, get the height of the bottom of the page
    while True: # while this is still the very bottom of the page

        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Scroll down to bottom
        sleep(15)   # Wait to load page

        new_height = driver.execute_script("return document.body.scrollHeight") # Calculate new scroll height and compare with last scroll height
        if new_height == last_height:     # if the new height reaches the bottom then break

            break
        last_height = new_height          # if the new hight is not the very bottom then keep scrolling and repeating this process

    soup = BeautifulSoup(driver.page_source, 'lxml')                         # load page from beautiful soup and control selenium(driver)
    for review in soup.find_all('div', {'class': 'moment'}):
            #find all of the information needed and save to variables, if no information found save as no information
            try:
                place_name = review.find('a', {'class': 'place_name'}).text         
            except:
                place_name = 'no place'
            try:
                category = review.find('small', {'class': 'place_type'}).text
            except:
                category = 'no category'
            try:
                rating = review.find('div', {'aria-label': True}).attrs['aria-label']
            except:
                rating = 'no rating'
            try:
                review_text = review.find('div', {'class': 'row review'}).text
            except:
                review_text = "no text"         
            user_name = url.split('/')[4]
            df.loc[len(df)] = [place_name, category, rating, user_name, review_text]  # add all to the end of the dataframe
            

# Use scraper function to loop through urls

In [43]:
# scraper in process
df = pd.DataFrame(columns=['place_name', 'category', 'rating', 'user_name', 'review_text']) # create new dataframe with specific columns
driver = webdriver.Chrome(executable_path="./chromedriver/macos/chromedriver")              # Load Selenium driver
driver.get(url)   # sends browser to desired url
sleep(4)          # sleep is important, because website will not allow scraper to get information if the driver starts too quickly
cookies = driver.find_element_by_xpath('//*[@id="cookie-banner-root"]/section/div/div/div/button')    # use selenium to click on I accept cookies banner
cookies.click()
sleep(4)

for url in urls[150:200]:   # I scraped in 4 waves of 50 urls each
    sleep(2)
    driver.get(url)         # have selenium fetch information from the url
    scrape_test(url)        # carry url to the function to scrape the pages information
driver.quit                 # end the driver after using

<bound method WebDriver.quit of <selenium.webdriver.chrome.webdriver.WebDriver (session="7537059bdc238cdecc3cdd023465423e")>>

In [44]:
df.head() # view dataframe

Unnamed: 0,place_name,category,rating,user_name,review_text
0,Austin,City,5 stars,mzmessynessy,Just stepping fresh outta zilker botanical gar...
1,Boggy Creek Greenbelt,Attraction,5 stars,mzmessynessy,So this quaint little park offers a nice littl...
2,Zilker Botanical Garden,Attraction,5 stars,mzmessynessy,Oh be still my beating heart! Beautiful! Broke...
3,Walmart,Attraction,5 stars,mzmessynessy,Ok so my shopping experience at this Austin lo...
4,Star Seeds Cafe,Restaurant,5 stars,mzmessynessy,First spot I stopped to just soak up the Austi...
5,Austin,City,no rating,mela-mcgary,no text
6,Blanton Museum of Art,Attraction,5 stars,mela-mcgary,Had a really great time here. Enjoyed the gift...
7,Stevie Ray Vaughan Statue,Attraction,5 stars,mela-mcgary,Very cool to see an SRV statue. We went at nig...
8,Four Seasons Hotel Austin,Hotel,5 stars,mela-mcgary,no text
9,Alamo Drafthouse - South Lamar,Attraction,5 stars,mela-mcgary,no text


In [45]:
df['user_name'].value_counts() # make sure each usernames matches up with their review count, some users have deleted posts which may have the numbers slightly off

lee-a_3                      45
maria-coco-m                 37
robin-j                      34
holly-w_4                    32
forrest-b                    27
melissa-s_16                 22
lillie-s                     21
andi-w                       20
eliot-e                      20
bibi-javed                   19
jayne-g_2                    15
andrew-w_3                   15
sean-r                       15
sarah_80                     14
steve-atwater                11
brenda-g_14                  11
cindy-chow_2                 10
tory-k                       10
teresa-d_4                   10
robert-rodriguez_8           10
dominique-d                  10
loren-r_3                    10
vivian-c                     10
david-k_25                   10
pam-k_2                      10
kristen-kachotravels         10
scott-c_5                    10
bradley-jackson              10
leslie-g_4                   10
hannah-shirley               10
sharon-bubz-z                10
lorraine

# Save dataframe to csv

In [46]:
df.to_csv('./text_reviews_150-200.csv')