<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Web-Scraping-with-Selenium" data-toc-modified-id="Web-Scraping-with-Selenium-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Web Scraping with Selenium</a></span><ul class="toc-item"><li><span><a href="#Learning-Objectives" data-toc-modified-id="Learning-Objectives-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Learning Objectives</a></span></li><li><span><a href="#Installs" data-toc-modified-id="Installs-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Installs</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Required-Software" data-toc-modified-id="Required-Software-1.2.0.1"><span class="toc-item-num">1.2.0.1&nbsp;&nbsp;</span>Required Software</a></span></li><li><span><a href="#Python" data-toc-modified-id="Python-1.2.0.2"><span class="toc-item-num">1.2.0.2&nbsp;&nbsp;</span>Python</a></span></li></ul></li></ul></li><li><span><a href="#Using-Selenium-to-Automate-Logins" data-toc-modified-id="Using-Selenium-to-Automate-Logins-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Using Selenium to Automate Logins</a></span><ul class="toc-item"><li><span><a href="#Headless-Browsing" data-toc-modified-id="Headless-Browsing-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Headless Browsing</a></span></li><li><span><a href="#Logging-in-to-GHE" data-toc-modified-id="Logging-in-to-GHE-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Logging in to GHE</a></span></li><li><span><a href="#Passing-Cookies-to-Requests" data-toc-modified-id="Passing-Cookies-to-Requests-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Passing Cookies to Requests</a></span></li></ul></li><li><span><a href="#Using-Selenium-to-Load-Javascript-Pages" data-toc-modified-id="Using-Selenium-to-Load-Javascript-Pages-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Using Selenium to Load Javascript Pages</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#This-function-defines-my-driver-for-a-single-page" data-toc-modified-id="This-function-defines-my-driver-for-a-single-page-1.4.0.1"><span class="toc-item-num">1.4.0.1&nbsp;&nbsp;</span>This function defines my driver for a single page</a></span></li><li><span><a href="#This-function-allows-me-to-check-that-my-page-has-finished-loading" data-toc-modified-id="This-function-allows-me-to-check-that-my-page-has-finished-loading-1.4.0.2"><span class="toc-item-num">1.4.0.2&nbsp;&nbsp;</span>This function allows me to check that my page has finished loading</a></span></li><li><span><a href="#Setting-up-the-lists-of-items-that-I-need-to-collect" data-toc-modified-id="Setting-up-the-lists-of-items-that-I-need-to-collect-1.4.0.3"><span class="toc-item-num">1.4.0.3&nbsp;&nbsp;</span>Setting up the lists of items that I need to collect</a></span></li><li><span><a href="#Define-max-pages" data-toc-modified-id="Define-max-pages-1.4.0.4"><span class="toc-item-num">1.4.0.4&nbsp;&nbsp;</span>Define max pages</a></span></li><li><span><a href="#Create-user-tuples" data-toc-modified-id="Create-user-tuples-1.4.0.5"><span class="toc-item-num">1.4.0.5&nbsp;&nbsp;</span>Create user tuples</a></span></li><li><span><a href="#Write-tuples-to-CSV" data-toc-modified-id="Write-tuples-to-CSV-1.4.0.6"><span class="toc-item-num">1.4.0.6&nbsp;&nbsp;</span>Write tuples to CSV</a></span></li><li><span><a href="#Bring-it-all-together" data-toc-modified-id="Bring-it-all-together-1.4.0.7"><span class="toc-item-num">1.4.0.7&nbsp;&nbsp;</span>Bring it all together</a></span></li><li><span><a href="#Here,-we'll-grab-all-the-ratings-for-a-single-game" data-toc-modified-id="Here,-we'll-grab-all-the-ratings-for-a-single-game-1.4.0.8"><span class="toc-item-num">1.4.0.8&nbsp;&nbsp;</span>Here, we'll grab all the ratings for a single game</a></span></li><li><span><a href="#A-final-wrapper-function" data-toc-modified-id="A-final-wrapper-function-1.4.0.9"><span class="toc-item-num">1.4.0.9&nbsp;&nbsp;</span>A final wrapper function</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Web Scraping with Selenium
*Author: Douglas Strodtman (SaMo)*

## Learning Objectives

1. Scraping Javascript pages
2. Manipulating objects on a page
3. Automating logins
4. Passing cookies to `requests`

## Installs

#### Required Software

- Google Chrome
- [Xpath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en)
- [Chromedriver](http://chromedriver.chromium.org/downloads) (Download the Chromedriver for your OS. Unzip and move the `chromedriver` file to the directory containing this notebook.)

#### Python 
- scrapy
- selenium
- beautiful soup

Uncomment and run the install for any of the packages you're missing below.

In [5]:
# !pip install scrapy
# !pip install selenium 
# !pip install beautifulsoup4 

In [6]:
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
from scrapy.selector import Selector
import re
import time
import csv
from itertools import islice

## Using Selenium to Automate Logins

To avoid typing my password in the browser, I created a hidden file with the variable

```PASSWORD = 'myPassw0rd'``` 

defined. I add this file to my `.gitignore` so it doesn't get accidentally shared when I push to Github. **I'm not saying this is security best practices, but it's a simple safeguard.**

In [7]:
%run .password.py

ERROR:root:File `'.password.py'` not found.


### Headless Browsing
One of the great options available with Selenium is the ability to automate websurfing without open a visual browser. This is as simple as adding

```options = webdriver.ChromeOptions()
options.add_argument('headless')```

### Logging in to GHE
All uses of Selenium are esoteric. If you want to log in to a site, build out a custom function that utilizes that site's structure and functionality and return the logged in driver.

In [3]:
def login_to_github(username, user_pass, headless=False, repo='DSI-US-4/course-info'):
    if headless:
        options = webdriver.ChromeOptions()
        options.add_argument('headless')
        driver = webdriver.Chrome('./chromedriver', chrome_options=options)
    else:
        driver = webdriver.Chrome('./chromedriver')
    driver.get(f'https://git.generalassemb.ly/{repo}')
    
    

    user = driver.find_elements_by_css_selector('input[type=text]')[0]
    user.send_keys(username)

    password = driver.find_element_by_css_selector('input[type=password]')
    password.send_keys(user_pass)

    button = driver.find_element_by_css_selector('.btn')
    button.click()
    
    

    return driver

In [4]:
driver = login_to_github('dstrodtman', PASSWORD)

NameError: name 'PASSWORD' is not defined

Notice that this opens a new Chrome browser window. I can use xpath (or other options) to select elements of this page.

In [5]:
insights = driver.find_element_by_xpath("//a[@class='js-selected-navigation-item reponav-item'][3]")

In [6]:
insights.text

'Insights'

I can now automate a click to navigate.

In [7]:
insights.click()

Notice that I've now navigated to a different page in my browser.

In [16]:
driver.quit()

In [17]:
del driver

### Passing Cookies to Requests
While Selenium is powerful, it can be much slower than `requests`. Whenever possible, accelerate your scraping by capturing the html source instead of rendering the Javascript. You can pass cookies that are generated automatically back to `requests` to mimic browser behavior.

In [8]:
def get_cookie_jar(driver):
    cookies = driver.get_cookies()
    cookie_jar = {x['name']:x['value'] for x in cookies}
    
    return cookie_jar

In [9]:
cookie_jar = get_cookie_jar(driver)

In [10]:
cookie_jar

{'_fi_sess': 'eyJsYXN0X3dyaXRlIjoxNTI5NDI1NTA2MTk4LCJzZXNzaW9uX2lkIjoiYThjY2Q0OGU1NDE2ZDFjZWMwMmIyNDY0ODk2Y2JjZmQiLCJsYXN0X3JlYWRfZnJvbV9yZXBsaWNhcyI6MTUyOTQyNTUyMTE3NSwic3B5X3JlcG8iOiJEU0ktVVMtNC9jb3Vyc2UtaW5mbyIsInNweV9yZXBvX2F0IjoxNTI5NDI1NTIwfQ%3D%3D--a524b550112b743f242a2c915ed6682455d48e53',
 '_gh_render': 'BAh7B0kiD3Nlc3Npb25faWQGOgZFVEkiRTJhYzRjNGRhZWQyYWU3ZDVkNzky%0AYWQ4YmY4Yzc5YTVlNjdlNmVkODgyZTM5MDc3MDM0Yjg4NWU2YmRmNjMyOTkG%0AOwBGSSIPdXNlcl9sb2dpbgY7AEZJIg9kc3Ryb2R0bWFuBjsAVA%3D%3D%0A--307ecb8caa23da7a65643a4ba1bf0cc598edd689',
 'tz': 'America%2FLos_Angeles',
 'dotcom_user': 'dstrodtman',
 'logged_in': 'yes',
 '__Host-user_session_same_site': 'EoNbv2WD56BErIcQIL_716Ia5e33-Fop4uSYD3CYkY40MhNY',
 'user_session': 'EoNbv2WD56BErIcQIL_716Ia5e33-Fop4uSYD3CYkY40MhNY'}

I pass these cookies back to requests using the `cookies` arg.

In [11]:
page = requests.get('https://git.generalassemb.ly/DSI-US-4/course-info', 
                    cookies=cookie_jar)

In [12]:
soup = BeautifulSoup(page.text, 'html.parser')

In this simple example, I'll just extract the total number of commits to this page (here, I use `Selector` from `scrapy` to select by xpath so that my process for using `requests` is more similar to that for `selenium`.

In [13]:
commits = Selector(text=page.text).xpath("//li[@class='commits']/a/span").extract()[0]

In [14]:
commits

'<span class="num text-emphasized">\n                526\n              </span>'

As you can see, my result isn't as clean as I might like, but with some quick `BeautifulSoup` and regex parsing, I can clean things up. **In general**, my preferenece is to get as much data as I can in a single call and then clean it using the same reusable functions.

In [15]:
re.sub('\s*', '', BeautifulSoup(commits, 'html.parser').text)

'526'

## Using Selenium to Load Javascript Pages

The below functions are esoteric to boardgamegeek.com, which is the website that I chose to scrape for my capstone project. My approach was to maximize the amount of navigation I could do with `requests` and only use `selenium` for loading dynamically generated pages so that I could capture the data from their page source. There are **many** other ways to approach these problems, and I'm not suggesting that my solution is the best, just that it is a workable solution.

#### This function defines my driver for a single page
I choose to open a new driver for each page that I visit, rather than automating button clicks.

In [36]:
def connect_to_bgg(glink, curr_page, headless=True):
    if headless:
        options = webdriver.ChromeOptions()
        options.add_argument('headless')
        driver = webdriver.Chrome('./chromedriver', chrome_options=options)
    else:
        driver = webdriver.Chrome('./chromedriver')
    driver.get(f'https://boardgamegeek.com{glink}/ratings?pageid={curr_page}&rated=1')

    return driver

#### This function allows me to check that my page has finished loading
I was running into issues where I was trying to scrape before the page had finished loading, which at various times in my troubleshooting led to either just errors or (in an earlier iteration of this approach) scraping the results from the previous page that I had visited. While this function is all error handling, I'm programming around expected error behavior based upon what I know about the page.

In [39]:
def check_page_loaded(driver, last_page_el):
    first_rater = []
    page_el = []
    while not first_rater or not page_el:
        try:
            first_rater = driver.find_element_by_xpath("//ratings-module//li[@class='summary-item summary-rating-item']\
                                                        [1]/div[@class='comment-header']/div/div/a")
            page_el = first_rater.text
        except:
            time.sleep(.5)
    while page_el == last_page_el:
        try:
            first_rater = driver.find_element_by_xpath("//ratings-module//li[@class='summary-item summary-rating-item']\
                                                        [1]/div[@class='comment-header']/div/div/a")
            page_el = first_rater.text
        except:
            time.sleep(.5)
            
    return page_el

#### Setting up the lists of items that I need to collect
I had previously scraped the top 10k boardgames using `requests` from ranked lists that returned simple html. I was able to capture a unique identifier number(`gid`), the specific url path to that game (`glink`), and the total number of users that had provided a rating for that game (`numrating`).

In [24]:
def read_gid_link_numratings(start=0, stop=10000):
    gids = []
    glinks = []
    numratings = []
    with open('data/numratings', 'r') as f:
        reader = csv.reader(islice(f, start, stop+1))
        for row in reader:
            gids.append(row[0])
            glinks.append(row[1])
            numratings.append(row[2])
    return gids, glinks, numratings

#### Define max pages
To avoid having to automate clicks, I chose to leverage the design of the site to iterate through pages. Here, I find the number of pages of reviews that I should expect for each game.

In [29]:
def set_max_pages(numrating):
    numrating = int(numrating)
    if numrating%50 != 0:
        max_pages = numrating//50 + 1
    else:
        max_pages = int(numrating/50)
    return max_pages

#### Create user tuples
Because I knew that I wanted my data to end up in a normalized postgres database, I chose to include all necessary info here to define my table. Both `user_name` and `gid` are unique identifiers, so I use these to index ratings.

In [31]:
def make_user_rows(user_names, gid, ratings):
    return list(zip(user_names, [gid]*len(user_names), ratings))

#### Write tuples to CSV
I write out my data after **every** query into a file for each game. Reasons:
- For my scrape, I knew that I needed to wait 2s between calls so as to not overload the site
- This avoids the possibility of filling up my RAM and crashing my instance
- I cannot lose any data to unknown errors (at most, I only lose a single page query worth of data)
- My files will be of expected len (my max `numratings` for a game is only around 70k, so my files won't get huge)

In [53]:
def write_user_rows(gid,user_rows):
    with open(f'data/{gid}_users', 'a+') as f:
        csv.writer(f).writerows(user_rows)

#### Bring it all together
Here I build out a function to combine these smaller functions to grab all the ratings for a single game. Note that I have both a log file and I print out my progress along the way. This helps me in knowing when my attempts fail (as they certainly will).

In [71]:
def get_game_raters(gid, glink, max_pages, curr_page=1):

    last_page_el = []

    while curr_page <= max_pages:
        start = time.time()

        driver = connect_to_bgg(glink, curr_page)
        
        last_page_el = check_page_loaded(driver, last_page_el)
        
        html = driver.page_source
        
        driver.quit()

        xpath_user_names = "//ratings-module//div[@class='comment-header']/div/div/a/text()"
        xpath_user_links = "//ratings-module//div[@class='comment-header']/div/div/a/@href"
        xpath_ratings = "//ratings-module//li/div[@class='summary-item-callout']/div/text()"

        user_names = Selector(text=html).xpath(xpath_user_names).extract()
        user_links = Selector(text=html).xpath(xpath_user_links).extract()
        dirty_ratings = Selector(text=html).xpath(xpath_ratings).extract()

        ratings = []
        for rating in dirty_ratings:
            ratings.append(re.sub('\s', '', rating))

        user_rows = make_user_rows(user_names, gid, ratings)

        write_user_rows(gid,user_rows)
        
        print(f'Scraped {gid} page {curr_page} of {max_pages} in {time.time()-start}s')
        
        curr_page += 1

    with open('get_users_log', 'a+') as f:
        f.write(f'{time.time()} {gid} finished\n')    

#### Here, we'll grab all the ratings for a single game
My `gids` are ordered by `numratings`, so I choose a game a little further in so this doesn't take forever.

In [66]:
gids, glinks, numratings = read_gid_link_numratings()

glink = glinks[8000]
gid = gids[8000]
max_pages = set_max_pages(numratings[8000])

In [72]:
get_game_raters(gid, glink, max_pages)

Scraped 1514 page 1 of 3 in 6.446454048156738s
Scraped 1514 page 2 of 3 in 5.369781017303467s
Scraped 1514 page 3 of 3 in 4.921979188919067s


#### A final wrapper function
Here I build out a final function to wrap all of my inside functions and iterate through a set amount of games.

**NOTE**: It will take roughly two weeks for this scrape to complete. I managed to cut this time to around 4 days by splitting this up onto 10 AWS instances.

**In addition**, some of the paths in my `glinks` have been corrupted due to internal changes in board game names that will result in infinite while loops with my current code.

In [None]:
def get_user_ratings(start=0, stop=10000):
    gids, glinks, numratings = read_gid_link_numratings(start, stop)
    for gid, glink, numrating in zip(gids, glinks, numratings):
        max_pages = set_max_pages(numrating)
        get_game_raters(gid, glink, max_pages)