# Lab 5.2 -- Scraping IMBD

Our goal is to scrap [IMDB](imdb.com) user reviews for *Borat Subsequent Moviefilm*.  Unfortunately, the page for user reviews only shows a limited number of reviews and you can't access additional pages through a link.  `selenium` to the rescue! In this lab, we will combine our two approaches to web scraping by

1. Using `selenium` to load the page and click the *Load More* until we have all the reviews.
2. Creating a `BeautifulSoup` instance for the complete page and parsing the results.

### Task 1 -- Load the reviews.

Explore IMBD to find the web link for the user reviews for *Borat Subsequent Moviefilm* and load this page in Python with `selenium`.

In [1]:
from selenium import webdriver

DRIVER_PATH = r'/mnt/c/Users/kg3597wc/Desktop/chromefolder/chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://www.imdb.com/title/tt13143964/reviews')

### Task 2 -- Figure out how to click the *Load More* button.

To load all of the user reviews, we need to click the *Load More* button multiple times.  First, find the corresponding WebElement and verify that clicking this button loads another page of results.

In [2]:
html_before = driver.page_source

In [3]:
more_button = driver.find_element_by_class_name("ipl-load-more__button")
more_button.click()

In [4]:
html_after = driver.page_source
html_before != html_after

True

### Task 3 -- Click *Load More* until you have all the results.

Now you need to write code that will keep clicking the *Load More* button when you find it.  **Hint:** We can think of this as an example of an *unfold* process, meaning you should use a `while` loop combined with a [try-and-except statement](https://pythonbasics.org/try-except/) to keep trying to click the button.  To make sure you don't get an infinite loop, use a variable to identify and hold the stopping condition/state.

In [5]:
# While comments load in, the button is not clickable
# Rather than using a fixed wait time, selenium can detect when it changes
# Based on https://selenium-python.readthedocs.io/waits.html
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

max_pages = 250 # use to cutoff after a certain number of pages
pages = max_pages
while pages > 0:
    try:
        more_button.click()
        pages -= 1        
        more_button = WebDriverWait(driver, 5).until(
                    EC.element_to_be_clickable((By.CLASS_NAME, "ipl-load-more__button")))
        
    except TimeoutException: # Can't be found anymore - done
        print("Finished")
        break

Finished


### Task 4 -- Load the results in a `BeautifulSoup` object.

Since `bs4` has better tools for parsing html, we will now switch to using this module to parse the results.  Recall that you can access the content of the current content from the `selenium` driver using `driver.page_source`.  You can use this attribute to make a `soup` object for the page using 

> soup = BeautifulSoup(driver.page_source, 'html.parser')

In [7]:
from bs4 import BeautifulSoup
reviews_soup = BeautifulSoup(driver.page_source, 'html.parser')

### Task 5 -- Extract the information

Now extract the following data to a csv file.

1. Title
2. Score
3. User
4. Date
5. Text (replace commas with semi-colons!)
6. Two columns for X and Y, where `"X out of Y found this helpful"`
7. Permanent link the the review.


In [8]:
# Your code here
review_blocks = reviews_soup.find_all("div", class_="imdb-user-review")

In [9]:
titles = [review.find("a", class_="title").text.strip() for review in review_blocks]
titles[:5]

['Borat Make a Number 2',
 'Excellent. And this is from a non Sasha Cohen Baron fan. REAL REVIEW.',
 'Laugh Out Loud Funny S#!^!',
 'Cohen is a genius',
 "The 10's are 10's & The 1's are 10's!"]

In [10]:
#sometimes they don't have a score
scores = [review.find("span", class_="rating-other-user-rating").span.text if review.find("span", class_="rating-other-user-rating") else None for review in review_blocks]
scores[-20:-10]

['6', '3', '3', '2', '4', '2', '2', '2', '6', '6']

In [12]:
users = [review.find("span", class_="display-name-link").a.text for review in review_blocks]
users[:5]

['MissCzarChasm',
 'lvanka',
 'YourSonsDad',
 'WindsOfWintergreen',
 'AnaAnaBanana']

In [13]:
dates = [review.find("span", class_="review-date").text for review in review_blocks]
dates[:5]

['29 October 2020',
 '30 October 2020',
 '28 October 2020',
 '27 October 2020',
 '27 October 2020']

In [14]:
# Sometimes texts appear behind spoiler warnings or are partially cut off due to length on the visible page.
# In the HTML, however, the complete text can always be found in a div with one class that's "show-more__control"
texts = [review.find("div", class_="show-more__control").text for review in review_blocks]
texts[:5]

['Borat Make a *Glorious* #2! Subsequent Moviefilm: Delivery of Prodigious Bribe to American Regime for Make Benefit Once Glorious Nation of Kazakhstan is very naiiice!America Mayor Rudolph Giuliani say he not like film.America Mayor Rudolph Giuliani say he very much LIE\ndown to fix pants like in nation of Kazakhstan where we not stand up to tuck the shirt. Much success.You watch.Chin qui',
 'My husband loved SCB in all his incarnations (Ali G., Borat, Bruno, and that guy from Who is America). He\'d quote parts of Bruno ("But first, more dancing with Bruno!") as he\'d dance around me in the kitchen. He\'d quote parts of an interview SCB did with Dick Cheney, as I rolled my eyes. Every couple of years or so, he\'d put on Bruno or Borat, and laugh, and laugh, while I just shook my head. In short, he reverted from a mature 35 year old man, into a a teen, because of this one comedian. I kind of hated it and really didn\'t see the appeal of Cohen\'s comedy which I found crude and sometimes

In [17]:
import re
x_out_of_y = re.compile(r"(\d+) out of (\d+)")
nums_txt = [review.find("div", class_="actions text-muted").text for review in review_blocks]
helpful_nums = [x_out_of_y.search(txt).groups() for txt in nums_txt]
xes = [nums[0] for nums in helpful_nums]
ys = [nums[1] for nums in helpful_nums]

In [16]:
links = [review.find(text="Permalink").parent['href'] for review in review_blocks]
links[:5]

['/review/rw6217081/?ref_=tt_urv',
 '/review/rw6219436/?ref_=tt_urv',
 '/review/rw6213611/?ref_=tt_urv',
 '/review/rw6210276/?ref_=tt_urv',
 '/review/rw6211296/?ref_=tt_urv']

In [21]:
import csv

with open("imdb.csv", "w") as outfile:
    writer = csv.writer(outfile)
    for record in zip(titles, scores, users, dates, texts, xes, ys, links):
        writer.writerow(record)