# Lab 5.2 -- Scraping IMBD

Our goal is to scrap [IMDB](imdb.com) user reviews for *Borat Subsequent Moviefilm*.  Unfortunately, the page for user reviews only shows a limited number of reviews and you can't access additional pages through a link.  `selenium` to the rescue! In this lab, we will combine our two approaches to web scraping by

1. Using `selenium` to load the page and click the *Load More* until we have all the reviews.
2. Creating a `BeautifulSoup` instance for the complete page and parsing the results.

### Task 1 -- Load the reviews.

Explore IMBD to find the web link for the user reviews for *Borat Subsequent Moviefilm* and load this page in Python with `selenium`.

In [73]:
from composablesoup import find, find_all, get_text, has_attr
from composable.sequence import slice, head
from composable.strict import map, filter
from composable.string import replace,split
from composable import from_toolz as tlz
from composable import pipeable
from composablesoup.soup import find_parent, parents, children, find_previous_sibling, find_previous_siblings, find_next_sibling, find_next_siblings, find_previous_sibling
import re
import requests
from bs4 import BeautifulSoup

In [2]:
!pip install selenium

Collecting selenium
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
[K     |████████████████████████████████| 904 kB 1.7 MB/s eta 0:00:01
Installing collected packages: selenium
Successfully installed selenium-3.141.0


In [94]:
from selenium import webdriver

DRIVER_PATH = '/mnt/c/Users/vn6415dw/Desktop/chromedriver.exe'
driver = webdriver.Chrome(executable_path= DRIVER_PATH)
driver.get('https://www.imdb.com/title/tt13143964/reviews?ref_=tt_ov_rt')

### Task 2 -- Figure out how to click the *Load More* button.

To load all of the user reviews, we need to click the *Load More* button multiple times.  First, find the corresponding WebElement and verify that clicking this button loads another page of results.

In [95]:
load_more = driver.find_element_by_id('load-more-trigger')
load_more

<selenium.webdriver.remote.webelement.WebElement (session="527b180533205688b3322731b7e14c48", element="e302d762-5846-4e32-b76d-0e90bf83a03d")>

In [96]:
load_more.click()

In [97]:
keep_running = True
i = 0
while keep_running and i < 1000:
    try:
        i = i + 1
        load_more.click()
        print('Click Number {0}'.format(i))
    except:
        keep_running = False

Click Number 1


### Task 3 -- Click *Load More* until you have all the results.

Now you need to write code that will keep clicking the *Load More* button when you find it.  **Hint:** We can think of this as an example of an *unfold* process, meaning you should use a `while` loop combined with a [try-and-except statement](https://pythonbasics.org/try-except/) to keep trying to click the button.  To make sure you don't get an infinite loop, use a variable to identify and hold the stopping condition/state.

In [98]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [93]:
driver.get('https://www.imdb.com/title/tt13143964/reviews?ref_=tt_ov_rt')

MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=54573): Max retries exceeded with url: /session/680748ecfdbfd3805b80cf497e0ae916/url (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7faa5b20e710>: Failed to establish a new connection: [Errno 111] Connection refused'))

In [99]:
wait = WebDriverWait(driver,10)
load_more = wait.until(EC.element_to_be_clickable((By.ID, 'load-more-trigger')))

In [100]:
keep_running = True
i = 0
while keep_running and i < 1000:
    try:
        i = i + 1
        load_more.click()
        print('Click Number {0}'.format(i))
        load_more = wait.until(EC.element_to_be_clickable((By.ID, 'load-more-trigger')))
    except:
        keep_running = False

Click Number 1
Click Number 2
Click Number 3
Click Number 4
Click Number 5
Click Number 6
Click Number 7
Click Number 8
Click Number 9
Click Number 10
Click Number 11
Click Number 12
Click Number 13
Click Number 14
Click Number 15
Click Number 16
Click Number 17
Click Number 18
Click Number 19
Click Number 20
Click Number 21
Click Number 22
Click Number 23
Click Number 24
Click Number 25
Click Number 26
Click Number 27
Click Number 28
Click Number 29
Click Number 30
Click Number 31
Click Number 32
Click Number 33
Click Number 34
Click Number 35


In [102]:
driver.page_source



### Task 4 -- Load the results in a `BeautifulSoup` object.

Since `bs4` has better tools for parsing html, we will now switch to using this module to parse the results.  Recall that you can access the content of the current content from the `selenium` driver using `driver.page_source`.  You can use this attribute to make a `soup` object for the page using 

> soup = BeautifulSoup(driver.page_source, 'html.parser')

In [104]:
review_soup = BeautifulSoup(driver.page_source, 'html.parser')
review_soup

<html class="scriptsOn" style="--ipt-focus-outline-on-base:none; --ipt-focus-outline-on-baseAlt:none;" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#"><head><script async="" crossorigin="anonymous" src="https://images-na.ssl-images-amazon.com/images/I/31YXrY93hfL.js"></script><script async="" crossorigin="anonymous" src="https://m.media-amazon.com/images/G/01/imdbads/custom/test/index/js/ad-plugins/showadv2.js"></script>
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>
<script type="text/javascript">
window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].isStub=1}};e.exec=func

### Task 5 -- Extract the information

Now extract the following data to a csv file.

1. Title
2. Score
3. User
4. Date
5. Text (replace commas with semi-colons!)
6. Two columns for X and Y, where `"X out of Y found this helpful"`
7. Permanent link the the review.


In [1]:
# Your code here