# Lab 5.2 -- Scraping IMBD

Our goal is to scrap [IMDB](imdb.com) user reviews for *Borat Subsequent Moviefilm*.  Unfortunately, the page for user reviews only shows a limited number of reviews and you can't access additional pages through a link.  `selenium` to the rescue! In this lab, we will combine our two approaches to web scraping by

1. Using `selenium` to load the page and click the *Load More* until we have all the reviews.
2. Creating a `BeautifulSoup` instance for the complete page and parsing the results.

### Task 1 -- Load the reviews.

Explore IMBD to find the web link for the user reviews for *Borat Subsequent Moviefilm* and load this page in Python with `selenium`.

In [42]:
from selenium import webdriver


DRIVER_PATH = '/mnt/c/Users/Dillon McDaniel/Documents/chromedriver.exe'
url = 'https://www.imdb.com/'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get(url)

In [43]:
input_field = driver.find_element_by_id('suggestion-search')
input_button = driver.find_element_by_id('suggestion-search-button')
input_field.send_keys('Borat Subsequent Moviefilm')
input_button.click()

In [44]:
result_feild = driver.find_element_by_class_name('result_text')
result_link = result_feild.find_element_by_tag_name('a')
result_link.click()

In [45]:
result_feild = driver.find_element_by_class_name('titleReviewbarItemBorder')
result_link = result_feild.find_element_by_tag_name('a')
result_link.click()

### Task 2 -- Figure out how to click the *Load More* button.

To load all of the user reviews, we need to click the *Load More* button multiple times.  First, find the corresponding WebElement and verify that clicking this button loads another page of results.

In [36]:
input_button = driver.find_element_by_id('load-more-trigger')
input_button.click()

Done


### Task 3 -- Click *Load More* until you have all the results.

Now you need to write code that will keep clicking the *Load More* button when you find it.  **Hint:** We can think of this as an example of an *unfold* process, meaning you should use a `while` loop combined with a [try-and-except statement](https://pythonbasics.org/try-except/) to keep trying to click the button.  To make sure you don't get an infinite loop, use a variable to identify and hold the stopping condition/state.

In [47]:
import time
keep_running  = True
i = 0
while keep_running and i < 1000:
    try:
        input_button = driver.find_element_by_id('load-more-trigger')
        input_button.click()
        time.sleep(4)
        i = i + 1
    except:
        keep_running = False
        print("Done")

Done


### Task 4 -- Load the results in a `BeautifulSoup` object.

Since `bs4` has better tools for parsing html, we will now switch to using this module to parse the results.  Recall that you can access the content of the current content from the `selenium` driver using `driver.page_source`.  You can use this attribute to make a `soup` object for the page using 

> soup = BeautifulSoup(driver.page_source, 'html.parser')

In [48]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
soup

<html class="scriptsOn" style="--ipt-focus-outline-on-base:none; --ipt-focus-outline-on-baseAlt:none;" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#"><head><script async="" crossorigin="anonymous" src="https://images-na.ssl-images-amazon.com/images/I/31YXrY93hfL.js"></script><script async="" crossorigin="anonymous" src="https://m.media-amazon.com/images/G/01/imdbads/custom/test/index/js/ad-plugins/showadv2.js"></script>
<script type="text/javascript">var ue_t0=ue_t0||+new Date();</script>
<script type="text/javascript">
window.ue_ihb = (window.ue_ihb || window.ueinit || 0) + 1;
if (window.ue_ihb === 1) {

var ue_csm = window,
    ue_hob = +new Date();
(function(d){var e=d.ue=d.ue||{},f=Date.now||function(){return+new Date};e.d=function(b){return f()-(b?0:d.ue_t0)};e.stub=function(b,a){if(!b[a]){var c=[];b[a]=function(){c.push([c.slice.call(arguments),e.d(),d.ue_id])};b[a].replay=function(b){for(var a;a=c.shift();)b(a[0],a[1],a[2])};b[a].isStub=1}};e.exec=func

### Task 5 -- Extract the information

Now extract the following data to a csv file.

1. Title
2. Score
3. User
4. Date
5. Text (replace commas with semi-colons!)
6. Two columns for X and Y, where `"X out of Y found this helpful"`
7. Permanent link the the review.


In [138]:
from composable import pipeable
from composable.strict import map, filter
from composablesoup import find, find_all, get_text, has_attr
from composablesoup.soup import find_parent, parents, children, find_previous_sibling, find_previous_siblings, find_next_sibling, find_next_siblings, find_previous_sibling
from composable.sequence import to_list, head
from composable.string import strip
from composable import from_toolz as tlz
import re


get_X = pipeable(lambda l: "" if re.search("(\d+)", l) is None else re.search("(\d+)", l).group(0))
get_Y = pipeable(lambda l: "" if re.search(" (\d+)", l) is None else re.search(" (\d+)", l).group(0).strip())
safe_getText = pipeable(lambda l: l.text.strip() if l is not None else "")
replace_Commas = pipeable(lambda l: l.replace(",",";"))
safe_getLink = pipeable(lambda l: "".join(["www.imdb.com",l["href"]]) if l["href"] is not None else "")
clear_newLines = pipeable(lambda l: l.replace("\n"," "))

In [154]:
test_review = soup.find('div', attrs={'class': re.compile('lister-item mode-detail\simdb-user-review')})
test_review

<div class="lister-item mode-detail imdb-user-review collapsable" data-initialized="true" data-review-id="rw6217081" data-vote-url="/title/tt13143964/review/rw6217081/vote/interesting">
<div class="review-container" style="max-height: none;">
<div class="lister-item-content">
<div class="ipl-ratings-bar">
<span class="rating-other-user-rating">
<svg class="ipl-icon ipl-star-icon" fill="#000000" height="24" viewbox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg">
<path d="M0 0h24v24H0z" fill="none"></path>
<path d="M12 17.27L18.18 21l-1.64-7.03L22 9.24l-7.19-.61L12 2 9.19 8.63 2 9.24l5.46 4.73L5.82 21z"></path>
<path d="M0 0h24v24H0z" fill="none"></path>
</svg>
<span>10</span><span class="point-scale">/10</span>
</span>
</div>
<a class="title" href="/review/rw6217081/?ref_=tt_urv"> Borat Make a Number 2
</a> <div class="display-name-date">
<span class="display-name-link"><a href="/user/ur0649363/?ref_=tt_urv">MissCzarChasm</a></span><span class="review-date">29 October 2020</s

In [62]:
get_title = pipeable(lambda soup: 
                     (soup
                      >>find('a', attrs={'class': 'title'})
                      >>safe_getText
                      >>replace_Commas
                     )
                    )

get_title(test_review)

'Borat Make a Number 2'

In [158]:
get_score = pipeable(lambda soup: 
                     (soup
                      >>find('span', attrs={'class': 'point-scale'})
                      >>find_previous_sibling
                      >>safe_getText
                      >>replace_Commas
                     ) if soup.find('span', attrs={'class': 'point-scale'}) is not None else "NA"
                    )
get_score(test_review)

'10'

In [64]:
get_username = pipeable(lambda soup: 
                     (soup
                      >>find('span', attrs={'class': 'display-name-link'})
                      >>safe_getText
                      >>replace_Commas
                     )
                    )
get_username(test_review)

'MissCzarChasm'

In [65]:
get_date = pipeable(lambda soup: 
                     (soup
                      >>find('span', attrs={'class': 'review-date'})
                      >>safe_getText
                      >>replace_Commas
                     )
                    )
get_date(test_review)

'29 October 2020'

In [141]:
get_comment = pipeable(lambda soup: 
                     (soup
                      >>find('div', attrs={'class': 'content'})
                      >>find('div')
                      >>safe_getText
                      >>replace_Commas
                      >>clear_newLines
                     )
                    )
get_comment(test_review)

'Borat Make a *Glorious* #2! Subsequent Moviefilm: Delivery of Prodigious Bribe to American Regime for Make Benefit Once Glorious Nation of Kazakhstan is very naiiice!America Mayor Rudolph Giuliani say he not like film.America Mayor Rudolph Giuliani say he very much LIE down to fix pants like in nation of Kazakhstan where we not stand up to tuck the shirt. Much success.You watch.Chin qui'

In [113]:
get_helpfulX = pipeable(lambda soup: 
                     (soup
                      >>find('div', attrs={'class': 'actions text-muted'})
                      >>safe_getText
                      >>get_X
                     )
                    )
get_helpfulX(test_review)

'167'

In [114]:
get_helpfulY = pipeable(lambda soup: 
                     (soup
                      >>find('div', attrs={'class': 'actions text-muted'})
                      >>safe_getText
                      >>get_Y
                     )
                    )
get_helpfulY(test_review)

'246'

In [123]:
get_link = pipeable(lambda soup: 
                     (soup
                      >>find('div', attrs={'class': 'actions text-muted'})
                      >>find('span')
                      >>find_next_sibling
                      >>find_next_sibling
                      >>safe_getLink
                     )
                    ) if soup.find('div', attrs={'class': 'actions text-muted'}) is not None else ""
get_link(test_review)

'www.imdb.com/review/rw6217081/?ref_=tt_urv'

In [155]:

def get_info(soup):
    
    reviews = soup.find_all('div', attrs={'class': re.compile('lister-item mode-detail\simdb-user-review')})
    titles = map(get_title,reviews)
    scores = map(get_score,reviews)
    users = map(get_username,reviews)
    dates = map(get_date,reviews)
    comments = map(get_comment,reviews)
    helpfulXs = map(get_helpfulX,reviews)
    helpfulYs = map(get_helpfulY,reviews)
    links = map(get_link,reviews)
    
    data = [list(a) for a in zip(titles,scores,users,dates,comments,helpfulXs,helpfulYs,links)]
    lines = lines = [",".join(r) for r in data]
    content = "\n".join(lines)
    return content
    
get_info(soup)



In [161]:
with open('imdb_data.csv','w') as outfile:
    outfile.write("titles,scores,users,dates,comments,helpfulXs,helpfulYs,link\n")


In [162]:
content = get_info(soup)
with open('imdb_data.csv','a') as outfile:  
        outfile.write(content)