# Lab 5.2 -- Scraping IMBD

Our goal is to scrap [IMDB](imdb.com) user reviews for *Borat Subsequent Moviefilm*.  Unfortunately, the page for user reviews only shows a limited number of reviews and you can't access additional pages through a link.  `selenium` to the rescue! In this lab, we will combine our two approaches to web scraping by

1. Using `selenium` to load the page and click the *Load More* until we have all the reviews.
2. Creating a `BeautifulSoup` instance for the complete page and parsing the results.

In [1]:
from selenium import webdriver
from composable import pipeable

### Task 1 -- Load the reviews.

Explore IMBD to find the web link for the user reviews for *Borat Subsequent Moviefilm* and load this page in Python with `selenium`.

In [2]:
# Your code here
DRIVER_PATH = '/mnt/c/Users/le7858ey/Desktop/chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://www.imdb.com/title/tt13143964/reviews?ref_=tt_ql_3')

### Task 2 -- Figure out how to click the *Load More* button.

To load all of the user reviews, we need to click the *Load More* button multiple times.  First, find the corresponding WebElement and verify that clicking this button loads another page of results.

In [20]:
# Your code here
more_button = driver.find_element_by_class_name('ipl-load-more__button')
more_button.click()

### Task 3 -- Click *Load More* until you have all the results.

Now you need to write code that will keep clicking the *Load More* button when you find it.  **Hint:** We can think of this as an example of an *unfold* process, meaning you should use a `while` loop combined with a [try-and-except statement](https://pythonbasics.org/try-except/) to keep trying to click the button.  To make sure you don't get an infinite loop, use a variable to identify and hold the stopping condition/state.

In [3]:
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [5]:
# Your code here
all_clear = True
i = 0
while(all_clear == True and i < 200):
    try:
        more_button = driver.find_element_by_class_name('ipl-load-more__button')
        more_button.click()
        i += 1
    except:
        all_clear = False
        print('No more')

No more


In [10]:
wait = WebDriverWait(driver, 10)
get_more = wait.until(EC.element_to_be_clickable((By.ID, 'load-more-trigger')))
all_clear = True
i = 0
while(all_clear == True and i < 200):
    try:
        i += 1
        get_more.click()
        print('Clicked {0} times.'.format(i))
        get_more = wait.until(EC.element_to_be_clickable((By.ID, 'load-more-trigger')))
    except:
        all_clear = False
        print('No more')

Clicked 1 times.
Clicked 2 times.
Clicked 3 times.
Clicked 4 times.
Clicked 5 times.
Clicked 6 times.
Clicked 7 times.
Clicked 8 times.
Clicked 9 times.
Clicked 10 times.
Clicked 11 times.
Clicked 12 times.
Clicked 13 times.
Clicked 14 times.
Clicked 15 times.
Clicked 16 times.
Clicked 17 times.
Clicked 18 times.
Clicked 19 times.
Clicked 20 times.
Clicked 21 times.
Clicked 22 times.
Clicked 23 times.
Clicked 24 times.
Clicked 25 times.
Clicked 26 times.
Clicked 27 times.
Clicked 28 times.
Clicked 29 times.
Clicked 30 times.
Clicked 31 times.
Clicked 32 times.
Clicked 33 times.
No more


### Task 4 -- Load the results in a `BeautifulSoup` object.

Since `bs4` has better tools for parsing html, we will now switch to using this module to parse the results.  Recall that you can access the content of the current content from the `selenium` driver using `driver.page_source`.  You can use this attribute to make a `soup` object for the page using 

> soup = BeautifulSoup(driver.page_source, 'html.parser')

In [12]:
from composable import pipeable
from composable.strict import map, filter
from composablesoup import find, find_all, get_text, has_attr
from composablesoup.soup import find_parent, parents, children, find_previous_sibling, find_previous_siblings, find_next_sibling, find_next_siblings, find_previous_sibling
from composable.sequence import to_list, head
from composable.string import strip
from composable import from_toolz as tlz
import requests
from bs4 import BeautifulSoup

In [14]:
# Your code here
borat_reviews = BeautifulSoup(driver.page_source, 'html.parser')

### Task 5 -- Extract the information

Now extract the following data to a csv file.

1. Title
2. Score
3. User
4. Date
5. Text (replace commas with semi-colons!)
6. Two columns for X and Y, where `"X out of Y found this helpful"`
7. Permanent link the the review.


In [1]:
# Your code here

In [23]:
strip = pipeable(lambda s: s.replace(',',';').strip())

get_title = pipeable(lambda soup: soup
>> find_all('div', attrs={'class':'lister-item'})
>> map(find('a', attrs={'class':'title'}))
>> map(get_text)
>> map(strip)
)

In [39]:
score_text = pipeable(lambda s: (s >> find('div', attrs={'class':'ipl-ratings-bar'}) >> get_text >> strip) if (s >> find('div', attrs={'class':'ipl-ratings-bar'})) != None else '')

get_score = pipeable(lambda soup: soup
>> find_all('div', attrs={'class':'lister-item'})
>> map(score_text)
)

In [41]:
get_user = pipeable(lambda soup: soup
>> find_all('div', attrs={'class':'lister-item'})
>> map(find('span', attrs={'class':'display-name-link'}))
>> map(get_text)
>> map(strip)
)

In [44]:
get_date = pipeable(lambda soup: soup
>> find_all('div', attrs={'class':'lister-item'})
>> map(find('span', attrs={'class':'review-date'}))
>> map(get_text)
>> map(strip)
)

In [91]:
remove_extras = pipeable(lambda s: s.replace('\n',' ').replace('\\',''))
get_review = pipeable(lambda soup: soup
>> find_all('div', attrs={'class':'lister-item'})
>> map(find('div', attrs={'class':'text'}))
>> map(get_text)
>> map(remove_extras)
>> map(strip)
)

In [67]:
import re as r
first_number = r.compile(r'(\d+) out of \d+ .*')
get_first_number = pipeable(lambda s: first_number.match(s).groups()[0])
get_first_helpful = pipeable(lambda soup: soup
>> find_all('div', attrs={'class':'lister-item'})
>> map(find('div', attrs={'class':'actions'}))
>> map(get_text)
>> map(strip)
>> map(get_first_number)
)

In [68]:
second_number = r.compile(r'\d+ out of (\d+) .*')
get_second_number = pipeable(lambda s: second_number.match(s).groups()[0])
get_second_helpful = pipeable(lambda soup: soup
>> find_all('div', attrs={'class':'lister-item'})
>> map(find('div', attrs={'class':'actions'}))
>> map(get_text)
>> map(strip)
>> map(get_second_number)
)

In [82]:
get_link = pipeable(lambda s: s['href'])

get_permalink = pipeable(lambda soup: soup
>> find_all('div', attrs={'class':'lister-item'})
>> map(find('div', attrs={'class':'actions'}))
>> map(find_all('a'))
>> map(tlz.get(1))
>> map(get_link)
)

In [100]:
def get_all_reviews(soup):
    """

    """
    title = soup >> get_title
    score = soup >> get_score
    user = soup >>get_user
    date = soup >> get_date
    review = soup >> get_review
    first = soup >> get_first_helpful
    second = soup >> get_second_helpful
    permalink = soup >> get_permalink  
    output = [row for row in zip(title, score, user, date, review, first, second, permalink)]
    return output



In [101]:
all_reviews = get_all_reviews(borat_reviews)
all_reviews[:2]


[('Borat Make a Number 2',
  '10/10',
  'MissCzarChasm',
  '29 October 2020',
  'Borat Make a *Glorious* #2! Subsequent Moviefilm: Delivery of Prodigious Bribe to American Regime for Make Benefit Once Glorious Nation of Kazakhstan is very naiiice!America Mayor Rudolph Giuliani say he not like film.America Mayor Rudolph Giuliani say he very much LIE down to fix pants like in nation of Kazakhstan where we not stand up to tuck the shirt. Much success.You watch.Chin qui',
  '167',
  '243',
  '/review/rw6217081/?ref_=tt_urv'),
 ('Excellent. And this is from a non Sasha Cohen Baron fan. REAL REVIEW.',
  '10/10',
  'lvanka',
  '30 October 2020',
  'My husband loved SCB in all his incarnations (Ali G.; Borat; Bruno; and that guy from Who is America). He\'d quote parts of Bruno ("But first; more dancing with Bruno!") as he\'d dance around me in the kitchen. He\'d quote parts of an interview SCB did with Dick Cheney; as I rolled my eyes. Every couple of years or so; he\'d put on Bruno or Borat; 

In [125]:
each_review = [row for row in all_reviews]
contents = [','.join(row) for row in each_review]

In [128]:
results = '\n'.join(contents)
results

 quite indicative of the misplaced appreciation some seem to have for this series. Cohen sabotaged the key moment of his own movie. How asinine.,1,14,/review/rw6231049/?ref_=tt_urv\nI give credit to the actress,2/10,Pukeonthestreet,2 November 2020,She was great. But making a joke about shootings in a time when they\'re actively happening is just f\'d. Honestly I started off laughing at a few things then it just got dark and gross and the jewish jokes made me wish I never watched it. Upsetting.,1,14,/review/rw6229582/?ref_=tt_urv\nOld hat,4/10,stircrazysos,2 November 2020,As a Sasha baron Cohen fan. I\'m was disappointed. He always pushes the boundaries; but this was "old hat". From Ali g to the spy. The guy is genius. This film though was tripe. Same jokes; just rehashed. You can\'t play characters like Borat; ali g or bruno. More than once; the public won\'t fall for it. Such ashame; because the first borat film was great. Hence 4/10 come on Sasha you can do better.,1,15,/review/rw623

In [129]:
with open('borat_reviews.csv', 'w') as outfile:
    outfile.write(results)