<div class="alert alert-block alert-warning">Each assignment needs to be completed independently. Never ever copy others' work or let someone copy your solution (even with minor modification, e.g. changing variable names). Anti-Plagiarism software will be used to check all submissions. No last minute extension of due date. Be sure to start working on it ASAP! </div>

## Q1. Collecting Movie Reviews 



Write a function `getReviews(url)` to scrape all **reviews on the first page**, including, 
- **title** (see (1) in Figure)
- **reviewer's name** (see (2) in Figure)
- **date** (see (3) in Figure)
- **rating** (see (4) in Figure)
- **review content** (see (5) in Figure. For each review text, need to get the **complete text**.)
- **helpful** (see (6) in Figure). 


Requirements:
- `Function Input`: book page URL
- `Function Output`: save all reviews as a DataFrame of columns (`title, reviewer, rating, date, review, helpful`). For the given URL, you can get 24 reviews.
- If a field, e.g. rating, is missing, use `None` to indicate it. 

    
![alt text](IMDB.png "IMDB")

In [1]:
import requests
from bs4 import BeautifulSoup  
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time


In [2]:
def getElement(start_node, css_selector):
    
    result = None
    
    nodes = start_node.select(css_selector)
    if len(nodes) > 0:
        result = nodes[0].get_text().strip() 
    
    return result

def getReviews(page_url):
    
    reviews=[]
    page = requests.get(page_url)    # send a get request to the web page

    # status_code 200 indicates success. 
    #a status code >200 indicates a failure 
    if page.status_code==200:        
        soup = BeautifulSoup(page.content, 'html.parser')

        # find a block with id='seven-day-forecast-body'
        # follow the path down to the div for each period
        divs=soup.select("div.lister-item")
        #print(len(divs))
        #print(divs)
        
        for div in divs:
            
            title = getElement(div, "a.title")
            
            user = getElement(div, "span.display-name-link")
            
            date = getElement(div, "span.review-date")
            
            rating = getElement(div, "span.rating-other-user-rating span")
            
            review = getElement(div, "div.text.show-more__control")
            
            helpful = getElement(div, "div.actions.text-muted")
            helpful = helpful.split("\n")[0].strip()
            
            #print([title, user, date, rating, review, helpful])
            reviews.append([title, user, date, rating, review, helpful])
            
            
    reviews_tb = pd.DataFrame(reviews, columns = ['title', 'user',  'date', 'rating','review','helpful'])   
    
    return reviews_tb


In [3]:
page_url = 'https://www.imdb.com/title/tt1745960/reviews?sort=totalVotes&dir=desc&ratingFilter=0'
reviews = getReviews(page_url)

print(len(reviews))
reviews.head()

25


Unnamed: 0,title,user,date,rating,review,helpful
0,This is slightly different to the other review...,scottedwards-87359,26 May 2022,10,If you were a late teen or in your early twent...,"4,546 out of 4,762 found this helpful."
1,The truly epic blockbuster we needed.,Top_Dawg_Critic,23 May 2022,10,"Wow. The first Top Gun is a classic, and as we...","2,418 out of 2,690 found this helpful."
2,Let me just say...,lovefalloutkindagamer,26 May 2022,10,"I was reluctantly dragged into the theater, th...","1,414 out of 1,564 found this helpful."
3,Best Sequel yet,goshamorrell,25 May 2022,10,In one of the more memorable lines in the orig...,"988 out of 1,156 found this helpful."
4,The real cinema experience!,alexglimbergwindh,30 May 2022,10,If there's any movie that deserves to be seen ...,827 out of 941 found this helpful.


In [4]:
reviews.iloc[0]["review"]

"If you were a late teen or in your early twenties in the mid 1980's the world was very different. No computers, no mobile phones, no internet, no DVD's. We had cars though, and bikes, and we loved them, and we loved films too. The original Top Gun captured this moment in time perfectly, and gave us a thrilling ride like we had never seen before. The humour, the games, the bikes, the aircraft and my word, those flying scenes. We went back to the cinema to see it again and again, and spent the following decades quoting the movie. As time went on, it remained like a static snapshot in time to perfectly represent that magical point in our lives for so many of us.Now, 36 years later, we are a generation that has lost our parents, we've had our own children who have moved on themselves, and we now approach the end of our own careers and our young selves are gone forever.This film is the missing bookend to that whole generation. The original was there for the start of our young adult lives, 

# Q2 (Bonus) Scrape Dynamic Content 


Write a function `get_N_review(url, N)` to scrape **at least 100 reviews** by clicking "Load More" button 5 times through Selenium WebDrive.

In [73]:
def getElement(start_node, css_selector):
    
    result = None
    
    nodes = start_node.select(css_selector)
    if len(nodes) > 0:
        result = nodes[0].get_text().strip() 
    
    return result

def getReviews(page_url, driver):
    
    reviews=[]
    driver.get(page_url)
    
    for i in range(5):
        
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            
        button = WebDriverWait(driver, 1).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="load-more-trigger"]')))
        button.click()


    page_source=driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')

    divs=soup.select("div.lister-item")
    #print(len(divs))
    #print(divs)
        
    for div in divs:

        title = getElement(div, "a.title")

        user = getElement(div, "span.display-name-link")

        date = getElement(div, "span.review-date")

        rating = getElement(div, "span.rating-other-user-rating span")

        review = getElement(div, "div.text.show-more__control")

        helpful = getElement(div, "div.actions.text-muted")
        helpful = helpful.split("\n")[0].strip()

        #print([title, user, date, rating, review, helpful])
        reviews.append([title, user, date, rating, review, helpful])

            
    reviews_tb = pd.DataFrame(reviews, columns = ['title', 'user',  'date', 'rating','review','helpful'])   
    
    return reviews_tb

In [74]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions 

# for Firefox browser, do the following
# (1) find the path where you save the webdriver 
executable_path = '../notes/Web_Scraping/driver/geckodriver'
# (2) initialize the driver
driver = webdriver.Firefox(executable_path=executable_path)



page_url = 'https://www.imdb.com/title/tt1745960/reviews?sort=totalVotes&dir=desc&ratingFilter=0'
reviews = getReviews(page_url, driver)
driver.quit()

print(len(reviews))
reviews.head()

150


Unnamed: 0,title,user,date,rating,review,helpful
0,This is slightly different to the other review...,scottedwards-87359,26 May 2022,10,If you were a late teen or in your early twent...,"4,501 out of 4,713 found this helpful."
1,The truly epic blockbuster we needed.,Top_Dawg_Critic,23 May 2022,10,"Wow. The first Top Gun is a classic, and as we...","2,390 out of 2,657 found this helpful."
2,Let me just say...,lovefalloutkindagamer,26 May 2022,10,"I was reluctantly dragged into the theater, th...","1,394 out of 1,541 found this helpful."
3,Best Sequel yet,goshamorrell,25 May 2022,10,In one of the more memorable lines in the orig...,"981 out of 1,146 found this helpful."
4,The real cinema experience!,alexglimbergwindh,30 May 2022,10,If there's any movie that deserves to be seen ...,803 out of 915 found this helpful.


In [5]:
reviews

Unnamed: 0,title,user,date,rating,review,helpful
0,This is slightly different to the other review...,scottedwards-87359,26 May 2022,10,If you were a late teen or in your early twent...,"4,546 out of 4,762 found this helpful."
1,The truly epic blockbuster we needed.,Top_Dawg_Critic,23 May 2022,10,"Wow. The first Top Gun is a classic, and as we...","2,418 out of 2,690 found this helpful."
2,Let me just say...,lovefalloutkindagamer,26 May 2022,10,"I was reluctantly dragged into the theater, th...","1,414 out of 1,564 found this helpful."
3,Best Sequel yet,goshamorrell,25 May 2022,10,In one of the more memorable lines in the orig...,"988 out of 1,156 found this helpful."
4,The real cinema experience!,alexglimbergwindh,30 May 2022,10,If there's any movie that deserves to be seen ...,827 out of 941 found this helpful.
5,This is why we go to the movies,dtucker86,27 May 2022,10,This is one sequel that looked like it would n...,789 out of 924 found this helpful.
6,Fake Imdb reviews artificially upping the rati...,imseeg,29 May 2022,5,"Almost 90 percent of all the reviews have a 8,...",127 out of 751 found this helpful.
7,Flying High,DarkVulcan29,27 May 2022,10,"Top Gun (1986) made Tom Cruise a star, and now...",533 out of 671 found this helpful.
8,What an excellent sequel,r96sk,25 May 2022,9,"What an excellent sequel - I, in fact, like it...",510 out of 644 found this helpful.
9,TOM CRUISE YOU LEGEND!,nihal-38544,25 May 2022,9,This is one of the best theatrical experiences...,498 out of 608 found this helpful.


In [20]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions 

# for Firefox browser, do the following
# (1) find the path where you save the webdriver 
executable_path = '../notes/Web_Scraping/driver/geckodriver'
# (2) initialize the driver
driver = webdriver.Firefox(executable_path=executable_path)



page_url = 'https://www.imdb.com/title/tt1745960/reviews?sort=totalVotes&dir=desc&ratingFilter=0'

driver.get(page_url)

contents = driver.find_elements(By.CSS_SELECTOR, "div.content div.text.show-more__control" ) 
for i, content in enumerate(contents):
    time.sleep(1)
    print(i, "\t", content.get_attribute("innerHTML"))

driver.quit()

  driver = webdriver.Firefox(executable_path=executable_path)


0 	 If you were a late teen or in your early twenties in the mid 1980's the world was very different. No computers, no mobile phones, no internet, no DVD's. We had cars though, and bikes, and we loved them, and we loved films too. The original Top Gun captured this moment in time perfectly, and gave us a thrilling ride like we had never seen before. The humour, the games, the bikes, the aircraft and my word, those flying scenes. We went back to the cinema to see it again and again, and spent the following decades quoting the movie. As time went on, it remained like a static snapshot in time to perfectly represent that magical point in our lives for so many of us.<br><br>Now, 36 years later, we are a generation that has lost our parents, we've had our own children who have moved on themselves, and we now approach the end of our own careers and our young selves are gone forever.<br><br>This film is the missing bookend to that whole generation. The original was there for the start of our 