### Data Super Heroes - Python, Web Scraping

1. Install Selenium and Scrapy

-   Selenium: to load the reviews.
-   Scrapy: to extract the relevant information.

In [None]:
! pip install selenium
! pip install scrapy

2. Dowload Chrome Driver

3. Importing dependencies

In [18]:
import numpy as np
import pandas as pd
from scrapy.selector import Selector
from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

4. Selenium test

In [20]:
driver = webdriver.Chrome()
url = 'https://www.imdb.com/title/tt15398776/reviews/?ref_=tt_ql_2'
time.sleep(1)
driver.get(url)
time.sleep(1)
print(driver.title)
time.sleep(1)
body = driver.find_element(By.CSS_SELECTOR, 'body')
body.send_keys(Keys.PAGE_DOWN)
time.sleep(1)
body.send_keys(Keys.PAGE_DOWN)
time.sleep(1)
body.send_keys(Keys.PAGE_DOWN)

Oppenheimer (2023) - Oppenheimer (2023) - User Reviews - IMDb


5. Extract the review count

In [22]:
# use scrapy Selector to extract this information
sel = Selector(text = driver.page_source)
# pass the HTML code of the page to Scrapy Selector and extract the total review count.
review_counts = sel.css('.lister .header span::text').extract_first().replace('.','').split(' ')[0]

review_counts

'3192'

6. Load all reviews

In [26]:
more_review_pages = int(int(review_counts)/25)

counter = 0

for i in tqdm(range(more_review_pages)):
    try:
        css_selector = 'load-more-trigger'
        driver.find_element(By.ID, css_selector).click()
        counter = counter + 1
    except:
        counter = counter + 1
        pass
    
counter

100%|██████████| 127/127 [00:04<00:00, 29.33it/s]


127

7. Extract Review from HTML

In [10]:
# We can now extract the review information from the page source.
sel2 = Selector(text = driver.page_source)
rating = sel2.css('.review-container .rating-other-user-rating span::text').extract_first().strip()
review = sel2.css('.text.show-more__control::text').extract_first().strip()
review_date = sel2.css('.review-date::text').extract_first().strip()
review_title = sel2.css('a.title::text').extract_first().strip()
review_url = sel2.css('a.title::attr(href)').extract_first().strip()

print('nreview_title:',review_title)
print('nreview_rating:',rating)
print('nreview_date:',review_date)
print('nreview:',review)


nreview_title: Murphy is exceptional
nreview_rating: 9
nreview_date: 19 July 2023
nreview: You'll have to have your wits about you and your brain fully switched on watching Oppenheimer as it could easily get away from a nonattentive viewer. This is intelligent filmmaking which shows it's audience great respect. It fires dialogue packed with information at a relentless pace and jumps to very different times in Oppenheimer's life continuously through it's 3 hour runtime. There are visual clues to guide the viewer through these times but again you'll have to get to grips with these quite quickly. This relentlessness helps to express the urgency with which the US attacked it's chase for the atomic bomb before Germany could do the same. An absolute career best performance from (the consistenly brilliant) Cillian Murphy anchors the film. This is a nailed on Oscar performance. In fact the whole cast are fantastic (apart maybe for the sometimes overwrought Emily Blunt performance). RDJ is also

In [28]:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from scrapy import Selector
from tqdm import tqdm

# Initialize the WebDriver
driver = webdriver.Chrome()
url = 'https://www.imdb.com/title/tt0241527/reviews?ref_=tt_sa_3'
driver.get(url)

# Scroll down to load more reviews
for _ in range(3):
    driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.PAGE_DOWN)
    time.sleep(1)

# Extract the review count and calculate the number of pages
sel = Selector(text=driver.page_source)
review_counts = int(sel.css('.lister .header span::text').re_first(r'\d+').replace(',', ''))
more_review_pages = (review_counts // 25) + 1

# Click the "Load More" button to load additional reviews
for i in tqdm(range(more_review_pages)):
    try:
        driver.find_element(By.CLASS_NAME, 'load-more-trigger').click()
        time.sleep(2)  # Add a slight delay after clicking the button
    except:
        break  # Break if there are no more reviews to load

# Extract the review content
reviews = sel.css('.lister-item-content .text::text').extract()

# Print the first few reviews as an example
len(reviews)

# Close the WebDriver
driver.quit()


  0%|          | 0/1 [00:00<?, ?it/s]


There's nothing like the first in a series, is there? The introduction to the characters, the immersion into the fictional world, the first time you laugh, cry, care, and fear for someone's safety can never be repeated. No matter how many Harry Potter movies they crank out, or if they ever remake them in the future, none will come close to the wonderful first film, Harry Potter and the Sorcerer's Stone.
I'm sure everyone has their own childhood memories of reading the Harry Potter books that they'll tell their grandkids about, but I'll never forget going to see the first movie in the theaters. The lights dimmed, John Williams's perfect theme played its first notes as Richard Harris walked down Privet Drive, and everyone in the theater was transported to another world. John Williams's numerous themes, all wonderful and a personification of the wizarding world, took the early movies to another level. As other composers tried their hands at the later films, that quality was missing. There

8. Extract info for all the reviews

In [11]:

rating_list = []
review_date_list = []
review_title_list = []
author_list = []
review_list = []
review_url_list = []
error_url_list = []
error_msg_list = []
reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')
for d in tqdm(reviews):
    try:
        sel2 = Selector(text = d.get_attribute('innerHTML'))
        try:
            rating = sel2.css('.rating-other-user-rating span::text').extract_first()
        except:
            rating = np.NaN
        try:
            review = sel2.css('.text.show-more__control::text').extract_first()
        except:
            review = np.NaN
        try:
            review_date = sel2.css('.review-date::text').extract_first()
        except:
            review_date = np.NaN    
        try:
            author = sel2.css('.display-name-link a::text').extract_first()
        except:
            author = np.NaN    
        try:
            review_title = sel2.css('a.title::text').extract_first()
        except:
            review_title = np.NaN
        try:
            review_url = sel2.css('a.title::attr(href)').extract_first()
        except:
            review_url = np.NaN
        rating_list.append(rating)
        review_date_list.append(review_date)
        review_title_list.append(review_title)
        author_list.append(author)
        review_list.append(review)
        review_url_list.append(review_url)
    except Exception as e:
        error_url_list.append(url)
        error_msg_list.append(e)

100%|██████████| 150/150 [00:02<00:00, 66.25it/s]


9. Create a pandas DataFrame with each extracted review

In [13]:
# Create pandas dataframe from the lists of reviews, ratings, titles and dates.

review_df = pd.DataFrame({
    'Review_Date':review_date_list,
    'Rating':rating_list,
    'Review_Title':review_title_list,
    'Review_Url':review_url_list,
    'Review':review_list
    })

review_df.head()

Unnamed: 0,Review_Date,Rating,Review_Title,Review_Url,Review
0,19 July 2023,9,Murphy is exceptional\n,/review/rw9199470/?ref_=tt_urv,You'll have to have your wits about you and yo...
1,20 July 2023,8,"A challenging watch to be sure, but a worthwh...",/review/rw9202448/?ref_=tt_urv,One of the most anticipated films of the year ...
2,20 July 2023,10,A brilliantly layered examination of a man th...,/review/rw9202246/?ref_=tt_urv,"""Oppenheimer"" is a biographical thriller film ..."
3,20 July 2023,10,Nolan delivers a powerfull biopic that shows ...,/review/rw9202357/?ref_=tt_urv,This movie is just... wow! I don't think I hav...
4,26 July 2023,8,"Nolan touches greatness, falls slightly short\n",/review/rw9216587/?ref_=tt_urv,I was familiar with the Manhattan project and ...


10. Save the content of the DataFrame in a .csv file

In [14]:
# Save the dataframe to a csv file.

review_df.to_csv('movie_reviews.csv', index=False)

In [15]:
# Check if there are nan values in the dataframe.

review_df.isna().sum()

Review_Date     0
Rating          7
Review_Title    0
Review_Url      0
Review          0
dtype: int64

In [58]:
# Check if there are duplicate reviews in the dataframe.

review_df.duplicated().sum()

0

In [60]:
# Check the data types of the dataframe.
review_df.dtypes

Review_Date     object
Rating          object
Review_Title    object
Review_Url      object
Review          object
dtype: object

In [61]:
changes_df = pd.read_csv('movie_reviews.csv')