# Rotten Tomatoes Web Scraper for Top 100 Movies Between 2000 and 2021

This is currently a work-in-progress. I want to use this as practice for data collection for data science skills. I wanted a dataset to perform analyses on, but I felt I needed to create my own extensive dataset.

Unfortunately (or, should I say, good for the person who already did it) there already exists a Rotten Tomatoes dataset just like the one I am trying to make here. I did not come across this work until I was already through the beginning stages of this project, but I shall link that dataset here for anyone interested. This guy also has other movie datasets on Kaggle and seems like an interesting guy. Check him out: https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset

Anyway, it was also good to know that the creator of the above dataset also had to wait a day or so for all the data to be collected.

Below you will find a test I did on the movie Interstellar, my favorite movie which did not make the top 100 movie list for 2014 somehow....

Below that I have defined some functions to aid in the collection of review data for the top 100 movies for each year between 2000 and 2021. Collecting all this data takes a *long* time. You will see somewhere in the comments me complaining about my poor internet, and how I had to settle for only scraping data for the year 2021 as a result.

In [1]:
from bs4 import BeautifulSoup
from selenium import webdriver #for dynamically loaded page elements that requests won't read in
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
service = Service('C:/Users/cbarg/Downloads/chromedriver_win32/chromedriver.exe') #if anyone ever dares use this you need to replace this with the path to your own chromedriver or what have you
import time
import re
import numpy as np

In [2]:
driver = webdriver.Chrome(service=service)
driver.get('https://www.rottentomatoes.com/m/interstellar_2014/reviews')
html = driver.page_source
soup = BeautifulSoup(html) #grab the information on the initial page

#grab all the reviews on current page
review_rows = soup.find_all('div', class_ = 'row review_table_row')

while True:
    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//*[@id="content"]/div/div/div/nav[1]/button[2]'))
        )
        element.click()
        time.sleep(1)
        html = driver.page_source
        soup = BeautifulSoup(html)
        review_rows.extend(soup.find_all('div', class_ = 'row review_table_row'))
    except:
        print('Task exited successfully.')
        driver.quit()
        break
        
critic = []
date = []
reviews = []
score = []
is_fresh = []
is_rotten = []
is_top_critic = []

for review in review_rows:
    critic.append(review.find('a', class_ = 'unstyled bold articleLink').text) #critic's name
    date.append(review.find('div', class_ = 'review-date subtle small').text.strip()) #date of review
    reviews.append(review.find('div', class_ = 'the_review').text.strip()) #comments by reviewer
    
    #the way the review score works is pretty messed up. Even with this cleaning all the scales are different
    foo = review.find('div', class_ = 'small subtle review-link').text.strip().split('|')
    if len(foo) >= 2:
        score.append(foo[1].split(' ')[3])
    else:
        score.append(None)
        
    #check if a critic rated the movie as fresh or not
    if review.find('div', class_ = 'review_icon icon small fresh'):
        is_fresh.append(1)
    else:
        is_fresh.append(0)
        
    #check to see if a critic rated the movie as rotten or not
    if review.find('div', class_ = 'review_icon icon small rotten'):
        is_rotten.append(1)
    else:
        is_rotten.append(0)
    
    #check to see if the review was posted by a top critic
    if review.find('rt-icon-top-critic', class_ = 'small'): 
        is_top_critic.append(1)
    else:
        is_top_critic.append(0)

Task exited successfully.


In [3]:
import pandas as pd
import numpy as np
  
df_dict = {'critic':critic, 'date':date, 'reviews':reviews, 'score':score, 'is_fresh':is_fresh, 'is_rotten':is_rotten, 'is_top_critic':is_top_critic}
columns = ['critic', 'date', 'review', 'score', 
            'is_fresh', 'is_rotten', 'is_top_critic']
  
# Calling DataFrame constructor on list
df = pd.DataFrame(df_dict)
df.head(15)

Unnamed: 0,critic,date,reviews,score,is_fresh,is_rotten,is_top_critic
0,Therese Lacson,"Oct 9, 2021","The inherent message of the film brings hope, ...",3/5,1,0,0
1,Kip Mooney,"Aug 10, 2021",The film is indeed a sight to behold -- and on...,B,1,0,0
2,Richard Crouse,"Feb 2, 2021",Nolan reaches for the stars with beautifully c...,3/5,1,0,0
3,Mike Massie,"Dec 4, 2020",Audiences are sure to lose their suspensions o...,5/10,0,1,0
4,David Nusair,"Sep 20, 2020",...an often insanely ambitious science fiction...,4/4,1,0,0
5,Richard Propes,"Sep 12, 2020","Scientists will debate, theologians will conte...",3.5/4.0,1,0,0
6,Siddhant Adlakha,"Sep 3, 2020",A big-budget reprise of ideas Nolan has been e...,,1,0,0
7,Stephen A. Russell,"Aug 26, 2020",None of these characters feel fully-fledged......,2.5/5,0,1,0
8,Kelechi Ehenulo,"Jul 16, 2020",Interstellar is not Christopher Nolan's best f...,4/5,1,0,0
9,Brent McKnight,"Jul 7, 2020",As spectacular as it is flawed.,B,1,0,0


In [4]:
#returns a dictionary of years and the links to the associated top 100 movies of that year on rotten tomatoes
def get_years_url_dict(tomatoes='https://www.rottentomatoes.com/top/'):
    driver = webdriver.Chrome(service=service)
    driver.get(tomatoes)
    time.sleep(1)
    html = driver.page_source
    soup = BeautifulSoup(html)
    driver.close()
    
    table = soup.find('div', style='position: absolute; left: 366px; top: 733px;')
    entries = table.find_all('td', class_='rank_col')
    years = [year.text.strip() for year in entries]
    links = [link.find('a', href=True)['href'] for link in entries]
    links = [tomatoes[:-5] + link for link in links]
    dictionary = {years[i]:links[i] for i in range(len(years))}
    return dictionary

url_by_years = get_years_url_dict()

In [5]:
#returns a dictionary of all the movie titles and the link to their main rotten tomatoes page
#for all the movies that are in the top 100 lists for years 2000-2021
def get_movie_urls(dictionary):
    driver = webdriver.Chrome(service=service)
    top_links = {}
    for key in dictionary.keys():
        html = str(dictionary[key])
        driver.get(html)
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source)
        
        top_movies_section = soup.find('section', id='top_movies_main')
        table = top_movies_section.find('tbody')
        rows = table.find_all('tr')
        links = [row.find('a', href=True)['href'] for row in rows]
        links = ['https://www.rottentomatoes.com' + link for link in links]
        titles = [row.find('a', href=True).text.strip() for row in rows]
        new_dictionary = {titles[i]:links[i] for i in range(len(titles))}
        top_links.update(new_dictionary)
    driver.close()
    return top_links

all_movie_links = get_movie_urls(url_by_years)
print(list(all_movie_links)[:50]) #print the first 50 movies in the dictionary. There are 2100 of them so I'm not printing them all

['Nomadland (2021)', 'Judas and the Black Messiah (2021)', 'The Father (2021)', 'In the Heights (2021)', 'Summer of Soul (...Or, When the Revolution Could Not Be Televised) (2021)', 'Pig (2021)', 'CODA (2021)', 'Raya and the Last Dragon (2021)', 'A Quiet Place Part II (2021)', 'The Mitchells vs. The Machines (2021)', 'Shang-Chi and the Legend of the Ten Rings (2021)', 'The Suicide Squad (2021)', 'MLK/FBI (2021)', 'Shiva Baby (2021)', 'Luca (2021)', 'Quo Vadis, Aida? (2021)', 'Slalom (2021)', 'Becoming Cousteau (Cousteau) (2021)', 'Beyond the Infinite Two Minutes (2021)', 'Paper Spiders (2021)', 'Woodlands Dark and Days Bewitched: A History of Folk Horror (2021)', 'Sabaya (2021)', 'The Worst Person In The World (Verdens Verste Menneske) (2021)', 'Luzzu (2021)', 'Two of Us (Deux) (2021)', 'The Velvet Underground (2021)', 'The Sparks Brothers (2021)', 'Riders of Justice (2021)', 'Mass (2021)', "I'm Your Man (2021)", 'I Carry You with Me (2021)', 'The Paper Tigers (2021)', 'Identifying Fea

In [8]:
def obtain_url_data(dictionary):
    titles = []
    tomatometer_scores_avg = []
    audience_scores_avg = []
    critic = []
    date = []
    reviews = []
    score = []
    is_fresh = []
    is_rotten = []
    is_top_critic = []
    year_gen_len = []
    directors = []
    grosses = []
    keys = dictionary.keys()
    
    #this for loop goes through all the movies from 2001 to extract the reviews and other data from those movies
    #this is easily converted to a function that can grab reviews from the movies for any year from 2000-2021, but
    #my internet is sooooo slow so I would try to go through all 2100 movies at once and my internet would crap out
    #once I got to around 2004 or so and the process wouldn't finish. So just doing one year was easier. if you had good internet
    #though you could use this function to collect reviews from the all top 100 movies on rotten tomatoes between 2000 and 2021
    #for key in [key for key in keys if '(2021)' in key]:
    
    #you can ignore the comments above about the year 2001. I changed it back to the original function for the test dictionary below
    for key in keys:
        driver = webdriver.Chrome(service=service) #define the driver
        
        #going to the main movie page to extract information
        driver.get(dictionary[key])
        time.sleep(1)
        html = driver.page_source
        soup = BeautifulSoup(html)
        
        try:
            #get tomatometer score
            tomatometer_score = soup.find('score-board', tomatometerscore=True)['tomatometerscore']
            #get audience score
            audience_score = soup.find('score-board', audiencescore=True)['audiencescore']
            #get year, genre, and length of movie
            year_genre_length = soup.find('p', class_='scoreboard__info').text.strip()
            #get director's name
            director = soup.find('a', attrs={'data-qa':'movie-info-director'}).text.strip()
            #get film gross revenue
            box_office_tags = soup.find_all('div', class_ = 'meta-value')
            gross = '0'
            for tag in box_office_tags:
                if '$' in tag.text.strip():
                    gross = tag.text.strip()
                    #print(tag.text.strip())
                    break
            print(gross) #print the movie's gross to make sure it was obtained


            #going to the reviews page to extract information
            driver.get(dictionary[key] + '/reviews')
            html = driver.page_source
            soup = BeautifulSoup(html) #grab the information on the initial page

            #grab all the reviews on current page
            review_rows = soup.find_all('div', class_ = 'row review_table_row')
        except:
            #driver.close()
            print('Unsuccessful collection for {}.'.format(key))
            continue

        #now we want to start nagivagting through all the reviews
        #this while loop goes through every review and extracts its information to be stored in a list
        #if an error is encountered then it stops collecting reviews and closes the driver

        while True:
            try:
                element = WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located((By.XPATH, '//*[@id="content"]/div/div/div/nav[1]/button[2]'))
                )
                element.click()
                time.sleep(0.5)
                html = driver.page_source
                soup = BeautifulSoup(html)
                review_rows.extend(soup.find_all('div', class_ = 'row review_table_row'))
            except:
                print('Successfully collected reviews for {}.'.format(key))
                driver.quit()
                break
        
        #goes through the list of reviews and splits each entry up into the individual information chunks
        for review in review_rows:

            try:
                critic_name = review.find('a', class_ = 'unstyled bold articleLink').text #crtic's name

                review_date = review.find('div', class_ = 'review-date subtle small').text.strip() #date of review

                critic_comments = review.find('div', class_ = 'the_review').text.strip() #comments by the reviewer

                #the way the review score works is pretty messed up. Even with this cleaning all the scales are different
                foo = review.find('div', class_ = 'small subtle review-link').text.strip().split('|')
                if len(foo) >= 2:
                    score.append(foo[1].split(' ')[3])
                else:
                    score.append(None)

                #check if a critic rated the movie as fresh or not
                if review.find('div', class_ = 'review_icon icon small fresh'):
                    is_fresh.append(1)
                else:
                    is_fresh.append(0)

                #check to see if a critic rated the movie as rotten or not
                if review.find('div', class_ = 'review_icon icon small rotten'):
                    is_rotten.append(1)
                else:
                    is_rotten.append(0)

                #check to see if the review was posted by a top critic
                if review.find('rt-icon-top-critic', class_ = 'small'): 
                    is_top_critic.append(1)
                else:
                    is_top_critic.append(0)

                    
                #all of the data is added to the list at the same time at the end of the loop
                #this is because if an error has occurred during an earlier part of the loop
                #we dont want to quit mid-loop and have one of the lists be longer than another
                #that would cause us to receive an error when creating the dataframe
                titles.append(key)
                tomatometer_scores_avg.append(tomatometer_score)
                audience_scores_avg.append(audience_score)
                year_gen_len.append(year_genre_length)
                directors.append(director)
                critic.append(critic_name)
                date.append(review_date)
                reviews.append(critic_comments)
                grosses.append(gross)
            except:
                driver.close()
                break

                
    df_dict = {'title':titles, 'director':directors, 'gross (usd)':grosses, 'tomatometer_scores_avg':tomatometer_scores_avg, 'audience_scores_avg':audience_scores_avg, 'year_genre_length':year_gen_len, 'critic':critic, 'date_of_review':date, 'reviews':reviews, 'score':score, 'is_fresh':is_fresh, 'is_rotten':is_rotten, 'is_top_critic':is_top_critic}
    #print(df_dict)
    # Calling DataFrame constructor on list
    df = pd.DataFrame(df_dict)
    return df

In [9]:
#all_movie_links.update({'Interstellar (2014)':'https://www.rottentomatoes.com/m/interstellar_2014'})

#i should explain this test dictionary. I included Donnie Darko's Director's Cut because there is simply no info
#on it on rotten tomatoes. It shows up in the top movie's links for 2004, but it's page might as well not exist at all
#it's placed in here to see how my function handles it.
#Interstellar is my favorite movie ever, so that's why that's here
#A Quiert Place Part II was just a more recent, successful movie
test_dictionary = {"Donnie Darko: The Director's Cut (2004)": 'https://www.rottentomatoes.com/m/donnie_darko_the_directors_cut',
                  'Interstellar (2014)':'https://www.rottentomatoes.com/m/interstellar_2014',
                  'A Quiet Place Part II (2021)': 'https://www.rottentomatoes.com/m/a_quiet_place_part_ii'}
df = obtain_url_data(test_dictionary)

Unsuccessful collection for Donnie Darko: The Director's Cut (2004).
$188.0M
Successfully collected reviews for Interstellar (2014).
$160.2M
Successfully collected reviews for A Quiet Place Part II (2021).


In [10]:
df.to_csv('rotten_tomatoes_review_data_test_dict.csv', sep='\t', index=False)

In [11]:
df.tail()

Unnamed: 0,title,director,gross (usd),tomatometer_scores_avg,audience_scores_avg,year_genre_length,critic,date_of_review,reviews,score,is_fresh,is_rotten,is_top_critic
830,A Quiet Place Part II (2021),John Krasinski,$160.2M,91,92,"2021, Mystery & thriller/Horror, 1h 37m",Jacob Oller,"May 18, 2021",A Quiet Place Part II's technical merits mostl...,7.0/10,1,0,0
831,A Quiet Place Part II (2021),John Krasinski,$160.2M,91,92,"2021, Mystery & thriller/Horror, 1h 37m",David Rooney,"May 18, 2021",The intimacy of the storytelling tugs relentle...,,1,0,1
832,A Quiet Place Part II (2021),John Krasinski,$160.2M,91,92,"2021, Mystery & thriller/Horror, 1h 37m",Peter Bradshaw,"May 18, 2021","What a pleasure to see a big, brash picture li...",4/5,1,0,1
833,A Quiet Place Part II (2021),John Krasinski,$160.2M,91,92,"2021, Mystery & thriller/Horror, 1h 37m",Joey Magidson,"May 18, 2021",A Quiet Place Part II manages to take a really...,3.5/4,1,0,0
834,A Quiet Place Part II (2021),John Krasinski,$160.2M,91,92,"2021, Mystery & thriller/Horror, 1h 37m",Kate Erbland,"May 18, 2021",As his chops as an action and horror director ...,B+,1,0,1


In [12]:
df.head()

Unnamed: 0,title,director,gross (usd),tomatometer_scores_avg,audience_scores_avg,year_genre_length,critic,date_of_review,reviews,score,is_fresh,is_rotten,is_top_critic
0,Interstellar (2014),Christopher Nolan,$188.0M,72,86,"2014, Sci-fi/Adventure, 2h 45m",Therese Lacson,"Oct 9, 2021","The inherent message of the film brings hope, ...",3/5,1,0,0
1,Interstellar (2014),Christopher Nolan,$188.0M,72,86,"2014, Sci-fi/Adventure, 2h 45m",Kip Mooney,"Aug 10, 2021",The film is indeed a sight to behold -- and on...,B,1,0,0
2,Interstellar (2014),Christopher Nolan,$188.0M,72,86,"2014, Sci-fi/Adventure, 2h 45m",Richard Crouse,"Feb 2, 2021",Nolan reaches for the stars with beautifully c...,3/5,1,0,0
3,Interstellar (2014),Christopher Nolan,$188.0M,72,86,"2014, Sci-fi/Adventure, 2h 45m",Mike Massie,"Dec 4, 2020",Audiences are sure to lose their suspensions o...,5/10,0,1,0
4,Interstellar (2014),Christopher Nolan,$188.0M,72,86,"2014, Sci-fi/Adventure, 2h 45m",David Nusair,"Sep 20, 2020",...an often insanely ambitious science fiction...,4/4,1,0,0
