lthough, even the use of the Little Man is the subject of current controversy and debateLinks to an external site..

Gene Siskel believed in a binary metric, the thumbs-up or thumbs-down, as Ebert recalled him sayingLinks to an external site.: "What's the first thing people ask you? Should I see this movie? They don't want a speech on the director's career. Thumbs up--yes. Thumbs down--no."

In the past the most influential film critics were the ones employed by major newspapers in important markets, or ones with nationally syndicated television shows. But today, websites that aggregate these reviews and generate scores are dominant. The two most well-known aggregator websites are Rotten TomatoesLinks to an external site. and MetacriticLinks to an external site., both of which summarize all of the reviews for a film (or TV show or video game) to a score from 0 to 100. However, these two websites employ very different methodologies to derive the score. Rotten Tomatoes assesses whether each review is positive or negative, with no middle ground, and calculates the percent of positive reviews for a film out of all reviews. Metacritic assigns a score to each review denoting the degree to which that review is positive towards the movie, then takes the mean across all scores to generate its overall metric. These two approaches can differ in notable ways. Suppose a movie is universally reviewed as pretty good, but not outstanding. If all of the reviews are positive, then the Rotten Tomatoes score will be 100, a score that denotes an all-time excellent movie. But if each of these reviews is only slightly positive, Metacritic might assess the average rating across these reviews to be somewhere around 60 to 70.

In addition, both Metacritic and Rotten Tomatoes also compile ratings and short reviews from their website's users, leading to some movies having significantly lower audience scores than critic scores, sometimes for overtly sexist and bigoted reasonsLinks to an external site..

In this live coding exercise, we will be web-scraping data from the Rotten Tomatoes pages for all movies that are currently in theaters:

https://www.rottentomatoes.com/browse/movies_in_theaters/sort:a_z?page=8Links to an external site.

Our goals are:

To check the robots.txt file for Rotten Tomatoes, to make sure web-scraping this webpage is permitted.
To use BeautifulSoup to get the following features for each movie:
* The title
* The audience score
* The critics' score
Whether or not a movie is reviewed positive overall ("certified fresh") or negative
To build a spider that finds the URL for each film's Rotten Tomatoes page, and extract the following additional features:
* Rating (G, PG, PG-13, R)
* Genre
* Original Language      
* Director
* Producer
* Writer
* Release Date (Theaters and Streaming)
* Box Office (Gross USA)
* Runtime
* Distributor
* Sound Mix
* Aspect Ratio
* and the text that summarizes each review

While the point of this exercise is to practice the techniques involved in web-scraping, we can imagine some things we could do with the resulting dataframe. We can correlate the scores with budgets, rating, or genre, we can see if certain compositions of the cast lead to greater disparities between the critic and audience scores, we can mine the reviews for sentiment and compare that index against the critics' score, and so on.

In [1]:
import numpy as np
import pandas as pd
import os
import sys

In [9]:
import requests
from bs4 import BeautifulSoup
import json

In [4]:
# find the robots.txt file
url = 'https://www.rottentomatoes.com/robots.txt'

# For User-agent: * this means all user agents
# do not crawl the search and user/id/ directories
'''
User-agent: *
Disallow: /search
Disallow: /user/id/
Sitemap: https://www.rottentomatoes.com/sitemaps/sitemap.xml
'''

'\nUser-agent: *\nDisallow: /search\nDisallow: /user/id/\nSitemap: https://www.rottentomatoes.com/sitemaps/sitemap.xml\n'

## Get our User Agent Specific to Python & requests

In [10]:
url = 'https://httpbin.org/user-agent'
r = requests.get(url)
user_agent = json.loads(r.text)['user-agent']
user_agent

'python-requests/2.27.1'

## Get Content From Rotten Tomatoes

In [19]:
# getting the big block of html text from the webpage
url = 'https://www.rottentomatoes.com/browse/movies_in_theaters/sort:a_z?page=5'

headers = {'User-Agent': user_agent}
print(headers)
r = requests.get(url, headers=headers)
r


{'User-Agent': 'python-requests/2.27.1'}


<Response [200]>

In [21]:
#r.text

In [22]:
rotten = BeautifulSoup(r.text, 'html') # argument html tells BeautifulSoup that this url has html in it

We want to find:
```
<span class="p--small" data-qa="discovery-media-list-item-title">
    The Integrity of Joseph Chambers
</span>
```

In [30]:

# find the a's with the class and an href
a_s = rotten.find_all('a', class_='js-tile-link', href=True)

In [38]:
a_s[0].find('span', class_='p--small').text.strip()

'80 for Brady'

Yay!, we got the title!

Now do a list comprehension to get all the titles

In [40]:
titles = [m.find('span', class_='p--small').text.strip() for m in a_s]

In [41]:
titles

['80 for Brady',
 '88',
 'A Man Called Otto',
 'A Radiant Girl',
 'All the Beauty and the Bloodshed',
 'Amigos',
 'Among the Beasts',
 'Ant-Man and The Wasp: Quantumania',
 'Avatar: The Way of Water',
 'Baby Ruby',
 'Bakasuran',
 'Black Panther: Wakanda Forever',
 'Broker',
 'Brotherhood of the Wolf',
 'Cat Daddies',
 'Cinema Sabaya',
 'Consecration',
 'Corsage',
 'Dada',
 'Daughter',
 "Devil's Peak",
 'EO',
 'Eating Miss Campbell',
 'Emily',
 'Enkilum Chandrike',
 'Facing the Laughter: Minnie Pearl',
 'Fear',
 'Framing Agnes',
 'Fulbari',
 'Full Time',
 'Golgappe',
 'Hannah Ha Ha',
 'Heart of a Champion',
 'Hidden Blade',
 'House Party',
 'Huesera: The Bone Woman',
 'Infinity Pool',
 'Irreversible: Straight Cut',
 'Irréversible',
 'Kaguya-sama: Love Is War - The First Kiss that Never Ends',
 'Kasethan Kadavulada',
 'Knock at the Cabin',
 'Leonor Will Never Die',
 'Let It Be Morning',
 'Life Upside Down',
 'Living',
 'Lonesome',
 'M3GAN',
 'Made in Chittagong',
 "Magic Mike's Last Danc

In [44]:
start_date = [m.find('span', class_='smaller').text.strip() for m in a_s]
start_date

['Opened Feb 03, 2023',
 'Opens Feb 17, 2023',
 'Opened Jan 13, 2023',
 'Opens Feb 17, 2023',
 'Opened Nov 23, 2022',
 'Opened Feb 10, 2023',
 'Opened Feb 10, 2023',
 'Opens Feb 17, 2023',
 'Opened Dec 16, 2022',
 'Opened Feb 03, 2023',
 'Opens Feb 17, 2023',
 'Opened Nov 11, 2022',
 'Opened Dec 26, 2022',
 'Opened Jan 25, 2002',
 'Opened Oct 14, 2022',
 'Opened Feb 10, 2023',
 'Opened Feb 10, 2023',
 'Opened Dec 23, 2022',
 'Opened Feb 10, 2023',
 'Opened Feb 10, 2023',
 'Opens Feb 17, 2023',
 'Opened Nov 18, 2022',
 'Opened Feb 16, 2023',
 'Opens Feb 17, 2023',
 'Opens Feb 17, 2023',
 'Opened Feb 06, 2023',
 'Opened Jan 27, 2023',
 'Opened Dec 02, 2022',
 'Opens Feb 17, 2023',
 'Opened Feb 03, 2023',
 'Opens Feb 17, 2023',
 'Opened Feb 10, 2023',
 'Opens Feb 17, 2023',
 'Opens Feb 17, 2023',
 'Opened Jan 13, 2023',
 'Opened Feb 10, 2023',
 'Opened Jan 27, 2023',
 'Opened Feb 10, 2023',
 'Opened Mar 14, 2003',
 'Opened Feb 14, 2023',
 'Opened Feb 10, 2023',
 'Opened Feb 03, 2023',
 'O

In [45]:
aud_score = [m.find('score-pairs')['audiencescore'] for m in a_s]
aud_score

In [47]:
aud_sent= [m.find('score-pairs')['audiencesentiment'] for m in a_s]
aud_sent

['positive',
 '',
 'positive',
 '',
 'negative',
 'positive',
 '',
 '',
 'positive',
 '',
 '',
 'positive',
 'positive',
 'positive',
 'positive',
 '',
 'negative',
 'negative',
 '',
 'positive',
 '',
 'positive',
 '',
 '',
 '',
 'positive',
 'negative',
 '',
 '',
 'positive',
 '',
 '',
 '',
 '',
 'positive',
 '',
 'negative',
 '',
 'positive',
 'positive',
 '',
 'positive',
 'positive',
 '',
 'positive',
 'positive',
 '',
 'positive',
 '',
 'positive',
 '',
 'negative',
 'positive',
 'positive',
 '',
 'positive',
 'positive',
 'positive',
 '',
 'positive',
 'positive',
 'positive',
 'negative',
 'negative',
 'positive',
 'negative',
 '',
 '',
 'negative',
 '',
 'positive',
 'negative',
 'positive',
 'positive',
 '',
 '',
 '',
 '',
 'negative',
 'positive',
 'positive',
 'positive',
 '',
 'positive',
 'positive',
 'positive',
 'positive',
 '',
 'positive',
 'positive',
 'positive',
 'positive',
 '',
 'positive',
 '',
 'positive',
 'negative',
 'positive']

In [49]:
critic_score= [m.find('score-pairs')['criticsscore'] for m in a_s]
critic_score

['61',
 '67',
 '70',
 '92',
 '95',
 '',
 '',
 '51',
 '76',
 '68',
 '',
 '84',
 '94',
 '73',
 '82',
 '88',
 '37',
 '85',
 '',
 '92',
 '',
 '96',
 '',
 '91',
 '',
 '',
 '22',
 '79',
 '',
 '97',
 '',
 '',
 '',
 '',
 '28',
 '98',
 '86',
 '83',
 '58',
 '',
 '',
 '68',
 '91',
 '87',
 '5',
 '96',
 '100',
 '94',
 '',
 '49',
 '',
 '23',
 '86',
 '100',
 '90',
 '92',
 '100',
 '83',
 '',
 '76',
 '95',
 '95',
 '94',
 '44',
 '61',
 '67',
 '',
 '',
 '71',
 '',
 '96',
 '93',
 '98',
 '92',
 '',
 '86',
 '100',
 '100',
 '72',
 '30',
 '85',
 '78',
 '',
 '65',
 '88',
 '',
 '72',
 '',
 '95',
 '91',
 '',
 '96',
 '',
 '73',
 '',
 '91',
 '33',
 '']

In [51]:

df = pd.DataFrame({'title': titles, 'start_date': start_date, 'aud_score': aud_score, 'aud_sent': aud_sent, 'critic_score': critic_score})

In [54]:
df_non_empty = df[df['aud_score'] != '']
df_non_empty = df_non_empty[df_non_empty['critic_score'] != '']

In [55]:
# Correlation between audience score and critic score
df_non_empty = df[df['aud_score'] != '']
df_non_empty = df_non_empty[df_non_empty['critic_score'] != '']
df_non_empty['aud_score'] = df_non_empty['aud_score'].astype(int)
df_non_empty['critic_score'] = df_non_empty['critic_score'].astype(int)
df_non_empty.corr()

Unnamed: 0,aud_score,critic_score
aud_score,1.0,0.487053
critic_score,0.487053,1.0
