# Best of Rotten Tomatoes

## Gather

In [19]:
import pandas as pd
import zipfile
import os
import requests
import glob

### 1. Files on hand

#### 1st source: files on hand

In [20]:
df = pd.read_csv('bestofrt.tsv', sep='\t')
df

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370
...,...,...,...,...
95,96,100,Man on Wire (2008),156
96,97,97,Jaws (1975),74
97,98,100,Toy Story (1995),78
98,99,97,"The Godfather, Part II (1974)",72


### 2. Web scraping & accessing an HTML

The quick way to get HTML data is by saving the HTML file to your computer manually. You can do this by clicking Save in your browser.
Programmatic Access

Programmatic access is preferred for scalability and reproducibility. Two options include:

 1. Downloading HTML file programmatically. We'll explore this code in more detail later

In [21]:
import requests
url = 'https://www.rottentomatoes.com/m/et_the_extraterrestrial'
response = requests.get(url)

# Save HTML to file

with open("et_the_extraterrestrial.html", mode='wb') as file:
    file.write(response.content)

 2. Working with the response content live in your computer's memory using the BeautifulSoup HTML parser

In [22]:
# Work with HTML in memory

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'lxml')

Next, we can make the soup by passing the path to the HTML file into a filehandle, then passing that filehandle into the Beautiful Soup constructor along with a parser. We're using lxml which is the most popular parser:

In [23]:
with open("et_the_extraterrestrial.html") as file:
    soup = BeautifulSoup(file, 'lxml')

This result looks exactly like an HTML document and we can use methods in the Beautiful Soup library to easily find and extract data from this HTML.

`find()` is one of the most popular Beautiful Soup methods. It is similar to the find feature in a text editor. 

In [24]:
soup.find('title')

<title>E.T. the Extra-Terrestrial - Rotten Tomatoes</title>

We get the `title` element of the webpage, and not the title of the movie.

To get the movie title only, we'll need to do some string slicing. We can use `.contents` to return a list of the tag's children. Because there's only one item with the title tag, the list is one item long so we can access it using the index `0`:

In [25]:
soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]

'E.T. the Extra-Terrestrial'

The `\xa0` in the returned title is unicode for non-breaking space, which we would need to deal with later in the cleaning step. We actually won't do any cleaning in this lesson, so we can ignore this. 

#### 2d source: web scraping

Extract the Audience Score metric, number of audience ratings and the movie title from Rotten Tomatoes pages.

In [26]:
# Extract all contents from zip file
with zipfile.ZipFile('rt-html.zip', 'r') as myzip:
    myzip.extractall()

In [27]:
# List of dictionaries to build file by file and later convert to a DataFrame
df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        # Your code here
        # Note: a correct implementation may take ~15 seconds to run
        soup = BeautifulSoup(file, 'lxml')
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        audience_score = soup.find('div', class_='audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',', '')
        # Append to list of dictionaries
        df_list.append({'title': title,
                        'audience_score': int(audience_score),
                        'number_of_audience_ratings': int(num_audience_ratings)})
df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])

df

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,Man on Wire (2008),87,29827
1,Laura (1944),91,10481
2,The Big Sick (2017),90,23391
3,The Wages of Fear (1953),95,8536
4,Argo (2012),90,207373
...,...,...,...
95,Baby Driver (2017),89,48114
96,M (1931),95,35778
97,Skyfall (2012),86,372497
98,Jaws (1975),90,942217


At this point, we've gathered enough data to produce a scatter plot with audience score on the horizontal axis and critics score on the vertical axis.

We'll skip the assessing and cleaning steps of the data wrangling process, which includes joining our two DataFrames and flash forward to the final product, the visualization.

![title](rt-scatter.png)

Key Features of the Scatter Plot

- Audience score on the horizontal axis ranges from 70% to 100%
- Critics score in the vertical axis ranges from 91% to 100%
- Vertical reference line for the median of audience score (90%)
- Horizontal reference line for the median of critics score (98%)
- Number of audience ratings where the shade of blue gets darker as the number of ratings increases
- Number of critics ratings where a larger circle mean a larger number of critic ratings

Quadrants

- Top right quadrant has universally loved movies with high audience scores and high critics scores.
- Bottom right corner includes the critically underrated movies with audience scores above the median and critics scores below the median for this top 100 list.
- Top left has the critically overrated movies where audiences didn't like these movies as much as critics did
- Bottom left quadrant includes movies that didn't have particularly high critic or audience scores in reference to the movies on this list.

This Best of Rotten Tomatoes critic versus audience score visualization required gathering data from two different sources: accessing files on hand and scraping data from web pages. The data was also in two different formats, a flat file (TSV), and HTML.

### 3. Download files from Internet

`GET` method will send a request and return the contents of the file we requested, which for us, is a text file, which we can then save to a file.

Use `OS` method to check if a folder exists and create a new one if it doesn't.

```
# Make directory if it doesn't already exist
folder_name = 'new_folder'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)
```

Define the url and use the `requests.get` method 

```
url = 'https://data-source.com'
response = requests.get(url)
```

If the request is successful, the HTTP request will return a `200` response which is the HTTP status code for a successful response. This tells you that all the text in our text file is in the computer's working memory in the body of the response.

Use the Requests `.content` method and some basic file I/O to save this file to our computer. We'll open this in wb mode which stands for write binary because `response.content` is in bytes, not text. When we open these files in a text editor or in pandas later, the bytes will be rendered as human-readable text.

Next, we can write to the filehandle we've opened, `file.write response.content`. 

```
with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)
```

#### 3d source: downloading files from internet

Starting the Roger Ebert Review Word Cloud. We'll need the text from each of his reviews, for each of the movies on the Rotten Tomatoes Top 100 Movies of All Time list that live on his website. Lucky for you I've pre-gathered all of this text in the form of 100 `.txt` files that you can download programmatically. 

In [28]:
# Make directory if it doesn't already exist
folder_name = 'ebert_reviews'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [29]:
ebert_review_urls = ['https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9900_1-the-wizard-of-oz-1939-film/1-the-wizard-of-oz-1939-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_2-citizen-kane/2-citizen-kane.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_3-the-third-man/3-the-third-man.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_4-get-out-film/4-get-out-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_5-mad-max-fury-road/5-mad-max-fury-road.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_6-the-cabinet-of-dr.-caligari/6-the-cabinet-of-dr.-caligari.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_7-all-about-eve/7-all-about-eve.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_8-inside-out-2015-film/8-inside-out-2015-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_9-the-godfather/9-the-godfather.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_10-metropolis-1927-film/10-metropolis-1927-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_11-e.t.-the-extra-terrestrial/11-e.t.-the-extra-terrestrial.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_12-modern-times-film/12-modern-times-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_14-singin-in-the-rain/14-singin-in-the-rain.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_15-boyhood-film/15-boyhood-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_16-casablanca-film/16-casablanca-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_17-moonlight-2016-film/17-moonlight-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_18-psycho-1960-film/18-psycho-1960-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_19-laura-1944-film/19-laura-1944-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_20-nosferatu/20-nosferatu.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_21-snow-white-and-the-seven-dwarfs-1937-film/21-snow-white-and-the-seven-dwarfs-1937-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_22-a-hard-day27s-night-film/22-a-hard-day27s-night-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_23-la-grande-illusion/23-la-grande-illusion.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_25-the-battle-of-algiers/25-the-battle-of-algiers.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_26-dunkirk-2017-film/26-dunkirk-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_27-the-maltese-falcon-1941-film/27-the-maltese-falcon-1941-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_29-12-years-a-slave-film/29-12-years-a-slave-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_30-gravity-2013-film/30-gravity-2013-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_31-sunset-boulevard-film/31-sunset-boulevard-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_32-king-kong-1933-film/32-king-kong-1933-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_33-spotlight-film/33-spotlight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_34-the-adventures-of-robin-hood/34-the-adventures-of-robin-hood.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_35-rashomon/35-rashomon.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_36-rear-window/36-rear-window.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_37-selma-film/37-selma-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_38-taxi-driver/38-taxi-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_39-toy-story-3/39-toy-story-3.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_40-argo-2012-film/40-argo-2012-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_41-toy-story-2/41-toy-story-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_42-the-big-sick/42-the-big-sick.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_43-bride-of-frankenstein/43-bride-of-frankenstein.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_44-zootopia/44-zootopia.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_45-m-1931-film/45-m-1931-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_46-wonder-woman-2017-film/46-wonder-woman-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_48-alien-film/48-alien-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_49-bicycle-thieves/49-bicycle-thieves.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_50-seven-samurai/50-seven-samurai.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_51-the-treasure-of-the-sierra-madre-film/51-the-treasure-of-the-sierra-madre-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_52-up-2009-film/52-up-2009-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_53-12-angry-men-1957-film/53-12-angry-men-1957-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_54-the-400-blows/54-the-400-blows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_55-logan-film/55-logan-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_57-army-of-shadows/57-army-of-shadows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_58-arrival-film/58-arrival-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_59-baby-driver/59-baby-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_60-a-streetcar-named-desire-1951-film/60-a-streetcar-named-desire-1951-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_61-the-night-of-the-hunter-film/61-the-night-of-the-hunter-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_62-star-wars-the-force-awakens/62-star-wars-the-force-awakens.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_63-manchester-by-the-sea-film/63-manchester-by-the-sea-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_64-dr.-strangelove/64-dr.-strangelove.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_66-vertigo-film/66-vertigo-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_67-the-dark-knight-film/67-the-dark-knight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_68-touch-of-evil/68-touch-of-evil.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_69-the-babadook/69-the-babadook.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_72-rosemary27s-baby-film/72-rosemary27s-baby-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_73-finding-nemo/73-finding-nemo.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_74-brooklyn-film/74-brooklyn-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_75-the-wrestler-2008-film/75-the-wrestler-2008-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_77-l.a.-confidential-film/77-l.a.-confidential-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_78-gone-with-the-wind-film/78-gone-with-the-wind-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_79-the-good-the-bad-and-the-ugly/79-the-good-the-bad-and-the-ugly.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_80-skyfall/80-skyfall.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_82-tokyo-story/82-tokyo-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_83-hell-or-high-water-film/83-hell-or-high-water-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_84-pinocchio-1940-film/84-pinocchio-1940-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_85-the-jungle-book-2016-film/85-the-jungle-book-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991a_86-la-la-land-film/86-la-la-land-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_87-star-trek-film/87-star-trek-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_89-apocalypse-now/89-apocalypse-now.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_90-on-the-waterfront/90-on-the-waterfront.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_91-the-wages-of-fear/91-the-wages-of-fear.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_92-the-last-picture-show/92-the-last-picture-show.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_93-harry-potter-and-the-deathly-hallows-part-2/93-harry-potter-and-the-deathly-hallows-part-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_94-the-grapes-of-wrath-film/94-the-grapes-of-wrath-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_96-man-on-wire/96-man-on-wire.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_97-jaws-film/97-jaws-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_98-toy-story/98-toy-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_99-the-godfather-part-ii/99-the-godfather-part-ii.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_100-battleship-potemkin/100-battleship-potemkin.txt']

In [30]:
for url in ebert_review_urls:
    response = requests.get(url)
    with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
        file.write(response.content)

In [31]:
len(os.listdir(folder_name))

88

12 movies from top 100 Rotten Tomatoes movies did not have a review on Roger Ebert Review website.

We will need to store them into a dataframe.

In [32]:
df_list = []
for ebert_review in glob.glob('ebert_reviews/*.txt'):
    with open(ebert_review, encoding='utf-8') as file:
        title = file.readline()[:-1]
        review_url = file.readline()[:-1]
        review_text = file.read()
        
        df_list.append({'title': title,
                        'review_url': review_url,
                        'review_text': review_text})
        
df = pd.DataFrame(df_list, columns = ['title','review_url','review_text'])
df

Unnamed: 0,title,review_url,review_text
0,Dr. Strangelove Or How I Learned to Stop Worry...,http://www.rogerebert.com/reviews/great-movie-...,"Every time you see a great film, you find new ..."
1,The Night of the Hunter (1955),http://www.rogerebert.com/reviews/great-movie-...,"Charles Laughton's ""The Night of the Hunter” (..."
2,The Jungle Book (2016),http://www.rogerebert.com/reviews/the-jungle-b...,"I saw the newest Disney version of ""The Jungle..."
3,Dunkirk (2017),http://www.rogerebert.com/reviews/dunkirk-2017,"Lean and ambitious, unsentimental and bombasti..."
4,Brooklyn (2015),http://www.rogerebert.com/reviews/brooklyn-2015,Colm Tóibín’s 2009 novel “Brooklyn” is one of ...
...,...,...,...
83,A Hard Day's Night (1964),http://www.rogerebert.com/reviews/great-movie-...,"When it opened in September, 1964, ""A Hard Day..."
84,Jaws (1975),http://www.rogerebert.com/reviews/great-movie-...,"""You're going to need a bigger boat.""\n\nSo th..."
85,Rashômon (1951),http://www.rogerebert.com/reviews/great-movie-...,"Shortly before filming was to begin on ""Rashom..."
86,Hell or High Water (2016),http://www.rogerebert.com/reviews/hell-or-high...,"After a summer filled with retreads, ripoffs a..."


### 4. APIs (Application Programming Interfaces)

**Getting the Movie Poster Images**

We could scrape the image URL from the HTML. But a better way is to access them through an API (Application Programming Interface). Each movie has its poster on its Wikipedia page, so we can use Wikipedia's API.

APIs give you relatively easy access to data from the Internet. Twitter, Facebook, Instagram all have APIs and there are many open-source APIs.

In this lesson we'll be using [MediaWiki](https://www.mediawiki.org/wiki/MediaWiki), which is a popular open-source API for Wikipedia.

**Why Can't We Use the Rotten Tomatoes API?**

The Rotten Tomatoes API and does provide audience scores, which means we could have hit the API instead of scraping it off of the Rotten Tomatoes web page earlier in the lesson. But this API doesn't provide posters and images.

In addition, the Rotten Tomatoes API requires you to apply for access before using it.

But that is fine because this API wasn't going to be scalable enough for use in this course anyway.

**When Given a Choice, Pick API over Scraping**

Scraping is brittle and breaks with web layout redesigns because the underlying HTML has changed.

APIs and their access libraries allow programmers to access data in a super simple manner. rtsimple is an access library for Rotten Tomatoes that uses Python. If we had permission to use the Rotten Tomatoes API, we could import rtsimple, use our API key, create an object for each movie and access the ratings data directly from the movie object.

```
import rtsimple as rt
rt.API_Key = 'YOUR API KEY HERE'
movie = rt.Movies('10489')
movie.ratings['audience_score']
```

**MediaWiki API**

MediaWiki has a great [tutorial](https://www.mediawiki.org/wiki/API:Tutorial) on their website on how their API calls are structured. It's a nice and simple example and they explain the various moving parts:

- The endpoint (important takeaway: there is nothing special about this URL!)
- The format
- The action
- Action-specific parameters

Go and read that example and then come back to the classroom.

Done reading? Great! Though they say that is a "simple example," it could definitely be simpler! This is where access libraries, also known as client libraries or even just libraries (as in "Twitter API libraries"), come into play and make our lives easier.

**wptools Library**

There are a bunch of different access libraries for MediaWiki to satisfy the variety of programming languages that exist. Here is a [list](https://www.mediawiki.org/wiki/API:Client_code#Python) for Python. This is pretty standard for most APIs. Some libraries are better than others, which again, is standard. For a MediaWiki, the most up to date and human readable one in Python is called [wptools](https://github.com/siznax/wptools). The analogous relationship for Twitter is:

- MediaWiki API → wptools
- Twitter API → tweepy

*wptools* has an even simpler [tutorial](https://github.com/siznax/wptools/wiki/Usage#page-usage) on their GitHub page using the [Mahatma Gandhi Wikipedia page](https://en.wikipedia.org/wiki/Mahatma_Gandhi) as a working example.

To get a `page` object, the usage is as follows:

```
page = wptools.page('Mahatma_Gandhi')
```

..where 'Mahatma_Gandhi' is the last bit of the Wikipedia URL for that page (https://en.wikipedia.org/wiki/Mahatma_Gandhi). This `page` object has methods that can get us various pieces of data about that Wikipedia page, including all of the images on the page. To get all of the data, simply calling get() on a page will automagically fetch extracts, images, infobox data, wikidata, and other metadata via the MediaWiki, Wikidata, and RESTBase APIs.

```
page = wptools.page('Mahatma_Gandhi').get()
```

Or if you already have a page object assigned to `page`:

```
page.get()
```

`page` now has the following attributes, which can be accessed using dot notation through `.data`:

![title](gandhi.png)

`page.data['image']`, for example, would return a list of data for six images on this specific Wikipedia page.

In [33]:
import wptools

In [34]:
page = wptools.page('E.T._the_Extra-Terrestrial').get()

en.wikipedia.org (query) E.T._the_Extra-Terrestrial
en.wikipedia.org (query) E.T. the Extra-Terrestrial (&plcontinue=...
en.wikipedia.org (parse) 73441
www.wikidata.org (wikidata) Q11621
www.wikidata.org (labels) P1981|Q499789|P373|Q168383|P162|P2465|P...
www.wikidata.org (labels) Q76757691|P18|Q488651|Q787131|Q10218003...
www.wikidata.org (labels) Q258064|Q981030|P1874|Q24909800|Q676094...
www.wikidata.org (labels) P2408|P4969|Q131520|P31|Q457893|P1712|P...
www.wikidata.org (labels) P344|Q1270715|P750|P6398|P1258|P136|P25...
en.wikipedia.org (restbase) /page/summary/E.T._the_Extra-Terrestrial
en.wikipedia.org (imageinfo) File:ET logo 3.svg|File:E t the extr...
E.T. the Extra-Terrestrial (en) data
{
  aliases: <list(2)> E.T., ET
  assessments: <dict(4)> United States, Film, Science Fiction, Lib...
  claims: <dict(127)> P1562, P57, P272, P345, P31, P161, P373, P48...
  description: 1982 American film
  exhtml: <str(485)> <p><i><b>E.T. the Extra-Terrestrial</b></i> i...
  exrest: <str(46

Accessing the first image of the image attribute:

In [35]:
page.data['image'][0]

{'kind': 'parse-image',
 'file': 'File:E t the extra terrestrial ver3.jpg',
 'orig': 'E t the extra terrestrial ver3.jpg',
 'timestamp': '2016-06-04T10:30:46Z',
 'size': 83073,
 'width': 253,
 'height': 394,
 'url': 'https://upload.wikimedia.org/wikipedia/en/6/66/E_t_the_extra_terrestrial_ver3.jpg',
 'descriptionurl': 'https://en.wikipedia.org/wiki/File:E_t_the_extra_terrestrial_ver3.jpg',
 'descriptionshorturl': 'https://en.wikipedia.org/w/index.php?curid=7419503',
 'title': 'File:E t the extra terrestrial ver3.jpg',
 'metadata': {'DateTime': {'value': '2016-06-04 10:30:46',
   'source': 'mediawiki-metadata',
   'hidden': ''},
  'ObjectName': {'value': 'E t the extra terrestrial ver3',
   'source': 'mediawiki-metadata',
   'hidden': ''},
  'CommonsMetadataExtension': {'value': 1.2,
   'source': 'extension',
   'hidden': ''},
  'Categories': {'value': 'All non-free media|E.T. the Extra-Terrestrial|Fair use images of film posters|Files with no machine-readable author|Noindexed pages|Wik

Accessing the director key of the infobox attribute

In [36]:
page.data['infobox']['director']

'[[Steven Spielberg]]'

#### 4th source: API

In [44]:
from PIL import Image
from io import BytesIO

In [45]:
title_list = [
 'The_Wizard_of_Oz_(1939_film)',
 'Citizen_Kane',
 'The_Third_Man',
 'Get_Out_(film)',
 'Mad_Max:_Fury_Road',
 'The_Cabinet_of_Dr._Caligari',
 'All_About_Eve',
 'Inside_Out_(2015_film)',
 'The_Godfather',
 'Metropolis_(1927_film)',
 'E.T._the_Extra-Terrestrial',
 'Modern_Times_(film)',
 'It_Happened_One_Night',
 "Singin'_in_the_Rain",
 'Boyhood_(film)',
 'Casablanca_(film)',
 'Moonlight_(2016_film)',
 'Psycho_(1960_film)',
 'Laura_(1944_film)',
 'Nosferatu',
 'Snow_White_and_the_Seven_Dwarfs_(1937_film)',
 "A_Hard_Day%27s_Night_(film)",
 'La_Grande_Illusion',
 'North_by_Northwest',
 'The_Battle_of_Algiers',
 'Dunkirk_(2017_film)',
 'The_Maltese_Falcon_(1941_film)',
 'Repulsion_(film)',
 '12_Years_a_Slave_(film)',
 'Gravity_(2013_film)',
 'Sunset_Boulevard_(film)',
 'King_Kong_(1933_film)',
 'Spotlight_(film)',
 'The_Adventures_of_Robin_Hood',
 'Rashomon',
 'Rear_Window',
 'Selma_(film)',
 'Taxi_Driver',
 'Toy_Story_3',
 'Argo_(2012_film)',
 'Toy_Story_2',
 'The_Big_Sick',
 'Bride_of_Frankenstein',
 'Zootopia',
 'M_(1931_film)',
 'Wonder_Woman_(2017_film)',
 'The_Philadelphia_Story_(film)',
 'Alien_(film)',
 'Bicycle_Thieves',
 'Seven_Samurai',
 'The_Treasure_of_the_Sierra_Madre_(film)',
 'Up_(2009_film)',
 '12_Angry_Men_(1957_film)',
 'The_400_Blows',
 'Logan_(film)',
 'All_Quiet_on_the_Western_Front_(1930_film)',
 'Army_of_Shadows',
 'Arrival_(film)',
 'Baby_Driver',
 'A_Streetcar_Named_Desire_(1951_film)',
 'The_Night_of_the_Hunter_(film)',
 'Star_Wars:_The_Force_Awakens',
 'Manchester_by_the_Sea_(film)',
 'Dr._Strangelove',
 'Frankenstein_(1931_film)',
 'Vertigo_(film)',
 'The_Dark_Knight_(film)',
 'Touch_of_Evil',
 'The_Babadook',
 'The_Conformist_(film)',
 'Rebecca_(1940_film)',
 "Rosemary%27s_Baby_(film)",
 'Finding_Nemo',
 'Brooklyn_(film)',
 'The_Wrestler_(2008_film)',
 'The_39_Steps_(1935_film)',
 'L.A._Confidential_(film)',
 'Gone_with_the_Wind_(film)',
 'The_Good,_the_Bad_and_the_Ugly',
 'Skyfall',
 'Rome,_Open_City',
 'Tokyo_Story',
 'Hell_or_High_Water_(film)',
 'Pinocchio_(1940_film)',
 'The_Jungle_Book_(2016_film)',
 'La_La_Land_(film)',
 'Star_Trek_(film)',
 'High_Noon',
 'Apocalypse_Now',
 'On_the_Waterfront',
 'The_Wages_of_Fear',
 'The_Last_Picture_Show',
 'Harry_Potter_and_the_Deathly_Hallows_–_Part_2',
 'The_Grapes_of_Wrath_(film)',
 'Roman_Holiday',
 'Man_on_Wire',
 'Jaws_(film)',
 'Toy_Story',
 'The_Godfather_Part_II',
 'Battleship_Potemkin'
]

In [46]:
folder_name = 'bestofrt_posters'
# Make directory if it doesn't already exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [48]:
# List of dictionaries to build and convert to a DataFrame later
df_list = []
image_errors = {}
for title in title_list:
    try:
        # This cell is slow so print ranking to gauge time remaining
        ranking = title_list.index(title) + 1
        print(ranking)
        page = wptools.page(title, silent=True)
        #print(page)
        # Your code here (three lines)
        images = page.get().data['image']
        #print(images)
        # First image is usually the poster
        first_image_url = images[0]['url']
        #print(first_image_url)
        r = requests.get(first_image_url)
        #print(r)
        # Download movie poster image
        i = Image.open(BytesIO(r.content))
        image_file_format = first_image_url.split('.')[-1]
        i.save(folder_name + "/" + str(ranking) + "_" + title + '.' + image_file_format)
        # Append to list of dictionaries
        df_list.append({'ranking': int(ranking),
                        'title': title,
                        'poster_url': first_image_url})
    
    # Not best practice to catch all exceptions but fine for this short script
    except Exception as e:
        print(str(ranking) + "_" + title + ": " + str(e))
        image_errors[str(ranking) + "_" + title] = images

1
2
3
3_The_Third_Man: cannot identify image file <_io.BytesIO object at 0x7eff20831b30>
4
5
6
6_The_Cabinet_of_Dr._Caligari: cannot identify image file <_io.BytesIO object at 0x7eff1f4ec290>
7
7_All_About_Eve: cannot identify image file <_io.BytesIO object at 0x7eff2013fc50>
8
9
10
11
12
13
13_It_Happened_One_Night: cannot identify image file <_io.BytesIO object at 0x7eff20146fb0>
14
14_Singin'_in_the_Rain: cannot identify image file <_io.BytesIO object at 0x7eff1ed6cd70>
15
15_Boyhood_(film): 'image'
16
17
18
18_Psycho_(1960_film): cannot identify image file <_io.BytesIO object at 0x7eff201b2fb0>
19
19_Laura_(1944_film): cannot identify image file <_io.BytesIO object at 0x7eff1ed6cd70>
20
21
22


API error: {'code': 'invalidtitle', 'info': 'Bad title "A_Hard_Day%27s_Night_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


22_A_Hard_Day%27s_Night_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=A_Hard_Day%2527s_Night_%28film%29
23
23_La_Grande_Illusion: cannot identify image file <_io.BytesIO object at 0x7eff1fe36cb0>
24
24_North_by_Northwest: cannot identify image file <_io.BytesIO object at 0x7eff1efe3b30>
25
26
27
27_The_Maltese_Falcon_(1941_film): cannot identify image file <_io.BytesIO object at 0x7eff1f35db90>
28
28_Repulsion_(film): cannot identify image file <_io.BytesIO object at 0x7eff20146590>
29
30
31
31_Sunset_Boulevard_(film): cannot identify image file <_io.BytesIO object at 0x7eff1efe3cb0>
32
32_King_Kong_(1933_film): cannot identify image file <_io.BytesIO object at 0x7eff2016e350>
33
34
34_The_Adventures_of_Robin_Hood: cannot identify image file <_io.BytesIO object at 0x7eff1fcd2ad0>
35
35_Rashomon: cannot identify image

API error: {'code': 'invalidtitle', 'info': 'Bad title "Rosemary%27s_Baby_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes.'}


72_Rosemary%27s_Baby_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=Rosemary%2527s_Baby_%28film%29
73
74
75
76
77
78
79
80
81
81_Rome,_Open_City: cannot identify image file <_io.BytesIO object at 0x7eff20233710>
82
82_Tokyo_Story: cannot identify image file <_io.BytesIO object at 0x7eff444fed10>
83
84
85
86
87
88
88_High_Noon: cannot identify image file <_io.BytesIO object at 0x7eff2013f590>
89
90
90_On_the_Waterfront: cannot identify image file <_io.BytesIO object at 0x7eff202338f0>
91
92
93
94
94_The_Grapes_of_Wrath_(film): cannot identify image file <_io.BytesIO object at 0x7eff2016e350>
95
95_Roman_Holiday: cannot identify image file <_io.BytesIO object at 0x7eff2034fb30>
96
97
98
99
100
100_Battleship_Potemkin: cannot identify image file <_io.BytesIO object at 0x7eff1f28bb30>


In [49]:
for key in image_errors.keys():
    print(key)

3_The_Third_Man
6_The_Cabinet_of_Dr._Caligari
7_All_About_Eve
13_It_Happened_One_Night
14_Singin'_in_the_Rain
15_Boyhood_(film)
18_Psycho_(1960_film)
19_Laura_(1944_film)
22_A_Hard_Day%27s_Night_(film)
23_La_Grande_Illusion
24_North_by_Northwest
27_The_Maltese_Falcon_(1941_film)
28_Repulsion_(film)
31_Sunset_Boulevard_(film)
32_King_Kong_(1933_film)
34_The_Adventures_of_Robin_Hood
35_Rashomon
36_Rear_Window
43_Bride_of_Frankenstein
47_The_Philadelphia_Story_(film)
50_Seven_Samurai
51_The_Treasure_of_the_Sierra_Madre_(film)
53_12_Angry_Men_(1957_film)
56_All_Quiet_on_the_Western_Front_(1930_film)
60_A_Streetcar_Named_Desire_(1951_film)
61_The_Night_of_the_Hunter_(film)
65_Frankenstein_(1931_film)
66_Vertigo_(film)
68_Touch_of_Evil
70_The_Conformist_(film)
71_Rebecca_(1940_film)
72_Rosemary%27s_Baby_(film)
81_Rome,_Open_City
82_Tokyo_Story
88_High_Noon
90_On_the_Waterfront
94_The_Grapes_of_Wrath_(film)
95_Roman_Holiday
100_Battleship_Potemkin


In [None]:
"""
# Inspect unidentifiable images and download them individually
for rank_title, images in image_errors.items():
    if rank_title == '22_A_Hard_Day%27s_Night_(film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/4/47/A_Hard_Days_night_movieposter.jpg'
    if rank_title == '53_12_Angry_Men_(1957_film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/9/91/12_angry_men.jpg'
    if rank_title == '72_Rosemary%27s_Baby_(film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/e/ef/Rosemarys_baby_poster.jpg'
    if rank_title == '93_Harry_Potter_and_the_Deathly_Hallows_–_Part_2':
        url = 'https://upload.wikimedia.org/wikipedia/en/d/df/Harry_Potter_and_the_Deathly_Hallows_%E2%80%93_Part_2.jpg'
    title = rank_title[3:]
    df_list.append({'ranking': int(title_list.index(title) + 1),
                    'title': title,
                    'poster_url': url})
    r = requests.get(url)
    # Download movie poster image
    i = Image.open(BytesIO(r.content))
    image_file_format = url.split('.')[-1]
    i.save(folder_name + "/" + rank_title + '.' + image_file_format)
"""

In [51]:
# Create DataFrame from list of dictionaries
df = pd.DataFrame(df_list, columns = ['ranking', 'title', 'poster_url'])
df = df.sort_values('ranking').reset_index(drop=True)
df

Unnamed: 0,ranking,title,poster_url
0,1,The_Wizard_of_Oz_(1939_film),https://upload.wikimedia.org/wikipedia/commons...
1,2,Citizen_Kane,https://upload.wikimedia.org/wikipedia/commons...
2,4,Get_Out_(film),https://upload.wikimedia.org/wikipedia/en/a/a3...
3,5,Mad_Max:_Fury_Road,https://upload.wikimedia.org/wikipedia/en/6/6e...
4,8,Inside_Out_(2015_film),https://upload.wikimedia.org/wikipedia/en/0/0a...
...,...,...,...
56,93,Harry_Potter_and_the_Deathly_Hallows_–_Part_2,https://upload.wikimedia.org/wikipedia/en/d/df...
57,96,Man_on_Wire,https://upload.wikimedia.org/wikipedia/en/5/54...
58,97,Jaws_(film),https://upload.wikimedia.org/wikipedia/en/e/eb...
59,98,Toy_Story,https://upload.wikimedia.org/wikipedia/en/1/13...


We've gathered the data to produce our second goal visualization, the Roger Ebert Review word cloud!

Like the last flashforward, we'll skip the assessing and cleaning steps of the data wrangling process and hop straight to the second final product, the word clouds.

![title](wordclouds.jpg)

These word clouds required gathering data from two different sources: downloading files from the internet, i.e. the Roger Ebert review text files, and accessing data from an API, i.e. the movie poster URLs. And this data was in two formats, `.txt` and `JSON`.

Data visualization can be informative, but it can also be art.

## Assess

-

## Clean

-