# Data Wrangling

Data wrangling is the process of gathering your data, assessing its quality and structure, and cleaning it before you do things like analysis, visualisation, or build predictive models using machine learning.

![](images/data_wrangling.png)

Sometimes, it's as simple as downloading a file, spotting a few typos, and fixing those typos. But other times, your data really isn't clean i.e. you'll have missing records, duplicates and inaccurate data. Sometimes the data itself is fine but structurally, it's difficult to work with. Taking care of all this is necessary or else, you risk making mistakes, missing insights and wasting time.

Wrangling means to round up, herd, or take charge of livestock, like horses or sheep. Let's focus in on the sheep example.

A shepherd's main goals are to get their sheep to their pastures to let them graze, guide them to market to shear them, and put them in the barn to sleep. Before any of that though, they must be rounded up in a nice and organized group. The consequences if they're not? These tasks take longer. If they're all scattered, some could also run off and get lost. A wolf could even sneak into the pack and feast on a few of them.

An Analogy:
<br>
The same idea of organizing before acting is true for those who are shepherds of data. We need to wrangle our data for good outcomes, otherwise there could be consequences. If we analyze, visualize, or model our data before we wrangle it, our consequences could be making mistakes, missing out on cool insights, and wasting time. So best practices say wrangle. Always.

The development of Python and its libraries have made wrangling easier.

### Gather (Intro)
Gathering data is always the first step in data wrangling. The idea is before gathering, we have no data and after it, we do. This is sometimes called acquiring your data, or collecting it. A bit of scurrying is often required plus some unexpectedness. And depending upon where you find your data and what format it's in, the steps of the gathering process can vary. If the data is in a file, gathering often means downloading the file and importing it in your programming environment like a Jupyter notebook. Other methods of gathering are things like collecting data from files and databases which is what you'll usually do in the workplace. Or you can scrape data off a website or get it from an API, which stands for application programming interface. API's let us programmatically access data from applications like Twitter and Facebook.

### Assess (Intro)
After gathering data, we need to assess our data to determine what's clean and potentially what else to gather if we're missing some data. We're not exploring our dataset. We just want to make sure our data is an a form that makes our analysis easier later on. Okay. So what are we assessing? What is dirty data? What is messy data? We're looking for two main things: our data's quality and its tidiness. Low quality data is dirty. Untidy data is messy.

#### Quality
Low quality data is commonly referred to as dirty data. Dirty data has issues with its content.

Imagine you had a table with two columns: Name and Height, like below:

![](images/table1.png)

Common data quality issues include:
* missing data, like the missing height value for Juan.
* invalid data, like a cell having an impossible value, e.g., like negative height value for Kwasi. Having "inches" and "centimetres" in the height entries is technically invalid as well, since the datatype for height becomes a string when those are present. The datatype for height should be integer or float.
* inaccurate data, like Jane actually being 58 inches tall, not 55 inches tall.
* inconsistent data, like using different units for height (inches and centimetres).

Data quality is a perception or an assessment of data's fitness to serve its purpose in a given context. Unfortunately, that’s a bit of an evasive definition but it gets to something important: there are no hard and fast rules for data quality. One dataset may be high enough quality for one application but not for another.

#### Tidiness
_Untidy data_ is commonly referred to as _"messy" data_. Messy data has issues with its structure. Tidy data is a relatively new concept coined by statistician, professor, and all-round data expert Hadley Wickham. I’m going to take a quote from his excellent paper on the subject:

It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected. To get a handle on the problem, this paper focuses on a small, but important, aspect of data cleaning that I call data tidying: structuring datasets to facilitate analysis.

A dataset is messy or tidy depending on how rows, columns, and tables are matched up with observations, variables, and types. In tidy data:
* Each variable forms a column.
* Each observation forms a row.
* Each type of observational unit forms a table.

![](images/tidy_data.png)

#### Types of Assessment
* Visual Assessment - Visual assessment is simple. Open your data in your favorite software application (Google Sheets, Excel, a text editor, etc.) and scroll through it, looking for quality and tidiness issues.
* Programmatic Assessment - Programmatic assessment tends to be more efficient than visual assessment. One simple example of a programmatic assessment is pandas' info method, which gives us the basic info of your DataFrame—like number of entries, number of columns, the types of each column, whether there are missing values, and more. Some other pandas' methods used for visual assessment are namely:
    - head
    - tail
    - shape
    - value_counts

### Clean (Intro)
Now we've gathered the data and made a few assessments. The few is important. We don't need to identify every issue right from the start. We can iterate. So now we can start cleaning. Cleaning means acting on the assessments we made to improve quality and tidiness.

#### Improving Quality
Improving quality doesn't mean changing the data to make it say something different. That would be data fraud. Instead, we're talking about things like correcting it when it's inaccurate or removing data when it's wrong or irrelevant, or replacing, like filling in missing values, or merging, like combining gathered datasets that was split up.<br>
Consider the animals DataFrame, which has headers for name, body weight (in kilograms), and brain weight (in grams). The last five rows of this DataFrame are displayed below:

![](images/animals_df.png)

Examples of improving quality include:
* Correcting when inaccurate, like correcting the mouse's body weight to 0.023 kg instead of 230 kg.
* Removing when irrelevant, like removing the row with "Apple" since an apple is a fruit and not an animal.
* Replacing when missing, like filling in the missing value for brain weight for Brachiosaurus.
* Combining, like concatenating the missing rows in the more_animals DataFrame displayed below

![](images/more_animals_df.png)

All of this stuff can be done manually, but it's most efficiently done using code that minimises repetition.

#### Improving Tidiness
Improving tidiness means transforming the dataset so that each variable is a column, each observation is a row, and each type of observational unit is a table. There are special functions in pandas that help us do that. We'll dive deeper into those in this notebook ahead.

#### Programmatic Data Cleaning Process
The programmatic data cleaning process:
1. Define
2. Code
3. Test

**Defining** means defining a data cleaning plan in writing, where we turn our assessments into defined cleaning tasks. This plan will also serve as an instruction list so others (or us in the future) can look at our work and reproduce it.

**Coding** means translating these definitions to code and executing that code.

**Testing** means testing our dataset, often using code, to make sure our cleaning operations worked.

We've gathered, assessed and just cleaned our data. Are we done? No. After cleaning, we always reassess and then iterate on any the steps if we need to. If we're happy for with the quality and tidiness of our data, we can end our wrangling process and move on to storing our clean data, or analysing, visualising, or modeling it. Sometimes we realise we need to gather more data. Sometimes we miss assessments. It's hard to catch everything on the first go, at it's also very common to find new issues as you're fixing the ones you've already identified. Sometimes our cleaning operations don't work as we intended. once we go through each step once, we can revisit any step in the process any time, even after we've finished wrangling and moved on to analysis, visualisation, or modeling.

![](images/reasses_iterate.png)

### Gathering Data
It's Sunday night. You just had a nice meal, maybe some dessert, and you're on the couch with some popcorn and you want to watch a movie.

![](images/sunday_night.png)

How do you pick what to watch? Being a data driven person, I love to check the ratings and reviews on Rotten Tomatoes, which is a popular movie rating website. This list they have [here](https://www.rottentomatoes.com/top/bestofrt/) is so useful. Note: this current list may be different than the latest archived list used here. The Rotten Tomatoes top 100 movies of all time list.

As written here, movies with 40 or more critic reviews are eligible for this list. That's critic reviews, so ratings from regular movie watchers like you and me aren't included in these calculations. It's sorted by this adjusted score metric which combines the rating and number of critic reviews for each movie to create this adjusted score. The idea being that movies with fewer critic ratings have an easier path to getting that 100% score or close to it. Let's scroll a bit here. Okay, so E.T., I've never actually seen. Steven Spielberg's 1982 masterpiece despite hearing lots about it. It's currently 11th on this list with a 98% rating and 114 critic reviews. Let's check it out by clicking on it. Okay, so there's that 98% rating. This number is a bit deceiving though. It's actually the Tomatometer score, which is the percentage of approved Tomatometer critics who have given the movie a positive review. But okay, there's this audience score here, 72%. That metric wasn't included in that top 100 list. That's a pretty massive gap. Maybe this movie isn't that amazing. Also, wouldn't it be cool to compare this critic's score and also the audience score for all of these movies to see which movie really truly is the best? Also to find the worst best movie if you will. I'm picturing a scatterplot with four quadrants, like this one, but with different axis labels and values.

![](images/Scatter_plot.png)

We'd have audience score on the horizontal axis and critics score on the vertical one. That would put amazing movies with high audience and critics scores in the top right quadrant, then critically underrated movies in the bottom right quadrant with high audience scores and low critics scores, and critically overrated movies in the top left with low audience scores and high critics scores. Also, I don't trust every single reviewer necessarily. One person I do though, the late great [Roger Ebert](https://www.rogerebert.com). For lots of people, Roger Ebert's was the only review they needed because he explained the movie in such a way that they would know whether they would like it or not, his opinion notwithstanding. Wouldn't it be neat if we had a word cloud for each of the movies in that top 100 list with Roger Ebert's review text in that word cloud for each movie? [Andreas Mueller: Word Cloud Generator](https://amueller.github.io/word_cloud/) in Python can help us out. Having a word cloud for each movie would give us a snapshot of what makes each movie great. This word cloud library also has the ability to use stencils, so we could take the movie poster for each movie and surround it with a stencil and then put the word cloud around the poster image. That'd be really cool. So, to create both of these visualizations, the data is in a few different spots and it will require some craftiness to gather it all.

#### Files on hand
So the inspiration for our little project here, was this Rotten Tomatoes top 100 movies of all time list. How do we get it? Rotten Tomatoes doesn't give us a file to download unfortunately. For this lesson, I'm just going to give it to you. Gathering can be complicated, but it can also be extremely easy. Sometimes you are simply just given a file. I've pre-gathered this dataset with its four columns; rank, title, rating, and number of reviews, and it's 101 rows, one for the headers and one for each of the top 100 movies. I put the data in a flat file called a TSV file which stands for tab separated values. Picture this as me e-mailing this file to you, or giving it to you on a thumb drive, or you accessing the file from your company's file storage system. We don't need to do it programmatically in this case, because we'll assume that this data is internal, and can't be downloaded from the Internet.

Flat files contain tabular data in plain text format with one data record per line and each record or line having one or more fields. These fields are separated by delimiters, like commas, tabs, or colons.

**Advantages of flat files** include:
* They're text files and therefore human readable.
* Lightweight.
* Simple to understand.
* Software that can read/write text files is ubiquitous, like text editors.
* Great for small datasets.

**Disadvantages of flat files**, in comparison to relational databases, for example, include:

* Lack of standards.
* Data redundancy.
* Sharing data can be cumbersome.
* Not great for large datasets.

In [1]:
# Importing the TSV file ('bestofrt.tsv') into a pandas DataFrame
# pandas has one main function for parsing flat files and it is read_csv
import pandas as pd

df = pd.read_csv('bestofrt.tsv', sep='\t')
df.head()

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370


#### Source: Web Scraping

All right. So we want to grab this audience score here to add to our data set and number of audience reviews to match our number of critic reviews.

![](images/E_T_Rotten_Tomatoes.png)

Unfortunately, Rotten Tomatoes hasn't made this easily accessible for us. One way we can get these pieces of data though, web scraping. Web scraping is a fancy way of saying extracting data from web sites using code. Behind the scenes though, web scraping is actually quite simple. The data that lives on web pages is called HTML, HyperText Markup Language. It's made up of these things called tags which give the web page structure. Because HTML code is just text, these tags and the content within them can be accessed using parsers, and in Python, there is an awesome one called Beautiful Soup. We can download HTML and access it offline, or we can do it in real time over the Internet. For this lesson, we're going to download the HTML files and parse them using Beautiful Soup.

The two main ways to work with HTML files are:
* Saving the HTML file to your computer (using the Requests library for example) library and reading that file into a BeautifulSoup constructor
* Reading the HTML response content directly into a BeautifulSoup constructor (again using the Requests library for example)

For this case study, you’re going to do neither of these. I've downloaded all of the Rotten Tomatoes HTML files for you and put them in a folder called rt_html.

The rt_html folder contains the Rotten Tomatoes HTML for each of the Top 100 Movies of All Time as the list stood at the time of this case study. I'm giving you these historical files because the ratings will change over time and there will be inconsistencies over time. Also, a web page's HTML is known to change over time. Scraping code can break easily when web redesigns occur, which makes scraping brittle and not recommended for projects with longevity. So just use these HTML files provided to you and pretend like you saved them yourself with one of the methods described above.

In [2]:
from bs4 import BeautifulSoup
import os

In [3]:
# List of dictionaries to build file by file and later convert to a DataFrame
df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        # Your code here
        # Note: a correct implementation may take ~15 seconds to run
        soup = BeautifulSoup(file, 'lxml')
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        audience_score = soup.find('div', class_='audience-score meter').find('span').contents[0][:-1]
        num_audience_ratings = soup.find('div', class_='audience-info hidden-xs superPageFontColor')
        num_audience_ratings = num_audience_ratings.find_all('div')[1].contents[2].strip().replace(',', '')
        
        # Append to list of dictionaries
        df_list.append({'title': title,
                        'audience_score': int(audience_score),
                        'number_of_audience_ratings': int(num_audience_ratings)})
df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])

In [4]:
df

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,Zootopia (2016),92,98633
1,The Treasure of the Sierra Madre (1948),93,25627
2,All Quiet on the Western Front (1930),89,17768
3,Rear Window (1954),95,149458
4,Selma (2015),86,60533
5,Citizen Kane (1941),90,157274
6,Casablanca (1942),95,355952
7,Inside Out (2015),89,133558
8,Gravity (2013),80,301261
9,Dr. Strangelove Or How I Learned to Stop Worry...,94,208215


#### Flash Forward 1

So at this point, you've actually gathered enough data to produce one of the goal visualizations, to scatterplot with quadrants with audience score on the horizontal axis and critics score in the vertical axis. Since this case study, in particular, is on gathering data, let's get past the assessing and cleaning steps of the data wrangling process, which includes joining our two dataframes, and let's flashforward to the final product, the visualization. So here's what the data frame looks like with your newly gathered audience score and number of audience ratings.

![](images/Flash_fwd_1.png)

You can use matplotlib to create a simple visualization like this one, a pretty basic scatterplot, but this is where something interactive like Tableau really shines.

![](images/visualization1.png)

And here's what it looks like in Tableau. Pretty amazing. So, you've got audience score on the horizontal axis ranging from 70% to 100%, critics score in the vertical axis ranging from 91% to 100%, well technically, 101, but that's just for visual purposes. Then you've got these reference lines. The vertical one being the median of audience score, which is 90%, and the horizontal reference line is the median of critics score, 98%. In the top right corner of the screen, we've even got number of audience ratings and number of critic ratings represented visually, too. Lighter shades of blue mean small number of audience ratings and darker shades of blue means a larger number. And smaller circles mean a smaller number of critic ratings and larger circles mean a larger number of critic ratings. And okay, so in the top right corner of the screen, we've got universally loved movies. High audience scores, high critics scores. One of them being The Godfather, 1972, which if we hover over this dot, we can see the 99% critics score and the 98% audience score. And in the bottom right corner, we've got critically underrated movies. Audience scores above the median audience score for this top 100 list and critics scores below the median critics score for this top 100 list. One movie being The Dark Knight, 2008, audience score and critics score of 94%. And in the top left, we have critically overrated movies. Audiences didn't like these movies as much as critics did, basically. Compared to the median audience and critic rating, the most notable point, 1940's, Pinocchio. A 100% critics score over 45 reviews but only a 72% audience score. And then in the bottom left quadrant, we've got movies that didn't have particularly high critic or audience scores in reference to the movies on this list. One notable one, Lala Land, 2016, an audience score of 82% and a critics score of 92%, both well below the median scores for each. And here's E.T., right on the median critics score line, a 98% critics score and there's that low 72% audience score. So, this Best of Rotten Tomatoes critic versus audience score visualization required gathering data from two different sources: Accessing files on hand, and scraping data from web pages and that data was in two different formats, a flat file, or a TSV file specifically, and HTML.

![](images/tableau.png)

#### Source: Downloading Files from the Internet
Okay, so now we can start the Roger Ebert review word cloud. So the first thing you need, the text from each of his reviews, for each of the movies on the Rotten Tomatoes Top 100 Movies of All Time list. They live on his website. Lucky for you I've pre-gathered all of this text in the form of 100.txt files. I put them on this Udacity hosted web page and we're going to download them all programmatically. Yes, you can point and click and download each file manually but that would take forever, especially if someone else wants to reproduce your analysis. If they want to check your work, for example, they can download all of the files with the execution of a few lines of code. So downloading files from Internet programmatically is best for scalability and reproducibility. In practice you really only need to know one thing to download files like this, Python's request library. Knowing a bit of HTTP aka Hypertext Transfer Protocol will help you understand what's going on under the hood here.

HTTP (Hypertext Transfer Protocol)
HTTP, the Hypertext Transfer Protocol, is the language that web browsers (like Chrome or Safari) and web servers (basically computers where the contents of a website are stored) speak to each other. Every time you open a web page, or download a file, or watch a video, it's HTTP that makes it possible.

HTTP is a request/response protocol:
* Your computer, a.k.a. the client, sends a request to a server for some file. For this lesson: "Get me the file **1-the-wizard-of-oz-1939-film.txt**", for example. GET is the name of the HTTP request method (of which there are multiple) used for retrieving data.
* The web server sends back a response. If the request is valid: "Here is the file you asked for:", then followed by the contents of the **1-the-wizard-of-oz-1939-film.txt** file itself.

![](images/http.png)

In [5]:
import requests

In [6]:
# Make directory if it doesn't already exist
folder_name = 'ebert_reviews'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [7]:
ebert_review_urls = ['https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9900_1-the-wizard-of-oz-1939-film/1-the-wizard-of-oz-1939-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_2-citizen-kane/2-citizen-kane.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_3-the-third-man/3-the-third-man.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_4-get-out-film/4-get-out-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_5-mad-max-fury-road/5-mad-max-fury-road.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_6-the-cabinet-of-dr.-caligari/6-the-cabinet-of-dr.-caligari.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_7-all-about-eve/7-all-about-eve.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_8-inside-out-2015-film/8-inside-out-2015-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_9-the-godfather/9-the-godfather.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_10-metropolis-1927-film/10-metropolis-1927-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_11-e.t.-the-extra-terrestrial/11-e.t.-the-extra-terrestrial.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_12-modern-times-film/12-modern-times-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_14-singin-in-the-rain/14-singin-in-the-rain.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_15-boyhood-film/15-boyhood-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_16-casablanca-film/16-casablanca-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_17-moonlight-2016-film/17-moonlight-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_18-psycho-1960-film/18-psycho-1960-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_19-laura-1944-film/19-laura-1944-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_20-nosferatu/20-nosferatu.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_21-snow-white-and-the-seven-dwarfs-1937-film/21-snow-white-and-the-seven-dwarfs-1937-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_22-a-hard-day27s-night-film/22-a-hard-day27s-night-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_23-la-grande-illusion/23-la-grande-illusion.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_25-the-battle-of-algiers/25-the-battle-of-algiers.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_26-dunkirk-2017-film/26-dunkirk-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_27-the-maltese-falcon-1941-film/27-the-maltese-falcon-1941-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_29-12-years-a-slave-film/29-12-years-a-slave-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_30-gravity-2013-film/30-gravity-2013-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_31-sunset-boulevard-film/31-sunset-boulevard-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_32-king-kong-1933-film/32-king-kong-1933-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_33-spotlight-film/33-spotlight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_34-the-adventures-of-robin-hood/34-the-adventures-of-robin-hood.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_35-rashomon/35-rashomon.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_36-rear-window/36-rear-window.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_37-selma-film/37-selma-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_38-taxi-driver/38-taxi-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_39-toy-story-3/39-toy-story-3.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_40-argo-2012-film/40-argo-2012-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_41-toy-story-2/41-toy-story-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_42-the-big-sick/42-the-big-sick.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_43-bride-of-frankenstein/43-bride-of-frankenstein.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_44-zootopia/44-zootopia.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_45-m-1931-film/45-m-1931-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_46-wonder-woman-2017-film/46-wonder-woman-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_48-alien-film/48-alien-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_49-bicycle-thieves/49-bicycle-thieves.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_50-seven-samurai/50-seven-samurai.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_51-the-treasure-of-the-sierra-madre-film/51-the-treasure-of-the-sierra-madre-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_52-up-2009-film/52-up-2009-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_53-12-angry-men-1957-film/53-12-angry-men-1957-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_54-the-400-blows/54-the-400-blows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_55-logan-film/55-logan-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_57-army-of-shadows/57-army-of-shadows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_58-arrival-film/58-arrival-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_59-baby-driver/59-baby-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_60-a-streetcar-named-desire-1951-film/60-a-streetcar-named-desire-1951-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_61-the-night-of-the-hunter-film/61-the-night-of-the-hunter-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_62-star-wars-the-force-awakens/62-star-wars-the-force-awakens.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_63-manchester-by-the-sea-film/63-manchester-by-the-sea-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_64-dr.-strangelove/64-dr.-strangelove.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_66-vertigo-film/66-vertigo-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_67-the-dark-knight-film/67-the-dark-knight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_68-touch-of-evil/68-touch-of-evil.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_69-the-babadook/69-the-babadook.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_72-rosemary27s-baby-film/72-rosemary27s-baby-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_73-finding-nemo/73-finding-nemo.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_74-brooklyn-film/74-brooklyn-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_75-the-wrestler-2008-film/75-the-wrestler-2008-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_77-l.a.-confidential-film/77-l.a.-confidential-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_78-gone-with-the-wind-film/78-gone-with-the-wind-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_79-the-good-the-bad-and-the-ugly/79-the-good-the-bad-and-the-ugly.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_80-skyfall/80-skyfall.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_82-tokyo-story/82-tokyo-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_83-hell-or-high-water-film/83-hell-or-high-water-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_84-pinocchio-1940-film/84-pinocchio-1940-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_85-the-jungle-book-2016-film/85-the-jungle-book-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991a_86-la-la-land-film/86-la-la-land-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_87-star-trek-film/87-star-trek-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_89-apocalypse-now/89-apocalypse-now.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_90-on-the-waterfront/90-on-the-waterfront.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_91-the-wages-of-fear/91-the-wages-of-fear.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_92-the-last-picture-show/92-the-last-picture-show.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_93-harry-potter-and-the-deathly-hallows-part-2/93-harry-potter-and-the-deathly-hallows-part-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_94-the-grapes-of-wrath-film/94-the-grapes-of-wrath-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_96-man-on-wire/96-man-on-wire.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_97-jaws-film/97-jaws-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_98-toy-story/98-toy-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_99-the-godfather-part-ii/99-the-godfather-part-ii.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_100-battleship-potemkin/100-battleship-potemkin.txt']

In [8]:
# Your code here
# Implement the code in the video above in a for loop for all Ebert reviews

for url in ebert_review_urls:
    response = requests.get(url)
    with open(os.path.join(folder_name, url.split('/')[-1]), mode='wb') as file:
        file.write(response.content)

In [9]:
# Check contents of the folder
len(os.listdir(folder_name))

88

There should be 88. 12 movies in the top 100 Rotten Tomatoes list didn't have reviews on Roger Ebert's site.

Gathering data from text files in Python means opening and reading from files. If you're using Pandas like we are here, then this also means storing the text data you just read in a Pandas data frame. We have 88 Roger Ebert reviews to open and read. We'll need a loop to iterate through all of the files in this folder to open and read each. There are two main ways of doing this. One using the OS library and the other using a library called glob. We've been using OS' listdir in this lesson so far. This is good if you're sure you want to open every file in the folder, like our case here. Every file in this folder is a Roger Ebert review text file. But let's switch it up here and use glob instead. The glob library allows for Unix-style pathname pattern expansion, which is a fancy way of saying, using something called glob patterns to specify sets of filenames. These glob patterns use something called wildcard characters. Searching the [documentation](https://docs.python.org/3/library/glob.html), I just want you to focus on this one thing here for now, glob.glob. This returns a list of path name that match pathname, i.e. the string parameter you pass in here, this is where the glob pattern goes. How can we use this? We want all file names that end in.txt, which in this Ebert reviews folder, is all of them. And because glob.glob returns a list, we can loop through that directly here.

In [10]:
import glob

In [11]:
# List of dictionaries to build file by file and later convert to a DataFrame
df_list = []
for ebert_review in glob.glob('ebert_reviews/*.txt'):
    with open(ebert_review, encoding='utf-8') as file:
        title = file.readline()[:-1]
        # Your code here
        review_url = file.readline()[:-1]
        review_text = file.read()

        # Append to list of dictionaries
        df_list.append({'title': title,
                        'review_url': review_url,
                        'review_text': review_text})
df = pd.DataFrame(df_list, columns = ['title', 'review_url', 'review_text'])

In [12]:
df

Unnamed: 0,title,review_url,review_text
0,Dunkirk (2017),http://www.rogerebert.com/reviews/dunkirk-2017,"Lean and ambitious, unsentimental and bombasti..."
1,Army of Shadows (L'Armée des ombres) (1969),http://www.rogerebert.com/reviews/great-movie-...,"Jean-Pierre Melville's ""Army of Shadows"" is ab..."
2,Alien (1979),http://www.rogerebert.com/reviews/great-movie-...,"At its most fundamental level, ""Alien"" is a mo..."
3,The Bride of Frankenstein (1935),http://www.rogerebert.com/reviews/great-movie-...,To a new world of gods and monsters.\n\nSo int...
4,The 400 Blows (Les Quatre cents coups) (1959),http://www.rogerebert.com/reviews/great-movie-...,I demand that a film express either the joy of...
5,Manchester by the Sea (2016),http://www.rogerebert.com/reviews/manchester-b...,"""Manchester by the Sea,"" about a self-punishin..."
6,Mad Max: Fury Road (2015),http://www.rogerebert.com/reviews/mad-max-fury...,George Miller’s “Mad Max” films didn’t just ma...
7,Wonder Woman (2017),http://www.rogerebert.com/reviews/wonder-woman...,Ever since William Moulton Marston created her...
8,Bicycle Thieves (Ladri di biciclette) (1949),http://www.rogerebert.com/reviews/great-movie-...,"""The Bicycle Thief"" is so well-entrenched as a..."
9,Laura (1944),http://www.rogerebert.com/reviews/great-movie-...,I've seen Otto Preminger's “Laura” three or fo...


#### Source: APIs (Application Programming Interfaces)
Okay, so now for the cherry on top. Getting each movie's poster to add to our word cloud. So how are you going to get these images? You could scrape the image URL from the HTML. But a better way to access them is through an API or application programming interface. Since each movie has its poster on its Wikipedia page, you can use Wikipedia's API.

![](images/poster.png)

At their simplest, APIs let you access data from the Internet in a reasonably easy manner. Twitter, Facebook, Instagram all have APIs. It doesn't have to be a company led thing though. There are tons of open source APIs. MediaWiki, which is a popular API for Wikipedia is open source. This is the API we'll be using.

The goal is to get these movie poster images somehow. The one displayed here is for E.T. So, Rotten Tomatoes actually does have an API and it does provide audience scores, which means we could have hit the API instead of scraping it off of the Rotten Tomatoes web page earlier. But this API doesn't provide posters and images, unfortunately. Even if they did, this API is one that requires you to apply for access before using it via a proposal form. This isn't uncommon. This API wasn't going to be scalable enough for use in our case anyway. When given a choice the rule of thumb is always pick the API over scraping. As this answer on Stack Overflow suggests, scraping is brittle and breaks with web layout redesigns because the underlying HTML has changed.

![](images/stackoverflow1.png)
![](images/stackoverflow2.png)

Imagine for a second that we did have access to the Rotten Tomatoes API though. APIs on the left and their access libraries, an example of which is on the right allow programmers to access data in a super simple manner.

![](images/API_vs_AccessLibrary.png)

Here's an example of how simple it would have been to get the audience score data using the Rotten Tomatoes API and this rotten tomatoes API access library called rtsimple. As written here rtsimple is a wrapper written in Python for the Rotten Tomatoes API. First, we import the rtsimple library as the alias rt then we set our API key. This is what Rotten Tomatoes is hiding behind that proposal form vetting process. And we create this movie object with this simple bit of code and the movie ID, which for E.T. is 10489, and then we access the ratings using .ratings and the audience_score in these brackets. If we had an API key and were to run this cell, the audience_score rating of 72 for E.T would have been printed out below here.

![](images/using_API.png)

So compared to scraping this is less brittle and despite beautiful suit being pretty simple, this is more intuitive code. So because we can't use the Rotten Tomatoes API for this case study, enter Wikipedia, and specifically, MediaWiki. MediaWiki is an API that hosts all of the Wikipedia data.

#### MediaWiki API

MediaWiki has a great [tutorial](https://www.mediawiki.org/wiki/API:Main_page#A_simple_example) on their website on how their API calls are structured.

It's a nice and simple example and they explain the various moving parts:

* The endpoint (important takeaway: there is nothing special about this URL!)
* The format
* The action
* Action-specific parameters

Done reading? Great! Though they say that is a "simple example," it could definitely be simpler! This is where access libraries, also known as client libraries or even just libraries (as in "Twitter API libraries"), come into play and make our lives easier.

#### wptools Library

There are a bunch of different access libraries for MediaWiki to satisfy the variety of programming languages that exist. Here is a [list](https://www.mediawiki.org/wiki/API:Client_code#Python) for Python. This is pretty standard for most APIs. Some libraries are better than others, which again, is standard. For a MediaWiki, the most up to date and human readable one in Python is called [wptools](https://github.com/siznax/wptools). The analogous relationship for Twitter is:

* MediaWiki API - wptools
* Twitter API - tweepy

wptools has an even simpler tutorial on their GitHub page using the [Mahatma Gandhi Wikipedia page](https://en.wikipedia.org/wiki/Mahatma_Gandhi) as a working example. To get a page object, the usage is as follows:

`page = wptools.page('Mahatma_Gandhi')`

...where 'Mahatma_Gandhi' is the last bit of the Wikipedia URL for that page (https://en.wikipedia.org/wiki/Mahatma_Gandhi). This `page` object has methods that can get us various pieces of data about that Wikipedia page, including all of the images on the page. To get all of the data:

Simply calling get() on a page will automagically fetch extracts, images, infobox data, wikidata, and other metadata via the MediaWiki, Wikidata, and RESTBase APIs.

`page = wptools.page('Mahatma_Gandhi').get()`

Or if you already have a page object assigned to `page`:

`page.get()`

`page` now has the following attributes, which can be accessed using dot notation through `.data`:

![](images/mahatma_gandhi.png)

`page.data['image']`, for example, would return a list of data for six images on this specific Wikipedia page.

Getting the page object for the E.T. The Extra-Terrestial Wikipedia page.

In [13]:
import wptools

In [14]:
# Your code here: get the E.T. page object
# This cell make take a few seconds to run
page = wptools.page('E.T._the_Extra-Terrestrial').get()

en.wikipedia.org (query) E.T._the_Extra-Terrestrial
en.wikipedia.org (parse) 73441
www.wikidata.org (wikidata) Q11621
www.wikidata.org (labels) Q229009|Q1757366|P915|Q2143665|Q65|P494...
www.wikidata.org (labels) P161|P2704|Q139184|P214|Q471839|Q8555|P...
www.wikidata.org (labels) P3203|P3135|Q8877|Q3897561|Q103360|P921...
www.wikidata.org (labels) Q499789|Q488645|P2061|Q1748409|P1237|P2...
en.wikipedia.org (restbase) /page/summary/E.T._the_Extra-Terrestrial
en.wikipedia.org (imageinfo) File:ET logo 3.svg|File:E t the extr...
E.T. the Extra-Terrestrial (en) data
{
  aliases: <list(2)> E.T., ET
  assessments: <dict(4)> United States, Film, Science Fiction, Lib...
  claims: <dict(90)> P1562, P57, P272, P345, P31, P161, P373, P480...
  description: <str(63)> 1982 American science fiction film direct...
  exhtml: <str(569)> <p><i><b>E.T. the Extra-Terrestrial</b></i> i...
  exrest: <str(548)> E.T. the Extra-Terrestrial is a 1982 American...
  extext: <str(1784)> _**E.T. the Extra-Terrestri

In [15]:
# Accessing the image attribute will return the images for this page
page.data['image']

[{'descriptionshorturl': 'https://en.wikipedia.org/w/index.php?curid=7419503',
  'descriptionurl': 'https://en.wikipedia.org/wiki/File:E_t_the_extra_terrestrial_ver3.jpg',
  'file': 'File:E t the extra terrestrial ver3.jpg',
  'height': 394,
  'kind': 'parse-image',
  'metadata': {'Assessments': {'hidden': '',
    'source': 'commons-categories',
    'value': ''},
   'Categories': {'hidden': '',
    'source': 'commons-categories',
    'value': 'All non-free media|E.T. the Extra-Terrestrial|Fair use images of movie posters|Files with no machine-readable author|Files with no machine-readable description|Files with no machine-readable license|Files with no machine-readable source|Noindexed pages|Non-free images for NFUR review|Non-free posters'},
   'CommonsMetadataExtension': {'hidden': '',
    'source': 'extension',
    'value': 1.2},
   'DateTime': {'hidden': '',
    'source': 'mediawiki-metadata',
    'value': '2016-06-04 10:30:46'},
   'NonFree': {'hidden': '', 'source': 'commons-desc

#### JSON File Structure

Most data from APIs comes in JSON or XML format. JSON stands for Javascript Object Notation and XML stands for Extensible Markup Language. They both have their use cases, but in this case study we will focus on JSON. In many situations, we're limited in what we can represent in tabular data. Sometimes we have data with fields that have multiple entries like, produced by, here, Kathleen Kennedy and Steven Spielberg, two entries.

![](images/json.png)

And sometimes fields have subfields like release date, having a date and also a location of release. May 26 1992 in Cannes for example, then also June 11 1982 in the States. Release state actually has multiple entries and multiple fields. Representing this data in tabular form would be weird. We'd need something like this. It's quite unnatural.

![](images/weird_tabular.png)

As displayed here, JSON is especially great for representing and accessing complicated data hierarchies, like this Wikipedia info box.

![](images/json_structure.png)

So JSON is built on two key structures. The first one JSON objects, which are a collection of key value pairs. Objects are surrounded by curly braces. So this whole chunk of code here is one object, as displayed on the right. The left side of the screen is what the JSON code looks like. And the right side is a helpful collapsable interpretation of this code. So back to the key value pairs part of the JSON object. Directed by, would be one key and Steven Spielberg would be the value for that key. Then after Spielberg we have a comma separating the key value pairs. In Python, JSON objects are interpreted as dictionaries and you can access them like you would a standard Python Dict. The second key of structure is called a JSON array, which is an ordered list of values. Here's an array in the value for the, produced by, key. Square brackets denote an array, which makes sense because in Python, JSON arrays are interpreted like lists and again they can be accessed as such. While JSON object keys must be strings, you'll notice that every key on the left side here is a string, the values for both JSON objects and arrays can be any valid JSON data type. So a string, number, object, array, Boolean or null. There's a string, Steven Spielberg again and there's a number. The box office pay day in dollars or running time in minutes. And here's the JSON array as the value for the starring key. Take a look at the release key here. We have the string released as a key, then a JSONarray, which contains two JSON objects. When objects and arrays are combined like this, this is called nesting.

Before you download the movie poster images to add to the word cloud as described at the end of the video above, let's first get comfortable with accessing wptools page object attributes. Let's inspect the wptools `page` object for the [E.T. The Extra-Terrestial Wikipedia page](https://en.wikipedia.org/wiki/E.T._the_Extra-Terrestrial).

In [16]:
page = wptools.page('E.T._the_Extra-Terrestrial').get()

en.wikipedia.org (query) E.T._the_Extra-Terrestrial
en.wikipedia.org (parse) 73441
www.wikidata.org (wikidata) Q11621
www.wikidata.org (labels) Q229009|Q1757366|P915|Q2143665|Q65|P494...
www.wikidata.org (labels) P161|P2704|Q139184|P214|Q471839|Q8555|P...
www.wikidata.org (labels) P3203|P3135|Q8877|Q3897561|Q103360|P921...
www.wikidata.org (labels) Q499789|Q488645|P2061|Q1748409|P1237|P2...
en.wikipedia.org (restbase) /page/summary/E.T._the_Extra-Terrestrial
en.wikipedia.org (imageinfo) File:ET logo 3.svg|File:E t the extr...
E.T. the Extra-Terrestrial (en) data
{
  aliases: <list(2)> E.T., ET
  assessments: <dict(4)> United States, Film, Science Fiction, Lib...
  claims: <dict(90)> P1562, P57, P272, P345, P31, P161, P373, P480...
  description: <str(63)> 1982 American science fiction film direct...
  exhtml: <str(569)> <p><i><b>E.T. the Extra-Terrestrial</b></i> i...
  exrest: <str(548)> E.T. the Extra-Terrestrial is a 1982 American...
  extext: <str(1784)> _**E.T. the Extra-Terrestri

In [17]:
# Access the first image in the images attribute, which is a JSON array.
page.data['image'][0]

{'descriptionshorturl': 'https://en.wikipedia.org/w/index.php?curid=7419503',
 'descriptionurl': 'https://en.wikipedia.org/wiki/File:E_t_the_extra_terrestrial_ver3.jpg',
 'file': 'File:E t the extra terrestrial ver3.jpg',
 'height': 394,
 'kind': 'parse-image',
 'metadata': {'Assessments': {'hidden': '',
   'source': 'commons-categories',
   'value': ''},
  'Categories': {'hidden': '',
   'source': 'commons-categories',
   'value': 'All non-free media|E.T. the Extra-Terrestrial|Fair use images of movie posters|Files with no machine-readable author|Files with no machine-readable description|Files with no machine-readable license|Files with no machine-readable source|Noindexed pages|Non-free images for NFUR review|Non-free posters'},
  'CommonsMetadataExtension': {'hidden': '',
   'source': 'extension',
   'value': 1.2},
  'DateTime': {'hidden': '',
   'source': 'mediawiki-metadata',
   'value': '2016-06-04 10:30:46'},
  'NonFree': {'hidden': '', 'source': 'commons-desc-page', 'value': '

In [18]:
# Access the director key of the infobox attribute, which is a JSON object.
page.data['infobox']['director']

'[[Steven Spielberg]]'

#### Mashup: APIs, Downloading Files Programmatically, and JSON

With APIs, downloading files programmatically from the internet, and JSON under your belt, we now have all of the knowledge to download all of the movie poster images for the Roger Ebert review word clouds. This is our next task.

There are two key things to be aware of before we begin:
1.  **Wikipedia Page Titles**<br>
To access Wikipedia page data via the MediaWiki API with wptools (phew, that was a mouthful), you need each movie's Wikipedia page title, i.e., what comes after the last slash in en.wikipedia.org/wiki/ in the URL. For this lesson, I've compiled all of these titles for each of the movies in the Top 100 for you.

![](images/wikipedia_page_titles.png)

2. **Downloading Image Files**<br>
Downloading images may seem tricky from a reading and writing perspective, in comparison to text files which you can read line by line, for example. But in reality, image files aren't special—they're just binary files. To interact with them, you don't need special software (like Photoshop or something) that "understands" images. You can use regular file opening, reading, and writing techniques, like this:
```
import requests
r = requests.get(url)
with open(folder_name + '/' + filename, 'wb') as f:
        f.write(r.content)
```

But this technique can be error-prone. I(t will work most of the time, but sometimes the file you write to will be damaged as shown below.

![](images/damaged_file.png)

Though you may still encounter a similar file error, this code above will at least warn us with an error message, at which point we can manually download the problematic images.

Let's gather the last piece of data for the Roger Ebert review word clouds now: the movie poster image files. Let's also keep each image's URL to add to the master DataFrame later.

Though we're going to use a loop to minimize repetition, here's how the major parts inside that loop will work, in order:
1. We're going to query the MediaWiki API using wptools to get a movie poster URL via each page object's `image` attribute.
2. Using that URL, we'll programmatically download that image into a folder called bestofrt_posters.

The code cells below contains code that:

1. Contains title_list, which is a list of all of the Wikipedia page titles for each movie in the Rotten Tomatoes Top 100 Movies of All Time list. This list is in the same order as the Top 100.
2. Creates an empty list, df_list, to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the most efficient way of building a DataFrame row by row).
3. Creates an empty folder, bestofrt_posters, to store the downloaded movie poster image files.
4. Creates an empty dictionary, image_errors, to fill to keep track of movie poster image URLs that don't work.
5. Loops through the Wikipedia page titles in *title_list* and:
    * Stores the ranking of that movie in the Top 100 list based on its position in *title_list*. Ranking is needed so we can join this DataFrame with the master DataFrame later. We can't join on title because the titles of the Rotten Tomatoes pages and the Wikipedia pages differ.
    * Uses `try` and `except` blocks to attempt to query MediaWiki for a movie poster image URL and to attempt to download that image. If the attempt fails and an error is encountered, the offending movie is documented in *image_errors*.
    * Appends a dictionary with *ranking*, *title*, and *poster_url* as the keys and the extracted values for each as the values to *df_list*.
6. Inspects the images that caused errors and downloads the correct image individually (either via another URL in the `image` attribute's list or a URL from Google Images)
7. Creates a DataFrame called *df* by converting *df_list* using the `pd.DataFrame` constructor.

In [19]:
import pandas as pd
import wptools
import os
import requests
from PIL import Image
from io import BytesIO

In [20]:
title_list = [
 'The_Wizard_of_Oz_(1939_film)',
 'Citizen_Kane',
 'The_Third_Man',
 'Get_Out_(film)',
 'Mad_Max:_Fury_Road',
 'The_Cabinet_of_Dr._Caligari',
 'All_About_Eve',
 'Inside_Out_(2015_film)',
 'The_Godfather',
 'Metropolis_(1927_film)',
 'E.T._the_Extra-Terrestrial',
 'Modern_Times_(film)',
 'It_Happened_One_Night',
 "Singin'_in_the_Rain",
 'Boyhood_(film)',
 'Casablanca_(film)',
 'Moonlight_(2016_film)',
 'Psycho_(1960_film)',
 'Laura_(1944_film)',
 'Nosferatu',
 'Snow_White_and_the_Seven_Dwarfs_(1937_film)',
 "A_Hard_Day%27s_Night_(film)",
 'La_Grande_Illusion',
 'North_by_Northwest',
 'The_Battle_of_Algiers',
 'Dunkirk_(2017_film)',
 'The_Maltese_Falcon_(1941_film)',
 'Repulsion_(film)',
 '12_Years_a_Slave_(film)',
 'Gravity_(2013_film)',
 'Sunset_Boulevard_(film)',
 'King_Kong_(1933_film)',
 'Spotlight_(film)',
 'The_Adventures_of_Robin_Hood',
 'Rashomon',
 'Rear_Window',
 'Selma_(film)',
 'Taxi_Driver',
 'Toy_Story_3',
 'Argo_(2012_film)',
 'Toy_Story_2',
 'The_Big_Sick',
 'Bride_of_Frankenstein',
 'Zootopia',
 'M_(1931_film)',
 'Wonder_Woman_(2017_film)',
 'The_Philadelphia_Story_(film)',
 'Alien_(film)',
 'Bicycle_Thieves',
 'Seven_Samurai',
 'The_Treasure_of_the_Sierra_Madre_(film)',
 'Up_(2009_film)',
 '12_Angry_Men_(1957_film)',
 'The_400_Blows',
 'Logan_(film)',
 'All_Quiet_on_the_Western_Front_(1930_film)',
 'Army_of_Shadows',
 'Arrival_(film)',
 'Baby_Driver',
 'A_Streetcar_Named_Desire_(1951_film)',
 'The_Night_of_the_Hunter_(film)',
 'Star_Wars:_The_Force_Awakens',
 'Manchester_by_the_Sea_(film)',
 'Dr._Strangelove',
 'Frankenstein_(1931_film)',
 'Vertigo_(film)',
 'The_Dark_Knight_(film)',
 'Touch_of_Evil',
 'The_Babadook',
 'The_Conformist_(film)',
 'Rebecca_(1940_film)',
 "Rosemary%27s_Baby_(film)",
 'Finding_Nemo',
 'Brooklyn_(film)',
 'The_Wrestler_(2008_film)',
 'The_39_Steps_(1935_film)',
 'L.A._Confidential_(film)',
 'Gone_with_the_Wind_(film)',
 'The_Good,_the_Bad_and_the_Ugly',
 'Skyfall',
 'Rome,_Open_City',
 'Tokyo_Story',
 'Hell_or_High_Water_(film)',
 'Pinocchio_(1940_film)',
 'The_Jungle_Book_(2016_film)',
 'La_La_Land_(film)',
 'Star_Trek_(film)',
 'High_Noon',
 'Apocalypse_Now',
 'On_the_Waterfront',
 'The_Wages_of_Fear',
 'The_Last_Picture_Show',
 'Harry_Potter_and_the_Deathly_Hallows_–_Part_2',
 'The_Grapes_of_Wrath_(film)',
 'Roman_Holiday',
 'Man_on_Wire',
 'Jaws_(film)',
 'Toy_Story',
 'The_Godfather_Part_II',
 'Battleship_Potemkin'
]

In [21]:
folder_name = 'bestofrt_posters'
# Make directory if it doesn't already exist
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

#### Note: the cell below, if correctly implemented, will likely take ~5 minutes to run.

In [22]:
# List of dictionaries to build and convert to a DataFrame later
df_list = []
image_errors = {}
for title in title_list:
    try:
        # This cell is slow so print ranking to gauge time remaining
        ranking = title_list.index(title) + 1
        print(ranking)
        page = wptools.page(title, silent=True)
        # Your code here (three lines)
        images = page.get().data['image']
        # First image is usually the poster
        first_image_url = images[0]['url']
        r = requests.get(first_image_url)
        # Download movie poster image
        i = Image.open(BytesIO(r.content))
        image_file_format = first_image_url.split('.')[-1]
        i.save(folder_name + "/" + str(ranking) + "_" + title + '.' + image_file_format)
        # Append to list of dictionaries
        df_list.append({'ranking': int(ranking),
                        'title': title,
                        'poster_url': first_image_url})
    
    # Not best practice to catch all exceptions but fine for this short script
    except Exception as e:
        print(str(ranking) + "_" + title + ": " + str(e))
        image_errors[str(ranking) + "_" + title] = images

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22


API error: {'code': 'invalidtitle', 'info': 'Bad title "A_Hard_Day%27s_Night_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.'}


22_A_Hard_Day%27s_Night_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=A_Hard_Day%2527s_Night_%28film%29
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
53_12_Angry_Men_(1957_film): cannot identify image file <_io.BytesIO object at 0x10fd974c0>
54
55
55_Logan_(film): 'image'
56
57
58
59
60
61
62
63
64
64_Dr._Strangelove: cannot identify image file <_io.BytesIO object at 0x10f8cfca8>
65
66
67
68
69
70
71
72


API error: {'code': 'invalidtitle', 'info': 'Bad title "Rosemary%27s_Baby_(film)".', 'docref': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.'}


72_Rosemary%27s_Baby_(film): https://en.wikipedia.org/w/api.php?action=parse&formatversion=2&contentmodel=text&disableeditsection=&disablelimitreport=&disabletoc=&prop=text|iwlinks|parsetree|wikitext|displaytitle|properties&redirects&page=Rosemary%2527s_Baby_%28film%29
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100


In [23]:
for key in image_errors.keys():
    print(key)

22_A_Hard_Day%27s_Night_(film)
53_12_Angry_Men_(1957_film)
55_Logan_(film)
64_Dr._Strangelove
72_Rosemary%27s_Baby_(film)


In [24]:
# Inspect unidentifiable images and download them individually
for rank_title, images in image_errors.items():
    if rank_title == '22_A_Hard_Day%27s_Night_(film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/4/47/A_Hard_Days_night_movieposter.jpg'
    if rank_title == '53_12_Angry_Men_(1957_film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/9/91/12_angry_men.jpg'
    if rank_title == '72_Rosemary%27s_Baby_(film)':
        url = 'https://upload.wikimedia.org/wikipedia/en/e/ef/Rosemarys_baby_poster.jpg'
    if rank_title == '93_Harry_Potter_and_the_Deathly_Hallows_–_Part_2':
        url = 'https://upload.wikimedia.org/wikipedia/en/d/df/Harry_Potter_and_the_Deathly_Hallows_%E2%80%93_Part_2.jpg'
    title = rank_title[3:]
    df_list.append({'ranking': int(title_list.index(title) + 1),
                    'title': title,
                    'poster_url': url})
    r = requests.get(url)
    # Download movie poster image
    i = Image.open(BytesIO(r.content))
    image_file_format = url.split('.')[-1]
    i.save(folder_name + "/" + rank_title + '.' + image_file_format)

In [25]:
# Create DataFrame from list of dictionaries
df = pd.DataFrame(df_list, columns = ['ranking', 'title', 'poster_url'])
df = df.sort_values('ranking').reset_index(drop=True)
df

Unnamed: 0,ranking,title,poster_url
0,1,The_Wizard_of_Oz_(1939_film),https://upload.wikimedia.org/wikipedia/commons...
1,2,Citizen_Kane,https://upload.wikimedia.org/wikipedia/en/c/ce...
2,3,The_Third_Man,https://upload.wikimedia.org/wikipedia/en/2/21...
3,4,Get_Out_(film),https://upload.wikimedia.org/wikipedia/en/a/a3...
4,5,Mad_Max:_Fury_Road,https://upload.wikimedia.org/wikipedia/en/6/6e...
5,6,The_Cabinet_of_Dr._Caligari,https://upload.wikimedia.org/wikipedia/commons...
6,7,All_About_Eve,https://upload.wikimedia.org/wikipedia/en/2/22...
7,8,Inside_Out_(2015_film),https://upload.wikimedia.org/wikipedia/en/0/0a...
8,9,The_Godfather,https://upload.wikimedia.org/wikipedia/en/1/1c...
9,10,Metropolis_(1927_film),https://upload.wikimedia.org/wikipedia/en/0/06...


#### Flash Forward 2

So now you've gathered the data to produce our second goal visualization, the Roger Ebert Review Wordcloud. Next, we'd have to assess and clean the data. The second and third steps in the data wrangling process. This case study is on gathering specifically though. These were taken care of both of these behind the scenes. And the result of that is this master data frame here. 

![](images/master_dataframe.png)

<br>
Let's hop straight to the second final product here, the word clouds themselves. And here's a sample of them. On the right, E.T. on the left, Inside Out.
<br><br>

![](images/flash_forward2.png)

<br>
E.T. was ranked 11th and Inside out was ranked 8th on that top 100 list. I think this is so cool and I hope you do too. It's definitely a unique snapshot into the movie's plot and what makes it great according to arguably the best film critic that ever lived. These movie posters pasted within the stencil of this word cloud really do add some flash. Data visualization can be informative, but it can also be art. These word clouds required gathering data from two different sources. Downloading files from the internet, i.e. the Roger Ebert review text files and accessing data from an API, i.e. the movie poster URLs. And this data was in two formats, dot txt and Json.

#### Storing Data

Storing is usually done after cleaning, but it's not always done, which excludes it from being a core part of the data wrangling process. Sometimes you just analyze and visualize and leave it at that, without saving your new data.

Given the size of this dataset and that it likely won't be shared often, saving to a flat file like a CSV is probably the best solution. With pandas, saving your gathered data to a CSV file is easy. The `to_csv` [DataFrameMethod](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) is all you need and the only parameter required to save a file on your computer is the file path to which you want to save this file. Often specifying `index=False` is necessary too if you don't want the DataFrame index showing up as a column in your stored dataset. If you had a DataFrame, *df*, and wanted to save to a file named *dataset.csv* with no index column:

```
df.to_csv('dataset.csv', index=False)
```


Imagine this notebook contains all of the gathering code from this entire lesson, plus the assessing and cleaning code done behind the scenes, and that the final product is a merged master DataFrame called df.

In [26]:
df = pd.read_csv('gathered_assessed_cleaned.csv')

In [27]:
# Your code here
# Save the master DataFrame to a file called 'bestofrt_master.csv'
# Hint: watch out for the index!
df.to_csv('bestofrt_master.csv', index=False)