# Web Scraping Part 1 — Workbook Solutions

*Inspired by web scraping lessons from [Lauren Klein](https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/class4-web-scraping-complete.ipynb) and [Allison Parrish](https://github.com/aparrish/dmep-python-intro/blob/master/scraping-html.ipynb)*

*Don't forget to rename this notebook if you want to save changes!*

In this lesson, we're going to introduce how to "scrape" data from the internet with the Python libraries requests and BeautifulSoup.

😺 Kittens toy website: http://static.decontextualize.com/kittens.html

## Responses and Requests

To programmatically access the text data attached to every URL, we can use a Python library called [requests](https://requests.readthedocs.io/en/master/).

When you type in a URL in your search address bar, you're sending an HTTP **request** for a web page, and the server which stores that web page will accordingly send back a **response**, some web page data that your browser will render.

## Import Requests 

In [1]:
import requests

## Get HTML Data

With the `.get()` method, we can request to "get" web page data for a specific URL, which we will store in a varaible called `response`.

In [6]:
response = requests.get("https://www.dailyscript.com/scripts/Juno.txt")

## HTTP Status Code

If we check out `response`, it will simply tell us its [HTTP response code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), aka whether the request was successful or not.

"200" is a successful response, while "404" is a common "Page Not Found" error.

In [7]:
response

<Response [200]>

Let's see what happens if we change the title of the movie from *Juno* to *Guno* in the URL...

In [8]:
bad_response = requests.get("http://www.scifiscripts.com/scripts/Guno.txt")

In [9]:
bad_response

<Response [404]>

### Extract Text From Web Page

To actually get at the text data in the reponse, we need to use `.text`, which we will save in a variable called `html_string`. The text data that we're getting is formatted in the HTML markup language, which we will talk more about in the BeautifulSoup section below.

In [10]:
html_string = response.text
print(html_string)
















                                          "JUNO"

                                      By Diablo Cody








               REVISED PINK                         -- FEBRUARY 06, 2007 
               FULL BLUE                            --  JANUARY 22, 2007 
               PRODUCTION WHITE                     --  JANUARY 12, 2007




                                                    Production Office: 
                                                    Dancing Elk Pictures Ltd. 
                                                    214-2400 Boundary Road 
                                                    Burnaby, BC V5M 3Z3

                

               EXT. CENTENNIAL LANE - DUSK

               JUNO MacGUFF stands on a placid street in a nondescript 
               subdivision, facing the curb. It's FALL. Juno is sixteen 
               years old, an artfully bedraggled burnout kid. She winces 
               and shields her eyes from the glare of the sun. The o

## Extract Text From Multiple Web Pages

In [11]:
urls = ['https://www.dailyscript.com/scripts/Juno.txt',
        'https://www.dailyscript.com/scripts/Titanic.txt',
        'http://www.scifiscripts.com/scripts/Ghostbusters.txt']

We can use a for loop and iterate through a list of screenplay urls called `urls` and then print out the first 500 characters for each screenplay.

In [12]:
for url in urls:
    response = requests.get(url)
    html_string = response.text
    print(html_string[:500])
















                                          "JUNO"

                                      By Diablo Cody








               REVISED PINK                         -- FEBRUARY 06, 2007 
               FULL BLUE                            --  JANUARY 22, 2007 
               PRODUCTION WHITE                     --  JANUARY 12, 2007




                                                    Production Office: 
                                                    Dancing Elk Pictures Ltd.
                                        "TITANIC"

                                      Screenplay by

                                      James Cameron

                

               BLACKNESS

               Then two faint lights appear, close together... growing 
               brighter. They resolve into two DEEP SUBMERSIBLES, free-
               falling toward us like express elevators.

               One is ahead of the other, and passes close enough to FILL 
               FRAME,

## HTML & BeautifulSoup

Not all web pages will be as easy to scrape as these screenplay files, however. If web pages are messy and complicated, how can we extract just the things that we want?

Well, we can use a Python library called BeautifulSoup, but first we need to learn a little about how web pages are written.

Poet and professor Allison Parrish made a toy website called "Kittens and the TV Shows They Love." It can be found at the following URL: http://static.decontextualize.com/kittens.html

If we use our requests library on this Kittens TV website, this is what we get:

In [13]:
response = requests.get("http://static.decontextualize.com/kittens.html")
html_string = response.text
print(html_string)

<!doctype html>
<html>
	<head>
		<title>Kittens!</title>
		<style type="text/css">
			span.lastcheckup { font-family: "Courier", fixed; font-size: 11px; }
		</style>
	</head>
	<body>
		<h1>Kittens and the TV Shows They Love</h1>
		<div class="kitten">
			<h2>Fluffy</h2>
			<div><img src="http://placekitten.com/120/120"></div>
			<ul class="tvshows">
				<li>
					<a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a>
				</li>
				<li>
					<a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a>
				</li>
			</ul>
			Last check-up: <span class="lastcheckup">2014-01-17</span>
		</div>
		<div class="kitten">
			<h2>Monsieur Whiskeurs</h2>
			<div><img src="http://placekitten.com/110/110"></div>
			<ul class="tvshows">
				<li>
					<a href="http://www.imdb.com/title/tt0106179/">The X-Files</a>
				</li>
				<li>
					<a href="http://www.imdb.com/title/tt0098800/">Fresh Prince</a>
				</li>
			</ul>
			Last check-up: <span class="lastcheckup">2013-11-02</span>
		</div

This is an HTML document. HTML stands for HyperText Markup Language. It is the standard language for writing web page documents. The most important thing you need to know about HTML is that the language uses HTML "tags" to represent different elements, such as a main header `<h1>`. 

| HTML Tag                | Explanation                              |
|--------------------|-------------------------------------------|
| <\!DOCTYPE>        | Defines document type                 |
| <html\>             | Defines HTML document                  |
| <head\>             | Main information about document    |
| <title\>            | Title for document          |
| <body\>             | Document body               |
| <h1\> to <h6\>       |  Headings                    |
| <p\>                | Paragraph                       |
| <br\>               | Line break               |
| <\!\-\-comment here-\-> | Comment                         |
| <img\> | Image                         |
| <a\> | Hyperlink                       |
| <ul\> | Unordered list                     |
| <ol\> | Ordered list                     |
| <li\> | List item                     |
| <style\> | Style information for a document                    |
| <div\> | Section in a document                   |
| <span\> | Section in a document                   |
| class= | Certain kind of element, can apply to multiple elements |
| id= | Unique identifier for an element |

HTML tags often, but not always, require a "closing" tag. For example, the main header "Kittens and the TV Shows They Love" will be surrounded by `<h1>` (opening tag) and `</h1>` (closing tag) on either side: `<h1>Kittens and the TV Shows They Love</h1>`

## Extract HTML Elements

To use BeautifulSoup, we need to import it.

In [14]:
from bs4 import BeautifulSoup

To make a BeautifulSoup document, we call `BeautifulSoup()` with two parameters: the `html_string` from our HTTP request and [the kind of parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use) that we want to use, which will always be `"html.parser"` for our purposes.

In [15]:
response = requests.get("http://static.decontextualize.com/kittens.html")
html_string = response.text

document = BeautifulSoup(html_string, "html.parser")

We can use the `.find()` method to find and extract certain elements, such as a main header.

In [16]:
document.find("h1")

<h1>Kittens and the TV Shows They Love</h1>

If we want only the text contained between those tags, we can use `.text` to extract just the text.

In [17]:
document.find("h1").text

'Kittens and the TV Shows They Love'

Find the HTML element that contains an image.

In [18]:
document.find("img")

<img src="http://placekitten.com/120/120"/>

You can also extract multiple HTML elements at a time with `.find_all()`

In [19]:
document.find_all("img")

[<img src="http://placekitten.com/120/120"/>,
 <img src="http://placekitten.com/110/110"/>]

You can extract elements that are only of a certain `class`:

In [20]:
document.find_all("div", attrs={"class": "kitten"})

[<div class="kitten">
 <h2>Fluffy</h2>
 <div><img src="http://placekitten.com/120/120"/></div>
 <ul class="tvshows">
 <li>
 <a href="http://www.imdb.com/title/tt0106145/">Deep Space Nine</a>
 </li>
 <li>
 <a href="http://www.imdb.com/title/tt0088576/">Mr. Belvedere</a>
 </li>
 </ul>
 			Last check-up: <span class="lastcheckup">2014-01-17</span>
 </div>,
 <div class="kitten">
 <h2>Monsieur Whiskeurs</h2>
 <div><img src="http://placekitten.com/110/110"/></div>
 <ul class="tvshows">
 <li>
 <a href="http://www.imdb.com/title/tt0106179/">The X-Files</a>
 </li>
 <li>
 <a href="http://www.imdb.com/title/tt0098800/">Fresh Prince</a>
 </li>
 </ul>
 			Last check-up: <span class="lastcheckup">2013-11-02</span>
 </div>]

### Your Turn!
Find the name of **one** of the kittens and then return the text of the name (either "Fluffy" or "Monsieur Whiskers").

To do so, open the web page (http://static.decontextualize.com/kittens.html) and then use your Developer Tools to find the HTML tag associated with the kitten names.

In [21]:
document.find('h2').text

'Fluffy'

### Extract Multiple HTML Elements

Let's try to extract the text from all the header2 elements:

In [22]:
document.find_all("h2").text

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

Uh oh. That didn't work! In order to extract text data from multiple HTML elements, we need a `for` loop and some list-building.

In [23]:
all_h2_headers = document.find_all("h2")

In [24]:
all_h2_headers

[<h2>Fluffy</h2>, <h2>Monsieur Whiskeurs</h2>]

First we will make an empty list called `h2_headers`.

We will loop through the headers, grab the `.text`, put it into a variable called `header_contents`, then `.append()` it to our `h2_headers` list.

In [25]:
h2_headers = []
for header in all_h2_headers:
    header_contents = header.text
    h2_headers.append(header_contents)

In [26]:
h2_headers

['Fluffy', 'Monsieur Whiskeurs']

🚨 Heads up! New Python concept!🚨

You can also use something called a [list comprehension](https://www.w3schools.com/python/python_lists_comprehension.asp) to make a new Python list in a single line of code. 

In [27]:
h2_headers = [header.text for header in all_h2_headers]
h2_headers

['Fluffy', 'Monsieur Whiskeurs']

## Your Turn!

Ok so now we've learned a little bit about how to use BeautifulSoup to parse HTML documents. So how would we apply what we've learned to extract Missy Elliott lyrics?

In [28]:
response = requests.get("https://genius.com/Missy-elliott-work-it-lyrics")
html_str = response.text

document = BeautifulSoup(html_str, "html.parser")

In [29]:
document


<!DOCTYPE html>

<html class="snarly apple_music_player--enabled bagon_song_page--enabled song_stories_public_launch--enabled react_forums--disabled" lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
<head>
<base href="//genius.com/" target="_top"/>
<script type="text/javascript">
//<![CDATA[

  var _sf_startpt=(new Date()).getTime();
  if (window.performance && performance.mark) {
    window.performance.mark('parse_start');
  }

//]]>
</script>
<title>Missy Elliott – Work It Lyrics | Genius Lyrics</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<meta content="app-id=709482991" name="apple-itunes-app"/>
<link href="https://assets.genius.com/images/apple-touch-icon.png?1617291191" rel="apple-touch-icon"/>
<link href="https://assets.genius.com/images/apple-touch-icon.png?1617291191" rel="apple-touch-icon"/>
<!-- Mobile IE allows us to

https://genius.com/Missy-elliott-work-it-lyrics

## Extract just the song lyrics for Missy Elliott's "Work It"

In [31]:
document.find('p').text

"[Intro]\nDJ, please pick up your phone, I'm on the request line\nThis is a Missy Elliott one-time exclusive, come on\n\n[Chorus]\nIs it worth it? Let me work it\nI put my thing down, flip it and reverse it\nTi esrever dna ti pilf, nwod gniht ym tup\nTi esrever dna ti pilf, nwod gniht ym tup\nIf you got a big *elephant trumpet*, let me search ya\nAnd find out how hard I gotta work ya\nTi esrever dna ti pilf, nwod gniht ym tup\nTi esrever dna ti pilf, nwod gniht ym tup\nC'mon\n\n[Verse 1]\nI'd like to get to know ya so I could show ya\nPut the pussy on ya like I told ya\nGive me all your numbers so I can phone ya\nYour girl acting stank, then call me over\nNot on the bed, lay me on your sofa\nCall before you come, I need to shave my chocha\nYou do or you don't or you will or won't ya?\nGo downtown and eat it like a vulture\nSee my hips and my tips, don't ya?\nSee my ass and my lips, don't ya?\nLost a few pounds and my waist for ya\nThis the kinda beat that go ra-ta-ta\nRa-ta-ta-ta-ta-ta

## Extract just the song title for Missy Elliott's "Work It"

In [33]:
document.find('h1').text

'Work It'

## Extract some of the metadata (producer, label, release date) for Missy Elliott's "Work It"

In [36]:
document.find_all('span', attrs={'class': 'metadata_unit-label'})

[<span class="metadata_unit-label">Produced by</span>,
 <span class="metadata_unit-label">Album</span>,
 <span class="metadata_unit-label">Written By</span>,
 <span class="metadata_unit-label">Recording Engineer</span>,
 <span class="metadata_unit-label">Assistant Recording Engineer</span>,
 <span class="metadata_unit-label">Mixing Engineer</span>,
 <span class="metadata_unit-label">Assistant Mixing Engineer</span>,
 <span class="metadata_unit-label">Mastering Engineer</span>,
 <span class="metadata_unit-label">Label</span>,
 <span class="metadata_unit-label">Copyright ©</span>,
 <span class="metadata_unit-label">Phonographic Copyright ℗</span>,
 <span class="metadata_unit-label">Performance Rights</span>,
 <span class="metadata_unit-label">Publisher</span>,
 <span class="metadata_unit-label">Mixed At</span>,
 <span class="metadata_unit-label">Recorded At</span>,
 <span class="metadata_unit-label">Release Date</span>,
 <span class="metadata_unit-label">Samples</span>,
 <span class="met

In [44]:
all_metadata_labels = document.find_all('span', attrs={'class': 'metadata_unit-label'})

In [49]:
document.find_all('span', attrs={'class': 'metadata_unit-info'})

[<span class="metadata_unit-info">
 <a href="https://genius.com/artists/Timbaland">Timbaland</a> &amp; <a href="https://genius.com/artists/Missy-elliott">Missy Elliott</a>
 </span>,
 <span class="metadata_unit-info"><a href="https://genius.com/albums/Missy-elliott/Under-construction">Under Construction</a></span>,
 <span class="metadata_unit-info">
 <a href="https://genius.com/artists/Missy-elliott">Missy Elliott</a> &amp; <a href="https://genius.com/artists/Timbaland">Timbaland</a>
 </span>,
 <span class="metadata_unit-info">
 <a href="https://genius.com/artists/Carlos-bedoya">Carlos Bedoya</a> &amp; <a href="https://genius.com/artists/Jimmy-douglass">Jimmy Douglass</a>
 </span>,
 <span class="metadata_unit-info">
 <a href="https://genius.com/artists/Demacio-demo-castellon">Demacio “Demo” Castellon</a> &amp; <a href="https://genius.com/artists/Marc-lee">Marc Lee</a>
 </span>,
 <span class="metadata_unit-info">
 <a href="https://genius.com/artists/Jimmy-douglass">Jimmy Douglass</a> &am

In [42]:
all_metadata_info = document.find_all('span', attrs={'class': 'metadata_unit-info'})

The `zip()` function will zip two lists together.

In [48]:
for label, info in zip(all_metadata_labels, all_metadata_info):
        print(f"{label.text}: {info.text}")

Produced by: 
Timbaland & Missy Elliott

Album: Under Construction
Written By: 
Missy Elliott & Timbaland

Recording Engineer: 
Carlos Bedoya & Jimmy Douglass

Assistant Recording Engineer: 
Demacio “Demo” Castellon & Marc Lee

Mixing Engineer: 
Jimmy Douglass & Timbaland

Assistant Mixing Engineer: 
Steamy

Mastering Engineer: 
Herb Powers

Label: 
The Goldmind & Elektra Records

Copyright ©: 
Elektra Entertainment Group & Warner Music Group

Phonographic Copyright ℗: 
Elektra Entertainment Group & Warner Music Group

Performance Rights: 
ASCAP

Publisher: 
Mass Confusion Music, Virginia Beach Music & Warner Music Group

Mixed At: 
Manhattan Center Studios, New York, NY

Recorded At: The Hit Factory Criteria (Miami)
Release Date: September 1, 2002
Samples: 

Peter Piper by Run-DMC


Take Me To The Mardi Gras by Bob James


Request Line by Rock Master Scott & the Dynamic Three


Sampled In: 

The Brainstream by Canibus


History of Rap 1 by Jimmy Fallon (Ft. Justin Timberlake)


No Pau