A lot of data aren't accessible through data sets or APIs. They may exist on the Internet as Web pages, though. One way to access the data without waiting for the provider to create an API is to use a technique called Web scraping.

Web scraping allows us to load a Web page into Python and extract the information we want. We can then work with the data using standard analysis tools like pandas and numpy.

Before we can do Web scraping, we need to understand the structure of the Web page we're working with, then find a way to extract parts of that structure in a sensible way.

We'll use the requests library heavily as we learn about Web scraping. This library enables us to download a Web page. We'll also use the beautifulsoup library to extract the relevant parts of the Web page.

Web pages use HyperText Markup Language (HTML). HTML isn't a programming language like Python. It's a markup language with its own syntax and rules. When a Web browser like Chrome or Firefox downloads a Web page, it reads the HTML to determine how to render it and display it to us.

HTML consists of tags. Anything in between the opening and closing of a tag is the content of that tag. We can nest tags to create complex formatting rules.

HTML documents contain a few major sections. The head section contains information that's useful to the Web browser that's rendering the page; the user doesn't see it. The body section contains the bulk of the content the user interacts with on the page.

Different tags have different purposes. For example, the title tag tells the Web browser what page title to display at the top of our tab. The p tag indicates that the content inside it is a single paragraph.

In this file, we'll make a GET request to http://dataquestio.github.io/web-scraping-pages/simple.html.

In [1]:
import requests

response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
print(response.status_code)

200


In [2]:
content = response.content
content

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'

Downloading the page is the easy part. Let's say that we want to get the text in the first paragraph. Now we need to parse the page and extract the information we want.

We'll use the BeautifulSoup library to parse the Web page with Python. This library allows us to extract tags from an HTML document.

We can think of HTML documents as "trees," and the nested tags as "branches" (similar to a family tree). BeautifulSoup works the same way.

If we look at [this page](http://dataquestio.github.io/web-scraping-pages/simple.html), for example, the root of the "tree" is the html tag.

The html tag contains two "branches," head and body. head contains one "branch," title. body contains one branch, p. Drilling down through these multiple branches is one way to parse a Web page.

To extract the text inside the p tag, we would first need to get the body element, then the p element, and then finally the text inside the p element.

In [3]:
# Get the text inside the title tag

from bs4 import BeautifulSoup

# Initialize the parser, and pass in the content variable

parser =  BeautifulSoup(content , "html.parser")
parser

<!DOCTYPE html>

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<p>Here is some simple content for this page.</p>
</body>
</html>

In [4]:
# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, 
# we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes

body = parser.body
body

<body>
<p>Here is some simple content for this page.</p>
</body>

In [5]:
# # Get the p tag from the body.
p = body.p
p

<p>Here is some simple content for this page.</p>

While it's nice to use the tag type as a property, it's not always a very robust way to parse a document. It's usually better to be more explicit by using the find_all method. This method will find all occurrences of a tag in the current element, and return a list.

If we only want the first occurrence of an item, we'll need to index the list to get it. Aside from this difference, it behaves the same way as passing in the tag type as an attribute.

In [6]:
parser = BeautifulSoup(content, "html.parser")
body = parser.find_all("body")
body

[<body>
 <p>Here is some simple content for this page.</p>
 </body>]

In [7]:
p = body[0].find_all("p")
p

[<p>Here is some simple content for this page.</p>]

In [8]:
p[0].text

'Here is some simple content for this page.'

In [9]:
title = parser.find_all("title")
title[0].text

'A simple example page'

HTML allows elements to have IDs. Because they are unique, we can use an ID to refer to a specific element.

HTML uses the div tag to create a divider that splits the page into logical units. We can think of a divider as a "box" that contains content. For example, different dividers hold a Web page's footer, sidebar, and horizontal menu.

In [10]:
# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")

content = response.content
parser = BeautifulSoup(content, "html.parser")
parser

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p id="first">
                First paragraph.
            </p>
</div>
<p id="second">
<b>
                Second paragraph.
            </b>
</p>
</body>
</html>

In [11]:
print(parser.find_all("p", id="first" )[0])

<p id="first">
                First paragraph.
            </p>


In [12]:
print(parser.find_all("p",id="second")[0].text)



                Second paragraph.
            



In HTML, elements can also have classes. Classes aren't globally unique. In other words, many different elements belong to the same class, usually because they share a common purpose or characteristic.

For example, we may want to create three dividers to display three of your photographs. we can create a common look and feel for these dividers, such as a border and caption style.

This is where classes come into play. We could create a class called "gallery," define a style for it once using CSS, and then apply that class to all of the dividers we'll use to display photos. One element can even have multiple classes.

We can use find_all to select elements by class. We'll just need to pass in the class_ parameter.

In [13]:
# Get the website that contains classes ---http://dataquestio.github.io/web-scraping-pages/simple_classes.html.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_classes.html")
content = response.content
content

b'<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <div>\n            <p class="inner-text">\n                First paragraph.\n            </p>\n            <p class="inner-text">\n                Second paragraph.\n            </p>\n        </div>\n        <p class="outer-text">\n            <b>\n                First outer paragraph.\n            </b>\n        </p>\n        <p class="outer-text">\n            <b>\n                Second outer paragraph.\n            </b>\n        </p>\n    </body>\n</html>'

In [14]:
parser  = BeautifulSoup(content, "html.parser")
print(parser.find_all("p", class_="inner-text")[0].text)
print(parser.find_all("p", class_="inner-text")[1].text)


                First paragraph.
            

                Second paragraph.
            


In [15]:
print(parser.find_all("p", class_ = "outer-text")[0].text)
print(parser.find_all("p", class_ = "outer-text")[1].text)



                First outer paragraph.
            



                Second outer paragraph.
            



Cascading Style Sheets, or CSS, is a language for adding styles to HTML pages. We can use selectors to add background colors, text colors, borders, padding, and many other style choices to the elements on HTML pages.

This CSS will make all of the text inside all paragraphs red:

p{
    color: red
 }

This CSS will change the text color to red for any paragraphs that have the class inner-text. We select classes with the period or dot symbol (.):

p.inner-text{
    color: red
 }

This CSS will change the text color to red for any paragraphs that have the ID first. We select IDs with the pound or hash symbol (#):

p#first{
    color: red
 }

We can also style IDs and classes without using any specific tags. For example, this CSS will make the element with the ID first red (not just paragraphs):

#first{
    color: red
 }
 
This CSS will make any element with the class inner-text red:

.inner-text{
    color: red
 }
 
In the examples above, we used CSS selectors to select one or more elements, then apply styles to only those elements. CSS selectors are very powerful and flexible.

Perhaps not surprisingly, we also use CSS selectors to select elements when we do Web scraping.

We can use BeautifulSoup's .select method to work with CSS selectors. 

Here's the HTML we'll be working with http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html


In [16]:
# Get the website that contains classes and IDs.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = response.content
parser = BeautifulSoup(content, "html.parser")
parser

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

In [17]:
# Select all of the elements that have the class outer-text.
print(parser.select(".outer-text")[0].text)



                First outer paragraph.
            



In [18]:
# Select all of the elements that have the ID second
print(parser.select("#second")[0].text)



                First outer paragraph.
            



We can nest CSS selectors similar to the way HTML nests tags. For example, we could use selectors to find all of the paragraphs inside the body tag. Nesting is a very powerful technique that enables us to use CSS to do complex Web scraping tasks.

This selector will target any paragraph inside a div tag:

div p

This selector will target any item inside a div tag that has the class first-item:

div .first-item

This one is even more specific. It selects any item that's inside a div tag inside a body tag, but only if it also has the ID first:

body div #first

This selector zeroes in on any items with the ID first that are inside any items with the class first-item:

.first-item #first

As we can see, we can nest CSS selectors in infinite ways. This allows us to extract data from websites with complex layouts. We can test selectors by using the .select method as we write them. Because it's easy to write a selector that doesn't work the way we expect. We can use them with the same .select method we used for our CSS selectors.

In [19]:
response = requests.get("http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html")
content = response.content
parser = BeautifulSoup(content, "html.parser")

# Total Plays for the New England Patriots
print(parser.select("#total-plays")[0].select("td")[2].text)

72


In [91]:
# Find the Total Yards for the Seahawks
# print(parser.find_all(id = "total-yards")[0].find_all("td")[1].text)
# print(parser.select("#total-yards")[0].select("td")[1].text)
print(parser.find_all(id = "total-yards")[0].select("td")[1].text)

396
396


We've covered the basics of HTML and how to select elements, which are key foundational blocks.

We might be wondering why Web scraping is useful, given that in most of our examples, we could easily have found the answer by looking at the page. The real power of Web scraping lies in getting information from a large amount of pages very quickly.

Let's say we wanted to find the total number of yards each NFL team gained in every single NFL game over an entire season. We could do this manually, but it would take days of boring drudgery. We could write a script to automate this in a couple of hours instead, and have a lot more fun doing it.