# Introduction

A lot of data aren't accessible through data sets or APIs. They may exist on the Internet as Web pages, though. One way to access the data without waiting for the provider to create an API is to use a technique called Web scraping.

Web scraping allows us to load a Web page into Python and extract the information we want. We can then work with the data using standard analysis tools like pandas and numpy.

We'll use the requests library heavily as we learn about Web scraping. This library enables us to download a Web page. We'll also use the beautifulsoup library to extract the relevant parts of the Web page.

This project was a part of the DataQuest Data Science for Python track. The project is used on a simple webpage they created.

## Downloading the Webpage

In [3]:
import requests

response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
content = response.content

## Extracting Information

Now we need to parse the page and extract the information we want. We'll use the BeautifulSoup library to parse the Web page with Python. This library allows us to extract tags from an HTML document.

In [4]:
from bs4 import BeautifulSoup

# Initialize the parser, and pass in the content we grabbed earlier.
parser = BeautifulSoup(content, 'html.parser')

# Get the body tag from the document.
# Since we passed in the top level of the document to the parser, we need to pick a branch off of the root.
# With BeautifulSoup, we can access branches by using tag types as attributes.
body = parser.body

# Get the p tag from the body.
p = body.p

# Print the text inside the p tag.
# Text is a property that gets the inside text of a tag.
print(p.text)

head = parser.head
title = head.title
title_text = title.text

Here is some simple content for this page.


The tag type is not always a very robust way to parse a document. It's usually better to be more explicit by using the find_all method. This method will find all occurrences of a tag in the current element, and return a list.

In [5]:
parser = BeautifulSoup(content, 'html.parser')

# Get a list of all occurrences of the body tag in the element.
body = parser.find_all("body")

# Get the paragraph tag.
p = body[0].find_all("p")

# Get the text.
print(p[0].text)

head = parser.find_all("head")
title = head[0].find_all("title")
title_text = title[0].text

Here is some simple content for this page.


## Scraping with ID attributes

HTML allows elements to have IDs. Because they are unique, we can use an ID to refer to a specific element. We will get the text of the second paragraph.

In [7]:
# Get the page content and set up a new parser.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_ids.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Pass in the ID attribute to only get the element with that specific ID.
first_paragraph = parser.find_all("p", id="first")[0]
print(first_paragraph.text)
second_paragraph_text = parser.find_all("p", id = "second")[0].text


                First paragraph.
            


## Scraping with Class attributes

In HTML, elements can also have classes. Classes aren't globally unique. In other words, many different elements belong to the same class, usually because they share a common purpose or characteristic.

In [8]:
# Get the website that contains classes.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/simple_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Get the first inner paragraph.
# Find all the paragraph tags with the class inner-text.
# Then, take the first element in that list.
first_inner_paragraph = parser.find_all("p", class_="inner-text")[0]
print(first_inner_paragraph.text)

second_inner_paragraph_text = parser.find_all("p", class_="inner-text")[1].text

first_outer_paragraph_text = parser.find_all("p", class_="outer-text")[0].text


                First paragraph.
            


## Scraping with CSS

We can use BeautifulSoup's .select method to work with CSS selectors.

In [9]:
# Get the website that contains classes and IDs.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Select all of the elements that have the first-item class.
first_items = parser.select(".first-item")

# Print the text of the first paragraph (the first element with the first-item class).
print(first_items[0].text)

first_outer_text = parser.select(".outer-text")[0].text
second_text = parser.select("#second")[0].text


                First paragraph.
            


In [13]:
# Get the Superbowl box score data.
response = requests.get("http://dataquestio.github.io/web-scraping-pages/2014_super_bowl.html")
content = response.content
parser = BeautifulSoup(content, 'html.parser')

# Find the number of turnovers the Seahawks committed.
turnovers = parser.select("#turnovers")[0]
seahawks_turnovers = turnovers.select("td")[1]
seahawks_turnovers_count = seahawks_turnovers.text
print("Seahawks turnovers count: {}".format(seahawks_turnovers_count))

patriots_total_plays_count = parser.select("#total-plays")[0].select("td")[2].text
seahawks_total_yards_count = parser.select("#total-yards")[0].select("td")[1].text

print("Patriots total plays count: {}".format(patriots_total_plays_count))
print("Seahawks total yards count: {}".format(seahawks_total_yards_count))

Seahawks turnovers count: 1
Patriots total plays count: 72
Seahawks total yards count: 396
