# 11.5.1 Performing an Automated Web Scrape

In [1]:
# Set Up Your Code to Use Your Tools
# In this section, you’ll set up your web scraping code to be able to use the tools that you’ll need.

# To begin, complete the following steps:

# Create a new folder, and then use the terminal to navigate to it.

# Activate Jupyter Notebook.

# Create a new Practice.ipynb file.
# DevTools to inspect a website and using Beautiful Soup to parse HTML code.


In [2]:
# Next, in the first cell of Practice.ipynb, enter the following code:

from splinter import Browser
from bs4 import BeautifulSoup as soup
from webdriver_manager.chrome import ChromeDriverManager


In [3]:
# In the next cell, set the executable path and initialize a browser by entering the following code:

# Set up Splinter
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)


In [4]:
# Scrape the Title
# At last, we can scrape the "Top Ten tags" title. To begin, in the next cell in your Jupyter notebook, enter and run the following code:

# Visit the Quotes to Scrape site

url = 'http://quotes.toscrape.com/'
browser.visit(url)

    

In [5]:
# We've now parsed all the HTML code on the page. This means that Beautiful Soup has examined the components on the page and can access them. Specifically, Beautiful Soup has parsed the HTML text and stored it as an object.

# In the preceding code, we use html.parser to parse the information, but other options also exist.

# Parse the HTML
html = browser.html
html_soup = soup(html, 'html.parser')

In [6]:
# Next, we want to find the title and extract it. To do so, in the next cell, enter and run the following code:

# Scrape the Title

h2 = html_soup.find('h2')
title = h2.text
print(h2)
print(title)

<h2>Top Ten tags</h2>
Top Ten tags


In [7]:
# In the preceding code, the first print statement displays <h2>Top Ten tags</h2>. And, the second print statement displays Top Ten tags. To understand how we got those, let’s go over the first two lines of code:

# On the h2 = html_soup.find('h2') line, we use the html_soup object that we created earlier and call its find method to search for the <h2 /> tag. Printing the result thus produces <h2>Top Ten tags</h2>.

# On the title = h2.text line, we extract just the text from the <h2 /> tag by adding .text to the end of the code. This extracts the text attribute, so printing it produces only the title text of Top Ten tags.



In [8]:
# We could also directly access the title text by using

title = html_soup.find('h2').text

In [9]:
# Notice that the opening tag for this division is <div class="col-md-4 tags-box">. This means that this division has two classes: col-md-4 and tags-box.

# An HTML element can belong to multiple classes.

# he col-md-4 class is a Bootstrap feature. Bootstrap is an HTML and CSS library that simplifies building websites. It uses a grid system to divide a page into 12 columns of equal width. In this case, col-md-4 means that the box containing “Top Ten tags” takes up four columns. The main quotes section takes up the remaining eight columns. Websites that use Bootstrap commonly use this class, but many others exist.

# The other class, tags-box, seems specific to this website, but we want to confirm that. To do so, use the DevTools Find box to search for it.

# Notice that searching for “tags-box” returns only one result: our <div class="col-md-4 tags-box"> tag. This means that tags-box is unique in the HTML code, so we can use it to find specific data.

# Next, we want to examine the content of this div element. To do so, in DevTools, expand the <div class="col-md-4 tags-box"> line.

# Notice that we get the <h2>Top Ten tags</h2> line followed by a list of <span /> tags, each with a class of tag-item. Expand some of the span elements to review their contents. If you observe <a /> tags that contain the names of the Top Ten tags, you're in the right place.

# Because the list that displays on the webpage contains 10 items, let's use the DevTools search function to verify the list item count. This is one more way to check that we’ll scrape the correct data. To do so, search for “tag-item” (without the quotation marks), and then note the number of returned results. This number should indeed be 11.

# Use the DevTools Find box to verify the number of instances of an element

# Splendid! We can now scrape all the tags by using a for loop. To do so, in the next cell of your notebook, enter and run the following code:





In [10]:
# Scrape the top ten tags

tag_box = html_soup.find('div', class_='tags-box')

# tag_box
tags = tag_box.find_all('a', class_='tag')

for tag in tags:
    word = tag.text
    print(word)


love
inspirational
life
humor
books
reading
friendship
friends
truth
simile


In [11]:
# Let's go over the preceding code:

# The first line, tag_box = html_soup.find('div', class_='tags-box'), assigns the results of a search to a new variable named tag_box. The search is for <div /> tags that have a class of tags-box.

# This search occurs in the HTML code that we parsed and stored in the html_soup variable earlier.

# The class_ argument contains an underscore (_). That’s because the word class is reserved for other uses in Python..

# The second line, tags = tag_box.find_all('a', class_='tag'), drills down further into the data in tag_box.

# This line uses the find_all method to search within tag_box. This time, we search through the parsed results that are stored in our tag_box variable to find all the anchor (a) elements that belong to the tag class.

# We use find_all because we want to capture all the results rather than a single or specific one.

# The Beautiful Soup find method returns the first search result. The find_all method returns all the results.

# We added a for loop. This loop cycles through the tags in the tags variable, extracts the HTML code from each, and then prints only the text of each tag.




In [12]:
browser.quit()