# Processing Webpages with BeautifulSoup

Welcome! This module will be a walkthrough to processing web data with the popular Python package BeautifulSoup.

BeautifulSoup has a terrific [webpage for documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) that has in-depth installation instructions.

### Under the hood of a webpage

![](images/toscrape_screenshot.png)

**Figure 1**: Every part of a webpage is generated from the underlying HTML. BeautifulSoup makes it easy to get this data and do cool things with it.

#### Fortunately
You don't need to know HTML to use BeautifulSoup, but it certainly helps.
For example, if you know what you want from looking at the webpage, you may not understand how to HTML works underneath which will limit what your efficiency with BeautifulSoup. Otherwise, the more about HTML you know, the more effective a tool BeautifulSoup will be.
<br><br><br>
### Creating a soup object

In [2]:
# import necessary packages
from bs4 import BeautifulSoup
import requests

In [4]:
# create the soup object 
url = 'http://quotes.toscrape.com'
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

In [9]:
# printing the soup will print the full soup object [very long if completely printed][not pretty]
print(str(soup)[:200])

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Quotes to Scrape</title>
<link href="/static/bootstrap.min.css" rel="stylesheet"/>
<link href="/static/main.css" rel="stylesheet"


This output looks pretty similar to what you see at the top of the inspect element console in figure 1.<br>
Additionally, you may notice that there is no formatting involved. When the soup object is printed like a string, a newline will simply create a newline without indenting to highlight the nested structure of the HTML.
<br>
<br>
Let's try and make this a little prettier

In [11]:
# second try [also long if completely printed][prettier]
print(soup.prettify()[:200])

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" 


Calling the `prettify()` function prints the soup object with it's nested hierarchy.

In [15]:
# soup object can be iterated; let's print the first 5 p tags
for tag in soup.find_all('p'):
    print(tag.prettify())
    print('-------')

<p>
 <a href="/login">
  Login
 </a>
</p>

-------
<p class="text-muted">
 Quotes by:
 <a href="https://www.goodreads.com/quotes">
  GoodReads.com
 </a>
</p>

-------
<p class="copyright">
 Made with
 <span class="sh-red">
  ❤
 </span>
 by
 <a href="https://scrapinghub.com">
  Scrapinghub
 </a>
</p>

-------


The `find_all` function will recursively traverse the all of the tags in the soup object. You can pass a string with the name of a tag to find all of that tag in the soup. I passed 'p' to find all 'p' tags in the soup.<br> 

I printed the line of hyphens to visualize where tags ended. <br>
All three tags in the webpage were printed using the prettify function to highlight any child tags that belonged to the p tags.

In [17]:
# get a specific tag
h1_tag = soup.find('h1')
print(h1_tag)

<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>


The `find()` method acts the same as `find_all()` but returns the first tag found.<br>
This is the same as calling this.

In [18]:
soup.h1

<h1>
<a href="/" style="text-decoration: none">Quotes to Scrape</a>
</h1>

You can also call for a tag sort of like an attribute directly from the soup object
<br><br>
If the tag you're looking for isn't in the soup or you incorrectly name the attribute you will get a `NoneType` object

In [22]:
print(soup.h7) # h7 is a non-existant tag.
print(type(soup.h7))

None
<class 'NoneType'>


Let's take a second to understand some BeautifulSoup objects. This will potentially help troubleshooting down the line.

In [25]:
# check out the object types
a_tag = soup.a # first a tag

print('soup type:',type(soup),'\n')
print('tag type:',type(a_tag),'\n')
print('tag attributes:',a_tag.attrs,'\n')
print('tag text type',type(a_tag.string))

soup type: <class 'bs4.BeautifulSoup'> 

tag type: <class 'bs4.element.Tag'> 

tag attributes: {'href': '/', 'style': 'text-decoration: none'} 

tag text type <class 'bs4.element.NavigableString'>


The soup object and tag object are unsurprisingly called BeautifulSoup and Tag, respectively.<br>
Attributes of tags are Python dictionaries, and the text of a tag is a NavigableString.<br>
Documentation for the BeautifulSoup native objects can be found on their site linked at the top of this notebook.
<br><br><br>
### A new example
Say one were to pull all div tags containing quotes.
<br><br>
After inspecting the source html, you can see that each of these tags have a class attribute equal to "quote". You can specify to `find_all()` to find a specific tag with a specific class.<br>
Be careful, though. There may be more tags on the webpage with the same class value that you're searching for, which can return unwanted tags. I checked for this example though, and there aren't any more div tags with a class value of "quote" other than the ones we care about.

In [36]:
# pull all of the div tags containing quotes
tag_list = []
for tag in soup.find_all('div',{"class": "quote"}):
    tag_list.append(tag)
 
first_div = tag_list[0]
print(first_div.prettify())

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
 <span class="text" itemprop="text">
  “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
 </span>
 <span>
  by
  <small class="author" itemprop="author">
   Albert Einstein
  </small>
  <a href="/author/Albert-Einstein">
   (about)
  </a>
 </span>
 <div class="tags">
  Tags:
  <meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
  <a class="tag" href="/tag/change/page/1/">
   change
  </a>
  <a class="tag" href="/tag/deep-thoughts/page/1/">
   deep-thoughts
  </a>
  <a class="tag" href="/tag/thinking/page/1/">
   thinking
  </a>
  <a class="tag" href="/tag/world/page/1/">
   world
  </a>
 </div>
</div>



We can see that our expedition was successful (for the first tag, at least).
<br>
This div tag has all quote data that we care about in it.
<br>
<br>
Now, let's see what happens when we call `get_text()` on an object with children tags

In [39]:
print(first_div.get_text())


“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
by Albert Einstein
(about)


            Tags:
            
change
deep-thoughts
thinking
world




We can see that all of the text for the tags within this div become concatenated when we call `get_text()`.
<br>
Let's isolate the quote followed by the author's name.

In [46]:
print(first_div.span.get_text())
print('---')
print(first_div.small.get_text())

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
---
Albert Einstein


Nice
<br><br>
### Moving down a quote
Let's use some of BeautifulSoup's built-in versatile methods to navigate the tree.<br>
By calling the `next_sibling` attribute of a tag, we can get the next tag in the tree.<br>
Let's call that on our Albert Einstein quote to retrieve the following

In [57]:
next_div = first_div.next_sibling.next_sibling
print(next_div)
#print(next_div.span.get_text())
#print('---')
#print(next_div.small.get_text())

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>
<span>by <small class="author" itemprop="author">J.K. Rowling</small>
<a href="/author/J-K-Rowling">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="abilities,choices" itemprop="keywords"/>
<a class="tag" href="/tag/abilities/page/1/">abilities</a>
<a class="tag" href="/tag/choices/page/1/">choices</a>
</div>
</div>


In [None]:
# move sideways
example_tag.next_sibling
example_tag.previous_sibling

# move up
print(type(example_tag.parent))
print(type(example_tag.children))

In [None]:
# get rid of jk rowling quotes
url = 'http://quotes.toscrape.com/'
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

for tag in soup.find_all('small'):
    if tag.get_text() == 'J.K. Rowling':
        try:
            tag.parent.parent.decompose()
            print('successfully decomposed tag')
        except Exception as e:
            print(e,'could not decompose tag')

In [None]:
# pull tags based on attributes
for tag in soup.find_all():
    if 'class' in tag.attrs and 'author' in tag.attrs['class']:
        print(tag.get_text())
        
    # don't do this because the attribute value may be a list
    #if 'class' in tag.attrs and tag.attrs['class'] == 'author':

In [None]:
# pull the same tag every scrape by specifying a selector
tag = soup.select('body > div > div:nth-child(2) > div.col-md-4.tags-box > span:nth-child(2) > a')
print(tag)

In [None]:
# error handling

# check to ensure the site isnt down
print(str(r))

# when in doubt, check the data type

![](images/http_responses.jpg)

### An important distinction

`soup.find_all()` returns all tags<br>
`soup.find()` returns one tag<br>
`soup.select_one()` returns the first tag matching the selector<br>
`soup.select()` returns all tags matching the selector<br>