# Processing Webpages with BeautifulSoup

Welcome! This module will be a walkthrough to processing web data with the popular Python package BeautifulSoup.

BeautifulSoup has a terrific [webpage for documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) that has in-depth installation instructions.

### Under the hood of a webpage

![](images/python_reddit.png)

Every part of a webpage is generated from the underlying HTML. BeautifulSoup makes it easy to get this data and do cool things with it.

#### Fortunately
You don't need to know HTML to use BeautifulSoup, but it certainly helps.
For example, if you know what you want from looking at the webpage, you may not understand how to HTML works underneath which will limit what your efficiency with BeautifulSoup. Otherwise, the more about HTML you know, the more effective a tool BeautifulSoup will be.
<br><br><br>
### Creating a soup object

In [None]:
# import necessary packages
from bs4 import BeautifulSoup
import requests

In [None]:
# create the soup object 
url = 'https://www.reddit.com/r/python'
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

In [None]:
# printing the soup will print the full soup object [very long][not pretty]
print(soup)

In [None]:
# second try [also long][significantly prettier]
print(soup.prettify())

In [None]:
# soup object can be iterated
for tag in soup.find_all():
    print(tag)

In [None]:
# get a specific tag
h1_tag = soup.find('h1')
print(h1_tag)

In [None]:
# check out the object types
print('soup type:',type(soup),'\n')
print('tag type:',type(h1_tag),'\n')
print('tag attributes:',h1_tag.attrs,'\n')
print('tag text type',type(h1_tag.string))

In [None]:
# pull all of the a tags (links)
a_tag_list = []
for tag in soup.find_all('a'):
    a_tag_list.append(tag)
    
a_tag_list[1]

In [None]:
for tag in a_tag_list:
    print(tag.get_text())

In [None]:
# it looks like theres a predictable structure in reddit's webpage






# lets iterate and look for div tags that contain the word 'project'
tag_list = []
for tag in soup.find_all('div'):
    if 'python' in tag.get_text().lower():
        tag_list.append(tag)

example_tag = tag_list[0]

example_tag # pretty ugly
example_tag.prettify() # still terrible
example_tag.get_text() # still terrible somehow

for tag in example_tag.find_all():
    if 'python' in tag.get_text().lower() and len(str(tag)) < 200:
        print(tag)

In [None]:
# move sideways
example_tag.next_sibling
example_tag.previous_sibling

# move up
print(type(example_tag.parent))
print(type(example_tag.children))

In [None]:
# get rid of jk rowling quotes
url = 'http://quotes.toscrape.com/'
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

for tag in soup.find_all('small'):
    if tag.get_text() == 'J.K. Rowling':
        try:
            tag.parent.parent.decompose()
            print('successfully decomposed tag')
        except Exception as e:
            print(e,'could not decompose tag')

In [None]:
# pull tags based on attributes
for tag in soup.find_all():
    if 'class' in tag.attrs and 'author' in tag.attrs['class']:
        print(tag.get_text())
        
    # don't do this because the attribute value may be a list
    #if 'class' in tag.attrs and tag.attrs['class'] == 'author':

In [None]:
# pull the same tag every scrape by specifying a selector
tag = soup.select('body > div > div:nth-child(2) > div.col-md-4.tags-box > span:nth-child(2) > a')
print(tag)

In [None]:
# error handling

# check to ensure the site isnt down
print(str(r))

# when in doubt, check the data type

![](images/http_responses.jpg)

### An important distinction

`soup.find_all()` returns all tags<br>
`soup.find()` returns one tag<br>
`soup.select_one()` returns the first tag matching the selector<br>
`soup.select()` returns all tags matching the selector<br>