### Adding BS4 to get info from all HTML files

Practice

In [4]:
from bs4 import BeautifulSoup

In [5]:
htmltxt = "<p>Hello World</p>"
soup = BeautifulSoup(htmltxt, 'lxml')

print("+++++++++++++++++++++++++++++")

type(soup)

+++++++++++++++++++++++++++++


bs4.BeautifulSoup

In [6]:
soup.text

'Hello World'

In [7]:
soup = BeautifulSoup("""<h1>Hello</h1><p>World</p>""", 'lxml')
soup.text

'HelloWorld'

In [8]:
mytxt = """
<h1>Hello World</h1>
<p>This is a <a href="http://example.com">link</a></p>"""

soup = BeautifulSoup(mytxt, 'lxml')

soup.text

'Hello World\nThis is a link'

In [10]:
soup.find('a')

<a href="http://example.com">link</a>

In [11]:
type(soup.find('a'))

bs4.element.Tag

In [12]:
soup.find('a').text

'link'

In [13]:
from bs4 import BeautifulSoup
mytxt = """
<h1>Hello World</h1>
<p>This is a <a href="http://example.com">link</a></p>
"""
soup = BeautifulSoup(mytxt, 'lxml')
mylink = soup.find('a')

In [14]:
type(mylink.attrs)

dict

In [15]:
mylink.attrs

{'href': 'http://example.com'}

In [16]:
mylink.attrs['href']

'http://example.com'

What about the other tags in our HTML snippet? They have no attributes and thus will have blank dictionaries for their attrs attributes:

In [17]:
moretxt = """
<p>Visit the <a href='http://www.nytimes.com'>New York Times</a></p>
<p>Visit the <a href='http://www.wsj.com'>Wall Street Journal</a></p>
"""

In [18]:
soup = BeautifulSoup(moretxt, 'lxml')
tags = soup.find_all('a')
type(tags)

bs4.element.ResultSet

####  A ResultSet acts very much like other kinds of Python sequence, such as a list:

In [21]:
len(tags)

2

In [19]:
tags

[<a href="http://www.nytimes.com">New York Times</a>,
 <a href="http://www.wsj.com">Wall Street Journal</a>]

In [22]:
tags[0]

<a href="http://www.nytimes.com">New York Times</a>

In [23]:
tags[0].attrs['href']

'http://www.nytimes.com'

In [24]:
for t in tags:
    print(t.text, t.attrs['href'])

New York Times http://www.nytimes.com
Wall Street Journal http://www.wsj.com


#### However, be careful not to treat the ResultSet as if it were a Tag – try to understand why the following doesn't make much sense (nevermind results in an error):

In [25]:
tags.attrs['href']

AttributeError: ResultSet object has no attribute 'attrs'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

#### The HTML attributes exist at a per-tag level – what would you expect it to return for a collection of tags? The designer of BeautifulSoup has no idea, thus, the error message.

If what you want is the href value for each of the tags, then you have to do it the old fashioned way with a for-loop:

In [27]:
hrefs = []
for t in tags:
    hrefs.append(t)

print(hrefs)

[<a href="http://www.nytimes.com">New York Times</a>, <a href="http://www.wsj.com">Wall Street Journal</a>]


## Finding nested elements

What happens when there is more than one "group" of link tags that we want? In the snippet below, the <a> tags we care about are nested within h1 tags:

In [29]:
evenmoretxt = """
<h1><a href="http://www.a.com">Awesome</a></h1>
<h1><a href="http://www.b.com">Really Awesome</a></h1>

<div><a href="http://na.com">Ignore me</a></div>
<div><a href="http://127.0.0.1">Ignore me again</a></div>
"""

In [30]:
soup = BeautifulSoup(evenmoretxt, 'lxml')

In [31]:
heds = soup.find_all('h1')

In [32]:
heds

[<h1><a href="http://www.a.com">Awesome</a></h1>,
 <h1><a href="http://www.b.com">Really Awesome</a></h1>]

Each of the members of heds is a Tag object, and each Tag object has a find() method, which we can use to select just the nested a tag:

In [33]:
links = []

In [34]:
for h in heds:
    a = h.find('a')
    links.append(a)

In [35]:
print(links)

[<a href="http://www.a.com">Awesome</a>, <a href="http://www.b.com">Really Awesome</a>]


A better code

In [36]:
links = []

In [37]:
for h in heds:
    links.append(h.find('a'))

In [38]:
print(links)

[<a href="http://www.a.com">Awesome</a>, <a href="http://www.b.com">Really Awesome</a>]


### Real world example.com
Parsing our own hand-constructed HTML is not much fun. So let's get a "real" HTML document from the web.

This part should be familiar:

In [39]:
import requests

In [40]:
resp = requests.get("http://example.com")

In [45]:
txt = resp.text
print(txt)

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 50px;
        background-color: #fff;
        border-radius: 1em;
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        body {
            background-color: #fff;
        }
        div {
            width: auto;
            margin: 0 auto;
            border-radius: 0;
            padding: 1em;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This doma

In [44]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'lxml')

In [46]:
len(soup.find_all('p'))

2

In [47]:
soup.find_all('p')[0].text

'This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission.'

In [48]:
len(soup.find('h1').text)

14

In [49]:
soup.find('a').text

'More information...'

In [50]:
soup.find('a').attrs['href']

'http://www.iana.org/domains/example'

## Extracting individual press briefings URLs from the White House press briefings list

Now see if you can extract each press briefing URL from this sample White House press briefings page:

http://stash.compjour.org/samples/webpages/whitehouse-press-briefings-page-50.html

#### Processing the press briefings page as soup

Let's turn this convoluted HTML into soup. See if you can remember the steps for downloading the webpage and converting it to a soup object well enough to type them by memory:

In [51]:
import requests
from bs4 import BeautifulSoup

url = 'http://stash.compjour.org/samples/webpages/whitehouse-press-briefings-page-50.html'

resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')


In [52]:
len(soup.find_all('a'))

263

In [53]:
len(soup.find_all('h3'))

10

In [54]:
# urls = []
# for h in soup.find_all('h3'):
  #   a = h.find('a')
  #  urls.append(a.attrs['href'])

urls = []
for h in soup.find_all('h3'):
    urls.append(h.find('a').attrs['href'])

In [55]:
urls

['https://www.whitehouse.gov/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013',
 'https://www.whitehouse.gov/the-press-office/2013/12/05/daily-briefing-press-secretary-1252013',
 'https://www.whitehouse.gov/the-press-office/2013/12/05/press-briefing-senior-administration-officials-fact-sheet-strengthening-',
 'https://www.whitehouse.gov/the-press-office/2013/12/04/press-briefing-press-secretary-1232013',
 'https://www.whitehouse.gov/the-press-office/2013/12/02/press-briefing-press-secretary-jay-carney-1222013',
 'https://www.whitehouse.gov/the-press-office/2013/11/26/press-gaggle-principal-deputy-press-secretary-josh-earnest-los-angeles-c',
 'https://www.whitehouse.gov/the-press-office/2013/11/25/press-gaggle-principal-deputy-press-secretary-josh-earnest-aboard-air-fo',
 'https://www.whitehouse.gov/the-press-office/2013/11/22/daily-briefing-press-secretary-112213',
 'https://www.whitehouse.gov/the-press-office/2013/11/21/briefing-principal-deputy-press-secr

## All together

To extract the URLs from the canned sample webpage, here's all the code:

In [56]:
import requests
from bs4 import BeautifulSoup as bs

url = 'http://stash.compjour.org/samples/webpages/whitehouse-press-briefings-page-50.html'
resp = requests.get(url)
soup = bs(resp.text, 'lxml')


urls = []

for h in soup.find_all('h3'):
    a = h.find('a')
    urls.append(a.attrs['href'])