# Next Steps with HTML Parsing

In our [Introduction to HTML Parsing](./intro-to-html-parsing.ipynb) notebook, we learned the basics of extracting text data from HTML pages. This notebook builds on that foundation to help explore some of the more advanced features of BeautifulSoup and some of the more difficult use cases in which one might leverage BeautifulSoup.

## Parsing XML

We've been focusing on "HTML" documents so far, but we can also use BeautifulSoup to parse "XML" documents. For example, the following snippet parses the ECCO TCP's XML version of David Garrick's "Ode on Dedicating a Building":

In [13]:
import bs4

# read in the xml file
soup = bs4.BeautifulSoup(open('Ode.xml'), 'html.parser')

# get the text content inside the "EEBO" tag
text = soup.find('eebo').get_text()

# print the text
print(text)



T042012
CW3310254864
0154100200





ADVERTISEMENT.
COULD some gentlemen of approved ability have been prevailed
upon to do justice to the subject of the following Ode, the present
apology would have been unnecessary;—but as it was requisite to
produce something of this kind upon the occasion, and the lot
having unluckily fallen on the person perhaps the least qualified to
succeed in the attempt, it is hoped the candour of the public will
esteem the performance rather as an act of duty, than vanity in the
author.
As some news-paper writers have illiberally endeavoured to shake
the poetic character of our immortal bard (too deeply indeed rooted
in the heart to be affected by them) it is recommended to those
who are not sufficiently established in their dramatic faith, to peruse
a work lately published, called, An Essay on the Writings and Genius
of SHAKESPEARE, by which they will with much satisfaction be
convinced, that England may justly boast the honour of producing
the greatest dr

<h2 style='color:green'>Reviewing XML Parsing</h2>

See if you can use the pattern displayed above to read in and then print the text within "Rom.xml". Note that this file does not contain an "eebo" tag.

## Filtering Selections

Sometimes an HTML selection returns a mixture of elements we wish to process and others we wish to skip altogether. For example, suppose a web page has multiple `div1` tags, and we only wish to parse some of them. In that case, we can use a conditional to ensure we only process the ones we care about. Let's see this in action:

In [15]:
import bs4

# read in the xml file
soup = bs4.BeautifulSoup(open('Ode.xml'), 'html.parser')

# get a list of the div1 tags
elems = soup.find_all('div1')

# iterate over the div1 tags in soup
for i in elems:
  
  # only proceed if the current tag has the attribute type="ode"
  if i['type'] == 'ode':
    
    # print the text content of this div1 element
    print(i.get_text())



ODE.

TO what blest genius of the isle,
Shall Gratitude her tribute pay,
Decree the festive day,
Erect the statue, and devote the pile?


Do not your sympathetic hearts accord,
To own the "bosom's lord?"
'Tis he! 'tis he!—that demi-god!
Who Avon's flow'ry margin trod,
While sportive Fancy round him flew,
Where Nature led him by the hand,
Instructed him in all she knew,
And gave him absolute command!
'Tis he! 'tis he!
" The god of our idolatry!"

To him the song, the Edifice we raise,
He merits all our wonder, all our praise!
Yet ere impatient joy break forth,
In sounds that lift the soul from earth;
And to our spell-bound minds impart
Some faint idea of his magic art;
Let awful silence still the air!
From the dark cloud, the hidden light
Bursts tenfold bright!
Prepare! prepare! prepare!
Now swell the choral song,
Roll the full tide of harmony along;
Let Rapture sweep the trembling strings,
And Fame expanding all her wings,
With all her trumpet-tongues proclaim,
The lov'd, rever'd, im

If you are working with an HTML or XML document that contains multiple tags that match a selection, and you only wish to work with a subset of those matched elements, you can use the `if` syntax above to filter your elements.

<h2 style='color:green'>Practicing Selection Filtering</h2>

Let's practice some selection filtering operations by processing "Farce.xml". See if you can print only the text content within the prologue. To do so, you will need to inspect "Farce.xml" to understand its structure!