/
2010-06-04-parsing-invalid-xml-with-beautiful-soup.html
48 lines (40 loc) · 2.12 KB
/
2010-06-04-parsing-invalid-xml-with-beautiful-soup.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
title: Parsing invalid xml with Beautiful Soup
tags: xml python beautifulsoup
---
<p>
Sometimes, bad xml happens to good people. In my case, I was getting a text stream back from a web-service call that proported to be xml, but was actually not well-formed. It had ampersands inside a tag.
</p>
<pre name="code" class="xml">
<?xml version="1.0" encoding='UTF-8'?>
<resume>
<title>Developer & Manager</title>
...
</resume>
</pre>
<p>
This lead to the following error parsing with python's minidom:
</p>
<pre name="code" class="python">
Traceback (most recent call last):
...
response = minidom.parseString(xml)
File "/usr/lib/python2.6/xml/dom/minidom.py", line 1928, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "/usr/lib/python2.6/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
ExpatError: not well-formed (invalid token): line 369, column 2025
</pre>
<p>
Virtually all XML parses will rightly balk at this input, because <a href="http://articles.techrepublic.com.com/5100-10878_11-5032714.html">it's not valid</a>. You could easily work-around this issue by replacing all ampersands with the html-entity &amp;, but the real issue is that the web-service is obviously not using an XML parser to create the document. It's likely creating the document by hand, which means that further cases of invalid XML are quite likely.
</p>
<p>
A more robust solution is to use a lenient parser like <a href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a>, which is actually an HTML parser. Even though it doesn't know anything about XML, it's enough for basic parsing. Beautiful Soup will allow ampersands (which are valid in HTML anyway), unclosed tags, bad encodings or virtually anything else. It's designed to make a best effort no matter what. It's also easy to use.
</p>
<pre name="code" class="python">
from BeautifulSoup import BeautifulSoup
response = BeautifulSoup(xml)
print response.find("title").string
</pre>