[Node 57: SAX Parser](http://www-static.etp.physik.uni-muenchen.de/kurs/Computing/python2/node57.html)

Navigation:

**Next:** [JSON – Lightweight Alternative](node58.ipynb) **Up:** [JSON – Lightweight Alternative](node58.ipynb) **Previous:** [JSON – Lightweight Alternative](node58.ipynb)

## SAX parser
 

SAX is a widely used XML parser. There are versions for Perl, Python, Java, ...

SAX processes documents as <font color=#008000> *streams*</font>, i.e. the document is processed in sections.

An alternative common parser is the DOM (Document Object Model), which reads the entire document and makes it available internally in memory $\Rightarrow$ well suited for complex tree-like XML structures.


SAX is a good example of application of object-oriented programming: uses inheritance and method overriding.

Use:
* <font color=#008000> *Handler class*</font> define $\Rightarrow$ inherits from <font color=#0000e6> ``SAX ContentHandler``</font>
* Implement default methods of the class if necessary: ​​<font color=#0000e6> ``startElement, endElement, ...``</font>
* These are called <font color=#008000> *automatically*</font> during "Document–Parsing" (similar to call-back functions in TkInter-GUI)
* The handler class is passed to SAX before document parsing

### XML example (article.xml)
```xml
<?xml version="1.0"?>
<webArticle category="news" subcategory="technical">
    <header title="NASA Builds Warp Drive"
           length="3k"
           author="Joe Reporter"
           distribution="all"/>
    <body>Seattle, WA - Today an anonymous individual
           announced that NASA has completed building a
           Warp Drive and has parked a ship which uses
           the drive in his back yard.  This individual
           claims that although he hasn't been contacted by
           NASA concerning the parked space vessel, he assumes
           that he will be launching it later this week to
           mount an exhibition to the Andromeda Galaxy.
    </body>
</webArticle>
```


Handler class XML (simplehandler.py):


In [None]:
from xml.sax.handler import ContentHandler

class ArticleHandler(ContentHandler):
  """A handler to deal with articles in XML"""

  def startElement(self, name, attrs):
    print ("Start element:",name)

  def endElement(self, name):
    print ("End element:",name)

def main():
  from xml.sax  import make_parser
  
  ch = ArticleHandler()
  saxparser = make_parser()
  
  saxparser.setContentHandler(ch)
  saxparser.parse("data/xml_article.xml")

if __name__ == '__main__':
  main()


### A little more XML processing

* Extend handlers and add tag-specific processing
* additional method <font color=#0000e6> ``characters()``</font> to process the actual data



In [None]:
from xml.sax.handler import ContentHandler

class ArticleHandler(ContentHandler):
  """A handler to deal with articles in XML"""

  inArticle = 0
  inBody    = 0
  isMatch   = 0
  title     = ""
  body      = ""

  def startElement(self, name, attrs):
    if name == "webArticle":
      subcat = attrs.get("subcategory", "")
      if subcat.find("tech") > -1:
        self.inArticle = 1
        self.isMatch = 1

    elif self.inArticle:
      if name == "header":
        self.title = attrs.get("title", "")
      if name == "body":
        self.inBody = 1

  def characters(self, characters):
    MAXLEN=800
    if self.inBody:
      if len(self.body) < MAXLEN:
        self.body += characters
      if len(self.body) > MAXLEN:
        self.body = self.body[:MAXLEN-2] + "..."
        self.inBody = 0

  def endElement(self, name):
    if name == "body":
      self.inBody = 0

def main():
  import sys
  
  from xml.sax  import make_parser
  
  ch = ArticleHandler()
  saxparser = make_parser()
  
  saxparser.setContentHandler(ch)
  saxparser.parse("data/xml_article.xml")

  if ch.isMatch:
      print ("News Item!")
      print ("Title:", ch.title)
      print ("Body:", ch.body)

if __name__ == '__main__':
  main()


## Element tree parser
Another simple alternative for plain XML documents is the <font color=#0000ff> **Elementtree**</font> parser, see documentation/tutorial in https://docs.python.org/3/library/xml .etree.elementtree.html

In the simplest case, the entire XML document is read, the individual elements are accessible as **nested iterables**

In [None]:
# angewandt auf obiges Beispiel
import xml.etree.ElementTree as ET
tree = ET.parse('data/xml_article.xml')
root = tree.getroot()
print(root)
print(root.attrib)
print(root[0])
for el in root:
    print ('Element:',el.tag)
    if el.tag =='body':
        print(el.text[:200])