## This notebook shows a simple way in Python to convert an xml file to json using the xmltodict package

In [1]:
import xml.etree.ElementTree as ET
import xmltodict
import json

Step 1: read the xml text from a file

In [19]:
xml_string = open('library/library.xml').read()
#xml_string = open('expenseReport/expenses.xml').read()
print(xml_string)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE library SYSTEM "library.dtd">
<library location="Bremen">
	<author name="Henry Wise">
	   <book title="Artificial Intelligence"/>
	   <book title="Modern Web Services"/>
	   <book title="Theory of Computation"/>
	</author>
	<author name="William Smart">
		<book title="Artificial Intelligence"/>
	</author>
	<author name="Cynthia Singleton">
	   <book title="The Semantic Web"/>
	   <book title="Browser Technology Revised"/>
	</author>
</library>



Step 2: use sxmtojson to convert it to a Python dict

In [20]:
xml_dict = xmltodict.parse(xml_string)
print(xml_dict)

OrderedDict([('library', OrderedDict([('@location', 'Bremen'), ('author', [OrderedDict([('@name', 'Henry Wise'), ('book', [OrderedDict([('@title', 'Artificial Intelligence')]), OrderedDict([('@title', 'Modern Web Services')]), OrderedDict([('@title', 'Theory of Computation')])])]), OrderedDict([('@name', 'William Smart'), ('book', OrderedDict([('@title', 'Artificial Intelligence')]))]), OrderedDict([('@name', 'Cynthia Singleton'), ('book', [OrderedDict([('@title', 'The Semantic Web')]), OrderedDict([('@title', 'Browser Technology Revised')])])])])]))])


Step 3: use the json package to convert the dict to a python json object

In [21]:
xml_json = json.loads(json.dumps(xml_dict))
print(xml_json)

{'library': {'@location': 'Bremen', 'author': [{'@name': 'Henry Wise', 'book': [{'@title': 'Artificial Intelligence'}, {'@title': 'Modern Web Services'}, {'@title': 'Theory of Computation'}]}, {'@name': 'William Smart', 'book': {'@title': 'Artificial Intelligence'}}, {'@name': 'Cynthia Singleton', 'book': [{'@title': 'The Semantic Web'}, {'@title': 'Browser Technology Revised'}]}]}}


Step 5: see what it's native json format looks like pretty printed

In [22]:
print(json.dumps(xml_json, indent=2))

{
  "library": {
    "@location": "Bremen",
    "author": [
      {
        "@name": "Henry Wise",
        "book": [
          {
            "@title": "Artificial Intelligence"
          },
          {
            "@title": "Modern Web Services"
          },
          {
            "@title": "Theory of Computation"
          }
        ]
      },
      {
        "@name": "William Smart",
        "book": {
          "@title": "Artificial Intelligence"
        }
      },
      {
        "@name": "Cynthia Singleton",
        "book": [
          {
            "@title": "The Semantic Web"
          },
          {
            "@title": "Browser Technology Revised"
          }
        ]
      }
    ]
  }
}


The package prefixes properties that started out as XML attributes with an '@' and an element's text is given the property name #text.  One reason for this is that the package also has a method to convert a dict back to XML.  If we don't like these defaults, we can change them by passing addional arguments to the xmltodict parse method. We can also include an optional argument to process namespaces.

In [25]:
def xml2json (infile, attr='', ns=True):
    """returns a python json object created from the contents of an xml file"""
    return json.loads(json.dumps(xmltodict.parse(open(infile).read(), attr_prefix=attr, process_namespaces=ns )))

In [26]:
print(json.dumps(xml2json('library/library.xml'), indent=2))

{
  "library": {
    "location": "Bremen",
    "author": [
      {
        "name": "Henry Wise",
        "book": [
          {
            "title": "Artificial Intelligence"
          },
          {
            "title": "Modern Web Services"
          },
          {
            "title": "Theory of Computation"
          }
        ]
      },
      {
        "name": "William Smart",
        "book": {
          "title": "Artificial Intelligence"
        }
      },
      {
        "name": "Cynthia Singleton",
        "book": [
          {
            "title": "The Semantic Web"
          },
          {
            "title": "Browser Technology Revised"
          }
        ]
      }
    ]
  }
}


Let's look at a more complex xml example with attributes, entity values and namespaces

In [27]:
print(json.dumps(xml2json('expenseReport/expenses.xml'), indent=2))

{
  "expense-report": {
    "http://www.w3.org/2001/XMLSchema-instance:noNamespaceSchemaLocation": "ExpReport.xsd",
    "currency": "USD",
    "detailed": "false",
    "total-sum": "556.9",
    "xmlns": {
      "xsi": "http://www.w3.org/2001/XMLSchema-instance"
    },
    "Person": {
      "First": "Fred",
      "Last": "Landis",
      "Title": "Project Manager",
      "Phone": "123-456-7890",
      "Email": "f.landis@nanonull.com"
    },
    "expense-item": [
      {
        "type": "Lodging",
        "expto": "Sales",
        "Date": "2003-01-01",
        "expense": "122.11"
      },
      {
        "type": "Lodging",
        "expto": "Development",
        "Date": "2003-01-02",
        "expense": "122.12",
        "description": "Played penny arcade"
      },
      {
        "type": "Lodging",
        "expto": "Marketing",
        "Date": "2003-01-02",
        "expense": "299.45",
        "description": "Treated Clients"
      },
      {
        "type": "Entertainment",
        "exp