_HDS5210 - Programming for Health Data Scientists_

# Week 9 - Data Structures - XML

XML is the abbreviation for eXtensible Markup Language.

In this part of the lecture, we'll be working on reading / processing / writing XML.  You can see the sample file that we'll be working with here: https://www.hl7.org/fhir/patient-example-f201-roel.xml.html

The Python manual for the xml module can be found here: https://docs.python.org/3.6/library/xml.html

 <id value="f201"/>
 same as
 <id value="f201"></id>

In [1]:
import xml.etree.ElementTree as xml

In [15]:
x = """<?xml version="1.0"?>
<start a="1" b="2">My Value</start>
"""

In [16]:
x

'<?xml version="1.0"?>\n<start a="1" b="2">My Value</start>\n'

In [17]:
root = xml.fromstring(x)

In [18]:
root.tag

'start'

In [19]:
root.attrib

{'a': '1', 'b': '2'}

In [20]:
root.text

'My Value'

In [22]:
hds5210 = """<?xml version="1.0"?>
<class name="hds5210" >
This class is about programming in Python.
    <instructor>Paul Boal</instructor>
    <instructor>Eric Westhus</instructor>
</class>
"""

In [23]:
c = xml.fromstring(hds5210)

In [24]:
c.text

'\nThis class is about programming in Python.\n    '

In [25]:
c.tag

'class'

In [26]:
c.attrib

{'name': 'hds5210'}

In [27]:
for child in c:
    print(child.tag, child.text)

instructor Paul Boal
instructor Eric Westhus


In [28]:
len(c)

2

## Parsing an XML file

In `/data/patient-example-f201-roel.xml` we have an XML version of another sample of FHIR.


In [29]:
%%bash
cat /data/patient-example-f201-roel.xml

<Patient xmlns="http://hl7.org/fhir">
  <id value="f201"/> 
  <text> <status value="generated"/> <div xmlns="http://www.w3.org/1999/xhtml"><p> <b> Generated Narrative with Details</b> </p> <p> <b> id</b> : f201</p> <p> <b> identifier</b> : BSN = 123456789 (OFFICIAL), BSN = 123456789 (OFFICIAL)</p> <p> <b> active</b> : true</p> <p> <b> name</b> : Roel(OFFICIAL)</p> <p> <b> telecom</b> : ph: +31612345678(MOBILE), ph: +31201234567(HOME)</p> <p> <b> gender</b> : male</p> <p> <b> birthDate</b> : 13/03/1960</p> <p> <b> deceased</b> : false</p> <p> <b> address</b> : Bos en Lommerplein 280 Amsterdam 1055RW NLD (HOME)</p> <p> <b> maritalStatus</b> : Legally married <span> (Details : {SNOMED CT code '36629006' = 'Legal marriage', given as 'Legally married'};
           {http://hl7.org/fhir/v3/MaritalStatus code 'M' = 'Married)</span> </p> <p> <b> multipleBirth</b> : false</p> <p> <b> photo</b> : </p> <h3> Contacts</h3> <table> <tr> <td> -</td> <td> <b> Relationship</b> </td> <td> <b> Name</b> </

In [31]:
tree = xml.parse('/data/patient-example-f201-roel.xml')
root = tree.getroot()

Each element in a XML tree has a `tag`, a dictionary of attributes (`attr`), a body of `text`, and zero or more `child` elements.

In [33]:
root.tag

'{http://hl7.org/fhir}Patient'

In [34]:
root.attrib

{}

In [35]:
root.text

'\n  '

We can loop through the children...

In [37]:
for child in root:
    print("{:60s} {:15s} {:d}".format(child.tag, str(child.attrib.get('value')), len(child)))

{http://hl7.org/fhir}id                                      f201            0
{http://hl7.org/fhir}text                                    None            2
{http://hl7.org/fhir}identifier                              None            4
{http://hl7.org/fhir}identifier                              None            4
{http://hl7.org/fhir}active                                  true            0
{http://hl7.org/fhir}name                                    None            6
{http://hl7.org/fhir}telecom                                 None            3
{http://hl7.org/fhir}telecom                                 None            3
{http://hl7.org/fhir}gender                                  male            0
{http://hl7.org/fhir}birthDate                               1960-03-13      0
{http://hl7.org/fhir}deceasedBoolean                         false           0
{http://hl7.org/fhir}address                                 None            5
{http://hl7.org/fhir}maritalStatus                  

Or using specific search criteria, we can search for descendents that match certain tag or attribute criteria.

In [38]:
ns = { 'fhir': 'http://hl7.org/fhir' }
xml.register_namespace('fhir','http://hl7.org/fhir')

In [39]:
for id in root.findall('fhir:telecom', ns):
    print(id.attrib, len(id))

{} 3
{} 3


In [40]:
for nm in root.findall('fhir:name', ns):
    for a in nm:
        print("{:s} --> {:s}".format(str(a.tag), str(a.attrib["value"])))

{http://hl7.org/fhir}use --> official
{http://hl7.org/fhir}text --> Roel
{http://hl7.org/fhir}family --> Bor
{http://hl7.org/fhir}given --> Roelof Olaf
{http://hl7.org/fhir}prefix --> Drs.
{http://hl7.org/fhir}suffix --> PDEng.


In [41]:
for nm in root.findall('{http://hl7.org/fhir}name'):
    print(xml.tostring(nm))

b'<fhir:name xmlns:fhir="http://hl7.org/fhir"> \n  \n    <fhir:use value="official" /> \n    <fhir:text value="Roel" /> \n    <fhir:family value="Bor" /> \n    <fhir:given value="Roelof Olaf" /> \n    <fhir:prefix value="Drs." /> \n    <fhir:suffix value="PDEng." /> \n  </fhir:name> \n  '
