<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

        
<center>
<h1><font color= "blue" size="+3">ASTG Python Courses</font></h1>
</center>

---
<center>
<H1 style="color:red">
Parsing XML Documents with Python
</H1>
</center>

## Reference Documents

* <a href="https://aws.amazon.com/what-is/xml/">What is XML?</a> from Amazon.
* [Understanding XML, Its Elements, and Benefits](https://www.spiceworks.com/tech/tech-general/articles/what-is-xml/) from spiceworks.com.

## <font color="red"> What is XML?</font>

- Extensible Markup Language (XML) is a markup language that provides rules to define any data.
- XML is similar to HTML, but without predefined tags to use. Instead, you define your own tags designed specifically for your needs.
- Unlike HTML which is primarily geared toward how data looks (presentation), XML focuses on what data is (content).
- XML lets you define and store data in a shareable manner.
- XML supports information exchange between computer systems such as websites, databases, and third-party applications.
- A XML file is a text-based document that you can save with the `.xml` extension.
- XML does not have the ability to perform data computing operations. Instead, it relies on other programming languages or software to do so.
- XML enables you to create or define your own language.
   - For example, languages such as XHTML, MathML (Mathematical Markup Language), and SVG are created using XML meta-language.

### <font color="blue">Some benefits of using XML</font>

- __Maintain data integrity__: You transfer data along with the data’s description, preventing the loss of data integrity.
- __Improve search efficiency__: Computer programs like search engines can sort and categorize XML files more efficiently and precisely than other types of documents.
- 

### <font color="blue"> XML Tags</font>

- XML uses markup symbols, called tags, to define data.
- Tags bring sophisticated data coding to integrate information flows across different systems.
- Software can use these tags to determine the data processing strategies of the document.
- A tag is a construct that starts with `<` and ends with `>`.
- A tag marks the beginning or end of an element.
   - For example, `<center>` is an opening tag, and `</center>` is a closing tag.
- Tags are case-sensitive.
   - `<center>` and `<Center>` are considered different.
- We can define an empty tag as `<empty_tag />`.

### <font color="blue">XML Syntax</font>

- XML documents embody a "tree" structure, composed of elements arranged in a hierarchical relationship. 
- The first line of an XML document should be a declaration that this is an XML document, including the version of XML being used and the type of encoding (optional).


```xml
<?xml version="1.0" encoding="UTF-8"?>
```

- The main content of an XML document consists entirely of XML __elements__.
- An __element__ is made up of an opening tag, content, and a closing tag.
- Elements contain features such as text, attributes, and other XML file elements.
  - XML elements may be nested; an XML element may have other XML elements as its content.
- An XML document must have a __single root element__, which contains all other XML elements in the document.

```xml
<?xml version="1.0" encoding="UTF-8"?>
```

- A comment in XML is anything between the delimiters `<!--` and`-->`.
- For readability, it is recommended to indend the contents of each element.

#### XML attributes

- Attributes provide additional information about XML elements.
- They are always defined within the start tag of an element using the name–value pair syntax, in the format `name="value"`.
   - The attribute value must always be enclosed in quotes, either single (`'`) or double (`"`).
- They are designed to hold data that is related to the specific element. 

```xml
    <center id="1">
        <name>Goddard Space Flight Center</name>
        <state>Maryland</state>
        <location>8800 Greenbelt Road, Greenbelt</location>
    </center>
```

#### XML content

- Data embedded within XML files are referred to as XML content.


#### XML schema

- XML schema sets boundaries to the XML file structure.
- It expresses rules and constraints that need to be obeyed by the XML document.
- It acts as a ‘definition’ of the XML document. 
   - It profiles key rules and limits for how the XML file is structured as well as controls for content and data types – and crucially, how they interact with one another within the document. 

#### Disallowed characters

XML uses entity references to handle special characters that could be problematic when parsing data, such as `<`, `&`, or large blocks of repeatable data.

| Disallowed | Character Description | XML syntax |
| --- | --- | --- |
| `<` | Less than sign | `&lt;` |
| `>` | Greater than sign |  `&gt;` |
| `&` | Ampersand | `&amp;` |
| `'` | Apostrophe | `&apos;` |
| `"` | Quotation mark | `&quot;` |

#### Sample XML document

- We define a simple document with a tree structure that starts at the root and branches to the lowest level of the tree.
- The first line describes the root element of the document: `<nasa_centers>`
- The next set of lines describes two child elements (`<center>` `</center>`) of the root, and each of the elements has 3 subelements (`name`, `state`, `location`).
- Each `<center>` has an `id` attribute.

```xml
<?xml version="1.0" encoding="UTF-8"?>
<!-- Names and locations of NASA Centers -->
<nasa_centers>
    <center id="1">
        <name>Goddard Space Flight Center</name>
        <state>Maryland</state>
        <location>8800 Greenbelt Road, Greenbelt</location>
    </center>
    <center id="2">
        <name>Stennis Space Center</name>
        <state>Mississipi</state>
        <location>John C. Stennis Space Center</location>
    </center>
</nasa_centers>
```

## <font color="red">Parsing XML documents with Python</font>

Python has a built-in library, `ElementTree`, that has functions to read and manipulate XMLs. 

In [None]:
import xml.etree.ElementTree as ET

In [None]:
xmldoc = """
<!-- Names and locations of NASA Centers -->
<nasa_centers>
    <center acronym="GSFC">
        <name>Goddard Space Flight Center</name>
        <state>Maryland</state>
        <location>8800 Greenbelt Road, Greenbelt</location>
    </center>
    <center acronym="SSC">
        <name>Stennis Space Center</name>
        <state>Mississipi</state>
        <location>Bay St. Louis</location>
    </center>
    <center acronym="JSC">
        <name>Johnson Space Center</name>
        <state>Texas</state>
        <location>Houston</location>
    </center>
    <center acronym="ARC">
        <name>Ames Research Center</name>
        <state>California</state>
        <location>Moffett Field</location>
    </center>
    <center acronym="GRC">
        <name>Glenn Research Center</name>
        <state>Ohio</state>
        <location>Cleveland</location>
    </center>
</nasa_centers>
"""

In [None]:
xml_url = "nasa_centers.xml"

In [None]:
tree = ET.parse(xml_url)

In [None]:
tree

In [None]:
root = tree.getroot()

__In case the XML document is a string, use the following:__

```python
root = ET.fromstring(xmldoc)
```

`fromstring()` parses XML from a string directly into an `Element`, which is the root element of the parsed tree. 

As an `Element`, `root` has a tag and a dictionary of attributes:

In [None]:
root.tag

In [None]:
root.attrib

__List the outer elements and their ids__

In [None]:
for child in root:
    print(f"Tag: {child.tag} --> Acronym: {child.attrib}")

In [None]:
for child in root:
    d = child.attrib
    key = list(d.keys())[0]
    print(f"Tag: {child.tag} --> {key}: {d[key]}")

__Children are nested, and we can access specific child nodes by index:__

In [None]:
root[0][0].text

In [None]:
root[0][1].text

In [None]:
root[0][2].text

__List all the tag elements__

In [None]:
list_tags = [elem.tag for elem in tree.iter()]
list_tags

__List unique tags__

In [None]:
unique_tags = list()
for tag in list_tags:
    if tag not in unique_tags:
        unique_tags.append(tag)

unique_tags

__Create a list of dictionaries with center information__

In [None]:
data = list()
for center in tree.findall("center"):
    d = dict()
    key = list(center.attrib.keys())[0]
    d[key] = center.attrib[key]
    for tag in unique_tags:
        if tag not in ['nasa_centers', 'center']:
            d[tag] = center.find(tag).text
    data.append(d)
data

__Create a Pandas DataFrame with center information__

In [None]:
import pandas as pd

In [None]:
center_df = pd.DataFrame(data)
center_df