### What does XML stand for?

XML stands for Extensible Markup Language

### What is a markup language?

According to _Wikipedia_

> In computer text processing, a markup language is a system for annotating a document in a way that is syntactically distinguishable from the text, meaning when the document is processed for display, the markup language is not shown, and is only used to format the text.

### What is the purpose of XML?

- XML is used for sharing of data.
- Sends data in a structure format.

### Basic Terminology

#### Tag

Generally, strings that `<` and ends with `>`.

#### Types of tags

- start-tag, such as `<section>`

- end-tag, such as `</section>`


- empty-element tag, such as `<line-break />` 

#### Element

- Logical document component that either begins with a start-tag and ends with a matching end-tag

OR

- consists only of an empty-element tag. 


##### Examples

- `<greeting>Hello, world!</greeting>`.

- `<line-break />`.

#### Attribute

In [None]:
- name-value pair that exists within a start-tag or empty-element tag.

#### Examples

- `<img src="foo.jpg" alt="foo" />`

<!-- illustrates an attribute with a list of values -->
- `<div class="inner outer-box"> FooBar</div>`

### Example

### Basic Parsing

In [12]:
import xml.etree.ElementTree as ET

# for parsing document from string
root = ET.fromstring(val)

# for parsing from a file
root = ET.parse('file.xml').getroot()

NameError: name 'val' is not defined

#### Getting Interesting Elements

In [None]:
for child in root.iter():
    print(child.tag, child.attrib)

In [13]:
for neighbor in root.iter('neighbor'):
    print(neighbor.tag, neighbor.attrib)

NameError: name 'root' is not defined

### More realistic example

In [60]:
URL = 'https://www.hackadda.com/latest/feed/'

# fetch feed from this URL and extract some interesting data
# like, find urls that contain 'django' in them.



#### Iterating through every element

In [63]:
import xml.etree.ElementTree as ET


tree = ET.iterparse('data.xml')

# first element is event
for _, ele in tree:
    print(ele.tag, ele.attrib)

### SAX(Simple API for XML)

- This can be used to parse XML element by element, line by line.

- It is generally slower and memory inefficient, prefer `ET.iterparse` instead.

### DOM(Document Object Model) API

- DOM is a cross-language API from W3C(World Wide Web Consortium) for accessing and modifying XML documents.

In [65]:
from xml.dom.minidom import parse, parseString


tree = parse(source)

countries = tree.getElementsByTagName('country')

for country in countries:
    tag = country.tagName
    children = country.childNodes
    print('name:', country.getAttribute('name'))

NameError: name 'source' is not defined

#### SideNotes

- This is generally memory consuming and `xml.etree.ElementTree` is generally preferred over it.

### Security Considerations

From the official `python` documentation

<div class="alert alert-danger">
    
> Warning: The XML modules are not secure against erroneous or maliciously constructed data. If you need to parse untrusted or unauthenticated data see the [XML vulnerabilities](https://docs.python.org/3/library/xml.html#xml-vulnerabilities) and The [defusedxml](https://docs.python.org/3/library/xml.html#defusedxml-package) Package sections. 

</div>

### Other useful parsing libraries

- [`xmltodict`](https://docs.python.org/3/library/xml.html#xml-vulnerabilities)
    - convert XML to JSON like object

- [`untangle`](https://github.com/stchris/untangle)  
    - converts XML to a python like object