# XML

In [7]:
import numpy as np
import pandas as pd
import requests
from lxml import html

## XML Tree

* The root node has an incoming edge
* All the edges are labeled and all the leaf nodes are labeled
* A non-leaf node does not have a label. However, it can have some attributes and values, all of which are string types
* The order of the children of a parent node is important, like a list
* The tree represents the actual data (labels and attributes) and meta info (labels of edges)

![title](./pic/xmlTree.png)


```xml
<? xml version="1.0" standalone = "yes" ?>
<MovieData>
    <Star id="es" movie="lala">
        <Name>Emma Stone</Name>
        <Address>222 Sunset Blvd. Hollywood</Address>
    </Star>
    <Star id="rg" movie="lala">
        <Name>Ryan Gosling</Name>
        <Address>
            <City>Los Angeles</City>
            <Zip>90210</Zip>
        </Address>
    </Star>
    <Movie id="lala" year="2016" actors="rg es">
        <Name>La La Land</Name>
    </Movie>
</MovieData>
```
Standalone means there's no previous data definitions being used

Version can be 1.0 or 1.1

## Example DTD with IDs

```xml
<!DOCTYPE MovieData [
    <!ELEMENT MovieData (Star*, Movie*)>
    <!ELEMENT Star (Name,Address+)>
        <!ATTLIST Star
            id ID #REQUIRED
            movie IDREFS #IMPLIED
        >
    <!ELEMENT Name (#PCDATA)>
    <!ELEMENT Address ((#PCDATA) | (City, Zip))>
    <!ELEMENT City (#PCDATA)>
    <!ELEMENT Zip (#PCDATA)>
    <!ELEMENT Movie (Name, Year)>
        <!ATTLIST Movie
            id ID #REQUIRED
            actors IDREFS #IMPLIED
            year CDATA
        >
]>
```
* \+ means one or more of the attribute exists, * means 0 or more
* An ID attribute must be unique, and IDREFS attribute must be an existing ID
* #PCDATA just means the label is a string

## Namespaces

For when 2 XML files use the same tag name

```xml
<md:MovieData xmlns:md = "http://www.example.org/movies">
    <md:Star ...>

    </md:Star>
</md:MovieData>
```
Using the namespace from "http://www.example.org/movies"

## XPath

Language for searching in the tree by specifying a path

* Axis:
    * `//` = all descendants
    * `.` = self
    * `..` = parent
    * `/` = child
    * `/*/` = down one edge
* Use `@id` to refer to attribute `id`
* The can be a forest (an array of XML docs)
* Use `text_content()` to get the label of a leaf node
* Use `[]` to specify a condition, and `text()` to get the value of a leaf node inside a condition
* `[]` also can be used to specify which element you want if there are multiple elements (1 based indexing)

## Scraping Web Pages/Loading XML Files

In [9]:
pageContent = requests.get('https://en.wikipedia.org/wiki/List_of_Olympic_medalists_in_judo')
tree = html.fromstring(pageContent.content)
gold = tree.xpath('//table/tbody/tr/td[2]/a[1]/text()')
silver = tree.xpath('//table/tbody/tr/td[3]/a[1]/text()')
silver.insert(21,'not awarded')
bronze = tree.xpath('//table/tbody/tr/td[not(@rowspan="2")]/a[1]/text()')
bronze1 = bronze[0::2]
bronze2 = bronze[1::2]
games = tree.xpath('//table/tbody/tr/td[1][@rowspan="2"]/a[1]/text()')
games = list(filter(lambda x: x != '2020 Tokyo', games))
df = pd.DataFrame({'games':games,'gold':gold,'silver':silver,'bronze1':bronze1,'bronze2':bronze2})
df = df.set_index('games')
display(df)

Unnamed: 0_level_0,gold,silver,bronze1,bronze2
games,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980 Moscow,Thierry Rey,José Rodríguez,Tibor Kincses,Aramby Emizh
1984 Los Angeles,Shinji Hosokawa,Kim Jae-yup,Neil Eckersley,Edward Liddie
1988 Seoul,Kim Jae-yup,Kevin Asano,Shinji Hosokawa,Amiran Totikashvili
1992 Barcelona,Nazim Huseynov,Yoon Hyun,Tadanori Koshino,Richard Trautmann
1996 Atlanta,Tadahiro Nomura,Girolamo Giovinazzo,Dorjpalamyn Narmandakh,Richard Trautmann
...,...,...,...,...
2000 Sydney,Yuan Hua,Daima Beltrán,Kim Seon-Young,Mayumi Yamashita
2004 Athens,Maki Tsukada,Daima Beltrán,Tea Donguzashvili,Sun Fuming
2008 Beijing,Tong Wen,Maki Tsukada,Lucija Polavder,Idalys Ortiz
2012 London,Idalys Ortiz,Mika Sugimoto,Karina Bryant,Tong Wen
