### XML basics
XML (Extensible Markup Language) is a markup language that does not do anything on its own. Its purpose is to improve machine readibility of data. 
Below you can see an exemplary XML file, that we will use in following cells for field value access demonstration.

In [None]:
!cat "./example_xml.xml"

### XPath
[XPath](https://www.w3.org/TR/1999/REC-xpath-19991116/) allows navigation through nodes in an XML document. [lxml](https://lxml.de/) is a Python package, that provides structure for storing XML document tree and support for XPath.

Below you can see few examples of accessing nodes of xml file in Python

In [None]:
# Make sure lxml is installed
!pip install lxml

In [None]:
# Parse xml file into lxml's element tree
from lxml import etree


def print_xml(tree, declaration: bool = False):
    print(etree.tostring(tree, encoding='UTF-8', xml_declaration=declaration, pretty_print=True).decode())

xml_tree = etree.parse("./example_xml.xml")

# Let's print
print(xml_tree)
print_xml(xml_tree)

Below we will present short example of xPath usage with ElementTree

In [None]:
"""
    Select nodes and values
"""
students = xml_tree.xpath("//student")

# We obtain a list of elements
students

In [None]:
# We can explore content of elements

for student in students:
    print(f"Tag: {student.tag}")
    print(f"Attributes: {student.attrib}")
    print(f"Tree:\n{etree.tostring(student, encoding='UTF-8', xml_declaration=True, pretty_print=True).decode()}")


In [None]:
for student in students:
    # Extract attribute value
    student_id = student.get("id")
    # find textual values of children nodes
    first_name = student.xpath("//firstName/text()")
    last_name = student.xpath("//lastName/text()")
    print(f"ID: {student_id}\tName: {' '.join(first_name)} {' '.join(last_name)}")

Note, that in an example above XPath call to `//firstName/text()` finds all occurences of first and last name in all student Elements. In order to select nodes relative to the Element use `./` at the beginning of the XPath.

In [None]:
for student in students:
    student_id = student.get("id")
    first_name = student.xpath("./firstName/text()")
    last_name = student.xpath("./lastName/text()")
    print(f"ID: {student_id}\tName: {' '.join(first_name)} {' '.join(last_name)}")

We can build conditional statements on children values and attribute values using `[]`

In [None]:
"""
    XPath and conditioning on attributes and values
"""

# All students that obtained 40<= points have passed
students_that_passed = xml_tree.xpath("//student[./scores/homework1 >= 40 and ./scores/homework2 >= 40 and ./scores/project >= 40]")
for student in students_that_passed:
    print_xml(student)

In [None]:
# Select all students with id>1, note that id being string XPath allows polymorphism on attributes and node values
students_with_required_id = xml_tree.xpath("./student[@id > 1]")
for student in students_with_required_id:
    print_xml(student)

### Namespaces, schemas and CMDI

#TODO anything on VLO? 
XML document allows for multiple schemas and namespaces to be used within a single document, which helps with agregating information from multiple sources, or providing it in parallel formats. However, when present, XPath requires explicit mapping of namespaces to Uniform Resource Identifirer (URI).

Let's look into real world example of the metadata we will stumble across further in this tutorial. Let's look into [CMDI](https://www.clarin.eu/content/component-metadata) metadata description of available OCR data from Latvian newspaper Valmieras Ziņojumu Lapa. 

In [None]:
# Let's explore 
xml_tree = etree.parse("./example_cmdi.xml")
print_xml(xml_tree, True)

Let's focus on the xml document declaration:
```xml
<?xml version='1.0' encoding='UTF-8'?>
<cmd:CMD xmlns:cmd="http://www.clarin.eu/cmd/1" xmlns="http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" CMDVersion="1.2" xsi:schemaLocation="http://www.clarin.eu/cmd/1         http://infra.clarin.eu/CMDI/1.x/xsd/cmd-envelop.xsd         http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869         https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1639731773869/xsd">
```

`xmlns:` allows for namespace declaration. Default namespace (if none present) is declared by `xmlns`.

XPath does not support implicit default namespace. In order to explore ElementTree with namespaces, we need to pass mapping explicitly.  

In [None]:
print(f"That does not work: {xml_tree.xpath('//Language/code/text()')}")

# Just specify artificial default namespace
nsmap = {"cmd": "http://www.clarin.eu/cmd/1",
         "dft": "http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1639731773869"}

print(f"That does work: {xml_tree.xpath('//dft:Language/dft:code/text()', namespaces=nsmap)}")

Let's explore the collection metadata

In [None]:
print(xml_tree.xpath('//dft:Description/dft:description/text()', namespaces=nsmap))

In [None]:
resource_list = xml_tree.xpath('//cmd:ResourceProxyList', namespaces=nsmap)

In [None]:
resource_list = resource_list[0]
print_xml(resource_list, True)

We can access metadata about resources by year of the issue.

In [None]:
resource_id = resource_list.xpath("./cmd:ResourceProxy[./cmd:ResourceType/text() = 'Metadata']/@id", namespaces=nsmap)
print(resource_id)

In [None]:
_id = resource_id[0]
print(f"Ref ID: {_id}")
ref_1900 = resource_list.xpath("./cmd:ResourceProxy[@id = '_1900']/cmd:ResourceRef/text()", namespaces=nsmap)
print(f"Ref link: {ref_1900}")
ref_1900 = ref_1900[0]

We can resolve referenced metadata in order to familiarise ourselves with how metadata links to the underlying bibliographic resources. 

In [None]:
import requests

response = requests.get(ref_1900)
xml_tree_1900 = etree.fromstring(response.content)
print_xml(xml_tree_1900)

Let's investigate Subresource element of the xml tree

In [None]:
nsmap["dft"] = "http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997"
subresources = xml_tree_1900.xpath("//dft:Subresource", namespaces=nsmap)

for subresource in subresources[3]:
    print_xml(subresource)

Let's select all issues from January 1900 and retrieve their download link

In [None]:
import regex as re

In [None]:
january_subresources = [subresource for subresource in subresources 
                          if subresource.xpath("./dft:SubresourceDescription/dft:TemporalCoverage/dft:label[contains(text(), '-01-')]", namespaces=nsmap)]

cmd_refs = [subresource.attrib for subresource in january_subresources]
cmd_refs = [subresource.xpath("./@cmd:ref", namespaces=nsmap)[0] for subresource in january_subresources]

direct_data_links = [xml_tree_1900.xpath(f"//cmd:ResourceProxy[@id='{cmd_ref}']/cmd:ResourceRef/text()", namespaces=nsmap)[0] for cmd_ref in cmd_refs]
print(direct_data_links)

More concrete examples of data retrieval will follow in Chapter 2. Hopefully you got some grasp of the linkage between metadata and textual reasources. 