In [None]:
# TODO: factor out this preamble
nsmap = {"cmd": "http://www.clarin.eu/cmd/1",
         "cmdp_text": "http://www.clarin.eu/cmd/1/profiles/clarin.eu:cr1:p_1633000337997"}

# Exercise 2: working with metadata and other files

In this exercise you will learn how to extract information from metadata, and how to work with files in general. If you think you already have these skills, we recommend to run all cells, review the contents and output, and using this knowledge move on to [Exercise 3](./Exercise3.ipynb).

----
Before we can run the cells in this exercise, we need to install and 'import' some packages. Run the following cell. Don't worry about understanding everything that is happening here for the moment, but verify that it completes without errors in the output.

Notice that this is a cell that might take a bit more time to run. While `[*]` is shown next to a cell, this means that it is being processed or waiting to be processed and has not yet completed.

In [None]:
!pip install lxml

from lxml import etree
import requests

## Exercise 2.1
In this exercise we will take a real metadata file and read some content out of it.

We will start by downloading a metadata record that we have [found in a catalogue](https://vlo.clarin.eu/record/https_58__47__47_europeana-oai.clarin.eu_47_metadata_47_fulltext-aggregation_47_9200357_47_Dziennik__l_ski_1915.xml).

**Read the following cell and try to understand what it does.** 

Note that the lines starting with a `#` character are so called comments, which are used to document code. In this case they are used to explain what is happening on each line.

In [None]:
# The next line gets a metadata record from a URL, and stores the result in memory. The results gets a name `remote_metadata` for use and reference later on
remote_metadata = requests.get('https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200357/Dziennik__l_ski_1915.xml')

# After the next line, the field `text` inside the retrieved data can be referenced using the name `metadata_content`
metadata_content = remote_metadata.content

# The final line yields a number, which represents the length of the metadata record. Because it is the last line of the cell it is considered
# the "result" of the cell and therefore is also shown after the cell has finished running
len(metadata_content)

We now have loaded the metadata locally as a file in memory. Next step is to 'parse' it - that is, read and process the structure of the document so that we can easily query it for values that we are interested in.

In [None]:
# First we have to do the parsing. This results in a so called 'XML tree', a hierarchical representation of the metadata document.
metadata_tree = etree.fromstring(metadata_content)

# We can now use the metadata tree variable to see if we can look up the title of the record
metadata_tree.xpath('//cmdp_text:TitleInfo/cmdp_text:title/text()', namespaces=nsmap)

As you may have noticed, we use the `xpath` function which is available when we have a parsed XML tree (`metadata_tree` in the cell above). The name of the function refers to **XPath**, which is a special language that is very commonly used to query XML documents. Its basic functions are easy to use, as can be seen above, but it also has many advanced features. To learn more about XPath, have a look at the [XPath page on Wikipedia](https://en.wikipedia.org/wiki/XPath) or follow [this tutorial](https://www.w3schools.com/xml/xpath_intro.asp).

----
The following snippet shows a part of the XML document that we have loaded (we encourage you to also have a look at the [full file](https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200357/Dziennik__l_ski_1915.xml)):
```xml
<cmd:CMD>
    ...
    <cmd:Components>
        <cmdp_text:TextResource>
            <cmdp_text:TitleInfo>
                <cmdp_text:title>Dziennik Śląski - 1915</cmdp_text:title>
            </cmdp_text:TitleInfo>
            <cmdp_text:Description>
                <cmdp_text:description>
                    Full text content aggregated from Europeana. Title: "Dziennik Śląski". Year: 1915.
                </cmdp_text:description>
            </cmdp_text:Description>
                ...
        </cmdp_text:TextResource>
    </cmd:Components>
</cmd:CMD>
```

**In the next cell, write a line that returns the _description_ contained in the metadata.** Run the cell and verify that the value matches the description found in the metadata snippet listed above.

Note: you can use the `metadata_tree` variable, since it has been defined in the previous cell. Variables remain available to use in following cells once they are defined.

In [None]:
# Write the line below that looks up and returns the value of the description as found in the metadata


## Exercise 2.2


----
Next: [Exercise 3](./Exercise3.ipynb)