# Exercise 2: Working with Metadata and other Files

In this exercise you will learn how to extract information from metadata, and how to work with files in general. If you think you already have these skills, we recommend running all the cells, reviewing the contents and output, and using this knowledge to move on to [Exercise 3](./Exercise3.ipynb).

----
Before we can run the cells in this exercise, we need to install and 'import' some packages. Run the following cell. Don't worry about understanding everything that is happening here for the moment, but verify that it completes without errors in the output.

Notice that this is a cell that might take a bit more time to run. While `[*]` is shown next to a cell, this means that it is being processed or waiting to be processed and has not yet completed.

In [None]:
!pip install lxml

from lxml import etree
import requests

%run ../common.py

----
## Exercise 2.1
In this exercise we will take a real metadata file and read some content out of it.

We will start by downloading a metadata record that we have [found in a catalogue](https://vlo.clarin.eu/record/https_58__47__47_europeana-oai.clarin.eu_47_metadata_47_fulltext-aggregation_47_9200357_47_Dziennik__l_ski_1915.xml).

**Read the following cell and try to understand what it does.** 

Note that the lines starting with a `#` character are so-called comments, which are used to document code. In this case they are used to explain what is happening on each line.

In [None]:
# The next line gets a metadata record from a URL, and stores the result in memory. The results gets a name `remote_metadata` for use and reference later on
remote_metadata = requests.get('https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200357/Dziennik_Wile_ski_1916.xml')

# After the next line, the field `text` inside the retrieved data can be referenced using the name `metadata_content`
metadata_content = remote_metadata.content

# The final line yields a number, which represents the length of the metadata record. Because it is the last line of the cell it is considered
# the "result" of the cell and therefore is also shown after the cell has finished running
len(metadata_content)

We now have loaded the metadata locally as a file in memory. The next step is to 'parse' it - that is, read and process the structure of the document so that we can easily query it for values that we are interested in.

In [None]:
# First we have to do the parsing. This results in a so-called 'XML tree', a hierarchical representation of the metadata document.
metadata_tree = etree.fromstring(metadata_content)

# We can now use the metadata tree variable to see if we can look up the title of the record
metadata_tree.xpath('//cmdp_text:TitleInfo/cmdp_text:title/text()', namespaces=nsmap)

As you may have noticed, we use the `xpath` function which is available when we have a parsed XML tree (`metadata_tree` in the cell above). The name of the function refers to **XPath**, which is a special language that is very commonly used to query XML documents. Its basic functions are easy to use, as can be seen above, but it also has many advanced features. To learn more about XPath, have a look at the [XPath page on Wikipedia](https://en.wikipedia.org/wiki/XPath) or follow [this tutorial](https://www.w3schools.com/xml/xpath_intro.asp).

----
The following snippet shows a part of the XML document that we have loaded (we encourage you to also have a look at the [full file](https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200357/Dziennik__l_ski_1916.xml)):
```xml
<cmd:CMD>
    ...
    <cmd:Components>
        <cmdp_text:TextResource>
            <cmdp_text:TitleInfo>
                <cmdp_text:title>Dziennik Śląski - 1916</cmdp_text:title>
            </cmdp_text:TitleInfo>
            <cmdp_text:Description>
                <cmdp_text:description>
                    Full text content aggregated from Europeana. Title: "Dziennik Śląski". Year: 1916.
                </cmdp_text:description>
            </cmdp_text:Description>
                ...
        </cmdp_text:TextResource>
    </cmd:Components>
</cmd:CMD>
```

**In the next cell, write a line that returns the _description_ contained in the metadata.**
Run the cell and verify that the value matches the description found in the metadata snippet listed above.

Note: you can use the `metadata_tree` variable since it has been defined in the previous cell. Variables remain available to use in following cells once they are defined.

In [None]:
# Write the line below that looks up and returns the value of the description as found in the metadata


Since we probably want to get different kinds of information from different metadata records later on, it would be nice if we could define simpler, shorter commands for doing so. This is possible in Python by defining **functions**. Doing this will make it easier to write complex code, without having to repeat yourself while reducing the chance of making errors and making it easier to make improvements and other changes.

Functions are made by writing `def <name_of_function>(<parameter1>, .., <parameterN>):` and then write the logic that needs to be encapsulated in the function below that. The parameters have to be provided upon using the function and can be used in the same way as variables in the logic of the function.

**Run the next cell to have some functions defined, thus becoming available for use directly afterwards.**

Note that putting text between triple quotes (`"""`) is another way to write comments, useful for longer descriptions.

In [None]:
def get_title_from_metadata(metadata_tree):
    """
        This function returns the title from the text resource metadata.
    """
    # Get all the values from the xpath
    titles = metadata_tree.xpath('//cmdp_text:TextResource/cmdp_text:TitleInfo/cmdp_text:title/text()', namespaces=nsmap)
    # Check if there is an actual value
    if len(titles) > 0:
        # Return the first (assuming only) value
        return titles[0]

def get_description_from_metadata(metadata_tree):
    """
        This function returns the description from the text resource metadata.
    """
    descriptions = metadata_tree.xpath('//cmdp_text:TextResource/cmdp_text:Description/cmdp_text:description/text()', namespaces=nsmap)
    if len(descriptions) > 0:
        return descriptions[0]
    

def get_resource_ids_from_metadata(metadata_tree):
    """
        This function returns all resource identifiers from the resource metadata.
    """
    ids = metadata_tree.xpath('//cmdp_text:SubresourceDescription/cmdp_text:IdentificationInfo/cmdp_text:identifier/text()', namespaces=nsmap)
    # The result can be any number of identifiers. We do want to filter the values a bit: only the numeric identifiers are useful 
    # to us so we use the special syntax below to make a new list by picking only the matching values from the query results list
    return [id for id in ids if id.isnumeric()]

In the following cell we will use ('call') the functions for the first time. Run it and see if you understand what is happening.

In [None]:
# Let's try it out!
print('Title: ' + get_title_from_metadata(metadata_tree))
print('Description: ' + get_description_from_metadata(metadata_tree))
print('Resource identifiers: ')
print(get_resource_ids_from_metadata(metadata_tree))

----
## Exercise 2.2
In this exercise we will use the resource identifiers that we can get from metadata file to find the text, and load it so that we can use the content.

We will start by defining a function that helps us to get from a numeric identifier, like the ones we managed to extract from the metadata previously, to a file that we have available locally. Note that _in the current environment_ we can safely make the following assumptions:

- we have a variable `data_dir` available that gives us the base location (path) where we have our text files
- this data directory is organised into sets; we know the identifier of the set that we have metadata for
- the directory for each set contains a file called `id_file_map.json` which we can load with a special library called `json`, giving us an in-memory map that we can query to get the file name for a given identifier

In [None]:
set_id = '9200357'

"""
    Here we prepare the look-up of resource files based on the resource identifier
"""
import json
with open(f'{data_dir}/{set_id}/id_file_map.json', 'r') as id_filename_map_file:
    id_filename_map = json.load(id_filename_map_file)
    
def get_resource_file(identifier):
    """
        A function that looks up the full path to a text resource file based on an identifier
        within the current set. If there is no mapping to a file, it returns nothing.
    """
    if identifier in id_filename_map:
        filename = id_filename_map[identifier]
        return f'{data_dir}/{set_id}/{filename}'

**a)** It would be a good start to have a variable that holds the list of identifiers for the metadata record (= year of newspaper issues) that we are interested in. **Define and set this variable in the next cell.**

In [None]:
# In the next line, delete 'None' and complete the line to populate this variable with the list of identifiers 
# contained in our `metadata tree` (using the the variable and function that we instantiated in part 2.1)
identifiers = None

# keep the following lines as they are so that we can evaluate the number of identifiers upon running the cell
if identifiers:
    len(identifiers)

The output of cell above should read `21`.

Now all that we need to do is to 'transform' the list of identifiers into a list of files. This is done in the following cell by means of a special syntax called [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions). The way we use it here, it just applies a function to each item in the identifiers list and returns the results as a new list.

In [None]:
# if all went well, we already did this before :)
identifiers = get_resource_ids_from_metadata(metadata_tree)

# now we can make a list of files
files = [get_resource_file(identifier) for identifier in identifiers]

# get the first five files to prevent a lot of output
files[0:5]

Note that in the last line, we use another new technique: we select a range from the list so that we keep only part of the original. This does not change the original list, but gives us a 'slice' that we can use on its own. In this case, we just used it to display only the first five files to keep the size of the output to a reasonable limit.

Now we are ready to get into the content of the file. Let's define a function that counts the occurences of a word or other text fragment in a file. Read and try to roughly understand the code in the following block, and run it.

In [None]:
def count_occurences(file_path, target):
    # We open the file; using the technique, it will be available in the block below it - this guarantees the 
    # automatic closing of the file
    with open(file_path, 'r') as file:
        # This is the counter we will increment on each find
        count = 0
        # Read the file line by line
        # Note that 'while True' creates an endless loop! Fortunately we will be able to break out of this when needed
        while True:
            line = file.readline()
            if line:
                # Search for the target - first we convert both sides of the equation to lower case to 
                # effectively make the search case-insensitive
                count += str.count(line.lower(), target.lower())
            else:
                # No line: end of file, break out of the "endless loop"
                break
    return count
        
count_occurences('/home/jovyan/data/9200357/BibliographicResource_3000095247423.txt', 'front')

**b)** In the next cell, count the number of occurences of the word 'front' in all the text that we have available via our `files` variable

In [None]:
total_count = 0
# Implement count of the text 'front' over all text in `files`
# <your implementation here>

# And make the final value as a result of the cell
total_count

The cell above should output a total of `275`.

----
Next: [Exercise 3](./Exercise3.ipynb)