# Exercise 3: data filtering and running NLP tasks

In this exercise you will first learn how to use metadata properties for filtering data. We will then apply this to an example NLP pipeline.

----
Like in the previous exercise, we first need to install and 'import' some packages. Run the following cell to get everything in place. Notice that this is a cell that might take a bit more time to run. While `[*]` is shown next to a cell, this means that it is being processed or waiting to be processed and has not yet completed.

In [92]:
!pip install lxml
!pip install -i https://pypi.clarin-pl.eu lpmn_client

from lxml import etree
import requests
from lpmn_client import download_file, upload_file
from lpmn_client import Task

%run ../common.py

Looking in indexes: https://pypi.clarin-pl.eu


## Exercise 3.1

For many research questions, we want to analyse one or more specific data segments based on some criteria. For this exercise we assume that we are interested to find names of persons, organisations, places etcetera found in texts mentioning the city of Kraków published in the first week of World War I.

In the next cells, create the list of file paths of resources that meet these criteria:
- publication date between 28 July 1914 and 4 August 1914
- the resource file contains the text 'Kraków'

In [93]:
# Retrieve the metadata. We put this in its own cell so that we can run the filtering process separately
set_metadata_dir = unpack_metadata(set_id, metadata_dir)

set_metadata_dir

Retrieving https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200357.zip
Extracting content in /home/jovyan/temp/metadata/9200357
Done


'/home/jovyan/temp/metadata/9200357'

In [94]:
# We provide two helper functions that gets all issue identifiers and the associated dates out of
# a metadata record. You can use this as is.

def get_issues_id_and_date(metadata_tree):
    """
        Returns a list of tuples (id, date) for all issues in the metadata tree.
        The 'date' part is a date object that supports retrieval of the date parts, i.e.
        `date.year`, `date.month`, `date.day`
    """
    issue_descriptions = metadata_tree.xpath('//cmdp_text:SubresourceDescription', namespaces=nsmap)
    issues = [get_id_and_date_from_description(description) for description in issue_descriptions  if description is not None]
    return [issue for issue in issues if issue is not None]

def get_id_and_date_from_description(description_element):
    """
        Helper that gets the identifier and date for a single issue. Returns None
        if there is identifier and date information are not both present.
    """
    issue_ids = description_element.findall('./cmdp_text:IdentificationInfo/cmdp_text:identifier', namespaces=nsmap)
    issue_dates_start = description_element.find('./cmdp_text:TemporalCoverage/cmdp_text:Start/cmdp_text:date', namespaces=nsmap)
    if len(issue_ids) > 0 and issue_dates_start is not None:
        for issue_id in issue_ids:
            if issue_id.text.isnumeric():
                return (issue_id.text, datetime.date.fromisoformat(issue_dates_start.text))


files = []
for metadata_file in os.listdir(set_metadata_dir):
    full_path = f'{set_metadata_dir}/{metadata_file}'
    tree = etree.parse(full_path)
    for info in get_issues_id_and_date(tree):
        (issue_id, issue_date) = info
        # We now have a numer identifier `issue_id` and a date `issue_date`
        # from which we can get the year, month, day through `issue_date.year`,
        # `issue_date.month`, `issue_date.day`. Use this to decide whether to include 
        # this issue.
        
        # If the date matches the desired range, we need to look at the resource itself
        # to see if the target text appears.
        # If you want, you can make use of the provided function get_resource_file(issue_id)
        # to determine the path to the file.
        #
        # In Exercises set 2 we explored how to open a file and look for text inside
        
        # If both criteria match, we only need to add the file path to the array, which is
        # done with `files.append(issue_id)`
        

In the next cell we compore the result to a predefined solution. In the following cells we will use the outcome of the predefined solution, so don't worry about moving on even if the outcomes do not fully match.

In [95]:
my_result = files
print(f'Number of files in our own result: {len(my_result)}')

# now we run the predefined solution
predefined_result = ex3_filter_by_date_and_content(set_metadata_dir,
                                       date.fromisoformat('1914-07-28'), 
                                       date.fromisoformat('1914-08-11'), 'w Polsce')
print(f'Number of files in result from predefined solution: {len(predefined_result)}')

if len(my_result) == len(predefined_result):
    print('The counts match!')
else:
    print('The counts do not match :(')

Number of files in our own result: 0
Number of files in result from predefined solution: 7
The counts do not match :(


## Exercise 3.2
In this exercise we will submit our selection of files to the pipeline

In [98]:
resource_files = predefined_result

# Here we define a function for tasking lpmn client with Liner2 NER pipeline with task size control 

def liner2_NER(resources, names=[]):
    """
        Wrap over CLARIN-PL lpmn client with control of the task size in order to avoid jamming the task queue on the server side
        
        :param list resources: list of paths to the resources to be processed
        :returns list: list of paths to the output zip files
    """
    
    # Size check
    _check_task_size(resources)
    # Upload reasources to task queue
    job_ids = [upload_file(resource_file) for resource_file in resources]
    # Specify pipeline 
    t = Task("any2txt|wcrft2|liner2")
    # Run uploaded tasks with pipeline
    output_file_ids = [t.run(job_id, verbose=True) for job_id in job_ids]
    if names:
        liner2_output = [download_file(output_file_id, output_file, filename) for output_file_id, filename in zip(output_file_ids, names)]
    else:
        liner2_output = [download_file(output_file_id, output_file) for output_file_id in output_file_ids]
    return liner2_output

output = liner2_NER(resource_files)

100%|██████████| 100.0/100 [00:02<00:00, 40.44it/s]
100%|██████████| 100.0/100 [00:07<00:00, 12.52it/s]
100%|██████████| 100.0/100 [00:04<00:00, 24.85it/s]
100%|██████████| 100.0/100 [00:03<00:00, 25.07it/s]
100%|██████████| 100.0/100 [00:04<00:00, 20.85it/s]
100%|██████████| 100.0/100 [00:02<00:00, 41.48it/s]
100%|██████████| 100.0/100 [00:04<00:00, 24.55it/s]


In [97]:
output

['/home/jovyan/output/6be59c9d-6615-4cb4-8dc3-5bd251a7f456.zip',
 '/home/jovyan/output/ccf2c814-824d-43ef-9da8-ba788b4135c5.zip',
 '/home/jovyan/output/c707f80c-db37-4c6d-9859-e20bf3459401.zip',
 '/home/jovyan/output/3979126f-f03a-4669-96a9-3d5a95d86ebc.zip',
 '/home/jovyan/output/13b31721-6040-4754-8dc9-6d4e2073834d.zip',
 '/home/jovyan/output/4940cf2e-7a0f-4ef7-b7f1-eb7510f9468b.zip',
 '/home/jovyan/output/db34271d-cb3c-4d03-ba80-55506d1a0544.zip']

In [None]:
resource_files