# Exercise 3: data filtering and running NLP tasks

In this exercise you will first learn how to use metadata properties for filtering data. We will then apply this to an example NLP pipeline.

----
Like in the previous exercise, we first need to install and 'import' some packages. Run the following cell to get everything in place. Notice that this is a cell that might take a bit more time to run. While `[*]` is shown next to a cell, this means that it is being processed or waiting to be processed and has not yet completed.

In [None]:
!pip install lxml
!pip install -i https://pypi.clarin-pl.eu lpmn_client

from lxml import etree
import requests

%run ../common.py

Looking in indexes: https://pypi.clarin-pl.eu


## Exercise 3.1

For many research questions, we want to analyse one or more specific data segments based on some criteria. For this exercise we assume that we are interested to find names of persons, organisations, places etcetera found in texts mentioning the city of Kraków published in the first week of World War I.

In the next cells, create the list of file paths of resources that meet these criteria:
- publication date between 28 July 1914 and 4 August 1914
- the resource file contains the text 'Kraków'

In [None]:
# Retrieve the metadata. We put this in its own cell so that we can run the filtering process separately
set_metadata_dir = unpack_metadata(set_id, metadata_dir)

set_metadata_dir

Retrieving https://europeana-oai.clarin.eu/metadata/fulltext-aggregation/9200357.zip
Extracting content in /home/jovyan/temp/metadata/9200357
Done


'/home/jovyan/temp/metadata/9200357'

In [None]:
# We provide two helper functions that gets all issue identifiers and the associated dates out of
# a metadata record. You can use this as is.

def get_issues_id_and_date(metadata_tree):
    """
        Returns a list of tuples (id, date) for all issues in the metadata tree.
        The 'date' part is a date object that supports retrieval of the date parts, i.e.
        `date.year`, `date.month`, `date.day`
    """
    issue_descriptions = metadata_tree.xpath('//cmdp_text:SubresourceDescription', namespaces=nsmap)
    issues = [get_id_and_date_from_description(description) for description in issue_descriptions  if description is not None]
    return [issue for issue in issues if issue is not None]

def get_id_and_date_from_description(description_element):
    """
        Helper that gets the identifier and date for a single issue. Returns None
        if there is identifier and date information are not both present.
    """
    issue_ids = description_element.findall('./cmdp_text:IdentificationInfo/cmdp_text:identifier', namespaces=nsmap)
    issue_dates_start = description_element.find('./cmdp_text:TemporalCoverage/cmdp_text:Start/cmdp_text:date', namespaces=nsmap)
    if len(issue_ids) > 0 and issue_dates_start is not None:
        for issue_id in issue_ids:
            if issue_id.text.isnumeric():
                return (issue_id.text, datetime.date.fromisoformat(issue_dates_start.text))


files = []
for metadata_file in os.listdir(set_metadata_dir):
    full_path = f'{set_metadata_dir}/{metadata_file}'
    tree = etree.parse(full_path)
    for info in get_issues_id_and_date(tree):
        (issue_id, issue_date) = info
        # We now have a numer identifier `issue_id` and a date `issue_date`
        # from which we can get the year, month, day through `issue_date.year`,
        # `issue_date.month`, `issue_date.day`. Use this to decide whether to include 
        # this issue.
        
        # If the date matches the desired range, we need to look at the resource itself
        # to see if the target text appears.
        # If you want, you can make use of the provided function get_resource_file(issue_id)
        # to determine the path to the file.
        #
        # In Exercises set 2 we explored how to open a file and look for text inside
        
        # If both criteria match, we only need to add the file path to the array, which is
        # done with `files.append(issue_id)`
        

In the next cell we compore the result to a predefined solution. In the following cells we will use the outcome of the predefined solution, so don't worry about moving on even if the outcomes do not fully match.

In [None]:
my_result = files
print(f'Number of files in our own result: {len(my_result)}')

# now we run the predefined solution
predefined_result = ex3_filter_by_date_and_content(set_metadata_dir,
                                       date.fromisoformat('1914-07-28'), 
                                       date.fromisoformat('1914-08-11'), 'w Polsce')
print(f'Number of files in result from predefined solution: {len(predefined_result)}')

if len(my_result) == len(predefined_result):
    print('The counts match!')
else:
    print('The counts do not match :(')

Number of files in our own result: 0
Number of files in result from predefined solution: 7
The counts do not match :(


## Exercise 3.2
In this exercise we will try to investigate effect of using contemporary spellchecking on archival textual data. Due to the time required for spellchecking we provde a mapping to already processed files. Your job will be to run NER pipeline on raw and spell-corrected textual data and compare number of tokens, annotations and their spans.

In [None]:
"""
    Functions for tasking lpmn client with Liner2 NER pipeline with task size control
"""

from lpmn_client import download_file, upload_file
from lpmn_client import Task

def lpmn_client_task(resources: list, task: str, names: list=[]):
    """
        Wrap over CLARIN-PL lpmn client with control of the task size in order to avoid jamming the task queue on the server side
        
        :param list resources: list of paths to the resources to be processed
        :param str task: string defining pipeline, e.g. "speller2" or ""
        :param list names: optional list of names for output files, has to be same length as resources
        :returns list: list of paths to the output zip files
    """
    
    # Size check
    _check_task_size(resources)
    # Upload reasources to task queue
    job_ids = [upload_file(resource_file) for resource_file in resources]
    # Specify pipeline 
    t = Task(task)
    # Run uploaded tasks with pipeline
    output_file_ids = [t.run(job_id, verbose=True) for job_id in job_ids]
    if names:
        output = [download_file(output_file_id, output_file, filename) 
                         for output_file_id, filename in zip(output_file_ids, names)]
    else:
        output = [download_file(output_file_id, output_file, os.path.basename(resource)) 
                         for output_file_id, resource in zip(output_file_ids, resources)]
    return output



In [None]:
resource_files = predefined_result

# We already did for you, line below the comment, if running outside workshop, uncomment that block and comment the line below
# resource_files_spellchecked = lpmn_client_task(resource_files, "speller2")
# resource_files_spellchecked = [unzip_file(r) for r in resource_files_spellchecked]
resource_files_spellchecked = get_spellchecked_resources()

In [13]:
"""from collections import Counter
Counter(elem[0] for elem in list1)
    Function for extracting annotations from Liner2 xml output
"""

# test_path = get_resource_file("3000095236729")
# print(test_path)
# lpmn_client_task([test_path], "speller2", ["test_speller.zip"])

def liner2_xml_to_annotation(path_to_xml):
    """
        Converts xml doc into list of annotations and tokens
        
        :param str path_to_xml: path to .xml Liner2 output file
        :returns list: list of tuples (annotation_type, [tokens])
    """
    with open(path_to_xml, "r") as xml_file:
        xml_tree = etree.parse(path_to_xml)
        sentences = xml_tree.xpath("//sentence")
        annotated_tokens = [sentence.xpath("./tok[./ann!=0]") for sentence in sentences]
        # Prune empty lists
        annotated_tokens = [annotated_token for annotated_token in annotated_tokens if annotated_token]
        annotated_tokens = [_chain_annotations(sentence) for sentence in annotated_tokens]
        return annotated_tokens
        
def _chain_annotations(sentence: list):
    annotation_heads = [token.xpath("./ann[@head]") for token in sentence]
    for token in annotation_heads:
        for annotation_head in token:
            annotation_channel = annotation_head.xpath("./text()")[0]
            annotation_type = annotation_head.get("chan")
            annotation_tokens = [token.xpath("./orth/text()")[0] for token in sentence if token.xpath(f"./ann[text()={annotation_channel}]")]
    return annotation_type, annotation_tokens

/home/jovyan/data/9200357/BibliographicResource_3000095236729.txt



  0%|          | 0/100 [00:00<?, ?it/s][A
  0%|          | 0.0/100 [00:00<?, ?it/s][A
  0%|          | 0.0/100 [00:01<?, ?it/s][A
  0%|          | 0.0/100 [00:02<?, ?it/s][A
  0%|          | 0.0/100 [00:02<?, ?it/s][A
  0%|          | 0.0/100 [00:03<?, ?it/s][A
  0%|          | 0.0/100 [00:04<?, ?it/s][A
  0%|          | 0.0/100 [00:05<?, ?it/s][A
  0%|          | 0.0/100 [00:05<?, ?it/s][A
  0%|          | 0.0/100 [00:06<?, ?it/s][A
  0%|          | 0.0/100 [00:07<?, ?it/s][A
  0%|          | 0.0/100 [00:07<?, ?it/s][A
  0%|          | 0.0/100 [00:08<?, ?it/s][A
  0%|          | 0.0/100 [00:09<?, ?it/s][A
  0%|          | 0.0/100 [00:10<?, ?it/s][A
  0%|          | 0.0/100 [00:10<?, ?it/s][A
  0%|          | 0.0/100 [00:11<?, ?it/s][A
  0%|          | 0.0/100 [00:12<?, ?it/s][A
  0%|          | 0.0/100 [00:13<?, ?it/s][A
  0%|          | 0.0/100 [00:13<?, ?it/s][A
  0%|          | 0.0/100 [00:14<?, ?it/s][A
  0%|          | 0.0/100 [00:15<?, ?it/s][A
  0%|      

In [19]:
test_files = unzip_file(f"{output_file}/test_speller.zip")
print(test_files)

['/home/jovyan/output/test_speller/home%jovyan%data%9200357%BibliographicResource_3000095236729.txt']


In [20]:
"""
    Run NER task on both raw and spellchecked input
"""

resource_files = predefined_result

output_files_raw = lpmn_client_task(resource_files, 'any2txt|wcrft2|liner2({"model":"top9"})')
print("NER pipeline over raw resources finished")
output_files_spellchecked = lpmn_client_task(test_files, 'any2txt|wcrft2|liner2({"model":"top9"})')
print("NER pipeline over spellchecked resources finished")



  0%|          | 0/100 [00:00<?, ?it/s][A
 20%|██        | 20.0/100 [00:00<00:02, 28.66it/s][A
 40%|████      | 40.0/100 [00:01<00:02, 28.51it/s][A
100%|██████████| 100.0/100 [00:04<00:00, 20.07it/s][A

  0%|          | 0/100 [00:00<?, ?it/s][A
 20%|██        | 20.0/100 [00:00<00:02, 27.99it/s][A
 40%|████      | 40.0/100 [00:01<00:02, 28.32it/s][A
 60%|██████    | 60.0/100 [00:05<00:04,  8.32it/s][A
100%|██████████| 100.0/100 [00:06<00:00, 15.29it/s][A

  0%|          | 0/100 [00:00<?, ?it/s][A
 20%|██        | 20.0/100 [00:00<00:02, 28.44it/s][A
 40%|████      | 40.0/100 [00:01<00:02, 28.13it/s][A
 60%|██████    | 60.0/100 [00:04<00:03, 11.94it/s][A
100%|██████████| 100.0/100 [00:04<00:00, 20.22it/s][A

  0%|          | 0/100 [00:00<?, ?it/s][A
 40%|████      | 40.0/100 [00:00<00:01, 55.11it/s][A
100%|██████████| 100.0/100 [00:03<00:00, 28.27it/s][A

  0%|          | 0/100 [00:00<?, ?it/s][A
 20%|██        | 20.0/100 [00:00<00:02, 28.40it/s][A
 40%|████      | 40.

goes through first



  0%|          | 0/100 [00:00<?, ?it/s][A
 20%|██        | 20.0/100 [00:00<00:02, 28.15it/s][A
 40%|████      | 40.0/100 [00:02<00:04, 13.01it/s][A
 60%|██████    | 60.0/100 [00:11<00:09,  4.24it/s][A
100%|██████████| 100.0/100 [00:12<00:00,  8.32it/s][A


goes through second
['/home/jovyan/output/BibliographicResource_3000095243392.txt', '/home/jovyan/output/BibliographicResource_3000095244058.txt', '/home/jovyan/output/BibliographicResource_3000095243952.txt', '/home/jovyan/output/BibliographicResource_3000095243514.txt', '/home/jovyan/output/BibliographicResource_3000095243250.txt', '/home/jovyan/output/BibliographicResource_3000095242404.txt', '/home/jovyan/output/BibliographicResource_3000095236729.txt']
['/home/jovyan/output/home%jovyan%data%9200357%BibliographicResource_3000095236729.txt']


In [22]:
"""
    Functions for parsing output and basic stats
"""
def count_tokens(ner_output_tree):
    return sum([1 for _ in ner_output_tree.xpath("//tok")])

def list_annotations(ner_output_tree):
    return liner2_xml_to_annotation(ner_output_tree)

def count_annotations(ner_output_tree):
    return Counter(f"{annotation_type}|{' '.join(annotation_tokens)}" for annotation_type, annotation_tokens in list_annotations(ner_output_tree))

In [None]:
"""
    Let's invastigate difference in number of parsed tokens in raw and spellchecked data
"""
token_nb_raw = 0
for output_file in output_files_raw:
    print(f"Processing {output_file}")
    xml_tree = etree.parse(output_file)
    token_nb_raw += count_tokens(xml_tree)
print(f"Raw data has {token_nb_raw} tokens after the pipeline")

token_nb_spellchecked = 0
for output_file in output_files_raw:
    xml_tree = etree.parse(output_file)
    token_nb_spellchecked += count_tokens(xml_tree)
print(f"Raw data has {token_nb_spellchecked} tokens after the pipeline")



Processing /home/jovyan/output/BibliographicResource_3000095243392.txt
ERROR! Session/line number was not unique in database. History logging moved to new session 24
