# Exercise 3: data filtering and running NLP tasks

In this exercise you will first learn how to use metadata properties for filtering data. We will then apply this to an example NLP pipeline.

----
Like in the previous exercise, we first need to install and 'import' some packages. Run the following cell to get everything in place. Notice that this is a cell that might take a bit more time to run. While `[*]` is shown next to a cell, this means that it is being processed or waiting to be processed and has not yet completed.

In [None]:
"""
    Install lxml and import it
"""
!pip install lxml

from lxml import etree

"""
    Install lpmn client and import it
"""
!pip install -i https://pypi.clarin-pl.eu lpmn_client

from lpmn_client import download_file, upload_file
from lpmn_client import Task

# We will also be using some core libraries
from datetime import date
import os
import requests

# Globals
from common import align_resources, _check_task_size, data_dir, metadata_dir, nsmap, set_id, print_xml, output_file, unpack_metadata, unzip_file, zip_file
# Getters
from common import ex3_filter_by_date_and_content, get_date_from_metadata, get_resource_ids_from_metadata, get_resource_file, get_spellchecked_resources_ex3

## Exercise 3.1

For many research questions, we want to analyse one or more specific data segments based on some criteria. For this exercise we assume that we are interested to find names of persons, organisations, places etcetera found in texts published in the first week of World War I containing the phrase _w Polsce_ (_"in Poland"_).

In the next cells, **create the list of file paths of resources that meet these criteria**:
- publication date between 28 July 1914 and 11 August 1914
- the resource file contains the text `w Polsce`

We will start with some preparations, no need to change the next cell

In [None]:
# Retrieve the metadata. We put this in its own cell so that we can run the filtering process separately
set_metadata_dir = unpack_metadata(set_id, metadata_dir)

set_metadata_dir

In [None]:
# We provide two helper functions that gets all issue identifiers and the associated dates out of
# a metadata record. You can use this as is.

def get_issues_id_and_date(metadata_tree):
    """
        Returns a list of tuples (id, date) for all issues in the metadata tree.
        The 'date' part is a date object that supports retrieval of the date parts, i.e.
        `date.year`, `date.month`, `date.day`
    """
    issue_descriptions = metadata_tree.xpath('//cmdp_text:SubresourceDescription', namespaces=nsmap)
    issues = [get_id_and_date_from_description(description) for description in issue_descriptions  if description is not None]
    return [issue for issue in issues if issue is not None]

def get_id_and_date_from_description(description_element):
    """
        Helper that gets the identifier and date for a single issue. Returns None
        if there is identifier and date information are not both present.
    """
    issue_ids = description_element.findall('./cmdp_text:IdentificationInfo/cmdp_text:identifier', namespaces=nsmap)
    issue_dates_start = description_element.find('./cmdp_text:TemporalCoverage/cmdp_text:Start/cmdp_text:date', namespaces=nsmap)
    if len(issue_ids) > 0 and issue_dates_start is not None:
        for issue_id in issue_ids:
            if issue_id.text.isnumeric():
                return (issue_id.text, date.fromisoformat(issue_dates_start.text))

Now we have to **put everything together to filter the metadata and collect the files containing text for issues that match our criteria**:

In [None]:
files = []
for metadata_file in os.listdir(set_metadata_dir):
    full_path = f'{set_metadata_dir}/{metadata_file}'
    tree = etree.parse(full_path)
    for info in get_issues_id_and_date(tree):
        (issue_id, issue_date) = info
        # We now have a numer identifier `issue_id` and a date `issue_date`
        # from which we can get the year, month, day through `issue_date.year`,
        # `issue_date.month`, `issue_date.day`. Use this to decide whether to include 
        # this issue.
        
        # If the date matches the desired range, we need to look at the resource itself
        # to see if the target text appears.
        # If you want, you can make use of the provided function get_resource_file(issue_id)
        # to determine the path to the file.
        #
        # In Exercises set 2 we explored how to open a file and look for text inside
        
        # If both criteria match, we only need to add the file path to the array, which is
        # done with `files.append(issue_id)`

In the next cell we **compare the result to a predefined solution**. In the following cells we will use the outcome of the predefined solution, so don't worry about moving on even if the outcomes do not fully match.

In [None]:
my_result = files
print(f'Number of files in our own result: {len(my_result)}')

# now we run the predefined solution
predefined_result = ex3_filter_by_date_and_content(set_metadata_dir,
                                       date.fromisoformat('1914-07-28'), 
                                       date.fromisoformat('1914-08-11'), 'w Polsce')
print(f'Number of files in result from predefined solution: {len(predefined_result)}')

if len(my_result) == len(predefined_result):
    print('The counts match!')
else:
    print('The counts do not match :(')

## Exercise 3.2
In this exercise we will try to investigate effect of using contemporary spellchecking on archival textual data. Due to the time required for spellchecking we provde a mapping to already processed files. Your job will be to **run the NER pipeline on raw and spell-corrected textual data, and compare number of tokens and found annotations**.

First we define a function that calls the CLARIN-PL NLP service with a specific tasks for a set of resources. You don't have to change anything in the next cell, but read the code and try to understand what is happening.


In [None]:
"""
    Function for tasking lpmn client with Liner2 NER pipeline with task size control 
"""

def lpmn_client_task(resources, task, names=[]):
    """
        Wrap over CLARIN-PL lpmn client with control of the task size in order to avoid jamming the task queue on the server side
        
        :param list resources: list of paths to the resources to be processed
        :param str task: string defining pipeline, e.g. "speller2" or ""
        :param list names: optional list of names for output files, has to be same length as resources
        :returns list: list of paths to the output zip files
    """
    
    # Size check
    _check_task_size(resources)
    # Upload reasources to task queue
    job_ids = [upload_file(resource_file) for resource_file in resources]
    # Specify pipeline 
    t = Task(task)
    # Run uploaded tasks with pipeline
    output_file_ids = [t.run(job_id, verbose=True) for job_id in job_ids]

    if names:
        # Names were provided, use these when unpacking
        output = [download_file(output_file_id, output_file, f"{os.path.basename(filename)}.zip") 
                         for output_file_id, filename in zip(output_file_ids, names)]
    else:
        output = [download_file(output_file_id, output_file, f"{os.path.basename(resource)}.zip") 
                         for output_file_id, resource in zip(output_file_ids, resources)]
    return output

resource_files_spellchecked = None
resource_files_raw = predefined_result # using the output we know to be correct from exercise 3.1

Run the next cell to use pre-defined spellchecked resources if they are available to save time.

In [None]:
"""
    Get resources for the pipeline
"""

# If running outside workshop, this may not work in which case the next cell will cause the pipeline to run
resource_files_spellchecked = get_spellchecked_resources_ex3()

The next cell triggers a pipeline run to obtain spellchecked resources if necessary. If the predefined result is available, it will only print a message that running of the pipeline is skipped.

In [None]:
if resource_files_spellchecked:
    print('Pre-defined resources found, skipping running of the pipeline.')
else:
    print('Pre-defined resources not found, will run the pipeline to obtain. This may take a while!')
    resource_files_spellchecked = lpmn_client_task(resource_files_raw, "speller2")
    resource_files_spellchecked = [unzip_file(r) for r in resource_files_spellchecked]

In the next cell, **we must call the NLP service with a pipeline that includes the Liner2 NER tool**

In [None]:
"""
    Run NER task on raw (un-spellchecked) input
"""
# Keep the following two lines as they are, you will need to pass these to the NLP pipeline
pipeline = 'any2txt|wcrft2|liner2({"model":"top9"})'
resource_files_raw_names = [f"{os.path.basename(r)}_raw" for r in resource_files_raw]

# Uncomment the line below and complete it to run the pipeline
# output_files_raw = lpmn_client_task(resource_files_raw, .............

print("NER pipeline over raw resources finished")

Now do the same thing for the spellchecked resources

In [None]:
"""
    Run NER task on spellchecked data
"""

pipeline = 'liner2({"model":"top9"})'
resource_files_spellchecked_names = [f"{os.path.basename(r)}_spellchecked" for r in resource_files_spellchecked]

# Complete this cell so that we will have a variable 'output_files_spellchecked' that allows us to compare 
# the NER output for raw and spellchecked data

Simply run the following cell for some required organisation of the output data

In [None]:
"""
    Unpacking and file name changes in order to avoid clashes
"""

output_files_raw_fixed = [f"{o.replace('home%jovyan%data%9200357%', '')}" for o in output_files_raw]
output_files_raw_fixed = [f"{unzip_file(o)[0]}" for o in output_files_raw_fixed]
for o in output_files_raw_fixed:
    os.rename(o, f"{o.replace('home%jovyan%data%9200357%', '')}_raw")

output_files_raw_fixed = [f"{o.replace('home%jovyan%data%9200357%', '')}_raw" for o in output_files_raw_fixed]

output_files_spellchecked_fixed = [f"{o.replace('home%jovyan%data%9200357%', '')}" for o in output_files_spellchecked]
output_files_spellchecked_fixed = [f"{unzip_file(o)[0]}" for o in output_files_spellchecked_fixed]
for o in output_files_spellchecked_fixed:
    os.rename(o, f"{o.replace('home%jovyan%data%9200357%', '')}_spellchecked") 
    
output_files_spellchecked_fixed = [f"{o.replace('home%jovyan%data%9200357%', '')}_spellchecked" for o in output_files_spellchecked_fixed]

## Exercise 3.3
We will now **carry out some analysis** on the different output of the two runs (raw and spellchecked data). For this, we first define some functions that will make it easy to count tokens and annotations in the pipeline output.



In [None]:
"""
    Functions for parsing output and basic stats
"""

from collections import Counter

def count_tokens(ner_output_tree):
    return sum([1 for _ in ner_output_tree.xpath("//tok")])

def list_annotations(ner_output_tree):
    return liner2_xml_to_annotation(ner_output_tree)

def count_annotations(ner_output_tree) -> Counter:
    return Counter(f"{annotation_type}|{' '.join(annotation_tokens)}" for annotation_type, annotation_tokens in list_annotations(ner_output_tree))

def liner2_xml_to_annotation(ner_output_tree):
    """
        Converts xml doc into list of annotations and tokens
        
        :param ElementTree ner_output_tree: lxml instance of ET of NER output xml
        :returns list: list of tuples (annotation_type, [tokens])
    """
    sentences = ner_output_tree.xpath("//sentence")
    annotated_tokens = [sentence.xpath("./tok[./ann!=0]") for sentence in sentences]
    # Prune empty lists
    annotated_tokens = [annotated_token for annotated_token in annotated_tokens if annotated_token]
    annotated_tokens = [_chain_annotations(sentence) for sentence in annotated_tokens]
    return annotated_tokens
        
def _chain_annotations(sentence: list):
    annotation_heads = [token.xpath("./ann[@head]") for token in sentence]
    for token in annotation_heads:
        for annotation_head in token:
            annotation_channel = annotation_head.xpath("./text()")[0]
            annotation_type = annotation_head.get("chan")
            annotation_tokens = [token.xpath("./lex/base/text()")[0] for token in sentence if token.xpath(f"./ann[text()={annotation_channel}]")]
    return annotation_type, annotation_tokens

The next cell demonstrates how to count tokens by comparing the number of tokens in the outputs for raw and spellchecked resources.

In [None]:
"""
    Let's invastigate difference in number of parsed tokens in raw and spellchecked data
"""

# Raw
token_nb_raw = 0
for o in output_files_raw_fixed:
    print(f"Processing {o}")
    xml_tree = etree.parse(o)
    token_nb_raw += count_tokens(xml_tree)
print(f"Raw data has {token_nb_raw} tokens")

# Spellchecked
token_nb_spellchecked = 0
for o in output_files_spellchecked_fixed:
    print(f"Processing {o}")
    xml_tree = etree.parse(o)
    token_nb_spellchecked += count_tokens(xml_tree)
print(f"Spellchecked data has {token_nb_spellchecked} tokens")


And now, a demonstration of how to get the most common annotations (= named entities) out of the output. The function that we have made to count annotations returns a [Counter](https://www.guru99.com/python-counter-collections-example.html), which is a very handy data structure that allows for comparison between different instances.

In [None]:
"""
    Let's check how annotation counts differ on entire output
"""
# Raw
annotation_nb_raw = Counter()
for o in output_files_raw_fixed:
    print(f"Processing {o}")
    xml_tree = etree.parse(o)
    annotation_nb_raw += count_annotations(xml_tree)
print(f"Raw data annotations: {annotation_nb_raw.most_common(10)}")

# Spellchecked
annotation_nb_spellchecked = Counter()
for o in output_files_spellchecked_fixed:
    print(f"Processing {o}")
    xml_tree = etree.parse(o)
    annotation_nb_spellchecked += count_annotations(xml_tree)
print(f"Spellchecked data annotations: {annotation_nb_spellchecked.most_common(10)} ")

In [None]:
""" 
    Now, your job is to investigate how annotations number differs per file 
"""
import matplotlib.pyplot as plt
from typing import List

def align_resources(resources_raw, resources_spelled):
    ret = []
    for rraw in resources_raw:
        for rspelled in resources_spelled:
            if os.path.basename(rraw).split(".")[0] == os.path.basename(rspelled).split(".")[0]:
                ret.append((rraw, rspelled))
    return ret

# We will obtain an array of tuples (raw, spellchecked), i.e. a Counter for each version that we can compare
aligned_resources = align_resources(output_files_raw_fixed, output_files_spellchecked_fixed)

"""
    Fill in for loop body
"""

diff_counts: List[int] = []

for rraw, rspelled in aligned_resources:
    # Sum all differences in number of annotation occurences. Note that 
    # a) you can use `+` and `-` on Counter instance
    # b) Counter instance does not store negative values (<0 are discarded from dict), 
    #    you can try: 
    #        diff_in_occur = sum(((counterA - counterB) + (counterB - counterA)).values())
    

In [None]:
"""
    Plot the differences
"""
    
fig = plt.figure()
plt.rcParams["figure.autolayout"] = True
ax = fig.add_axes([0,0,1,1])
resource_names = [os.path.basename(rraw).replace("_raw", "") for rraw, _ in aligned_resources]
plt.xticks(rotation=90)
ax.bar(resource_names, diff_counts)
plt.show()

In [None]:
"""
    SANDBOX
    
    Congratulations, you have finished our tutorial. Liner2 models come with different level of granularity of annotations. 
    Here, we invite you to investigate n82, that introduces additional level of granularity to annatations we presented 
    in previous sections
"""
predefined_resource = predefined_result[0]

lpmn_client_task([predefined_resource], 'any2txt|wcrft2|liner2({"model":"n82"})', [f"{predefined_resource}_n82"])