# Retrieving Incremental Outputs

Here we use bidirectional models to incrementally perform sequence labelling in the reading corpora texts.

We have checked what each participant actually saw, sentences or full texts, and feed the same to the models, even though it might differ from the training setting.
According to the papers, they saw 'texts', as encoded in the identifiers. For Nicenboim, these are actually sentences.

We extract (convenient, effective and normal) revisions.

In [1]:
from itertools import groupby
from pathlib import Path

import flair
import json
import numpy as np
import pandas as pd
import spacy
import stanza
import torch

from flair.models import SequenceTagger
from flair.data import Sentence
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

In [2]:
PATH_MODEL_DATA = Path('preprocessed/model_data/')
PATH_MODEL_OUTPUTS = Path('preprocessed/model_outputs/')

In [3]:
corpora = ['rastros_ptbr', 'potec_de', 'provo_en', 'nicenboim_es', 'mecol1_du', 'mecol2_enl2']

texts = {}

for corpus in corpora:
    PATH_TEXTS = Path(f'preprocessed/texts/{corpus}.json')
    with open(PATH_TEXTS, 'r') as file:
        texts[corpus] = json.load(file)
    # replace two cases where encoding left a special character
    for text_id, text_dic in texts[corpus].items():
        for token_id, token in text_dic.items():
            if '\xa0' in token:
                print(corpus, token)
                texts[corpus][text_id][token_id] = token.replace(u'\xa0', u' ')

rastros_ptbr Egito usava
rastros_ptbr fragrância à


In [4]:
# If no revision occurred, what label to use for convenient and effective (0 or NaN)
NA_LABEL = 0

In [5]:
def is_revision(previous, current):
    """Return 1 if a revision occurred, else 0."""
    assert len(previous) == len(current)
    if previous == current:
        return 0
    return 1

           
def is_convenient_revision(previous, current, final):
    """Return 1 if a convenient revision occurred, else 0."""
    assert len(previous) == len(final)
    if is_revision(previous, current) == 0:
        return NA_LABEL
    if previous == final:
        return 0
    return 1          

           
def is_effective_revision(previous, current, final):
    """Return 1 if an effective revision occurred, else 0."""
    assert len(previous) == len(final) and len(final) == len(current)
    if is_revision(previous, current) == 0:
        return NA_LABEL
    n_correct_before = sum(np.array(previous) == np.array(final))
    n_correct_after = sum(np.array(current) == np.array(final)) 
    if n_correct_after <= n_correct_before:
        return 0
    return 1


def get_partial_input(tokens, i):
    """Get prefix up to token i."""
    return " ".join([tokens[str(j)] for j in range(0, i+1)])


def get_prefixes(outputs, text_idx, i):
    """Return previous output, current prefix and final prefix up to label i."""
    previous_prefix = outputs[text_idx][i-1]
    # sometimes one interest area has more than one token (e.g. due to punctuation)
    # this will result in the current output prefix being extended with more 
    # than one label in one time step
    # still, we use only the length of the previous prefix for comparison
    # and consider everything that was added as 'one label' here
    previous_len = len(previous_prefix)
    current_prefix = outputs[text_idx][i][:previous_len]
    final_prefix = outputs[text_idx][-1][:previous_len]
    return previous_prefix, current_prefix, final_prefix


def index_df(df, texts):
    """Fill the Token column in a dataframe with tokens."""
    # add tokens to the df, the rest will be filled cell by cell
    for text_idx, tokens in texts.items():
        for i, token in tokens.items():
            df.loc[f'text_{text_idx}_token_{i}']['Token'] = token


def initialise_df(texts, tasks):
    """Create a dataframe with the standard structure."""
    columns = ['Token'] + [f'{prefix}revision:{name}' for name in tasks for prefix in ['', 'convenient-', 'effective-']]
    index = [f'text_{text_idx}_token_{i}' for text_idx, tokens in texts.items() for i in tokens]
    revisions = pd.DataFrame(columns=columns, index=index)
    index_df(revisions, texts)
    return revisions


def get_revised_signal(outputs, text_idx, i):
    """Return the labels for the revision dataframe."""
    # the first token by definition does not cause a revision
    if i == 0:
        revised, conv_revised, effec_revised = 0, NA_LABEL, NA_LABEL
    else:
        # check whether/which revisions occurred
        previous, current, final = get_prefixes(outputs, text_idx, i)
        revised = is_revision(previous, current)
        conv_revised = is_convenient_revision(previous, current, final)
        effec_revised = is_effective_revision(previous, current, final)
    return revised, conv_revised, effec_revised

## Explosion Pretrained Transformers

We extract outputs from the pretrained models at [Explosion's model hub in Hugging Face](https://huggingface.co/explosion). They are described in [this blogpost](https://explosion.ai/blog/ud-benchmarks-v3-2), where it says that:

> Aside from the tokenizer, the pipeline components are trained with a single transformer component using xlm-roberta-base, similar to Trankit Base. [...] The tokenizer is trained separately and the remaining components are trained sharing the same transformer component using multi-task learning.

They say these models are only meant for benchmarking purposes, but given that they are comparable for many languages, we'll inspect their incremental outputs for the available tasks:

- POS-tagging with XPOS (UPOS does not seem to be available for all languages)
- Dependency Parsing (task of predicting heads and task of predicting the relations)

We use the token annotation documentation from [Spacy](https://spacy.io/api/token) to retrieve the labels.

The model cards of the models we use are:

- [in Portuguese](https://huggingface.co/explosion/pt_udv25_portuguesebosque_trf)
- [in German](https://huggingface.co/explosion/de_udv25_germanhdt_trf)
- [in English](https://huggingface.co/explosion/en_udv25_englishewt_trf)
- [in Spanish](https://huggingface.co/explosion/es_udv25_spanishancora_trf)
- [in Dutch](https://huggingface.co/explosion/nl_udv25_dutchalpino_trf)

There are also the [Spacy's models](https://spacy.io/models), but Transformers are not available for all languages. The repository is [here](https://github.com/explosion/spacy-models).

In [6]:
spacy.prefer_gpu()

True

In [7]:
model_family = 'hf-trf'

corpus_model_names = [
    ('rastros_ptbr', 'pt_udv25_portuguesebosque_trf'),
    ('potec_de', 'de_udv25_germanhdt_trf'),
    ('provo_en', 'en_udv25_englishewt_trf'),
    ('nicenboim_es', 'es_udv25_spanishancora_trf'),
    ('mecol1_du', 'nl_udv25_dutchalpino_trf'),
    ('mecol2_enl2', 'en_udv25_englishewt_trf')
]

In [8]:
HF_TASKS = ['pos', 'deprel', 'head']

def get_outputs(seq, attribute):
    """Return the label prefix."""
    return [getattr(token, attribute) for token in seq]


def get_head(seq):
    """Return the heads prefix."""
    return [str(token.head.i) for token in seq]


def create_hf_outputs_dic(corpus_name):
    """Initialise outputs dictionary to be filled with a list of increasing prefixes."""
    return {task: {idx: [] for idx in texts[corpus_name]} for task in HF_TASKS}

In [9]:
for corpus_name, model_name in corpus_model_names:

    outputs = create_hf_outputs_dic(corpus_name)
    revisions = initialise_df(texts[corpus_name], HF_TASKS)
    model = spacy.load(model_name)

    for text_idx, tokens in tqdm(texts[corpus_name].items()):
        # first, get the sequence of partial outputs
        for i in range(len(tokens)):
            partial_input = get_partial_input(tokens, i)
            parsed = model(partial_input)
            
            outputs_pos = get_outputs(parsed, 'pos_')
            outputs['pos'][text_idx].append(outputs_pos)
            
            outputs_deprel = get_outputs(parsed, 'dep_')
            outputs['deprel'][text_idx].append(outputs_deprel)

            outputs_head = get_head(parsed)
            outputs['head'][text_idx].append(outputs_head)
                 
        # now, loop over the sequence of partial outputs and check for revisions and edits
        for task in HF_TASKS:
            for i in range(len(tokens)):
                # fill in the revisions dataframe 
                revised, conv_revised, effec_revised = get_revised_signal(outputs[task], text_idx, i)                                         
                identifier = f'text_{text_idx}_token_{i}'
                revisions.loc[identifier][f'revision:{task}'] = revised
                revisions.loc[identifier][f'convenient-revision:{task}'] = conv_revised
                revisions.loc[identifier][f'effective-revision:{task}'] = effec_revised

    # save everything
    revisions.to_csv(PATH_MODEL_DATA / f'{corpus_name}_{model_family}_revisions.tsv', sep='\t')
    for task in HF_TASKS:
        with open(PATH_MODEL_OUTPUTS / f'{corpus_name}_{model_family}_{task}.json', 'w') as file:
            json.dump(outputs[task], file)

100%|██████████| 50/50 [01:35<00:00,  1.91s/it]
100%|██████████| 12/12 [01:43<00:00,  8.62s/it]
100%|██████████| 55/55 [01:44<00:00,  1.89s/it]
100%|██████████| 48/48 [00:26<00:00,  1.80it/s]
100%|██████████| 12/12 [02:06<00:00, 10.56s/it]
100%|██████████| 12/12 [01:22<00:00,  6.86s/it]


## Stanza Pretrained BiLSTM

Stanza's models use BiLSTMs, according to [their paper](https://aclanthology.org/2020.acl-demos.14/). Their repository is [here](https://github.com/stanfordnlp/stanza) and the models are listed [here](https://huggingface.co/stanfordnlp). The official website is [here](https://stanfordnlp.github.io/stanza/). Tutorials showing how to get each type of token annotation are [here](https://stanfordnlp.github.io/stanza/tutorials.html).

In [10]:
spacy.require_cpu()

#stanza.download('en') 
#stanza.download('pt') 
#stanza.download('es') 
#stanza.download('nl') 
#stanza.download("de")

True

In [11]:
model_family = 'stanza-bilstm'

all_tasks = ['upos', 'xpos', 'ner', 'deprel', 'head']
pt_tasks = ['upos', 'deprel', 'head']

corpus_model_names = [
    ('rastros_ptbr', 'pt', pt_tasks),
    ('potec_de', 'de', all_tasks),
    ('provo_en', 'en', all_tasks),
    ('nicenboim_es', 'es', all_tasks),
    ('mecol1_du', 'nl', all_tasks),
    ('mecol2_enl2', 'en', all_tasks)
]

In [12]:
def get_stanza_outputs(doc, attribute):
    if attribute == 'ner':
        return [str(getattr(token, attribute)) for sent in doc.sentences for token in sent.tokens]
    return [str(getattr(token, attribute)) for sent in doc.sentences for token in sent.words]

In [13]:
for corpus_name, model, tasks in corpus_model_names:

    revisions = initialise_df(texts[corpus_name], tasks)
    outputs = {task: {idx: [] for idx in texts[corpus_name]} for task in tasks}
    tagger = stanza.Pipeline(model, use_gpu=True) 

    for text_idx, tokens in tqdm(texts[corpus_name].items()):
        for i in range(len(tokens)):
            partial_input = get_partial_input(tokens, i)
            doc = tagger(partial_input)
            for task in tasks:
                output_tags = get_stanza_outputs(doc, task)
                outputs[task][text_idx].append(output_tags)
        
        for i in range(len(tokens)):
            for task in tasks:
                revised, conv_revised, effec_revised = get_revised_signal(outputs[task], text_idx, i)                    
                identifier = f'text_{text_idx}_token_{i}'
                revisions.loc[identifier][f'revision:{task}'] = revised
                revisions.loc[identifier][f'convenient-revision:{task}'] = conv_revised
                revisions.loc[identifier][f'effective-revision:{task}'] = effec_revised    

    for task in tasks:        
        with open(PATH_MODEL_OUTPUTS / f'{corpus_name}_{model_family}_{task}.json', 'w') as file:
            json.dump(outputs[task], file)

    revisions.to_csv(PATH_MODEL_DATA / f'{corpus_name}_{model_family}_revisions.tsv', sep='\t')

2023-06-27 16:16:04 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-06-27 16:16:05 INFO: Loading these models for language: pt (Portuguese):
| Processor    | Package |
--------------------------
| tokenize     | bosque  |
| mwt          | bosque  |
| pos          | bosque  |
| lemma        | bosque  |
| constituency | cintil  |
| depparse     | bosque  |

2023-06-27 16:16:05 INFO: Using device: cuda
2023-06-27 16:16:05 INFO: Loading: tokenize
2023-06-27 16:16:05 INFO: Loading: mwt
2023-06-27 16:16:05 INFO: Loading: pos
2023-06-27 16:16:06 INFO: Loading: lemma
2023-06-27 16:16:06 INFO: Loading: constituency
2023-06-27 16:16:06 INFO: Loading: depparse
2023-06-27 16:16:07 INFO: Done loading processors!
100%|██████████| 50/50 [07:32<00:00,  9.05s/it]
2023-06-27 16:23:40 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-06-27 16:23:42 INFO: Loading these models for language: de (German):
| Processor | Package      |
----------------------------
| tokenize  | gsd          |
| mwt       | gsd          |
| pos       | gsd          |
| lemma     | gsd          |
| depparse  | gsd          |
| sentiment | sb10k        |
| ner       | germeval2014 |

2023-06-27 16:23:42 INFO: Using device: cuda
2023-06-27 16:23:42 INFO: Loading: tokenize
2023-06-27 16:23:42 INFO: Loading: mwt
2023-06-27 16:23:42 INFO: Loading: pos
2023-06-27 16:23:42 INFO: Loading: lemma
2023-06-27 16:23:42 INFO: Loading: depparse
2023-06-27 16:23:43 INFO: Loading: sentiment
2023-06-27 16:23:43 INFO: Loading: ner
2023-06-27 16:23:45 INFO: Done loading processors!
100%|██████████| 12/12 [07:42<00:00, 38.54s/it]
2023-06-27 16:31:28 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-06-27 16:31:30 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| constituency | wsj       |
| depparse     | combined  |
| sentiment    | sstplus   |
| ner          | ontonotes |

2023-06-27 16:31:30 INFO: Using device: cuda
2023-06-27 16:31:30 INFO: Loading: tokenize
2023-06-27 16:31:30 INFO: Loading: pos
2023-06-27 16:31:31 INFO: Loading: lemma
2023-06-27 16:31:31 INFO: Loading: constituency
2023-06-27 16:31:32 INFO: Loading: depparse
2023-06-27 16:31:32 INFO: Loading: sentiment
2023-06-27 16:31:33 INFO: Loading: ner
2023-06-27 16:31:34 INFO: Done loading processors!
100%|██████████| 55/55 [10:36<00:00, 11.57s/it]
2023-06-27 16:42:10 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-06-27 16:42:12 INFO: Loading these models for language: es (Spanish):
| Processor    | Package  |
---------------------------
| tokenize     | ancora   |
| mwt          | ancora   |
| pos          | ancora   |
| lemma        | ancora   |
| constituency | combined |
| depparse     | ancora   |
| sentiment    | tass2020 |
| ner          | conll02  |

2023-06-27 16:42:12 INFO: Using device: cuda
2023-06-27 16:42:12 INFO: Loading: tokenize
2023-06-27 16:42:12 INFO: Loading: mwt
2023-06-27 16:42:12 INFO: Loading: pos
2023-06-27 16:42:13 INFO: Loading: lemma
2023-06-27 16:42:13 INFO: Loading: constituency
2023-06-27 16:42:14 INFO: Loading: depparse
2023-06-27 16:42:14 INFO: Loading: sentiment
2023-06-27 16:42:15 INFO: Loading: ner
2023-06-27 16:42:16 INFO: Done loading processors!
100%|██████████| 48/48 [02:15<00:00,  2.83s/it]
2023-06-27 16:44:32 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=No

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-06-27 16:44:33 INFO: Loading these models for language: nl (Dutch):
| Processor | Package |
-----------------------
| tokenize  | alpino  |
| pos       | alpino  |
| lemma     | alpino  |
| depparse  | alpino  |
| ner       | conll02 |

2023-06-27 16:44:33 INFO: Using device: cuda
2023-06-27 16:44:33 INFO: Loading: tokenize
2023-06-27 16:44:33 INFO: Loading: pos
2023-06-27 16:44:34 INFO: Loading: lemma
2023-06-27 16:44:34 INFO: Loading: depparse
2023-06-27 16:44:35 INFO: Loading: ner
2023-06-27 16:44:36 INFO: Done loading processors!
100%|██████████| 12/12 [07:09<00:00, 35.83s/it]
2023-06-27 16:51:46 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-06-27 16:51:48 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| constituency | wsj       |
| depparse     | combined  |
| sentiment    | sstplus   |
| ner          | ontonotes |

2023-06-27 16:51:48 INFO: Using device: cuda
2023-06-27 16:51:48 INFO: Loading: tokenize
2023-06-27 16:51:48 INFO: Loading: pos
2023-06-27 16:51:49 INFO: Loading: lemma
2023-06-27 16:51:49 INFO: Loading: constituency
2023-06-27 16:51:50 INFO: Loading: depparse
2023-06-27 16:51:50 INFO: Loading: sentiment
2023-06-27 16:51:51 INFO: Loading: ner
2023-06-27 16:51:52 INFO: Done loading processors!
100%|██████████| 12/12 [08:59<00:00, 44.98s/it]
