# Token-Based Unification

Unification, in its most rudimentary form, works by either merging or 
overwriting annotations, so long as their token lists align. We provide a 
common segmentation and tokenization strategy to all NLP-Lab packages via the 
[syntok](https://github.com/fnl/syntok) library. Let's see how all this works.

## Remote Pipelines (Spacy)

To work easily with remote microservices, we will wrap them in the `RemotePipeline`
proxy class:
 

In [34]:
from pyjsonnlp.pipeline import RemotePipeline

sample_text = "Mary asked John to buy some apples. He bought them yesterday."
spacy_pipeline = RemotePipeline('https://api.linguistic.technology/spacy')

spacy_json = spacy_pipeline.process(sample_text, coreferences=True, dependencies=True)

BrokenPipeError: Internal Server Error from https://api.linguistic.technology/spacy (500)

Here we instantiate a pipeline, and call it, requesting only coreference
and dependency annotations. Let's look at what's inside.

#### Token List

In [16]:
import json

mary = spacy_json['documents'][1]['tokenList'][1]
asked = spacy_json['documents'][1]['tokenList'][2]
print(json.dumps(mary, indent=2))
print(json.dumps(asked, indent=2))

{
  "id": 1,
  "text": "Mary",
  "lemma": "Mary",
  "xpos": "NNP",
  "upos": "PROPN",
  "entity_iob": "B",
  "characterOffsetBegin": 0,
  "characterOffsetEnd": 4,
  "lang": "en",
  "features": {
    "Overt": "Yes",
    "Stop": "No",
    "Alpha": "Yes",
    "NounType": "Prop",
    "Number": "Sing",
    "Foreign": "No"
  },
  "misc": {
    "SpaceAfter": "Yes"
  },
  "shape": "Xxxx",
  "entity": "PERSON"
}
{
  "id": 2,
  "text": "asked",
  "lemma": "ask",
  "xpos": "VBD",
  "upos": "VERB",
  "entity_iob": "O",
  "characterOffsetBegin": 5,
  "characterOffsetEnd": 10,
  "lang": "en",
  "features": {
    "Overt": "Yes",
    "Stop": "No",
    "Alpha": "Yes",
    "VerbForm": "Fin",
    "Tense": "Past",
    "Foreign": "No"
  },
  "misc": {
    "SpaceAfter": "Yes"
  },
  "shape": "xxxx"
}


#### Coreferences
Here we see the coreference chains for these two sentences. Note 
that tokens are indexed at the document level, and are referred 
to by their id, which is also their dictionary index.

In [17]:
coref = spacy_json['documents'][1]['coreferences']
print(json.dumps(coref, indent=2))


[
  {
    "id": 0,
    "representative": {
      "tokens": [
        3
      ],
      "head": 3
    },
    "referents": [
      {
        "tokens": [
          9
        ],
        "head": 9
      }
    ]
  },
  {
    "id": 1,
    "representative": {
      "tokens": [
        6,
        7
      ],
      "head": 7
    },
    "referents": [
      {
        "tokens": [
          11
        ],
        "head": 11
      }
    ]
  }
]


Let's populate some ids:


In [19]:
from typing import List
import copy

tokens = spacy_json['documents'][1]['tokenList']

def id_list_to_text(id_list: List[int]):
    return [tokens[t_id]['text'] for t_id in id_list]

coref = copy.deepcopy(spacy_json['documents'][1]['coreferences'][0])
coref['representative']['tokens'] = id_list_to_text(coref['representative']['tokens'])
coref['representative']['head'] = tokens[coref['representative']['head']]['text']
for referent in coref['referents']:
    referent['tokens'] = id_list_to_text(referent['tokens'])
    referent['head'] = tokens[referent['head']]['text']

print(json.dumps(coref, indent=2))


{
  "id": 0,
  "representative": {
    "tokens": [
      "John"
    ],
    "head": "John"
  },
  "referents": [
    {
      "tokens": [
        "He"
      ],
      "head": "He"
    }
  ]
}


In future releases this code will be encapsulated in the pyjsonnlp library.
## Flair
Now let's get some annotations from [Flair](https://github.com/zalandoresearch/flair).
We follow the same process:

In [20]:
flair_pipeline = RemotePipeline('https://api.linguistic.technology/flair')
flair_json = flair_pipeline.process(sample_text, expressions=True)
flair_mary = flair_json['documents'][1]['tokenList'][1]
print(json.dumps(flair_mary, indent=2))

{
  "id": 1,
  "text": "Mary",
  "characterOffsetBegin": 0,
  "characterOffsetEnd": 4,
  "features": {
    "Overt": "Yes"
  },
  "scores": {
    "upos": 0.9905551075935364,
    "xpos": 0.9998297691345215,
    "entity": 0.9995647072792053
  },
  "misc": {
    "SpaceAfter": "Yes"
  },
  "upos": "PROPN",
  "xpos": "NNP",
  "entity": "S-PER",
  "entity_iob": "B"
}


Let's compare Mary to Spacy:


In [21]:
print(json.dumps(mary, indent=2))

{
  "id": 1,
  "text": "Mary",
  "lemma": "Mary",
  "xpos": "NNP",
  "upos": "PROPN",
  "entity_iob": "B",
  "characterOffsetBegin": 0,
  "characterOffsetEnd": 4,
  "lang": "en",
  "features": {
    "Overt": "Yes",
    "Stop": "No",
    "Alpha": "Yes",
    "NounType": "Prop",
    "Number": "Sing",
    "Foreign": "No"
  },
  "misc": {
    "SpaceAfter": "Yes"
  },
  "shape": "Xxxx",
  "entity": "PERSON"
}


We see here that while most annotations align, some are present
in Spacy only, some are present in Flair only, and some, such as `entity`,
follow different schemes. Scheme unification is an ongoing project for NLP-Lab.

Let's see what we can do so far.

## Unification
First, we'll instantiate a unification object:

In [22]:
from pyjsonnlp.unification import Unifier
unifier = Unifier()

We have two options, `overwrite` and `add`. The difference lies in how to 
handle keys that are present in both objects, for example `entity`. `add` will 
not perform destructive modifications, whereas `overwrite` will. Let's start by
performing a non-destructive unification. We'll start by just unifying the tokens.
### Non-destructive token unification

In [26]:
non_destructive = unifier.add_annotation_to_a_from_b(a=spacy_json, b=flair_json, annotation='tokens')
non_destructive_mary = non_destructive['documents'][1]['tokenList'][1]
print(json.dumps(non_destructive_mary, indent=2))

{
  "id": 1,
  "text": "Mary",
  "lemma": "Mary",
  "xpos": "NNP",
  "upos": "PROPN",
  "entity_iob": "B",
  "characterOffsetBegin": 0,
  "characterOffsetEnd": 4,
  "lang": "en",
  "features": {
    "Overt": "Yes",
    "Stop": "No",
    "Alpha": "Yes",
    "NounType": "Prop",
    "Number": "Sing",
    "Foreign": "No"
  },
  "misc": {
    "SpaceAfter": "Yes"
  },
  "shape": "Xxxx",
  "entity": "PERSON",
  "scores": {
    "upos": 0.9905551075935364,
    "xpos": 0.9998297691345215,
    "entity": 0.9995647072792053
  }
}


Note that spacy's `PERSON` entity annotation is preserved, while flair's `scores`
annotation is added. Now we'll take a look at the alternative.
### Destructive token unification

In [27]:
destructive = unifier.overwrite_annotation_from_a_with_b(a=spacy_json, b=flair_json, annotation='tokens')
destructive_mary = destructive['documents'][1]['tokenList'][1]
print(json.dumps(destructive_mary, indent=2))

{
  "id": 1,
  "text": "Mary",
  "lemma": "Mary",
  "xpos": "NNP",
  "upos": "PROPN",
  "entity_iob": "B",
  "characterOffsetBegin": 0,
  "characterOffsetEnd": 4,
  "lang": "en",
  "features": {
    "Overt": "Yes",
    "Stop": "No",
    "Alpha": "Yes",
    "NounType": "Prop",
    "Number": "Sing",
    "Foreign": "No"
  },
  "misc": {
    "SpaceAfter": "Yes"
  },
  "shape": "Xxxx",
  "entity": "S-PER",
  "scores": {
    "upos": 0.9905551075935364,
    "xpos": 0.9998297691345215,
    "entity": 0.9995647072792053
  }
}


Note that the `entity` field is now `S-PER`, as per flair.
### Annotations available for unification
We current support:
- tokens
- coreference
- expressions

Coreferences are merged intelligently by matching heads.
Expressions are either appended or overwritten entirely.

Let's add flair's expressions to spacy, with a new example:
### Expression Unification

In [33]:
sample_text = "John loves New York City."
flair_json = flair_pipeline.process(sample_text, expressions=True)
spacy_json = spacy_pipeline.process(sample_text)
unified = unifier.overwrite_annotation_from_a_with_b(a=spacy_json, b=flair_json, annotation='expressions')
expressions = unified['documents'][1]['expressions']
print(json.dumps(expressions, indent=2))

[
  {
    "type": "NP",
    "scores": {
      "type": 0.8759699861208597
    },
    "tokens": [
      3,
      4,
      5
    ]
  }
]


Stay tuned for more!
