#  🪜 Data Augmentation Step-by-Step (NER)

The goal of this experiment is to demonstrate how we can improve pretrained models like `en_core_web_lg` via data augmentation and domain adaptation. **Our task is to detect mentions of people from tweets, without an explicit @**. We will be using a Twitter dump from the [TweetEval benchmark](https://github.com/cardiffnlp/tweeteval), a domain vastly different from the original datasets where `en_core_web_lg` was trained from. 

We will look into techniques like substitution, weak supervision, and good old-fashioned data cleaning to train a more robust model. The outline of this notebook will be the following:

- [**Loading the dataset**](#loading-the-dataset): here, we download the assets from the spaCy projects repository. We'll also do a few cursory checks in the entities predicted by our baseline model (`en_core_web_lg`).
- [**Data augmentation**](#data-augmentation)
    - [via weak supervision](#weak-supervision): we will attempt to obtain better annotations for our training dataset using weak supervision. Instead of annotating our training data by hand, we will use a library called [skweak](https://github.com/NorskRegnesentral/skweak) ([paper](https://arxiv.org/abs/2104.09683))
    - [via token substitution](#token-substitution): in addition to improving our annotations, we will increase the size of our training data by substituting a few words, or rephrasing tweets in another form.
- [**Evaluation**](#evaluation)

## Loading the dataset

First, we need to obtain all our assets from the repo. Our input dataset is based on the [TwittEval benchmark](https://github.com/cardiffnlp/tweeteval). You can run the following command to download:

In [1]:
!spacy project assets


[38;5;3m⚠ No assets specified in project.yml[0m



Upon running this command, you'll get the `train.spacy` and `test.spacy` files in the `assets/` directory. 
The file `assets/test.spacy` was annotated using [Prodigy](https://prodi.gy/). We will be using this held-out test set to compare our baseline `en_core_web_lg` spaCy model to the one trained using augmented data.

Let's have a quick look on the training dataset and see how `en_core_web_lg` fares in detecting `PER(SON)` entities.

In [2]:
import spacy
from spacy.tokens import DocBin

In [3]:
spacy.__version__  # should be >= 3.1.3

'3.1.3'

In [4]:
db = DocBin()
nlp = spacy.load("en_core_web_lg")
docs = list(db.from_disk("assets/train.spacy").get_docs(nlp.vocab))

Let's just check what it the baseline NER looks like:

In [5]:
#for doc in docs[:3]:
#    print("==================================")
#    print(f"Tweet: {doc.text}")
#    for ent in doc.ents:
#        print(f"{ent.text}, {ent.label_}")

If you'll notice, `en_core_web_lg`'s NER doesn't work for everything probably because the domain where it was trained is different from how Tweets are usually structured. We'll then finetune this model with a better dataset.

## Data augmentation

In this section, we will perform some data augmentation tasks to improve our dataset. Again, we will use weak supervision to improve our training data annotations, and token substitution with other business rules to increase the size of our training data.

### Weak supervision

Weak supervision involves the design of labelling functions to annotate our dataset. Note that these functions are like heuristics. For example, if I want to annotate `MONEY`, I might want to look for cases when it starts with a `$` and whatnot. These aren't perfect, because you can miss alot of nuance (e.g., "50 bucks", "30 grand", etc.). 

Of course, there are also probabilistic approaches in writing heuristics, and we will apply it here. Later on, once we have enough labelling functions, we can **pool** them into a single annotator that combines all these weak annotators. **We will be using `skweak` for this task**:

In [6]:
import skweak
skweak.__version__

'0.2.13'

We will use three types of annotators:
- **Model-based:** we can use other NER models that were trained on other distinct corpora to improve our dataset. Because `skweak` is well-integrated into spaCy, it is possible to reuse some of our baseline models to guide the augmentation.
- **Gazetteers:** is a standard lookup annotator where you simply provide a list of entities that you want to be annotated. This is useful if you want to zero-in on a specific, well-defined, and finite set of groups (e.g. board of directors of your company, all publicly-listed companies, etc.).
- **Heuristic patterns:** of course, it's still possible to add rule-based patterns that may be useful given your domain. For `skweak`, these patterns should be defined on the `spaCy` document itself.

For each annotator, we will create an instance and apply it to a sample Tweet just to see how it works. We will use `skweak.utils.display_entities` function to display partial results.

In [7]:
# a sample tweet for demonstration
tweet = "So disappointed in wwe summerslam! I want to see john cena wins his 16th title"
doc = nlp(tweet)

#### Model-based annotators

Instead of providing handcrafted annotations, we can tap a model that was trained in another corpus to aid us in labelling.  
We will use the following model-based annotators:
- A spaCy base model (`en_core_web_lg`). We can reuse this model to guide how our dataset has been augmented.
- Another model trained on the [Broad Twitter Corpus (BTC)](https://github.com/GateNLP/broad_twitter_corpus). Since the file is too big, it's not included in this repository. However, you can download them from [`skweak`'s Release assets (btc.tar.gz)](https://github.com/NorskRegnesentral/skweak/releases/tag/0.2.8)

In [8]:
# a spaCy base model (en_core_web_lg)
core_web_annotator = skweak.spacy.ModelAnnotator(name="core_web_lg", model_path="en_core_web_lg")
skweak.utils.display_entities(doc=core_web_annotator(doc), layer="core_web_lg")

In [9]:
# a model trained from the Broad Twitter Corpus
btc_annotator = skweak.spacy.ModelAnnotator(name="btc", model_path="assets/data/btc")
skweak.utils.display_entities(doc=btc_annotator(doc), layer="btc")



#### Gazetteer-based annotators

Aside from using probabilistic models, we can also take advantage of other large datasets. Labelling functions based on them are called gazetteers in weak supervision literature. We will use the following datasets:
- Wikipedia: extracted from NECKar dataset. This will allow us to obtain names of famous personalities to aid us in NER
- Crunchbase: from the Open Data Map of Crunchbase. This will allow us to obtain names of various business personalities.

First, we have to supply the JSON data into a [trie](https://en.wikipedia.org/wiki/Trie), a type of data structure for efficient search, and create an annotator based on that

In [10]:
# a wikipedia based annotator
tries = skweak.gazetteers.extract_json_data("assets/wikidata_tokenised.json")
wiki_annotator = skweak.gazetteers.GazetteerAnnotator(name="wiki_uncased", tries=tries, case_sensitive=False)
skweak.utils.display_entities(doc=wiki_annotator(doc), layer="wiki_uncased")

Extracting data from assets/wikidata_tokenised.json
Populating trie for class PERSON (number: 2621131)
Populating trie for class LOC (number: 47104)
Populating trie for class GPE (number: 601419)
Populating trie for class ORG (number: 295449)
Populating trie for class PRODUCT (number: 12457)


In [11]:
# a crunchbase based annotator
#
# @note: the tokenizer for this gazeteer seems to rely on en_core_web_base_md.
# you can download this by running:
# python -m spacy download en_core_web_md
tries = skweak.gazetteers.extract_json_data("assets/crunchbase.json")
crunchbase_annotator = skweak.gazetteers.GazetteerAnnotator(name="crunchbase_uncased", tries=tries, case_sensitive=False)
skweak.utils.display_entities(doc=crunchbase_annotator(doc), layer="crunchbase_uncased")

Extracting data from assets/crunchbase.json
Populating trie for class COMPANY (number: 788714)
Populating trie for class ORG (number: 261)
Populating trie for class PERSON (number: 1062669)


#### Heuristic patterns

Lastly, we can apply business rules and other domain-specific functions to improve our annotations. We can create annotators for detecting person names, for checking cases, and identifying proper nouns. When using `skweak` and `spaCy`, you can define a labelling function that **takes a spaCy `Doc` as input and generate text spans with a label**. 

In [12]:
# General annotator for proper nouns
proper_noun_annotator = skweak.heuristics.TokenConstraintAnnotator(
    name="proper", constraint=skweak.utils.is_likely_proper, label="ENT"
)

# Another annotator that considers other name forumlation
NAME_PREFIXES = {
    "-",
    "von",
    "van",
    "de",
    "di",
    "le",
    "la",
    "het",
    "'t'",
    "dem",
    "der",
    "den",
    "d'",
    "ter",
}
proper_noun_annotator_name_prefix = skweak.heuristics.TokenConstraintAnnotator(
    "proper_prefix", skweak.utils.is_likely_proper, "ENT"
)
proper_noun_annotator_name_prefix.add_gap_tokens(NAME_PREFIXES)

# Since these are similar in nature, let's just combine them
combined = skweak.base.CombinedAnnotator()
for annotator in [proper_noun_annotator, proper_noun_annotator_name_prefix]:
    annotator.add_gap_tokens(["'s", "-"])
    combined.add_annotator(annotator)

skweak.utils.display_entities(combined(doc), "proper")
skweak.utils.display_entities(combined(doc), "proper_prefix")

It's also possible to write up some heuristics for detecting full names. Below is a `FullNameDetector` implemented in the `skweak` tutorials that we will adapt into this project:

In [13]:
from spacy.tokens import Span
import json

# Full name annotator from https://github.com/NorskRegnesentral/skweak/blob/main/examples/ner/conll2003_ner.py
class FullNameDetector:
    """Search for occurrences of full person names (first name followed by at least one title token)"""

    def __init__(self):
        with open("assets/first_names.json") as f:
            self.first_names = set(json.load(f))

    def __call__(self, span: Span) -> bool:
        # We assume full names are between 2 and 5 tokens        
        if len(span) < 2 or len(span) > 5:
            return False

        return (
            span[0].text in self.first_names and span[-1].is_alpha or span[-1].is_alpha
        )


full_name_detector = skweak.heuristics.SpanConstraintAnnotator(
    name="full_name_detector",
    other_name="proper",  # where the detector will base its tokens from
    constraint=FullNameDetector(),
    label="PERSON",
)
skweak.utils.display_entities(
    full_name_detector(doc), layer="full_name_detector"
)

### Interlude: Standardizing output labels

You might notice that there is inconsistent labelling in some of our annotators. For example, the BTC annotator uses `PER` for "person," whereas our heuristic annotators use `PERSON`. Let's correct them. The way we'll do is through another annotator with explicit rules for adding newer spans.

In [14]:
class Standardizer(skweak.base.SpanAnnotator):
    def __init__(self):
        super(Standardizer, self).__init__("")

    def __call__(self, doc):
        for source in doc.spans:
            new_spans = []
            for span in doc.spans[source]:
                if "\n" in span.text:
                    continue
                elif span.label_ == "PERSON":
                    new_spans.append(Span(doc, span.start, span.end, label="PER"))
                elif span.label_ in ["ORGANIZATION", "ORGANISATION", "COMPANY"]:
                    new_spans.append(Span(doc, span.start, span.end, label="ORG"))
                elif span.label_ in ["GPE"]:
                    new_spans.append(Span(doc, span.start, span.end, label="LOC"))
                # fmt: off
                elif span.label_ in ["EVENT", "FAC", "LANGUAGE", "LAW", "NORP", "PRODUCT", "WORK_OF_ART"]:
                # fmt: on
                    new_spans.append(Span(doc, span.start, span.end, label="MISC"))
                else:
                    new_spans.append(span)
            doc.spans[source] = new_spans
        return doc


standardizer = Standardizer()

### Combining all models together 

Now that we have created a variety of annotators, we can **combine them all together to estimate a label model**. The aggregation step is done by a hidden markov model (HMM). We will use this unified annotator to label our training dataset. 

In [15]:
unified_annotator = skweak.base.CombinedAnnotator()
annotators = [
    core_web_annotator,
    btc_annotator,
    wiki_annotator,
    crunchbase_annotator,
    combined,  # proper_noun_annotator + proper_noun_annotator_name_prefix
    full_name_detector,
    standardizer,
]

for annot in annotators:
    unified_annotator.add_annotator(annot)
    
print(f"Total number of annotators: {len(unified_annotator.annotators)}")

Total number of annotators: 7


We will then use the combined annotator to annotate our training set, then fit a hidden markov model afterwards. 

In [16]:
%%time
# Running this may take some time
docs_annotated = list(unified_annotator.pipe(docs))

CPU times: user 7min 51s, sys: 597 ms, total: 7min 52s
Wall time: 7min 52s


Now, we start fitting a Hidden Markov Model. What happens under the hood is that given all the annotations provided by all our annotators, we estimate a statistical model that generalizes that. Of course, it will be hard to do that because we have no knowledge of its true (the ground-truth) labels, so we use the [Baum-Welch algorithm](https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm) for estimation.

In [17]:
%%time
label_model = skweak.aggregation.HMM("hmm", ["LOC", "MISC", "ORG", "PER"])
label_model.add_underspecified_label("ENT", ["LOC", "MISC", "ORG", "PER"])
label_model.fit(docs_annotated)

Starting iteration 1
problem found for token \""""Vow\ in hmm
CPU times: user 43 s, sys: 30.1 ms, total: 43.1 s
Wall time: 43 s


In [18]:
label_model.save("assets/data/hmm_sample.pkl")

In [19]:
%%time
# Running this may take some time
docs_annotated_hmm = list(label_model.pipe(docs))

  forward_lattice.max(axis=1)[:, np.newaxis]


CPU times: user 37.5 s, sys: 10.1 ms, total: 37.6 s
Wall time: 37.5 s


In [20]:
for i in range(100):
    skweak.utils.display_entities(docs[i], "hmm")

### Token substitution