# Introduction to augmenters

This notebook provides a short introduction to some of the tools for augmentation included in `DaCy`. For information on how to  conduct robustness test of your models please see `dacy-robustness.ipynb`.

Let's start out by seeing how different augmenters change your text. The augmenters included in `DaCy` work on the `Example` class from SpaCy, so let's write a little helper function that converts a `Doc` to an `Example` and write some text to test on.

In [2]:
import spacy
from spacy.training import Example
from typing import List, Callable, Iterator

In [3]:
def doc_to_example(doc):
    return Example(doc, doc)


In [4]:
nlp = spacy.load("da_core_news_sm")
doc = nlp("Peter Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.")
example = doc_to_example(doc)

Let's see how some of the simple augmenters transform the text.

In [5]:
from spacy.training.augment import create_lower_casing_augmenter
from dacy.augmenters import (create_keyboard_augmenter, create_pers_augmenter,
                             create_spacing_augmenter, create_æøå_augmenter)
from dacy.datasets import danish_names

In [6]:
lower_aug = create_lower_casing_augmenter(level=1)
keyboard_05 = create_keyboard_augmenter(doc_level=1, char_level=0.05, keyboard = "QWERTY_DA")
keyboard_15 = create_keyboard_augmenter(doc_level=1, char_level=0.15, keyboard = "QWERTY_DA")
space_aug = create_spacing_augmenter(doc_level=1, spacing_level=0.4)


`lower_aug` will change all text to lowercase, `keyboard_05` and `keyboard_15` will change 5% or 15% of all characters to a character on a neighbouring key on a Danish QWERTY keyboard (replace `DA` with `EN` for English), and `space_aug` will remove 20% of all whitespaces. The augmenters modify both the reference and the predicted `Doc`s in the Example and makes sure that spans for NER, POS etc. remain correct. Let's see how the text looks.

In [7]:
lower = next(lower_aug(nlp, example))
key_05 = next(keyboard_05(nlp, example))
key_15 = next(keyboard_15(nlp, example))
space = next(space_aug(nlp, example))

In [8]:
for text in [lower.y.text, key_05.y.text, key_15.y.text, space.y.text]:
    print(text)


peter schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod england.
Perer Schmeichel mener også, at det danzke landshold anno 2021 tilhører verdenstoppej og kan vimde den kommende kamp mod Englandæ
Pe5er Sfymeicheæ menee også, at det dajske landsgolx qjno 2+21 tilh-rer verxenstopæen og ksn vinde den kommende ma,p mod Wngland.
PeterSchmeichel mener også, atdet danskelandshold anno 2021tilhørerverdenstoppen og kanvinde den kommende kamp modEngland.


Pretty neat, right? 
`DaCy` also includes a more sophisticated augmenter for augmenting names. `create_pers_augmenter` is highly flexible, and can augment names to fit a certain pattern (e.g. first_name, last_name; abbreviated_first_name, last_name) or replace names with one sampled from a dictionary. `DaCy` provides four utility functions for constructing such name dictionaries: `danish_names`, `female_names`, `male_names`, and `muslim_names` (see the README in `datasets/lookup_tables` for sources). The dictionaries are composed of the keys `first_name` and `last_name` which each contain a list of names to sample from. The `pers_augmenter` uses this dictionary when it replaces names to respect first and last names. Let's go through a couple of examples to demonstrate how it works

In [9]:
print(danish_names().keys())
print(danish_names()["first_name"][0:5])
print(danish_names()["last_name"][0:5])

dict_keys(['first_name', 'last_name'])
['Marie', 'Anna', 'Margrethe', 'Karen', 'Kirstine']
['Jensen', 'Nielsen', 'Hansen', 'Pedersen', 'Andersen']


In [10]:
def augment_texts(texts: List[str], augmenter: Callable) -> Iterator[Example]:
    """Takes a list of strings and yields augmented examples"""
    docs = nlp.pipe(texts)
    for doc in docs:
        ex = Example(doc, doc)
        aug = augmenter(nlp, ex)
        yield aug

texts = [
    "Hans Christian Andersen var en dansk digter og forfatter",
    "1, 2, 3, Schmeichel er en mur",
    "Peter Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England."
    ]

# Create a dictionary to use for name replacement
dk_name_dict = danish_names()


# force_pattern augments PER entities to fit the format and length of `patterns`. Patterns allows you to specificy arbitrary
# combinations of "fn" (first names), "ln" (last names), "abb" (abbreviated to first character) and "abbpunct" (abbreviated
# to first character + ".") separeated by ",". If keep_name=True, the augmenter will not change names, but if force_pattern_size
# is True it will make them fit the length and potentially abbreviate names. 
pers_aug = create_pers_augmenter(dk_name_dict, force_pattern_size=True, keep_name=False, patterns=["fn,ln"])
augmented_docs = augment_texts(texts, pers_aug)
for d in augmented_docs:
    print(next(d).y.text)

Eilif Stengaard var en dansk digter og forfatter
1, 2, 3, Skjold Boserup er en mur
Susanne Samuelsen mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.


In [11]:
# Here's an example with keep_name=True and force_pattern_size=False which simply abbreviates first names
abb_aug = create_pers_augmenter(dk_name_dict, force_pattern_size=False, keep_name=True, patterns=["abbpunct"])
augmented_docs = augment_texts(texts, abb_aug)
for d in augmented_docs:
    print(next(d).y.text)

H. Christian Andersen var en dansk digter og forfatter
1, 2, 3, S. er en mur
P. Schmeichel mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.


In [14]:
# patterns can also take a list of patterns to replace from (which can be weighted using the
# patterns_prob argument. The pattern to use is sampled for each entity. 
# This setting is especially useful for finetuning models.
multiple_pats = create_pers_augmenter(dk_name_dict, 
                                      force_pattern_size=True,
                                      keep_name=False,
                                      patterns=["fn,ln", "abbpunct,ln", "fn,ln,ln,ln"])
augmented_docs = augment_texts(texts, multiple_pats)
for d in augmented_docs:
    print(next(d).y.text)

M. Brun var en dansk digter og forfatter
1, 2, 3, Amalie Steenholdt Fabricius Krogh er en mur
Louis Bertelsen Joensen Terkelsen mener også, at det danske landshold anno 2021 tilhører verdenstoppen og kan vinde den kommende kamp mod England.


Feel free to play around with the options for `create_pers_augmenter` to get a feeling for how it works and check out the docs.

The main strength of making the augmenters work with SpaCy is that we ensure that the spans of the augmented data still has the correct tags even though we add or remove words. This allows us to use them with gold-standard tagged datasets such as DaNE and use them for both training and evaluation. 

In [13]:
docs = nlp.pipe(texts)
augmented_docs = augment_texts(texts, multiple_pats)

# Check that the added/removed PER entities are still tagged as entities
for doc, aug_doc in zip(docs, augmented_docs):
    print(doc.ents, "\t", next(aug_doc).y.ents)

(Hans Christian Andersen, dansk) 	 (Nadja Mønster, dansk)
(Schmeichel,) 	 (L Juhl,)
(Peter Schmeichel, danske, England) 	 (Kresten Huynh, danske, England)


## Contributing

We highly encourage others to contribute more augmenters that cover a wider range of use cases. For inspiration on how to make your own, checkout the source code for the ones included in `DaCy` in the `dacy/augmenters` folder and SpaCy's documentation [here](https://spacy.io/usage/training#data-augmentation). If you have a good idea for one or encounter any problems, please open an issue or write on the discussion board.