Training augmenter seems to not augment the corpus #7329

CruelMoney · 2021-03-07T08:42:16Z

How to reproduce the behavior

I'm trying to reduce my model's sensitivity to letter casing in addresses. But using the orth_variants augmenter to randomly make training data lowercase, does not seem to improve the model. Following the documentation, I'm assuming that setting level = 1.0 and lower = 1.0 should make the whole corpus lowercase and make the model perform well on lowercase examples - but it seems unchanged and still only detects cased examples. On the other hand, if I manually make the whole corpus lowercase, I get a model that mostly works on lower case examples.

I'm suspecting there might be a bug so the augmenters do not run?

Here's the snippet of the augmenter config:

...
[corpora.dev.augmenter]
@augmenters = "spacy.orth_variants.v1"
# Percentage of texts that will be augmented / lowercased
level = 1.0
lower = 1.0

[corpora.dev.augmenter.orth_variants]
@readers = "srsly.read_json.v1"
path = "corpus/orth_variants.json"
...

Additionally, using the spacy.lower_case.v1 augmenter, resulted in an error when initializing the training:

  File "spacy/tokens/doc.pyx", line 260, in spacy.tokens.doc.Doc.__init__
ValueError: [E027] Arguments `words` and `spaces` should be sequences of the same length, or `spaces` should be left default at None. `spaces` should be a sequence of booleans, with True meaning that the word owns a ' ' character following it.

Environment

Info about spaCy

spaCy version: 3.0.3
Platform: macOS-11.2.2-x86_64-i386-64bit
Python version: 3.9.2
Pipelines: en_core_web_lg (3.0.0), en_core_web_sm (3.0.0), en_core_web_trf (3.0.0)

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2021-03-08T08:17:39Z

Thanks for the report, it does look like there are bugs related to lowercasing in both augmenters. The variants are still applied if there's no lowercasing, but when lowercasing is enabled, it fails silently and just returns the original example. A lot of this code was more complicated due to quirks in GoldParse and can be simplified for v3 now, so it's good to have another look at it in any case.

We'll get it fixed for the next patch release (most likely v3.0.4). In the meanwhile, if you haven't already, you can define a custom lowercasing augmenter that fixes the bug in spacy.lower_case_v1:

def lower_casing_augmenter(
    nlp: "Language", example: Example, *, level: float
) -> Iterator[Example]:
    if random.random() >= level:
        yield example
    else:
        example_dict = example.to_dict()
        doc = nlp.make_doc(example.text.lower())
        example_dict["token_annotation"]["ORTH"] = [t.lower_ for t in example.reference]
        yield example.from_dict(doc, example_dict)

The underlying problem is that because of tokenizer settings related to acronyms, a lowercase version of the text might not get the same tokenization as the cased version, so it's important to keep the original tokens in the reference ORTH. If you use the tokens from the new doc with the lowercased text, your tokens can get out of alignment with the rest of the reference annotation.

CruelMoney · 2021-03-08T09:46:03Z

@adrianeboyd makes sense, good to know I wasn't going crazy :)

Thank you for the custom augmented, much appreciated!
As a workaround, I've already trained my model with a new dataset with lowercased examples, and it performs really well now 👌

Looking forward to the next patch of Spacy.

github-actions · 2021-10-25T00:02:06Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

adrianeboyd added bug Bugs and behaviour differing from documentation feat / training Feature: Training utils, Example, Corpus and converters labels Mar 8, 2021

adrianeboyd linked a pull request Mar 8, 2021 that will close this issue

Fix lowercase augmentation #7336

Merged

3 tasks

honnibal closed this as completed in #7336 Mar 9, 2021

This was referenced Mar 13, 2021

Bump spacy from 3.0.3 to 3.0.5 mozilla/bugbug#2236

Merged

Bump spacy from 3.0.1 to 3.0.5 marco-c/bugbug#286

Closed

Bump spacy from 2.3.4 to 3.0.5 bgrins/bugbug#152

Closed

Bump spacy from 3.0.3 to 3.0.5 ConnectionMaster/bugbug#79

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training augmenter seems to not augment the corpus #7329

Training augmenter seems to not augment the corpus #7329

CruelMoney commented Mar 7, 2021 •

edited

adrianeboyd commented Mar 8, 2021

CruelMoney commented Mar 8, 2021

github-actions bot commented Oct 25, 2021

Training augmenter seems to not augment the corpus #7329

Training augmenter seems to not augment the corpus #7329

Comments

CruelMoney commented Mar 7, 2021 • edited

How to reproduce the behavior

Environment

Info about spaCy

adrianeboyd commented Mar 8, 2021

CruelMoney commented Mar 8, 2021

github-actions bot commented Oct 25, 2021

CruelMoney commented Mar 7, 2021 •

edited