Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training augmenter seems to not augment the corpus #7329

Closed
CruelMoney opened this issue Mar 7, 2021 · 3 comments · Fixed by #7336
Closed

Training augmenter seems to not augment the corpus #7329

CruelMoney opened this issue Mar 7, 2021 · 3 comments · Fixed by #7336
Labels
bug Bugs and behaviour differing from documentation feat / training Feature: Training utils, Example, Corpus and converters

Comments

@CruelMoney
Copy link

CruelMoney commented Mar 7, 2021

How to reproduce the behavior

I'm trying to reduce my model's sensitivity to letter casing in addresses. But using the orth_variants augmenter to randomly make training data lowercase, does not seem to improve the model. Following the documentation, I'm assuming that setting level = 1.0 and lower = 1.0 should make the whole corpus lowercase and make the model perform well on lowercase examples - but it seems unchanged and still only detects cased examples. On the other hand, if I manually make the whole corpus lowercase, I get a model that mostly works on lower case examples.

I'm suspecting there might be a bug so the augmenters do not run?

Here's the snippet of the augmenter config:

...
[corpora.dev.augmenter]
@augmenters = "spacy.orth_variants.v1"
# Percentage of texts that will be augmented / lowercased
level = 1.0
lower = 1.0

[corpora.dev.augmenter.orth_variants]
@readers = "srsly.read_json.v1"
path = "corpus/orth_variants.json"
...

Additionally, using the spacy.lower_case.v1 augmenter, resulted in an error when initializing the training:

  File "spacy/tokens/doc.pyx", line 260, in spacy.tokens.doc.Doc.__init__
ValueError: [E027] Arguments `words` and `spaces` should be sequences of the same length, or `spaces` should be left default at None. `spaces` should be a sequence of booleans, with True meaning that the word owns a ' ' character following it.

Environment

Info about spaCy

  • spaCy version: 3.0.3
  • Platform: macOS-11.2.2-x86_64-i386-64bit
  • Python version: 3.9.2
  • Pipelines: en_core_web_lg (3.0.0), en_core_web_sm (3.0.0), en_core_web_trf (3.0.0)
@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation feat / training Feature: Training utils, Example, Corpus and converters labels Mar 8, 2021
@adrianeboyd
Copy link
Contributor

Thanks for the report, it does look like there are bugs related to lowercasing in both augmenters. The variants are still applied if there's no lowercasing, but when lowercasing is enabled, it fails silently and just returns the original example. A lot of this code was more complicated due to quirks in GoldParse and can be simplified for v3 now, so it's good to have another look at it in any case.

We'll get it fixed for the next patch release (most likely v3.0.4). In the meanwhile, if you haven't already, you can define a custom lowercasing augmenter that fixes the bug in spacy.lower_case_v1:

def lower_casing_augmenter(
    nlp: "Language", example: Example, *, level: float
) -> Iterator[Example]:
    if random.random() >= level:
        yield example
    else:
        example_dict = example.to_dict()
        doc = nlp.make_doc(example.text.lower())
        example_dict["token_annotation"]["ORTH"] = [t.lower_ for t in example.reference]
        yield example.from_dict(doc, example_dict)

The underlying problem is that because of tokenizer settings related to acronyms, a lowercase version of the text might not get the same tokenization as the cased version, so it's important to keep the original tokens in the reference ORTH. If you use the tokens from the new doc with the lowercased text, your tokens can get out of alignment with the rest of the reference annotation.

@adrianeboyd adrianeboyd linked a pull request Mar 8, 2021 that will close this issue
3 tasks
@CruelMoney
Copy link
Author

@adrianeboyd makes sense, good to know I wasn't going crazy :)

Thank you for the custom augmented, much appreciated!
As a workaround, I've already trained my model with a new dataset with lowercased examples, and it performs really well now 👌

Looking forward to the next patch of Spacy.

@github-actions
Copy link
Contributor

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / training Feature: Training utils, Example, Corpus and converters
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants