New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training augmenter seems to not augment the corpus #7329
Comments
Thanks for the report, it does look like there are bugs related to lowercasing in both augmenters. The variants are still applied if there's no lowercasing, but when lowercasing is enabled, it fails silently and just returns the original example. A lot of this code was more complicated due to quirks in We'll get it fixed for the next patch release (most likely v3.0.4). In the meanwhile, if you haven't already, you can define a custom lowercasing augmenter that fixes the bug in def lower_casing_augmenter(
nlp: "Language", example: Example, *, level: float
) -> Iterator[Example]:
if random.random() >= level:
yield example
else:
example_dict = example.to_dict()
doc = nlp.make_doc(example.text.lower())
example_dict["token_annotation"]["ORTH"] = [t.lower_ for t in example.reference]
yield example.from_dict(doc, example_dict) The underlying problem is that because of tokenizer settings related to acronyms, a lowercase version of the text might not get the same tokenization as the cased version, so it's important to keep the original tokens in the reference |
@adrianeboyd makes sense, good to know I wasn't going crazy :) Thank you for the custom augmented, much appreciated! Looking forward to the next patch of Spacy. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
How to reproduce the behavior
I'm trying to reduce my model's sensitivity to letter casing in addresses. But using the orth_variants augmenter to randomly make training data lowercase, does not seem to improve the model. Following the documentation, I'm assuming that setting
level = 1.0 and lower = 1.0
should make the whole corpus lowercase and make the model perform well on lowercase examples - but it seems unchanged and still only detects cased examples. On the other hand, if I manually make the whole corpus lowercase, I get a model that mostly works on lower case examples.I'm suspecting there might be a bug so the augmenters do not run?
Here's the snippet of the augmenter config:
Additionally, using the
spacy.lower_case.v1
augmenter, resulted in an error when initializing the training:Environment
Info about spaCy
The text was updated successfully, but these errors were encountered: