-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Added configuration file that initializes training with an existing model * Added more detailed info to README files
- Loading branch information
1 parent
79c9ccc
commit 0efc8bd
Showing
6 changed files
with
201 additions
and
109 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
[paths] | ||
train = null | ||
dev = null | ||
vectors = null | ||
init_tok2vec = null | ||
|
||
[system] | ||
gpu_allocator = "pytorch" | ||
seed = 0 | ||
|
||
[nlp] | ||
lang = "en" | ||
pipeline = ["transformer","ner"] | ||
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"} | ||
disabled = [] | ||
before_creation = null | ||
after_creation = null | ||
after_pipeline_creation = null | ||
|
||
[components] | ||
|
||
[components.ner] | ||
source = "/home/yves/projects/grammar-api3/grammar-api/quillgrammar/models/current/" | ||
|
||
[components.transformer] | ||
source = "/home/yves/projects/grammar-api3/grammar-api/quillgrammar/models/current/" | ||
|
||
[corpora] | ||
|
||
[corpora.dev] | ||
@readers = "spacy.Corpus.v1" | ||
path = ${paths.dev} | ||
max_length = 30 | ||
gold_preproc = false | ||
limit = 0 | ||
augmenter = null | ||
|
||
[corpora.train] | ||
@readers = "spacy.Corpus.v1" | ||
path = ${paths.train} | ||
max_length = 30 | ||
gold_preproc = false | ||
limit = 2000000 | ||
augmenter = null | ||
|
||
[training] | ||
accumulate_gradient = 3 | ||
dev_corpus = "corpora.dev" | ||
train_corpus = "corpora.train" | ||
seed = ${system.seed} | ||
gpu_allocator = ${system.gpu_allocator} | ||
dropout = 0.1 | ||
patience = 1000000 | ||
max_epochs = 1 | ||
max_steps = 200000 | ||
eval_frequency = 200 | ||
frozen_components = [] | ||
before_to_disk = null | ||
|
||
[training.batcher] | ||
@batchers = "spacy.batch_by_padded.v1" | ||
discard_oversize = true | ||
size = 100 | ||
buffer = 256 | ||
get_length = null | ||
|
||
[training.logger] | ||
@loggers = "spacy.WandbLogger.v1" | ||
project_name = "quill-grammar" | ||
remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"] | ||
|
||
[training.optimizer] | ||
@optimizers = "Adam.v1" | ||
beta1 = 0.9 | ||
beta2 = 0.999 | ||
L2_is_weight_decay = true | ||
L2 = 0.01 | ||
grad_clip = 1.0 | ||
use_averages = false | ||
eps = 0.00000001 | ||
|
||
[training.optimizer.learn_rate] | ||
@schedules = "warmup_linear.v1" | ||
warmup_steps = 100 | ||
total_steps = 50000 | ||
initial_rate = 0.000005 | ||
|
||
[training.score_weights] | ||
ents_per_type = null | ||
ents_f = 1.0 | ||
ents_p = 0.0 | ||
ents_r = 0.0 | ||
|
||
[pretraining] | ||
|
||
[initialize] | ||
vectors = ${paths.vectors} | ||
init_tok2vec = ${paths.init_tok2vec} | ||
vocab_data = null | ||
lookups = null | ||
|
||
[initialize.components] | ||
|
||
[initialize.tokenizer] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
# How to create synthetic data for the grammar model | ||
|
||
General information about the grammar scripts can be found in the README file in the top directory. This README contains more information about our methods for creating synthetic data. | ||
|
||
## General | ||
|
||
Synthetic data can be generated with the script `create_grammar_training_corpus.py`. | ||
|
||
```bash | ||
> python scripts/quillgrammar/create_grammar_training_corpus.py <path_to_newsoftheworld_corpus> | ||
``` | ||
|
||
The errors are created by taking sentences from US sources in the news of the world corpus, and injecting grammar errors into these sentences. By default, this injection is done with a probability of 50%. This means that: | ||
|
||
- We take an initial set of sentences from the News of the World corpus. | ||
- An error generator determines if a sentence is relevant for a particular error (e.g. it contains a commonly confused word) and injects an error by replacing the original word with an alternative word or word form. | ||
- Of all these relevant sentences, 50% are written to the output file in their original (correct) form, and 50% are written to the output file in their synthetic (incorrect) form. In this way, we ensure that the grammar model sees both correct and incorrect examples during its training process. The output file takes the name of the grammar error, with the extension `ndjson`. | ||
|
||
All output files that are used for training a spaCy model have to be added to `scripts/grammar/training_files.csv`. | ||
|
||
To change the error that is generated by the script, update the relevant error generator on line 56 of `scripts/quillgrammar/create_grammar_training_corpus.py` | ||
|
||
## Error generators | ||
|
||
Errors are created by so-called error generators. These can be found in `quillnlp.grammar.generation`, `quillnlp.grammar.fragments` (for fragment generators) and `quillnlp.grammar.verb` (for verb error generators). | ||
|
||
The general `ErrorGenerator` class is defined in `quillnlp.grammar.generation`. Crucially. it has a `generate_from_text` method, which takes a text and injects an error. All specific error generators inherit from this class. | ||
|
||
### Simple errors | ||
|
||
For simple errors, in which one word is confused with another, use the `TokenReplacementErrorGenerator` from `quillnlp.grammar.generation`. Upon creation, this error generator takes at least two arguments: a dictionary that describes which correct tokens should be replaced by which incorrect tokens, and the name of the grammar error. An optional third argument describes the probabilities of the replacement tokens. | ||
|
||
For example, the following code creates an error generator that replaces instances of *than* by *then*, instances of *then* by *than*, and labels the resulting error as `than_versus_then`. | ||
|
||
```python | ||
generator = TokenReplacementErrorGenerator({"then": ["than"], "than": ["then"]}, GrammarError.THAN_THEN.value) | ||
``` | ||
|
||
Similarly, the code below creates an error generator that replaces instances of *too* by *two* or *to* and labels the resulting error as `to_vs_too_vs_two_too_optimal`. The list of probabilities ensures it uses *to* as a replacement 90% of the time (because this is the more frequent error), and *two* 10% of the time. | ||
|
||
```python | ||
generator = TokenReplacementErrorGenerator({'too': ['two', 'to']}, | ||
GrammarError.TO_TWO_TOO_TOO_OPTIMAL.value, | ||
probs=[0.1, 0.9]) | ||
``` | ||
|
||
### Pronoun errors | ||
|
||
Pronoun error generators are created from the `PronounReplacementErrorGenerator`, defined in `quillnlp.grammar.generation`. This is slightly more advanced than the token replacement error generator. For example, it allows us to check the spaCy dependency role of words, in order to focus on subject/object/posessive/… pronouns. `quillnlp.grammar.generation` already has pronoun error generators for the most common pronoun errors. | ||
|
||
### Complex errors | ||
|
||
Other, more complex errors, have their own specific error generator, such as the `PassiveWithIncorrectBeErrorGenerator`. The `name` field of such a specific error generator specifies what grammar error they generate. |
Oops, something went wrong.