-
Notifications
You must be signed in to change notification settings - Fork 2.3k
second tutorial (configuration.md) #212
Conversation
let's take a deeper look at our experiment configuration file, | ||
[tutorials/getting_started/simple_tagger.json](https://github.com/allenai/allennlp/blob/master/tutorials/getting_started/simple_tagger.json). | ||
|
||
The configuration is a JSON (or [HOCON](https://github.com/typesafehub/config/blob/master/HOCON.md)) object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this confusing. Is it JSON or HOCON? Perhaps you mean "The configuration is HOCON (but don't worry if you are not familiar with HOCON--JSON is a valid subset of HOCON)"
that we need to override the list of non-padded namespaces: | ||
|
||
```js | ||
"non_padded_namespaces": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we doing this to get padding, or simply @@UNKNOWN@@
tokens for certain postags?
Let's first look at the text field embedder configuration: | ||
|
||
```js | ||
"text_field_embedder": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation here is quite weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it help to have comments on these lines?
The `"tokens"` namespace (which consists of integer encodings of the lowercased words in the input) | ||
gets fed into an | ||
[`Embedding`](http://docs.allennlp.org/en/latest/api/allennlp.modules.token_embedders.html?highlight=embedding#allennlp.modules.token_embedders.embedding.Embedding) | ||
module that embeds the vocabulary words in a 50-dimensional space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
append "(specified by embedding-dim
)"?
|
||
```js | ||
"stacked_encoder": { | ||
"type": "lstm", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would only indent by 4, rather than by 8. You could also remove the initial whitespace.
concatenated with a 50-dimensional vector for `"token_characters`"; | ||
that is, a 100-dimensional vector. | ||
|
||
### The Seq2SeqEncoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a Seq2SeqEncoder
? The JSON below has nothing about Seq2Seq
but rather specifies stacked_encoder
. I'm guessing that stacked_encoder = Seq2SeqEncoder
but it's an implciit connection.
|
||
Finally, we'll run the training for 40 epochs; | ||
we'll stop prematurely if we get no improvement for 10 epochs; | ||
and we'll train on the CPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might add instruction (or point somewhere) for how to specify a GPU, as this field is a bit opaque.
|
||
## What's Next | ||
|
||
TODO(joelgrus): next part of the tutorial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it!
Each encoding has a "namespace", in this case `"tokens"` and `"token_characters"`. | ||
The `SequenceLabelField` also has a namespace, `"labels"`. | ||
|
||
## non-padded namespaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping this would be a topic almost all users could ignore, that just shows up in some deep dive. It's unfortunate that we need to use it here. Can you construct the validation dataset such that it's not necessary in this tutorial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's a good idea, I'll just filter out the lines with the stray tags
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I fixed the datasets and removed all references to non-padded namespaces from the tutorial and config file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
* Set default input limits for forms (#209) * Limit the max input of text inputs and areas. * maxlength -> maxLength * Use rudder for deploy related functionality. (#211) * Use rudder for deploy related functionality. This ensures that the application is deployed to the latest n' greatest skiff cluster. * Use the right image path. * Release pending changes. (#212) (#213) * Set default input limits for forms (#209) * Use rudder for deploy related functionality. (#211) * Bert srl (#214) * update srl model * update description * bump to commit which includes srl model * set timeout to 15 minutes * update to latest image, add numbers (#216)
in addition,
run
torun.py
(I leftrun
there so as not to break anyone's workflow, but it logs a "deprecated" warning, and I plan to get rid of it eventually)python -m allennlp.run
main
asallennlp.commands.main()
so that we can tell people to make their own run scripts by importing and calling it