New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add conll2003 ner,pos,chunk task. #226
Conversation
templates/conll2003/templates.yaml
Outdated
WDT\",\n44:\"WP\",\n45:\"WP$\",\n46:\"WRB\"\n}) %}\n{% set _task = [\"named\ | ||
\ entities\", \"chunk tag\", \"parts of speech\"] | choice %}\n{% set _label_dict\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you separate all the template per task ("named entities", "chunk tag", "parts of speech")? it's very hard to read or review right now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Separating.
@VictorSanh Can you take a look? |
Hmm, I don't think we should add |
Requirement of If I use Requirement of |
Let me be more clear. I get what these do, but I don't understand why they are necessary. We are aiming to be stateless and deterministic when possible. For "randomness" we added a choice filter. But it is not clear to me why this dataset needs shuffle. (Happy to talk on slack if that is easier) |
Poked you on bigscience slack yesterday. here is the message. |
I get what you are doing now, and it is clever. But I am not going to approve the |
@srush Actually I gave some thoughts over this during my free time. I understand your concern over deterministic prompts. I am more open towards your suggestion. Do you think adding |
I would prefer that we just don't include these things. This seems to be the only dataset that uses them. |
@srush Note: Not sure why |
Prompt Description:
flat_question_with_label
: Regular task. Label are normalized label in-case of POS tagging.flat_question_with_random_label
: It is not expected that user will always provide labels strictly from the dataset. They may provide a subset of labels. So here we provided subset of labels. If the gold labels in the sample are not available in the subset we replace the gold label with "O". In case of choosing random tags, We always include "O" tag for ner, pos and chunk labels.flat_question_without_label
: Regular task. No label is provided.POS label Normalization
Both NER and Chunk task contains "O" tags. But POS doesn't contain "O" tag.
In case of parts-of-speech tags, there are few labels that are weird in natural sense. For example see a prompt with all pos labels,
Here first 9 labels are normalized to "O" tag.
Earlier Pull
Earlier
zip
was not available so I wrote the a brute force code in O(n^2) complexity. But now thatzip
is available, I wrote the code with simpler notation and loop (with O(n) complexity). While merging I messed up in previous pull #170 . So I closed that and created the new pull.