Add conll2003 ner,pos,chunk task. #226

sbmaruf · 2021-06-12T22:12:23Z

Prompt Description:

flat_question_with_label : Regular task. Label are normalized label in-case of POS tagging.
flat_question_with_random_label : It is not expected that user will always provide labels strictly from the dataset. They may provide a subset of labels. So here we provided subset of labels. If the gold labels in the sample are not available in the subset we replace the gold label with "O". In case of choosing random tags, We always include "O" tag for ner, pos and chunk labels.
flat_question_without_label : Regular task. No label is provided.

POS label Normalization

Both NER and Chunk task contains "O" tags. But POS doesn't contain "O" tag.

In case of parts-of-speech tags, there are few labels that are weird in natural sense. For example see a prompt with all pos labels,

Generate parts of speech from the following sentence. The parts of speech tags are ", '', #, $, (, ), ,, ., :, ``, CC, CD, DT, EX, FW, IN, JJ, JJR, JJS, LS, MD, NN, NNP, NNPS, NNS, NN|SYM, PDT, POS, PRP, PRP$, RB, RBR, RBS, RP, SYM, TO, UH, VB, VBD, VBG, VBN, VBP, VBZ, WDT, WP, WP$, WRB

Here first 9 labels are normalized to "O" tag.

Earlier Pull

Earlier zip was not available so I wrote the a brute force code in O(n^2) complexity. But now that zip is available, I wrote the code with simpler notation and loop (with O(n) complexity). While merging I messed up in previous pull #170 . So I closed that and created the new pull.

sbmaruf · 2021-06-12T22:17:16Z

@srush Can you please take a look here. I closed the previous pull #170

VictorSanh · 2021-06-15T20:54:48Z

templates/conll2003/templates.yaml

+      WDT\",\n44:\"WP\",\n45:\"WP$\",\n46:\"WRB\"\n}) %}\n{% set _task = [\"named\
+      \ entities\", \"chunk tag\", \"parts of speech\"] | choice %}\n{% set _label_dict\


can you separate all the template per task ("named entities", "chunk tag", "parts of speech")? it's very hard to read or review right now...

Ok. Separating.

VictorSanh · 2021-06-15T20:58:47Z

Please separate the templates (don't try to squeeze multiple templates into one).
This will remove the need for complex things like do which do not compile right now on my side

sbmaruf · 2021-06-15T23:05:53Z

@VictorSanh Can you take a look? do extension requires to update a dictionary without creating a variable. Otherwise this test would not pass.
So I have added the do extension and shuffle filter here.

srush · 2021-06-15T23:11:02Z

Hmm, I don't think we should add do or shuffle. Can you explain why we need them?

sbmaruf · 2021-06-16T00:06:09Z

@srush

Requirement of do:
I can add content to a dictionary in two ways,
Method 1: {% set _dummy=_random_label_dict.update({k:v}) %}
Method 2: {% do _random_label_dict.update({k:v}) %}

If I use Method 1 it causes an error in check_templates because of the unused variable _dummy

Requirement of shuffle:
It is not expected that the user will always provide labels in the exact same order following the dataset. To mimic this behavior, I shuffle the labels in the prompt description.

srush · 2021-06-16T01:05:46Z

Let me be more clear. I get what these do, but I don't understand why they are necessary.

We are aiming to be stateless and deterministic when possible. For "randomness" we added a choice filter. But it is not clear to me why this dataset needs shuffle.

(Happy to talk on slack if that is easier)

sbmaruf · 2021-06-16T17:10:17Z

Poked you on bigscience slack yesterday. here is the message.
I wanted to add shuffle because in conll2003 dataset, in case of pos tagging, it has more than 40 class. Now at the inference time may be an user wants to include the labels in the prompt. But he/she includes the labels in different order or may be he/she doesn’t know the original order of the labels that the model was trained on. If the model was trained on labels that are always in the same order, it might memorize it. That’s why I was thinking why not add the labels(in prompt) in different order in diffrent samples so that the model becomes aware of this.

srush · 2021-06-17T21:56:39Z

I get what you are doing now, and it is clever. But I am not going to approve the do extension or the use of random (please use choice instead). I am okay with using shuffle if you can do it without do. These are not meant to be so code heavy. (I know you got a hard one).

sbmaruf · 2021-06-17T22:13:52Z

@srush Actually I gave some thoughts over this during my free time. I understand your concern over deterministic prompts. I am more open towards your suggestion. Do you think adding
(i) “shuffle” (require shuffle filter)
(ii) “sub-set of labels” (require ‘do’ extension)
would substantially improve supervised prompting? If you think so, I would put some more effort to find a work-around. Otherwise, I will just remove them. Please let me know.

srush · 2021-06-18T20:25:20Z

I would prefer that we just don't include these things. This seems to be the only dataset that uses them.

sbmaruf · 2021-06-21T22:46:01Z

@srush
I removed those prompt. Please take a look.

Note: Not sure why check_code_quality and show_new_templates failed. locally it passed. Let me know if I need to do anything more.

sbmaruf added 3 commits June 13, 2021 05:58

add shuffle filter and do extension

13305b7

Add conll2003 ner,pos,chunk task.

7097331

code quality and style updated

07ee4ed

sbmaruf added 2 commits June 14, 2021 07:02

add wikiann

b20e39e

Add conllpp dataset

e693053

craffel assigned srush Jun 14, 2021

VictorSanh self-assigned this Jun 15, 2021

VictorSanh requested changes Jun 15, 2021

View reviewed changes

sbmaruf added 3 commits June 16, 2021 06:52

conll2003, multiple template creation

63011ac

conllpp updated

2a7e88e

cleaning

e81133f

Merge branch 'main' into conll2003

5bda048

sbmaruf added 2 commits June 22, 2021 06:34

cleaning randomized and shuffle based prompt.

f40c850

make quality.

f7c7294

VictorSanh closed this Jun 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add conll2003 ner,pos,chunk task. #226

Add conll2003 ner,pos,chunk task. #226

sbmaruf commented Jun 12, 2021 •

edited

sbmaruf commented Jun 12, 2021

VictorSanh Jun 15, 2021

sbmaruf Jun 15, 2021

VictorSanh commented Jun 15, 2021

sbmaruf commented Jun 15, 2021 •

edited

srush commented Jun 15, 2021

sbmaruf commented Jun 16, 2021 •

edited

srush commented Jun 16, 2021

sbmaruf commented Jun 16, 2021

srush commented Jun 17, 2021

sbmaruf commented Jun 17, 2021

srush commented Jun 18, 2021

sbmaruf commented Jun 21, 2021

		WDT\",\n44:\"WP\",\n45:\"WP$\",\n46:\"WRB\"\n}) %}\n{% set _task = [\"named\
		\ entities\", \"chunk tag\", \"parts of speech\"] \| choice %}\n{% set _label_dict\

Add conll2003 ner,pos,chunk task. #226

Add conll2003 ner,pos,chunk task. #226

Conversation

sbmaruf commented Jun 12, 2021 • edited

sbmaruf commented Jun 12, 2021

VictorSanh Jun 15, 2021

Choose a reason for hiding this comment

sbmaruf Jun 15, 2021

Choose a reason for hiding this comment

VictorSanh commented Jun 15, 2021

sbmaruf commented Jun 15, 2021 • edited

srush commented Jun 15, 2021

sbmaruf commented Jun 16, 2021 • edited

srush commented Jun 16, 2021

sbmaruf commented Jun 16, 2021

srush commented Jun 17, 2021

sbmaruf commented Jun 17, 2021

srush commented Jun 18, 2021

sbmaruf commented Jun 21, 2021

sbmaruf commented Jun 12, 2021 •

edited

sbmaruf commented Jun 15, 2021 •

edited

sbmaruf commented Jun 16, 2021 •

edited