Some issues regarding generating vocab.json files #43

yan1617262965 · 2023-06-06T09:55:57Z

Example of how you previously answered other people's questions:

Suppose that it is similar to the English tokenizer, use can obtain a vocab.json file by:

from datasets.caption.field import TextField

text_field = TextField(vocab_path="path_to_save_vocab.json", build_vocab=True)

given a list of captions

source = [
"This is a first caption",
"This is a second caption",
....
]

text_field.build_vocab(source)
That's how it works.

Hello author, according to your example, the vocab.json file I generated contains "freqs" and "itos", but I cannot obtain "stoi". May I ask if you could tell me why? Or is there no need for 'stoi' participation in GRIT training
I would greatly appreciate it if you could reply to me as soon as possible

davidnvq · 2023-06-06T10:47:07Z

Thank you for asking. I am not sure which errors you encounter.

itos is a short name of index_to_string (or to_token).
Similarly, stoi is a short name of string_to_index (or to_token).

Therefore, I guess you just need a simple line of code to get stoi from itos. For example:

stoi = {string:index for index, string in itos.items()} # if itos is a dictionary

# or 
stoi = {string:index for index, string in enumerate(itos)} # if itos is a list.

yan1617262965 · 2023-06-06T11:37:30Z

谢谢你的提问。我不确定您遇到了哪些错误。

itos是（或）的简称。index_to_string``to_token

同样，是（或）的简称。stoi``string_to_index``to_token

因此，我想您只需要一行简单的代码即可从.例如：stoi``itos
stoi = {string:index for index, string in itos.items()} # if itos is a dictionary

# or 
stoi = {string:index for index, string in enumerate(itos)} # if itos is a list.

Okay, thank you very much for your reply

yan1617262965 · 2023-06-07T08:52:52Z

Hello author, I would like to inquire if there may be some "" generated during the inference process after fine-tuning my own dataset. Can you answer my doubts

I generated my own vocab.json, but this phenomenon still exists after training

davidnvq · 2023-06-08T01:59:45Z

I think in this case, you need to do a little bit effort in the generate code. For example, if the logit is the highest value, then choose the second highest logit at this timestep.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some issues regarding generating vocab.json files #43

Some issues regarding generating vocab.json files #43

yan1617262965 commented Jun 6, 2023

davidnvq commented Jun 6, 2023

yan1617262965 commented Jun 6, 2023

yan1617262965 commented Jun 7, 2023

davidnvq commented Jun 8, 2023

Some issues regarding generating vocab.json files #43

Some issues regarding generating vocab.json files #43

Comments

yan1617262965 commented Jun 6, 2023

given a list of captions

davidnvq commented Jun 6, 2023

yan1617262965 commented Jun 6, 2023

yan1617262965 commented Jun 7, 2023

davidnvq commented Jun 8, 2023