Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some issues regarding generating vocab.json files #43

Open
yan1617262965 opened this issue Jun 6, 2023 · 4 comments
Open

Some issues regarding generating vocab.json files #43

yan1617262965 opened this issue Jun 6, 2023 · 4 comments

Comments

@yan1617262965
Copy link

Example of how you previously answered other people's questions:

Suppose that it is similar to the English tokenizer, use can obtain a vocab.json file by:

from datasets.caption.field import TextField

text_field = TextField(vocab_path="path_to_save_vocab.json", build_vocab=True)

given a list of captions

source = [
"This is a first caption",
"This is a second caption",
....
]

text_field.build_vocab(source)
That's how it works.

Hello author, according to your example, the vocab.json file I generated contains "freqs" and "itos", but I cannot obtain "stoi". May I ask if you could tell me why? Or is there no need for 'stoi' participation in GRIT training
I would greatly appreciate it if you could reply to me as soon as possible

@davidnvq
Copy link
Owner

davidnvq commented Jun 6, 2023

Thank you for asking. I am not sure which errors you encounter.

  • itos is a short name of index_to_string (or to_token).
  • Similarly, stoi is a short name of string_to_index (or to_token).

Therefore, I guess you just need a simple line of code to get stoi from itos. For example:

stoi = {string:index for index, string in itos.items()} # if itos is a dictionary

# or 
stoi = {string:index for index, string in enumerate(itos)} # if itos is a list.

@yan1617262965
Copy link
Author

谢谢你的提问。我不确定您遇到了哪些错误。

  • itos是 (或) 的简称。index_to_string``to_token
  • 同样,是 (或) 的简称。stoi``string_to_index``to_token

因此,我想您只需要一行简单的代码即可从.例如:stoi``itos

stoi = {string:index for index, string in itos.items()} # if itos is a dictionary

# or 
stoi = {string:index for index, string in enumerate(itos)} # if itos is a list.

Okay, thank you very much for your reply

@yan1617262965
Copy link
Author

1686127765(1)
Hello author, I would like to inquire if there may be some "" generated during the inference process after fine-tuning my own dataset. Can you answer my doubts

I generated my own vocab.json, but this phenomenon still exists after training

@davidnvq
Copy link
Owner

davidnvq commented Jun 8, 2023

I think in this case, you need to do a little bit effort in the generate code. For example, if the logit is the highest value, then choose the second highest logit at this timestep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants