-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some issues regarding generating vocab.json files #43
Comments
Thank you for asking. I am not sure which errors you encounter.
Therefore, I guess you just need a simple line of code to get stoi = {string:index for index, string in itos.items()} # if itos is a dictionary
# or
stoi = {string:index for index, string in enumerate(itos)} # if itos is a list. |
Okay, thank you very much for your reply |
I think in this case, you need to do a little bit effort in the generate code. For example, if the logit is the highest value, then choose the second highest logit at this timestep. |
Example of how you previously answered other people's questions:
Suppose that it is similar to the English tokenizer, use can obtain a vocab.json file by:
from datasets.caption.field import TextField
text_field = TextField(vocab_path="path_to_save_vocab.json", build_vocab=True)
given a list of captions
source = [
"This is a first caption",
"This is a second caption",
....
]
text_field.build_vocab(source)
That's how it works.
Hello author, according to your example, the vocab.json file I generated contains "freqs" and "itos", but I cannot obtain "stoi". May I ask if you could tell me why? Or is there no need for 'stoi' participation in GRIT training
I would greatly appreciate it if you could reply to me as soon as possible
The text was updated successfully, but these errors were encountered: