-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A few more instructions in the README #8
Comments
If you want to use real-world dataset, it better to use codes of "Image Coco" folder, You only modify the "realtrain_cotra.txt". |
Hi, thank you for the great work! Would it be possible to give more details on how the real data is pre-processed? |
For real data in "Image Coco" folder, I create a vocabulary dictionary for every word in |
According to realtrain_cotra.txt, there are 32 tokens per line, and some lines are occupied by more than 20 non-1814 tokens, assuming 1814 here means zero padding. So, I assume you meant "32-length" rather than "20-length." In vocab_contra.pkl, p4801 aS'OTHERPAD' is the last entry with ' ', so there are only 4801 vocabs for COCO. But main.py says the vocab size is 4839, which doesn't agree. realtrain_contra.txt says 0 is also used as a token (in a middle of a sentence), but it didn't appear in vocab_contra. Since 0 was designated to be a start token, I believe it cannot be used in a middle of a sentence. According to real_traincontra.txt, it seems 65 stands for 'A', but according to vocab_contra, 'A' is at 67. Likewise, '.' (period) is 193 according to real_traincontra, but it's 194 in vocab_contra. By the way, does 'OTHERPAD' mean zero padding (instead of 1814)? In vocab_contra, there's this line:
which means 194 corresponds to both '.' and 'much'. So, I believe your vocab_contra is inaccurate. Or is it not? |
In fact, when I write the |
For the last question, it may my bugs during uploading the codes. I will verify it and fix. Thanks for your reminds. |
I did print(word) print(vocab) in convert.py, then I found that '.' and 'much' are attributed differently and appropriately. So, I guess this is due to a bug that occurs when one opens .pkl file like txt file. I found that '0' corresponds to 'raining,' so it has nothing to do with the start token. A few sentences from realtrain_contra.py were translated nicely with convert.py, so I guess there's no problem at all. Sorry for confusion. |
OK, it may be .pkl's bug. Thanks for your discovery. |
|
Would you mind presenting some introductory instructions in your readme about how to load new data into synthetic data for training? I would like to replicate your results then introduce new data. Thank you for this resource.
The text was updated successfully, but these errors were encountered: