Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few more instructions in the README #8

Closed
waynethewizard opened this issue Dec 30, 2017 · 9 comments
Closed

A few more instructions in the README #8

waynethewizard opened this issue Dec 30, 2017 · 9 comments

Comments

@waynethewizard
Copy link

Would you mind presenting some introductory instructions in your readme about how to load new data into synthetic data for training? I would like to replicate your results then introduce new data. Thank you for this resource.

@CR-Gjx
Copy link
Owner

CR-Gjx commented Jan 8, 2018

If you want to use real-world dataset, it better to use codes of "Image Coco" folder, You only modify the "realtrain_cotra.txt".

@Crista23
Copy link

Hi, thank you for the great work! Would it be possible to give more details on how the real data is pre-processed?

@CR-Gjx
Copy link
Owner

CR-Gjx commented Jan 10, 2018

For real data in "Image Coco" folder, I create a vocabulary dictionary for every word in vocab_cotra.pkl firstly, and then every word in the dataset will be transformed to number according to the dictionary. Specifically, every sentence in the dataset is aligned to 20-length, if one sentence's length less than 20, some paddings (blank) will be added up to 20 and the padding is a special token in the dictionary.

@AranKomat
Copy link

AranKomat commented Jan 27, 2018

According to realtrain_cotra.txt, there are 32 tokens per line, and some lines are occupied by more than 20 non-1814 tokens, assuming 1814 here means zero padding. So, I assume you meant "32-length" rather than "20-length."

In vocab_contra.pkl, p4801 aS'OTHERPAD' is the last entry with ' ', so there are only 4801 vocabs for COCO. But main.py says the vocab size is 4839, which doesn't agree. realtrain_contra.txt says 0 is also used as a token (in a middle of a sentence), but it didn't appear in vocab_contra. Since 0 was designated to be a start token, I believe it cannot be used in a middle of a sentence. According to real_traincontra.txt, it seems 65 stands for 'A', but according to vocab_contra, 'A' is at 67. Likewise, '.' (period) is 193 according to real_traincontra, but it's 194 in vocab_contra. By the way, does 'OTHERPAD' mean zero padding (instead of 1814)? In vocab_contra, there's this line:

p194
aS'.'
aS'much'

which means 194 corresponds to both '.' and 'much'. So, I believe your vocab_contra is inaccurate. Or is it not?

@CR-Gjx
Copy link
Owner

CR-Gjx commented Jan 28, 2018

In fact, when I write the main.py ,I write a bigger vocab number to prevent vocabs-overflow. Maybe it is not rigorous. But with training, some tokens' probabilities become 0 because they never happened in training dataset.
As you say, aS'OTHERPAD' is a common word and a blank. In my code, I assume Generator network can only generate fixed length sentences, so I add this token to guarantee all sentences are fixed length in the dataset. But some sentences are so short that aS'OTHERPAD' appears many times.

@CR-Gjx
Copy link
Owner

CR-Gjx commented Jan 28, 2018

For the last question, it may my bugs during uploading the codes. I will verify it and fix. Thanks for your reminds.

@AranKomat
Copy link

I did print(word) print(vocab) in convert.py, then I found that '.' and 'much' are attributed differently and appropriately. So, I guess this is due to a bug that occurs when one opens .pkl file like txt file. I found that '0' corresponds to 'raining,' so it has nothing to do with the start token. A few sentences from realtrain_contra.py were translated nicely with convert.py, so I guess there's no problem at all. Sorry for confusion.

@CR-Gjx
Copy link
Owner

CR-Gjx commented Jan 30, 2018

OK, it may be .pkl's bug. Thanks for your discovery.

@CR-Gjx CR-Gjx closed this as completed Jan 30, 2018
@bharathreddy1997
Copy link

Hi, thank you for the great work! Would it be possible to give more details on how the real data is pre-processed?
Hi did you understand how the pickle file was generated?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants