A few more instructions in the README #8

waynethewizard · 2017-12-30T02:59:52Z

Would you mind presenting some introductory instructions in your readme about how to load new data into synthetic data for training? I would like to replicate your results then introduce new data. Thank you for this resource.

CR-Gjx · 2018-01-08T12:03:29Z

If you want to use real-world dataset, it better to use codes of "Image Coco" folder, You only modify the "realtrain_cotra.txt".

Crista23 · 2018-01-10T14:16:37Z

Hi, thank you for the great work! Would it be possible to give more details on how the real data is pre-processed?

CR-Gjx · 2018-01-10T14:29:38Z

For real data in "Image Coco" folder, I create a vocabulary dictionary for every word in vocab_cotra.pkl firstly, and then every word in the dataset will be transformed to number according to the dictionary. Specifically, every sentence in the dataset is aligned to 20-length, if one sentence's length less than 20, some paddings (blank) will be added up to 20 and the padding is a special token in the dictionary.

AranKomat · 2018-01-27T23:58:49Z

According to realtrain_cotra.txt, there are 32 tokens per line, and some lines are occupied by more than 20 non-1814 tokens, assuming 1814 here means zero padding. So, I assume you meant "32-length" rather than "20-length."

In vocab_contra.pkl, p4801 aS'OTHERPAD' is the last entry with ' ', so there are only 4801 vocabs for COCO. But main.py says the vocab size is 4839, which doesn't agree. realtrain_contra.txt says 0 is also used as a token (in a middle of a sentence), but it didn't appear in vocab_contra. Since 0 was designated to be a start token, I believe it cannot be used in a middle of a sentence. According to real_traincontra.txt, it seems 65 stands for 'A', but according to vocab_contra, 'A' is at 67. Likewise, '.' (period) is 193 according to real_traincontra, but it's 194 in vocab_contra. By the way, does 'OTHERPAD' mean zero padding (instead of 1814)? In vocab_contra, there's this line:

p194
aS'.'
aS'much'

which means 194 corresponds to both '.' and 'much'. So, I believe your vocab_contra is inaccurate. Or is it not?

CR-Gjx · 2018-01-28T14:38:53Z

In fact, when I write the main.py ，I write a bigger vocab number to prevent vocabs-overflow. Maybe it is not rigorous. But with training, some tokens' probabilities become 0 because they never happened in training dataset.
As you say, aS'OTHERPAD' is a common word and a blank. In my code, I assume Generator network can only generate fixed length sentences, so I add this token to guarantee all sentences are fixed length in the dataset. But some sentences are so short that aS'OTHERPAD' appears many times.

CR-Gjx · 2018-01-28T14:44:55Z

For the last question, it may my bugs during uploading the codes. I will verify it and fix. Thanks for your reminds.

AranKomat · 2018-01-30T09:05:19Z

I did print(word) print(vocab) in convert.py, then I found that '.' and 'much' are attributed differently and appropriately. So, I guess this is due to a bug that occurs when one opens .pkl file like txt file. I found that '0' corresponds to 'raining,' so it has nothing to do with the start token. A few sentences from realtrain_contra.py were translated nicely with convert.py, so I guess there's no problem at all. Sorry for confusion.

CR-Gjx · 2018-01-30T11:48:53Z

OK, it may be .pkl's bug. Thanks for your discovery.

bharathreddy1997 · 2020-06-10T11:11:55Z

Hi, thank you for the great work! Would it be possible to give more details on how the real data is pre-processed?
Hi did you understand how the pickle file was generated?

CR-Gjx closed this as completed Jan 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few more instructions in the README #8

A few more instructions in the README #8

waynethewizard commented Dec 30, 2017

CR-Gjx commented Jan 8, 2018

Crista23 commented Jan 10, 2018

CR-Gjx commented Jan 10, 2018 •

edited

Loading

AranKomat commented Jan 27, 2018 •

edited

Loading

CR-Gjx commented Jan 28, 2018

CR-Gjx commented Jan 28, 2018

AranKomat commented Jan 30, 2018

CR-Gjx commented Jan 30, 2018

bharathreddy1997 commented Jun 10, 2020

A few more instructions in the README #8

A few more instructions in the README #8

Comments

waynethewizard commented Dec 30, 2017

CR-Gjx commented Jan 8, 2018

Crista23 commented Jan 10, 2018

CR-Gjx commented Jan 10, 2018 • edited Loading

AranKomat commented Jan 27, 2018 • edited Loading

CR-Gjx commented Jan 28, 2018

CR-Gjx commented Jan 28, 2018

AranKomat commented Jan 30, 2018

CR-Gjx commented Jan 30, 2018

bharathreddy1997 commented Jun 10, 2020

CR-Gjx commented Jan 10, 2018 •

edited

Loading

AranKomat commented Jan 27, 2018 •

edited

Loading