How to prepare training dataset structure #37

GraphGrailAi · 2017-02-03T07:50:06Z

I am trying to use code with my dataset (training now goes).
For this i am trying to replicate structure of Ubuntu Dialog Corpus (UDC) from https://arxiv.org/pdf/1506.08909v3.pdf

In you article assumed that "The training data consists of 1,000,000 examples, 50% positive (label 1) and 50% negative (label 0)" (http://www.wildml.com/2016/07/deep-learning-for-chatbots-2-retrieval-based-model-tensorflow/)/
So i made a copy of dataset with Pandas and take random Utterance with flag 0.
Result is doubled dataset: each Context and Utterance appear 2 times, 1 correct Utterance and 1 incorrect.

Is that right for training?

In paper https://arxiv.org/pdf/1506.08909v3.pdf in is stated "In our experiments below, we consider both the case of 1 wrong response and 10 wrong responses." - this is completely other approach.

Should i add to train set 11 copies of Context - Utterance randomly selected to get more accurate results?

Also, i don't have ’EOS’ tag in my Context - so naturally context is not dialog merged, it is one big problem post. How this can influence ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to prepare training dataset structure #37

How to prepare training dataset structure #37

GraphGrailAi commented Feb 3, 2017 •

edited

Loading

How to prepare training dataset structure #37

How to prepare training dataset structure #37

Comments

GraphGrailAi commented Feb 3, 2017 • edited Loading

GraphGrailAi commented Feb 3, 2017 •

edited

Loading