Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prepare training dataset structure #37

Open
GraphGrailAi opened this issue Feb 3, 2017 · 0 comments
Open

How to prepare training dataset structure #37

GraphGrailAi opened this issue Feb 3, 2017 · 0 comments

Comments

@GraphGrailAi
Copy link

GraphGrailAi commented Feb 3, 2017

I am trying to use code with my dataset (training now goes).
For this i am trying to replicate structure of Ubuntu Dialog Corpus (UDC) from https://arxiv.org/pdf/1506.08909v3.pdf

In you article assumed that "The training data consists of 1,000,000 examples, 50% positive (label 1) and 50% negative (label 0)" (http://www.wildml.com/2016/07/deep-learning-for-chatbots-2-retrieval-based-model-tensorflow/)/
So i made a copy of dataset with Pandas and take random Utterance with flag 0.
Result is doubled dataset: each Context and Utterance appear 2 times, 1 correct Utterance and 1 incorrect.

Is that right for training?

In paper https://arxiv.org/pdf/1506.08909v3.pdf in is stated "In our experiments below, we consider both the case of 1 wrong response and 10 wrong responses." - this is completely other approach.

Should i add to train set 11 copies of Context - Utterance randomly selected to get more accurate results?

Also, i don't have ’EOS’ tag in my Context - so naturally context is not dialog merged, it is one big problem post. How this can influence ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant