Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Important fix (untokenized data written to .bin files) #2

Closed
abisee opened this issue May 8, 2017 · 3 comments
Closed

Important fix (untokenized data written to .bin files) #2

abisee opened this issue May 8, 2017 · 3 comments

Comments

@abisee
Copy link
Owner

abisee commented May 8, 2017

This is a notification that the code to obtain the CNN / Daily Mail dataset unfortunately had a bug which caused the untokenized data to be written to the .bin files (not the tokenized data, as intended). The fix has been committed here.

If you've already created your .bin and vocab files, I advise you to recreate them. To do this:

  • Pull the new version of the cnn-dailymail repo
  • Delete or rename the finished_files directory (but keep the cnn_stories_tokenized and dm_stories_tokenized directories)
  • Comment out the lines tokenize_stories(cnn_stories_dir, cnn_tokenized_stories_dir) and tokenize_stories(dm_stories_dir, dm_tokenized_stories_dir) (lines 178 and 179) of make_datafiles.py. This is because you don't need to retokenize the data.
  • Run make_datafiles.py. This will create the new .bin and vocab files.

If you've already begun training with the Tensorflow code, I advise you to restart training with the new datafiles. Switching the vocab and .bin files mid-training will not work.

Apologies for the inconvenience.

Tagging people to whom this may be relevant: @prokopevaleksey @tianjianjiang @StevenLOL @MrGLaDOS @hate5six @liuchen11 @bugtig @ayushoriginal @BenJamesbabala @BinbinBian @caomw @halolimat @ml-lab @ParseThis @qiang2100 @scylla @tonydeep @yiqingyang2012 @YuxuanHuang @Rahul-Iisc @pj-parag

@yangze01
Copy link

when I run the code , it always print "Tried to find tokenized story file 9bfbb6ede20df9611c2a8b42980629658dc5ec23.story in both directories cnn_stories_tokenized and dm_stories_tokenized.“ Couldn't find it.
how can i fix it?

@abisee
Copy link
Owner Author

abisee commented May 22, 2017

@yangze01 That error message means that it's trying to find a .story file in cnn_stories_tokenized or dm_stories_tokenized but the file is in neither. Those directories should contain a tokenized version of every file in the original cnn / dailymail stories directories you passed into make_datafile.py.

You probably had some error during tokenization that resulted in an incomplete set of tokenized files in cnn_stories_tokenized and dm_stories_tokenized.

The latest commit now has more informative checks and error messages.

@yangze01
Copy link

@abisee Thx. I downloads the orinal cnn/dailymail stories, and it works.

@abisee abisee closed this as completed Aug 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants