Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiently generate summaries from new data? #51

Closed
ibarrien opened this issue Aug 29, 2017 · 9 comments
Closed

Efficiently generate summaries from new data? #51

ibarrien opened this issue Aug 29, 2017 · 9 comments

Comments

@ibarrien
Copy link

ibarrien commented Aug 29, 2017

How do folks typically generate summaries given i) new text files (only content, no headlines/abstracts/urls) and ii) the pretrained model?

It seems that make_datafiles.py from the cnn-dailymail dir needs to be modified (e.g. removing much of the hardcoding). After these modifications, make_datafiles.py may be used to tokenize and chunk the new data. From there, we "decode" using the pretrained model to generate new summaries.

Is the above generally correct or is there a more efficient method?

@ibarrien
Copy link
Author

ibarrien commented Aug 31, 2017

Tips for anyone running decoding using a pretrained model:
after tokenizing and chunking your text(s) as above, run a command like:

python pointer-generator-master/run_summarization.py --mode=decode --data_path=/path/to/your/finished_test_files/chunked/test_* --vocab_path=vocab --log_root=/path/to/pretrained_model --exp_name=pretrained_model_ --coverage=1 --single_pass=1 --max_enc_steps=500 --max_dec_steps=200 --min_dec_steps=100

Here, "vocab" is the from the "finished" cnn+daily mail data and "pretrained_model" is, e.g., one of the pretrained models kindly made available by the author.

The summarization results will then be located in a decoded folder "decode" inside a new folder of the sort "decode_test_500maxenc..."

@Sharathnasa
Copy link

@ibarrien What changes did you do in make_datafiles.py? In the cnn-dailymail code, input is fed in the form of urls, there how to give only as input and generate new data? Could you please shed some light?

@anashkum
Copy link

anashkum commented May 2, 2018

I am having a similar issue -

I am able to successfully run the pre-trained model of TextSum (Tensorflow 1.2.1). The output consists of summaries of CNN & Dailymail articles (which are chunked into bin format prior to testing).

I have also been able to create the aforementioned bin format test data for CNN/Dailymail articles & vocab file (per instructions here). However, I am not able to create my own test data to check how good the summary is. I have tried modifying the make_datafiles.py code to remove hard coded values. I am able to create tokenized files, but the next step seems to be failing. It'll be great if someone can help me understand what url_lists is being used for. (I was under the impression that the summaries being generated from test.bin, which is created from the .story files and not from the URL.) Per the github readme -

"For each of the url lists all_train.txt, all_val.txt and all_test.txt, the corresponding tokenized stories are read from file, lowercased and written to serialized binary files train.bin, val.bin and test.bin. These will be placed in the newly-created finished_files directory."

How is a URL such as http://web.archive.org/web/20150401100102id_/http://www.cnn.com/2015/04/01/europe/france-germanwings-plane-crash-main/ being mapped to the corresponding story in my data folder? If someone has had success with this, please do let me know how to go about this. Thanks in advance!

@pruthvishetty
Copy link

If you are trying to test the model on your own data, you can avoid dealing with the url_lists folder altogether. You can follow this - https://github.com/dondon2475848/make_datafiles_for_pgn to generate bin file(s) for summarizing your data. Use it with --single_pass=1, if you are testing individual articles, like so:
python run_summarization.py --mode=decode --data_path=/make_datafiles_for_pgn-master/output/finished_files/test* --vocab_path=/pointer-generator/finished_files/vocab --log_root=../pointer-generator/ --exp_name=pretrainedmodel --max_enc_steps=450 --max_dec_steps=200 --coverage=1 --single_pass=1
I found ROUGE scoring had an issue with settings.ini file. If scoring isn't a priority, you can comment out those lines in decode.py.

@microLizzy
Copy link

@pruthvishetty When I ran the python make_datafiles.py ./stories ./output command, the generated test.bin file is empty. I strictly follow the read me. So do you know what's going on here? Or maybe I did something wrong?

@pruthvishetty
Copy link

pruthvishetty commented May 17, 2018

@microLizzy Make this change in your script:
Replace story_fnames = [name for name in os.listdir(tokenized_stories_dir) if os.path.isfile(tokenized_stories_dir+'\'+name) ] with
story_fnames = [name for name in os.listdir(tokenized_stories_dir)].
That should solve this issue. The author used \ for Windows which is not valid for GNU/Linux or Mac.

@microLizzy
Copy link

@pruthvishetty That works! Thanks so much!

@jonaseltes
Copy link

@pruthvishetty did you manage to disable ROUGE and still run single_pass? Which line did you comment? Thanks!

@shivam13juna
Copy link

@jonaseltes there are just a couple of lines for rouge function, in decode.py file. Just find where those functions are use and comment em, really easy task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants