Efficiently generate summaries from new data? #51

ibarrien · 2017-08-29T21:42:31Z

How do folks typically generate summaries given i) new text files (only content, no headlines/abstracts/urls) and ii) the pretrained model?

It seems that make_datafiles.py from the cnn-dailymail dir needs to be modified (e.g. removing much of the hardcoding). After these modifications, make_datafiles.py may be used to tokenize and chunk the new data. From there, we "decode" using the pretrained model to generate new summaries.

Is the above generally correct or is there a more efficient method?

ibarrien · 2017-08-31T01:56:41Z

Tips for anyone running decoding using a pretrained model:
after tokenizing and chunking your text(s) as above, run a command like:

python pointer-generator-master/run_summarization.py --mode=decode --data_path=/path/to/your/finished_test_files/chunked/test_* --vocab_path=vocab --log_root=/path/to/pretrained_model --exp_name=pretrained_model_ --coverage=1 --single_pass=1 --max_enc_steps=500 --max_dec_steps=200 --min_dec_steps=100

Here, "vocab" is the from the "finished" cnn+daily mail data and "pretrained_model" is, e.g., one of the pretrained models kindly made available by the author.

The summarization results will then be located in a decoded folder "decode" inside a new folder of the sort "decode_test_500maxenc..."

Sharathnasa · 2018-01-13T17:24:10Z

@ibarrien What changes did you do in make_datafiles.py? In the cnn-dailymail code, input is fed in the form of urls, there how to give only as input and generate new data? Could you please shed some light?

anashkum · 2018-05-02T18:47:53Z

I am having a similar issue -

I am able to successfully run the pre-trained model of TextSum (Tensorflow 1.2.1). The output consists of summaries of CNN & Dailymail articles (which are chunked into bin format prior to testing).

I have also been able to create the aforementioned bin format test data for CNN/Dailymail articles & vocab file (per instructions here). However, I am not able to create my own test data to check how good the summary is. I have tried modifying the make_datafiles.py code to remove hard coded values. I am able to create tokenized files, but the next step seems to be failing. It'll be great if someone can help me understand what url_lists is being used for. (I was under the impression that the summaries being generated from test.bin, which is created from the .story files and not from the URL.) Per the github readme -

"For each of the url lists all_train.txt, all_val.txt and all_test.txt, the corresponding tokenized stories are read from file, lowercased and written to serialized binary files train.bin, val.bin and test.bin. These will be placed in the newly-created finished_files directory."

How is a URL such as http://web.archive.org/web/20150401100102id_/http://www.cnn.com/2015/04/01/europe/france-germanwings-plane-crash-main/ being mapped to the corresponding story in my data folder? If someone has had success with this, please do let me know how to go about this. Thanks in advance!

pruthvishetty · 2018-05-04T00:56:48Z

If you are trying to test the model on your own data, you can avoid dealing with the url_lists folder altogether. You can follow this - https://github.com/dondon2475848/make_datafiles_for_pgn to generate bin file(s) for summarizing your data. Use it with --single_pass=1, if you are testing individual articles, like so:
python run_summarization.py --mode=decode --data_path=/make_datafiles_for_pgn-master/output/finished_files/test* --vocab_path=/pointer-generator/finished_files/vocab --log_root=../pointer-generator/ --exp_name=pretrainedmodel --max_enc_steps=450 --max_dec_steps=200 --coverage=1 --single_pass=1
I found ROUGE scoring had an issue with settings.ini file. If scoring isn't a priority, you can comment out those lines in decode.py.

microLizzy · 2018-05-17T06:04:21Z

@pruthvishetty When I ran the python make_datafiles.py ./stories ./output command, the generated test.bin file is empty. I strictly follow the read me. So do you know what's going on here? Or maybe I did something wrong?

pruthvishetty · 2018-05-17T18:53:48Z

@microLizzy Make this change in your script:
Replace story_fnames = [name for name in os.listdir(tokenized_stories_dir) if os.path.isfile(tokenized_stories_dir+'\'+name) ] with
story_fnames = [name for name in os.listdir(tokenized_stories_dir)].
That should solve this issue. The author used \ for Windows which is not valid for GNU/Linux or Mac.

microLizzy · 2018-05-19T05:19:15Z

@pruthvishetty That works! Thanks so much!

jonaseltes · 2018-10-03T13:26:39Z

@pruthvishetty did you manage to disable ROUGE and still run single_pass? Which line did you comment? Thanks!

shivam13juna · 2019-01-25T05:25:41Z

@jonaseltes there are just a couple of lines for rouge function, in decode.py file. Just find where those functions are use and comment em, really easy task.

ibarrien closed this as completed Aug 31, 2017

Sharathnasa mentioned this issue Jan 14, 2018

Tried running it on random internet news articles. Results look more extractive than abstractive? #49

Open

ibarrien mentioned this issue Jan 28, 2018

How to generate dataset for our own article? abisee/cnn-dailymail#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiently generate summaries from new data? #51

Efficiently generate summaries from new data? #51

ibarrien commented Aug 29, 2017 •

edited

ibarrien commented Aug 31, 2017 •

edited

Sharathnasa commented Jan 13, 2018

anashkum commented May 2, 2018 •

edited

pruthvishetty commented May 4, 2018

microLizzy commented May 17, 2018

pruthvishetty commented May 17, 2018 •

edited

microLizzy commented May 19, 2018

jonaseltes commented Oct 3, 2018

shivam13juna commented Jan 25, 2019

Efficiently generate summaries from new data? #51

Efficiently generate summaries from new data? #51

Comments

ibarrien commented Aug 29, 2017 • edited

ibarrien commented Aug 31, 2017 • edited

Sharathnasa commented Jan 13, 2018

anashkum commented May 2, 2018 • edited

pruthvishetty commented May 4, 2018

microLizzy commented May 17, 2018

pruthvishetty commented May 17, 2018 • edited

microLizzy commented May 19, 2018

jonaseltes commented Oct 3, 2018

shivam13juna commented Jan 25, 2019

ibarrien commented Aug 29, 2017 •

edited

ibarrien commented Aug 31, 2017 •

edited

anashkum commented May 2, 2018 •

edited

pruthvishetty commented May 17, 2018 •

edited