Tried running it on random internet news articles. Results look more extractive than abstractive? #49

anubhavmax · 2017-08-23T06:31:47Z

Hi Abigail .. I was trying to run the code using the already existing training model that was uploaded, as I do not have a powerful enough machine to train. I believe the vocab size is set to 50000 in the code. After running it through multiple news articles on the internet, I found the results to be extractive. Didn't really encounter any situation where a new word was generated for summarizing. Am I missing something in the settings? Could you please let me know where the gap in my understanding lies?

abisee · 2017-09-13T03:30:49Z

Hi @anubhavmax, the same question has been asked here.

Yes - the pointer-generator model produces mostly extractive summaries. This is discussed in section 7.2 of the paper. It is the main area for future work!

Sharathnasa · 2018-01-13T17:02:10Z

@anubhavmax Hi, How did you manage to run on your own data? could you please shed some light.

Thanks,
Sharath

alkanen · 2018-01-13T17:21:31Z

@Sharathnasa, you need to run the text through the Stanford tokenizer Java program first in order to create a token list file to feed to the network.

Basically, in Linux, you run
cat normal_text.txt | java edu.stanford.nlp.process.PTBTokenizer -preserveLines

And it will print a tokenized version of the text, which you need to save to a new file. That file is then fed into the pointer generator network with the "--data_path=" argument and "--mode=decode".

Sharathnasa · 2018-01-14T07:40:10Z

@alkanen Thanks a lot man!. I will give a try. Text in the sense if i only pass the entire article without a abstract, it will work fine right?

or

Should i need to Process into .bin and vocab files as explained in cnn-daily repo? and one more thing, how is that url and stories 1-1 mapping is done, if i need to do so how to proceed?

alkanen · 2018-01-14T10:18:23Z

@Sharathnasa text as in the entire article without an abstract, yes. That will create a bin file with a single article in it. Use the vocab file you already have from the CNN training set, it doesn't make much sense creating a new one based on a single article, and unless I misremember it will also break everything because the network will have trained on a particular vocab and that one needs to be used.

I'm afraid I never looked into the URL/stories mapping since that wasn't relevant for the work I did, so I can't help you there.

Sharathnasa · 2018-01-14T11:51:16Z

@alkanen Thanks once again man. When i try to run as you mentioned, i'm getting an error as below

vi womendriver.text | java edu.stanford.nlp.process.PTBTokenizer -preserveLines
Vim: Warning: Output is not to a terminal
Untokenizable: (U+1B, decimal: 27)

Would you please pass on the script if you have?

alkanen · 2018-01-14T11:58:08Z

@Sharathnasa you can't pipe vi into java, use cat to pipe the contents of the text file into java

Sharathnasa · 2018-01-14T12:16:09Z

@alkanen Ok, my bad. Thanks once again. After performing tokenization(which i need to save), should i need to make_datafile.py code to generate .bin files?

alkanen · 2018-01-14T12:30:40Z

Nope, just use the old vocab file used for training, and the file created by tokenization as input to the model:
python pointer-generator/run_summarization.py --log_root=<some path with trained models in it> --exp_name=<the name of your trained model> --vocab_path=<your old vocab file> --mode=decode --data_path=<the file generated by tokenizer>

Sharathnasa · 2018-01-14T12:34:04Z

@alkanen did you took a look at this #51

alkanen · 2018-01-14T12:59:19Z

No, anything in particular there you mean I should be aware of?

I never had the need to summarize multiple texts at once, so I haven't looked into that use case at all.

Sharathnasa · 2018-01-14T13:06:16Z

@alkanen Nothing in particular, just wanted to let you know the command he suggested to run.

One more query i have:

Repo says input should be in the form of .bin files, but the tokenization we did is in the form of .bin format, will the network run?
Whatever you had suggested is to run single article?

Sharathnasa · 2018-01-15T08:13:05Z

hi @alkanen when i run the below command

python3 pointer-generator/run_summarization.py --mode=decode --data_path=/Users/setup/text_abstraction/cnn-dailymail/finished_files/chunked/train_* --vocab_path=/Users/setup/text_abstraction/finished_files/vocab --log_root=/Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train --exp_name="model-238410.data-00000-of-00001" --coverage=1 --single_pass=1 --max_enc_steps=500 --max_dec_steps=200 --min_dec_steps=100

Im getting the logs as below
INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs...
INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs...
INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs...
INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs..

Where is it gone wrong?

dondon2475848 · 2018-03-06T09:24:25Z

Hi @Sharathnasa
You can clone below repository:
https://github.com/dondon2475848/make_datafiles_for_pgn
Run

python make_datafiles.py  ./stories  ./output

It processes your test data into the binary format

glalwani2 · 2018-03-20T18:24:03Z

@dondon2475848 I tried your repo with a sample txt file under stories folder and the .bin files didnt get created only tokenized file did. I am not sure why

dondon2475848 · 2018-03-21T05:32:41Z

Do you put xxx.txt under stories folder ?
Maybe you can try xxx.story.
format like below:

test1.story

MOSCOW, Russia (CNN) -- Russian space officials say the crew of the Soyuz space ship is resting after a rough ride back to Earth.

A South Korean bioengineer was one of three people on board the Soyuz capsule.

The craft carrying South Korea's first astronaut landed in northern Kazakhstan on Saturday, 260 miles (418 kilometers) off its mark, they said.

Mission Control spokesman Valery Lyndin said the condition of the crew -- South Korean bioengineer Yi So-yeon, American astronaut Peggy Whitson and Russian flight engineer Yuri Malenchenko -- was satisfactory, though the three had been subjected to severe G-forces during the re-entry.

Search helicopters took 25 minutes to find the capsule and determine that the crew was unharmed.

Officials said the craft followed a very steep trajectory that subjects the crew to gravitational forces of up to 10 times those on Earth.

Interfax reported that the spacecraft's landing was rough.

This is not the first time a spacecraft veered from its planned trajectory during landing.

In October, the Soyuz capsule landed 70 kilometers from the planned area because of a damaged control cable. The capsule was carrying two Russian cosmonauts and the first Malaysian astronaut. E-mail to a friend

@highlight

Soyuz capsule lands hundreds of kilometers off-target

@highlight

Capsule was carrying South Korea's first astronaut

@highlight

Landing is second time Soyuz capsule has gone awry

victorherbemontagne · 2018-03-23T09:18:18Z

@Sharathnasa I don't know if you still have this issue but I think I figure it out. I had the same issue with the TraceBack, do you run on TensorFlow 1.5?
You can check at my repo, I fork the code of @becxer in python3 and modify it for Tensorflow 1.5 (still loading the TF 1.2.1 model presented in @abisee repo). Not so much work, TF 1.5 has a really bad support for tf.tags so I modify the code to make it works.
If you look at your error, go to .utils.py and print the exception in the load_checkpoint() function.For me it came from the fact that 4 words in the vocab_meta.tsv were not added to the vocab so I had a shape issue, I made a small correction in the code to format the considered words and to add them to the vocab and it worked like a charm.
You can check my code, tell me if there is a bug or anything, I will work it out!

anubhavmax changed the title ~~Time taken to generate summaries once the model is trained ??~~ Tried running it on random internet news articles. Results look more extractive than abstractive? Aug 24, 2017

abisee added question duplicate labels Sep 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tried running it on random internet news articles. Results look more extractive than abstractive? #49

Tried running it on random internet news articles. Results look more extractive than abstractive? #49

anubhavmax commented Aug 23, 2017 •

edited

abisee commented Sep 13, 2017

Sharathnasa commented Jan 13, 2018

alkanen commented Jan 13, 2018

Sharathnasa commented Jan 14, 2018 •

edited

alkanen commented Jan 14, 2018

Sharathnasa commented Jan 14, 2018

alkanen commented Jan 14, 2018

Sharathnasa commented Jan 14, 2018

alkanen commented Jan 14, 2018

Sharathnasa commented Jan 14, 2018

alkanen commented Jan 14, 2018

Sharathnasa commented Jan 14, 2018

Sharathnasa commented Jan 15, 2018

dondon2475848 commented Mar 6, 2018 •

edited

glalwani2 commented Mar 20, 2018

dondon2475848 commented Mar 21, 2018

victorherbemontagne commented Mar 23, 2018 •

edited

Tried running it on random internet news articles. Results look more extractive than abstractive? #49

Tried running it on random internet news articles. Results look more extractive than abstractive? #49

Comments

anubhavmax commented Aug 23, 2017 • edited

abisee commented Sep 13, 2017

Sharathnasa commented Jan 13, 2018

alkanen commented Jan 13, 2018

Sharathnasa commented Jan 14, 2018 • edited

alkanen commented Jan 14, 2018

Sharathnasa commented Jan 14, 2018

alkanen commented Jan 14, 2018

Sharathnasa commented Jan 14, 2018

alkanen commented Jan 14, 2018

Sharathnasa commented Jan 14, 2018

alkanen commented Jan 14, 2018

Sharathnasa commented Jan 14, 2018

Sharathnasa commented Jan 15, 2018

dondon2475848 commented Mar 6, 2018 • edited

glalwani2 commented Mar 20, 2018

dondon2475848 commented Mar 21, 2018

victorherbemontagne commented Mar 23, 2018 • edited

anubhavmax commented Aug 23, 2017 •

edited

Sharathnasa commented Jan 14, 2018 •

edited

dondon2475848 commented Mar 6, 2018 •

edited

victorherbemontagne commented Mar 23, 2018 •

edited