Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tried running it on random internet news articles. Results look more extractive than abstractive? #49

Open
anubhavmax opened this issue Aug 23, 2017 · 17 comments

Comments

@anubhavmax
Copy link

anubhavmax commented Aug 23, 2017

Hi Abigail .. I was trying to run the code using the already existing training model that was uploaded, as I do not have a powerful enough machine to train. I believe the vocab size is set to 50000 in the code. After running it through multiple news articles on the internet, I found the results to be extractive. Didn't really encounter any situation where a new word was generated for summarizing. Am I missing something in the settings? Could you please let me know where the gap in my understanding lies?

@anubhavmax anubhavmax changed the title Time taken to generate summaries once the model is trained ?? Tried running it on random internet news articles. Results look more extractive than abstractive? Aug 24, 2017
@abisee
Copy link
Owner

abisee commented Sep 13, 2017

Hi @anubhavmax, the same question has been asked here.

Yes - the pointer-generator model produces mostly extractive summaries. This is discussed in section 7.2 of the paper. It is the main area for future work!

@Sharathnasa
Copy link

@anubhavmax Hi, How did you manage to run on your own data? could you please shed some light.

Thanks,
Sharath

@alkanen
Copy link

alkanen commented Jan 13, 2018

@Sharathnasa, you need to run the text through the Stanford tokenizer Java program first in order to create a token list file to feed to the network.

Basically, in Linux, you run
cat normal_text.txt | java edu.stanford.nlp.process.PTBTokenizer -preserveLines

And it will print a tokenized version of the text, which you need to save to a new file. That file is then fed into the pointer generator network with the "--data_path=" argument and "--mode=decode".

@Sharathnasa
Copy link

Sharathnasa commented Jan 14, 2018

@alkanen Thanks a lot man!. I will give a try. Text in the sense if i only pass the entire article without a abstract, it will work fine right?

or

Should i need to Process into .bin and vocab files as explained in cnn-daily repo? and one more thing, how is that url and stories 1-1 mapping is done, if i need to do so how to proceed?

@alkanen
Copy link

alkanen commented Jan 14, 2018

@Sharathnasa text as in the entire article without an abstract, yes. That will create a bin file with a single article in it. Use the vocab file you already have from the CNN training set, it doesn't make much sense creating a new one based on a single article, and unless I misremember it will also break everything because the network will have trained on a particular vocab and that one needs to be used.

I'm afraid I never looked into the URL/stories mapping since that wasn't relevant for the work I did, so I can't help you there.

@Sharathnasa
Copy link

@alkanen Thanks once again man. When i try to run as you mentioned, i'm getting an error as below

vi womendriver.text | java edu.stanford.nlp.process.PTBTokenizer -preserveLines
Vim: Warning: Output is not to a terminal
Untokenizable: (U+1B, decimal: 27)

Would you please pass on the script if you have?

@alkanen
Copy link

alkanen commented Jan 14, 2018

@Sharathnasa you can't pipe vi into java, use cat to pipe the contents of the text file into java

@Sharathnasa
Copy link

@alkanen Ok, my bad. Thanks once again. After performing tokenization(which i need to save), should i need to make_datafile.py code to generate .bin files?

@alkanen
Copy link

alkanen commented Jan 14, 2018

Nope, just use the old vocab file used for training, and the file created by tokenization as input to the model:
python pointer-generator/run_summarization.py --log_root=<some path with trained models in it> --exp_name=<the name of your trained model> --vocab_path=<your old vocab file> --mode=decode --data_path=<the file generated by tokenizer>

@Sharathnasa
Copy link

@alkanen did you took a look at this #51

@alkanen
Copy link

alkanen commented Jan 14, 2018

No, anything in particular there you mean I should be aware of?

I never had the need to summarize multiple texts at once, so I haven't looked into that use case at all.

@Sharathnasa
Copy link

@alkanen Nothing in particular, just wanted to let you know the command he suggested to run.

One more query i have:

  1. Repo says input should be in the form of .bin files, but the tokenization we did is in the form of .bin format, will the network run?
  2. Whatever you had suggested is to run single article?

@Sharathnasa
Copy link

hi @alkanen when i run the below command

python3 pointer-generator/run_summarization.py --mode=decode --data_path=/Users/setup/text_abstraction/cnn-dailymail/finished_files/chunked/train_* --vocab_path=/Users/setup/text_abstraction/finished_files/vocab --log_root=/Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train --exp_name="model-238410.data-00000-of-00001" --coverage=1 --single_pass=1 --max_enc_steps=500 --max_dec_steps=200 --min_dec_steps=100

Im getting the logs as below
INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs...
INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs...
INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs...
INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs..

Where is it gone wrong?

@dondon2475848
Copy link

dondon2475848 commented Mar 6, 2018

Hi @Sharathnasa
You can clone below repository:
https://github.com/dondon2475848/make_datafiles_for_pgn
Run

python make_datafiles.py  ./stories  ./output

It processes your test data into the binary format

@glalwani2
Copy link

@dondon2475848 I tried your repo with a sample txt file under stories folder and the .bin files didnt get created only tokenized file did. I am not sure why

@dondon2475848
Copy link

Do you put xxx.txt under stories folder ?
Maybe you can try xxx.story.
format like below:

test1.story

MOSCOW, Russia (CNN) -- Russian space officials say the crew of the Soyuz space ship is resting after a rough ride back to Earth.

A South Korean bioengineer was one of three people on board the Soyuz capsule.

The craft carrying South Korea's first astronaut landed in northern Kazakhstan on Saturday, 260 miles (418 kilometers) off its mark, they said.

Mission Control spokesman Valery Lyndin said the condition of the crew -- South Korean bioengineer Yi So-yeon, American astronaut Peggy Whitson and Russian flight engineer Yuri Malenchenko -- was satisfactory, though the three had been subjected to severe G-forces during the re-entry.

Search helicopters took 25 minutes to find the capsule and determine that the crew was unharmed.

Officials said the craft followed a very steep trajectory that subjects the crew to gravitational forces of up to 10 times those on Earth.

Interfax reported that the spacecraft's landing was rough.

This is not the first time a spacecraft veered from its planned trajectory during landing.

In October, the Soyuz capsule landed 70 kilometers from the planned area because of a damaged control cable. The capsule was carrying two Russian cosmonauts and the first Malaysian astronaut. E-mail to a friend

@highlight

Soyuz capsule lands hundreds of kilometers off-target

@highlight

Capsule was carrying South Korea's first astronaut

@highlight

Landing is second time Soyuz capsule has gone awry

@victorherbemontagne
Copy link

victorherbemontagne commented Mar 23, 2018

@Sharathnasa I don't know if you still have this issue but I think I figure it out. I had the same issue with the TraceBack, do you run on TensorFlow 1.5?
You can check at my repo, I fork the code of @becxer in python3 and modify it for Tensorflow 1.5 (still loading the TF 1.2.1 model presented in @abisee repo). Not so much work, TF 1.5 has a really bad support for tf.tags so I modify the code to make it works.
If you look at your error, go to .utils.py and print the exception in the load_checkpoint() function.For me it came from the fact that 4 words in the vocab_meta.tsv were not added to the vocab so I had a shape issue, I made a small correction in the code to format the considered words and to add them to the vocab and it worked like a charm.
You can check my code, tell me if there is a bug or anything, I will work it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants