New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tried running it on random internet news articles. Results look more extractive than abstractive? #49
Comments
Hi @anubhavmax, the same question has been asked here. Yes - the pointer-generator model produces mostly extractive summaries. This is discussed in section 7.2 of the paper. It is the main area for future work! |
@anubhavmax Hi, How did you manage to run on your own data? could you please shed some light. Thanks, |
@Sharathnasa, you need to run the text through the Stanford tokenizer Java program first in order to create a token list file to feed to the network. Basically, in Linux, you run And it will print a tokenized version of the text, which you need to save to a new file. That file is then fed into the pointer generator network with the "--data_path=" argument and "--mode=decode". |
@alkanen Thanks a lot man!. I will give a try. Text in the sense if i only pass the entire article without a abstract, it will work fine right? or Should i need to Process into .bin and vocab files as explained in cnn-daily repo? and one more thing, how is that url and stories 1-1 mapping is done, if i need to do so how to proceed? |
@Sharathnasa text as in the entire article without an abstract, yes. That will create a bin file with a single article in it. Use the vocab file you already have from the CNN training set, it doesn't make much sense creating a new one based on a single article, and unless I misremember it will also break everything because the network will have trained on a particular vocab and that one needs to be used. I'm afraid I never looked into the URL/stories mapping since that wasn't relevant for the work I did, so I can't help you there. |
@alkanen Thanks once again man. When i try to run as you mentioned, i'm getting an error as below vi womendriver.text | java edu.stanford.nlp.process.PTBTokenizer -preserveLines Would you please pass on the script if you have? |
@Sharathnasa you can't pipe vi into java, use cat to pipe the contents of the text file into java |
@alkanen Ok, my bad. Thanks once again. After performing tokenization(which i need to save), should i need to make_datafile.py code to generate .bin files? |
Nope, just use the old vocab file used for training, and the file created by tokenization as input to the model: |
No, anything in particular there you mean I should be aware of? I never had the need to summarize multiple texts at once, so I haven't looked into that use case at all. |
@alkanen Nothing in particular, just wanted to let you know the command he suggested to run. One more query i have:
|
hi @alkanen when i run the below command python3 pointer-generator/run_summarization.py --mode=decode --data_path=/Users/setup/text_abstraction/cnn-dailymail/finished_files/chunked/train_* --vocab_path=/Users/setup/text_abstraction/finished_files/vocab --log_root=/Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train --exp_name="model-238410.data-00000-of-00001" --coverage=1 --single_pass=1 --max_enc_steps=500 --max_dec_steps=200 --min_dec_steps=100 Im getting the logs as below Where is it gone wrong? |
Hi @Sharathnasa
It processes your test data into the binary format |
@dondon2475848 I tried your repo with a sample txt file under stories folder and the .bin files didnt get created only tokenized file did. I am not sure why |
Do you put xxx.txt under stories folder ? test1.story
|
@Sharathnasa I don't know if you still have this issue but I think I figure it out. I had the same issue with the TraceBack, do you run on TensorFlow 1.5? |
Hi Abigail .. I was trying to run the code using the already existing training model that was uploaded, as I do not have a powerful enough machine to train. I believe the vocab size is set to 50000 in the code. After running it through multiple news articles on the internet, I found the results to be extractive. Didn't really encounter any situation where a new word was generated for summarizing. Am I missing something in the settings? Could you please let me know where the gap in my understanding lies?
The text was updated successfully, but these errors were encountered: