Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreting the results #12

Open
raduk opened this issue May 18, 2017 · 23 comments
Open

Interpreting the results #12

raduk opened this issue May 18, 2017 · 23 comments
Labels

Comments

@raduk
Copy link

raduk commented May 18, 2017

I trained the model with the default parameters, and evaluated at the step 27847. (I didn't do the coverage training, just launched the training with default parameters)

These are the results I got:

ROUGE-1:
rouge_1_f_score: 0.6100 with confidence interval (0.6087, 0.6113)
rouge_1_recall: 0.5746 with confidence interval (0.5726, 0.5767)
rouge_1_precision: 0.6785 with confidence interval (0.6767, 0.6803)

ROUGE-2:
rouge_2_f_score: 0.4800 with confidence interval (0.4788, 0.4813)
rouge_2_recall: 0.4533 with confidence interval (0.4515, 0.4552)
rouge_2_precision: 0.5325 with confidence interval (0.5311, 0.5339)

ROUGE-l:
rouge_l_f_score: 0.5986 with confidence interval (0.5973, 0.5999)
rouge_l_recall: 0.5638 with confidence interval (0.5618, 0.5659)
rouge_l_precision: 0.6658 with confidence interval (0.6640, 0.6676)

How to interpret these with respect to the results in the paper ? Where can be the problem, given that the paper reports a rouge-1 of 0.3953 and the above one is 0.61 ?

@abisee
Copy link
Owner

abisee commented May 18, 2017

Have you looked at the output files themselves?

@raduk
Copy link
Author

raduk commented May 19, 2017

I did, and they don't seem too good. That is why I am a bit puzzled about the numbers. Are these numbers (the ones reported by the "single_pass" arg) the ones that were used in the paper ? Or there is some other way to get them ?

This is actually my goal: to make sure that the model I have trained reproduces the results in the paper.

@abisee
Copy link
Owner

abisee commented May 19, 2017

Yes, we used this code and the single_pass flag to get the results reported in the paper.

The code we've released here is a cleaned-up version of the code we used to get our results (removing a lot of extra unnecessary stuff and making things easier to understand). As you can see there've been a few bugs that were introduced by the clean-up, that we've needed to fix. But the code should essentially be the same.

Could you post some examples of your outputs here? How many files do you have in your decoded and reference directories?

@raduk
Copy link
Author

raduk commented May 19, 2017

I did multiple tests, on the whole test set (11490 files in both decoded and reference) and on the first 1000, and the results are similar. I checked to have the same number of files in the decoded and reference folder

These are some samples:

reference:
marseille prosecutor says "so far no videos were used in the crash investigation" despite media reports .
journalists at bild and paris match are "very confident" the video clip is real, an editor says .
andreas lubitz had informed his lufthansa training school of an episode of severe depression, airline says.

decoded:
marseille prosecutor says the agency is not aware of any video footage from on board .
marseille prosecutor says it was not aware of any video footage from on board .
marseille prosecutor says it was not aware of any video footage from on board crash .

reference:
many girls in nima,one of accra's poorest slums, receive little or no education .
achievers ghana is a school funded by the community to give the next generation a better chance of success .
girls are being taught to code by tech entrepreneur regina agyare, who believes her students will go far .

decoded:
regina project achievers explains up to comes, project .
tech entrepreneur says the students proved quite we are teaching the girls .
[UNK] has one of the densest populations in education project .
the ghana has one of the densest populations .

reference:
thunderstorms with large hail are predicted for the midwest and the plains .
tornadoes could strike thursday night and friday .

decoded:
severe weather is perilous anytime, of national weather service .
severe weather is hit parts of indiana and kentucky. oklahoma .
severe weather is hit parts of indiana and kentucky. residents .
severe weather is perilous anytime, of illinois .

Do you think my pyrouge / ROUGE-1.5.5 install may have a problem ? Is there a way to simply check that my rouge setup is the same as yours ? pyrouge tests pass.

@abisee
Copy link
Owner

abisee commented May 19, 2017

I'm not sure what's going on here. Your output seems pretty reasonable but not the kind of thing we'd expect to get such high ROUGE scores. I'd recommend the following:

Look in decode.py at the code that runs pyrouge eval. It's fairly simple. Run the same commands to run the pyrouge eval again on your decoded and reference directories (in my experience, this takes several minutes). See if you get the same ROUGE numbers again. I think there might be some pyrouge options that will give you more verbose output (like ROUGE scores per file) - this could be helpful.

By the way, are you aware of this fix to the data-getting code?

@raduk
Copy link
Author

raduk commented May 19, 2017

I think I figured out the problem: the pyrouge package in python 2.7 generates the rouge_conf.html configuration file differently from the pyrouge in python 3.5. More precisely, the one in 3.5 generates <INPUT-FORMAT TYPE="SEE"> whereas 2.7 generates <INPUT-FORMAT TYPE="SPL">.

SPL makes the rouge script interpret the html tags as words as well (is per line matching), which explains the high ROUGE scores. SEE is the correct one for HTML.

Now I get these scores, which seem more reasonable:

1 ROUGE-1 Average_R: 0.28443 (95%-conf.int. 0.27736 - 0.29145)
1 ROUGE-1 Average_P: 0.22913 (95%-conf.int. 0.22278 - 0.23556)
1 ROUGE-1 Average_F: 0.24776 (95%-conf.int. 0.24154 - 0.25383)

1 ROUGE-2 Average_R: 0.07891 (95%-conf.int. 0.07356 - 0.08464)
1 ROUGE-2 Average_P: 0.06283 (95%-conf.int. 0.05836 - 0.06727)
1 ROUGE-2 Average_F: 0.06816 (95%-conf.int. 0.06340 - 0.07314)

1 ROUGE-3 Average_R: 0.03494 (95%-conf.int. 0.03079 - 0.03904)
1 ROUGE-3 Average_P: 0.02780 (95%-conf.int. 0.02446 - 0.03128)
1 ROUGE-3 Average_F: 0.03016 (95%-conf.int. 0.02665 - 0.03381)

1 ROUGE-4 Average_R: 0.01884 (95%-conf.int. 0.01577 - 0.02204)
1 ROUGE-4 Average_P: 0.01507 (95%-conf.int. 0.01263 - 0.01766)
1 ROUGE-4 Average_F: 0.01630 (95%-conf.int. 0.01373 - 0.01907)

1 ROUGE-L Average_R: 0.25577 (95%-conf.int. 0.24911 - 0.26218)
1 ROUGE-L Average_P: 0.20593 (95%-conf.int. 0.20012 - 0.21176)
1 ROUGE-L Average_F: 0.22271 (95%-conf.int. 0.21701 - 0.22858)

1 ROUGE-W-1.2 Average_R: 0.12043 (95%-conf.int. 0.11713 - 0.12381)
1 ROUGE-W-1.2 Average_P: 0.15987 (95%-conf.int. 0.15547 - 0.16443)
1 ROUGE-W-1.2 Average_F: 0.13347 (95%-conf.int. 0.13015 - 0.13704)

1 ROUGE-S* Average_R: 0.07204 (95%-conf.int. 0.06810 - 0.07603)
1 ROUGE-S* Average_P: 0.04788 (95%-conf.int. 0.04503 - 0.05083)
1 ROUGE-S* Average_F: 0.05255 (95%-conf.int. 0.04972 - 0.05523)

1 ROUGE-SU* Average_R: 0.08544 (95%-conf.int. 0.08127 - 0.08949)
1 ROUGE-SU* Average_P: 0.05661 (95%-conf.int. 0.05361 - 0.05966)
1 ROUGE-SU* Average_F: 0.06245 (95%-conf.int. 0.05946 - 0.06531)

I'll try retraining with the fix you mentioned for data preprocessing.

Thanks for the help.

@raduk
Copy link
Author

raduk commented May 19, 2017

By the way, using the following is quite nice for understanding / debugging pyrouge behavior:

pyrouge_evaluate_plain_text_files -s ../log/test/decode_tmp/decoded/ -sfp "(\d+)_decoded.txt" -m ../log/test/decode_tmp/reference/ -mfp "#ID#_reference.txt"

@abisee
Copy link
Owner

abisee commented May 20, 2017

@raduk These ROUGE scores look like what we'd expect. Looks like you figured it out!

@scylla
Copy link

scylla commented May 25, 2017

Seems like the model is overfitting quite early i am training on the complete dataset with default parameters
screenshot from 2017-05-25 12-45-50

@makcbe
Copy link

makcbe commented May 25, 2017

On what kind of hardware are you running this on?

@scylla
Copy link

scylla commented May 25, 2017

a GTX 1080 Ti machine

@raduk
Copy link
Author

raduk commented May 25, 2017

This is how mine is looking:

image

@scylla
Copy link

scylla commented May 25, 2017

any idea why this might be happening?

@makcbe
Copy link

makcbe commented May 30, 2017

The model is at about 15k with values of eval and train at 3.4 & 4.02 and is still on. Lets say the train and eval are terminated, checkpoints and other files in both train & eval backed up, run eval and train with coverage=true options followed by decode (this will update the ckpt files); if it is not satisfactorily decoding the summary, is it possible to resume running train and eval after restoring the ckpt files (backed up before converge was done) with coverage=false?

@abisee
Copy link
Owner

abisee commented May 30, 2017

@makcbe If I understand your question correctly: yes you should be able to restore a non-coverage model to continue training with coverage=false.

@joy369
Copy link

joy369 commented Jun 1, 2017

Hi, I also try to replicate results shown in the paper. I trained the model without changing any parameter, and evaluated at the chekpoint-44550. There are loss and results I got:
2017-06-01 7 24 39

ROUGE-1:
rouge_1_f_score: 0.3405 with confidence interval (0.3381, 0.3427)
rouge_1_recall: 0.3506 with confidence interval (0.3480, 0.3532)
rouge_1_precision: 0.3536 with confidence interval (0.3507, 0.3564)

ROUGE-2:
rouge_2_f_score: 0.1384 with confidence interval (0.1362, 0.1405)
rouge_2_recall: 0.1421 with confidence interval (0.1398, 0.1443)
rouge_2_precision: 0.1447 with confidence interval (0.1421, 0.1471)

ROUGE-l:
rouge_l_f_score: 0.3074 with confidence interval (0.3051, 0.3097)
rouge_l_recall: 0.3164 with confidence interval (0.3138, 0.3189)
rouge_l_precision: 0.3195 with confidence interval (0.3168, 0.3222)

Some examples are shown below:

000000
REF:marseille prosecutor says so far no videos were used in the crash investigation '' despite media reports . journalists at bild and paris match are very confident '' the video clip is real , an editor says .
andreas lubitz had informed his lufthansa training school of an episode of severe depression , airline says .

DEC:french prosecutor : `` one can hear cries of ` my god ' in several languages , '' french prosecutor says .
french prosecutor says he was not aware of any video footage from on board the plane .
the video was recovered from a phone at the wreckage site .
the video was recovered from a phone at the wreckage site .

000001
REF:membership gives the icc jurisdiction over alleged crimes committed in palestinian territories since last june .
israel and the united states opposed the move , which could open the door to war crimes investigations against israelis .

DEC:the palestinian authority officially became the 123rd member of the international criminal court on wednesday .
the icc opened a preliminary examination into the situation in palestinian territories .
they also accepted its jurisdiction over alleged crimes committed `` in the occupied palestinian territory , including east jerusalem , since june 13 , 2014 ''

000002
REF:amnesty 's annual death penalty report catalogs encouraging signs , but setbacks in numbers of those sentenced to death .
organization claims that governments around the world are using the threat of terrorism to advance executions .
the number of executions worldwide has gone down by almost 22 % compared with 2013 , but death sentences up by 28 % .

DEC:amnesty international alleges in its annual report on the death penalty .
amnesty international alleges in its annual report on the death penalty .
`` it is shameful that so many states around the world are essentially playing with people 's lives , '' the report says .

000003
REF:amnesty international releases its annual review of the death penalty worldwide ; much of it makes for grim reading .
salil shetty : countries that use executions to deal with problems are on the wrong side of history .

DEC:armed soldiers were found guilty of a range of offenses linked to violent attacks in the region and jailed .
55 people were found guilty of a range of offenses linked to violent attacks in the region and jailed .

000004
REF:museum : anne frank died earlier than previously believed .
researchers re-examined archives and testimonies of survivors .
anne and older sister margot frank are believed to have died in february 1945 .

DEC:anne frank died of typhus in a nazi concentration camp at the age of 15 .
researchers re-examined archives of the red cross , the international training service and the bergen-belsen memorial , along with testimonies of survivors .
they concluded that anne and margot probably did not survive to march 1945 .
they concluded that anne and margot probably did not survive to march 1945 .

Machine I used is GTX 1080-ti. Since I also encountered the problem of NAN loss on the other machine, I follow the solution in another issue and change the training set to chunked data after checkpoint at 38432(it took me about 5 days, got Rouge-2 recall 0.0882 at this checkpoint). The results seem good in Rouge scores and decoded abstract are not too weird I think.

I am wondering did I do right? Because iterations are much less than 230k (yet it spent 2 times of days on training). By the way, is there any phenomenon can be viewed as clue to start coverage mechanism or stop training rather than training time?

Thanks for any help!

@abisee
Copy link
Owner

abisee commented Jun 6, 2017

@joy369 Congratulations, looks like you've got some fairly reasonable results! Remember that ROUGE scores are not a perfect measure of quality (see the discussion in section 7.1 of the paper) so you will need to do a lot of manual inspection of your summaries to gauge their quality. Are you aware of the attention visualizer tool? We find that it gives some useful clues to understand why the model produces what it does.

By the way, is there any phenomenon can be viewed as clue to start coverage mechanism or stop training rather than training time?

  1. Coverage is designed to reduce repetition. We turned it on for a short training period at the end of training. See the paper for more details and explanation.
  2. Mostly we stopped training when the loss on the validation set stopped dropping, or started to rise. You may find other strategies work better.

@JafferWilson
Copy link

@joy369 I can see that you have succeeded in generating the result. For that, I would like to congratulate you. I have a request, please can you share with me the model or the checkpoint that gives you the result. I will be glad by your help. As I am trying to get the similar results but failing. I have tried many other tools but they failed me. Hence, I would like to test the model, if you share that with me.
Please consider my humble request.

@joy369
Copy link

joy369 commented Aug 5, 2017

@JafferWilson I have uploaded my checkpoint above in here:
https://drive.google.com/open?id=0B9ERf9Fj8Gh6bW1fVm5VaHQ3ams
I hope this may help you.

@JafferWilson
Copy link

@joy369 I will surely test it. And I am grateful to you for your check point upload. But I guess this is GPU based Checkpoint. Is there CPU version or how I can convert? It will be helpful.

@v3nm
Copy link

v3nm commented Jul 5, 2018

@raduk i have run decode with single_pass=1, ref and decode file are there but no rough_result.
i am using python 3.6, can you help regarding it.

@karansaxena
Copy link

I think I figured out the problem: the pyrouge package in python 2.7 generates the rouge_conf.html configuration file differently from the pyrouge in python 3.5. More precisely, the one in 3.5 generates <INPUT-FORMAT TYPE="SEE"> whereas 2.7 generates <INPUT-FORMAT TYPE="SPL">.

SPL makes the rouge script interpret the html tags as words as well (is per line matching), which explains the high ROUGE scores. SEE is the correct one for HTML.

Now I get these scores, which seem more reasonable:

1 ROUGE-1 Average_R: 0.28443 (95%-conf.int. 0.27736 - 0.29145)
1 ROUGE-1 Average_P: 0.22913 (95%-conf.int. 0.22278 - 0.23556)
1 ROUGE-1 Average_F: 0.24776 (95%-conf.int. 0.24154 - 0.25383)

1 ROUGE-2 Average_R: 0.07891 (95%-conf.int. 0.07356 - 0.08464)
1 ROUGE-2 Average_P: 0.06283 (95%-conf.int. 0.05836 - 0.06727)
1 ROUGE-2 Average_F: 0.06816 (95%-conf.int. 0.06340 - 0.07314)

1 ROUGE-3 Average_R: 0.03494 (95%-conf.int. 0.03079 - 0.03904)
1 ROUGE-3 Average_P: 0.02780 (95%-conf.int. 0.02446 - 0.03128)
1 ROUGE-3 Average_F: 0.03016 (95%-conf.int. 0.02665 - 0.03381)

1 ROUGE-4 Average_R: 0.01884 (95%-conf.int. 0.01577 - 0.02204)
1 ROUGE-4 Average_P: 0.01507 (95%-conf.int. 0.01263 - 0.01766)
1 ROUGE-4 Average_F: 0.01630 (95%-conf.int. 0.01373 - 0.01907)

1 ROUGE-L Average_R: 0.25577 (95%-conf.int. 0.24911 - 0.26218)
1 ROUGE-L Average_P: 0.20593 (95%-conf.int. 0.20012 - 0.21176)
1 ROUGE-L Average_F: 0.22271 (95%-conf.int. 0.21701 - 0.22858)

1 ROUGE-W-1.2 Average_R: 0.12043 (95%-conf.int. 0.11713 - 0.12381)
1 ROUGE-W-1.2 Average_P: 0.15987 (95%-conf.int. 0.15547 - 0.16443)
1 ROUGE-W-1.2 Average_F: 0.13347 (95%-conf.int. 0.13015 - 0.13704)

1 ROUGE-S* Average_R: 0.07204 (95%-conf.int. 0.06810 - 0.07603)
1 ROUGE-S* Average_P: 0.04788 (95%-conf.int. 0.04503 - 0.05083)
1 ROUGE-S* Average_F: 0.05255 (95%-conf.int. 0.04972 - 0.05523)

1 ROUGE-SU* Average_R: 0.08544 (95%-conf.int. 0.08127 - 0.08949)
1 ROUGE-SU* Average_P: 0.05661 (95%-conf.int. 0.05361 - 0.05966)
1 ROUGE-SU* Average_F: 0.06245 (95%-conf.int. 0.05946 - 0.06531)

I'll try retraining with the fix you mentioned for data preprocessing.

Thanks for the help.

@raduk do you get these numbers using Python3?
Another related question - How much time did it take to run for all 11.5k test samples?
CC: @abisee

@egornevezhin
Copy link

egornevezhin commented Aug 7, 2019

I think I figured out the problem: the pyrouge package in python 2.7 generates the rouge_conf.html configuration file differently from the pyrouge in python 3.5. More precisely, the one in 3.5 generates <INPUT-FORMAT TYPE="SEE"> whereas 2.7 generates <INPUT-FORMAT TYPE="SPL">.
SPL makes the rouge script interpret the html tags as words as well (is per line matching), which explains the high ROUGE scores. SEE is the correct one for HTML.
Now I get these scores, which seem more reasonable:
1 ROUGE-1 Average_R: 0.28443 (95%-conf.int. 0.27736 - 0.29145)
1 ROUGE-1 Average_P: 0.22913 (95%-conf.int. 0.22278 - 0.23556)
1 ROUGE-1 Average_F: 0.24776 (95%-conf.int. 0.24154 - 0.25383)
1 ROUGE-2 Average_R: 0.07891 (95%-conf.int. 0.07356 - 0.08464)
1 ROUGE-2 Average_P: 0.06283 (95%-conf.int. 0.05836 - 0.06727)
1 ROUGE-2 Average_F: 0.06816 (95%-conf.int. 0.06340 - 0.07314)
1 ROUGE-3 Average_R: 0.03494 (95%-conf.int. 0.03079 - 0.03904)
1 ROUGE-3 Average_P: 0.02780 (95%-conf.int. 0.02446 - 0.03128)
1 ROUGE-3 Average_F: 0.03016 (95%-conf.int. 0.02665 - 0.03381)
1 ROUGE-4 Average_R: 0.01884 (95%-conf.int. 0.01577 - 0.02204)
1 ROUGE-4 Average_P: 0.01507 (95%-conf.int. 0.01263 - 0.01766)
1 ROUGE-4 Average_F: 0.01630 (95%-conf.int. 0.01373 - 0.01907)
1 ROUGE-L Average_R: 0.25577 (95%-conf.int. 0.24911 - 0.26218)
1 ROUGE-L Average_P: 0.20593 (95%-conf.int. 0.20012 - 0.21176)
1 ROUGE-L Average_F: 0.22271 (95%-conf.int. 0.21701 - 0.22858)
1 ROUGE-W-1.2 Average_R: 0.12043 (95%-conf.int. 0.11713 - 0.12381)
1 ROUGE-W-1.2 Average_P: 0.15987 (95%-conf.int. 0.15547 - 0.16443)
1 ROUGE-W-1.2 Average_F: 0.13347 (95%-conf.int. 0.13015 - 0.13704)
1 ROUGE-S* Average_R: 0.07204 (95%-conf.int. 0.06810 - 0.07603)
1 ROUGE-S* Average_P: 0.04788 (95%-conf.int. 0.04503 - 0.05083)
1 ROUGE-S* Average_F: 0.05255 (95%-conf.int. 0.04972 - 0.05523)
1 ROUGE-SU* Average_R: 0.08544 (95%-conf.int. 0.08127 - 0.08949)
1 ROUGE-SU* Average_P: 0.05661 (95%-conf.int. 0.05361 - 0.05966)
1 ROUGE-SU* Average_F: 0.06245 (95%-conf.int. 0.05946 - 0.06531)
I'll try retraining with the fix you mentioned for data preprocessing.
Thanks for the help.

@raduk do you get these numbers using Python3?
Another related question - How much time did it take to run for all 11.5k test samples?
CC: @abisee

You can try find and replace all SEE values to SPL in rouge_conf.html. In my case (12.5 K) it took about an hour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants