-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interpreting the results #12
Comments
Have you looked at the output files themselves? |
I did, and they don't seem too good. That is why I am a bit puzzled about the numbers. Are these numbers (the ones reported by the "single_pass" arg) the ones that were used in the paper ? Or there is some other way to get them ? This is actually my goal: to make sure that the model I have trained reproduces the results in the paper. |
Yes, we used this code and the The code we've released here is a cleaned-up version of the code we used to get our results (removing a lot of extra unnecessary stuff and making things easier to understand). As you can see there've been a few bugs that were introduced by the clean-up, that we've needed to fix. But the code should essentially be the same. Could you post some examples of your outputs here? How many files do you have in your |
I did multiple tests, on the whole test set (11490 files in both These are some samples: reference: decoded: reference: decoded: reference: decoded: Do you think my |
I'm not sure what's going on here. Your output seems pretty reasonable but not the kind of thing we'd expect to get such high ROUGE scores. I'd recommend the following: Look in By the way, are you aware of this fix to the data-getting code? |
I think I figured out the problem: the
Now I get these scores, which seem more reasonable: 1 ROUGE-1 Average_R: 0.28443 (95%-conf.int. 0.27736 - 0.29145) 1 ROUGE-2 Average_R: 0.07891 (95%-conf.int. 0.07356 - 0.08464) 1 ROUGE-3 Average_R: 0.03494 (95%-conf.int. 0.03079 - 0.03904) 1 ROUGE-4 Average_R: 0.01884 (95%-conf.int. 0.01577 - 0.02204) 1 ROUGE-L Average_R: 0.25577 (95%-conf.int. 0.24911 - 0.26218) 1 ROUGE-W-1.2 Average_R: 0.12043 (95%-conf.int. 0.11713 - 0.12381) 1 ROUGE-S* Average_R: 0.07204 (95%-conf.int. 0.06810 - 0.07603) 1 ROUGE-SU* Average_R: 0.08544 (95%-conf.int. 0.08127 - 0.08949) I'll try retraining with the fix you mentioned for data preprocessing. Thanks for the help. |
By the way, using the following is quite nice for understanding / debugging pyrouge behavior:
|
@raduk These ROUGE scores look like what we'd expect. Looks like you figured it out! |
On what kind of hardware are you running this on? |
a GTX 1080 Ti machine |
any idea why this might be happening? |
The model is at about 15k with values of eval and train at 3.4 & 4.02 and is still on. Lets say the train and eval are terminated, checkpoints and other files in both train & eval backed up, run eval and train with coverage=true options followed by decode (this will update the ckpt files); if it is not satisfactorily decoding the summary, is it possible to resume running train and eval after restoring the ckpt files (backed up before converge was done) with coverage=false? |
@makcbe If I understand your question correctly: yes you should be able to restore a non-coverage model to continue training with coverage=false. |
@joy369 Congratulations, looks like you've got some fairly reasonable results! Remember that ROUGE scores are not a perfect measure of quality (see the discussion in section 7.1 of the paper) so you will need to do a lot of manual inspection of your summaries to gauge their quality. Are you aware of the attention visualizer tool? We find that it gives some useful clues to understand why the model produces what it does.
|
@joy369 I can see that you have succeeded in generating the result. For that, I would like to congratulate you. I have a request, please can you share with me the model or the checkpoint that gives you the result. I will be glad by your help. As I am trying to get the similar results but failing. I have tried many other tools but they failed me. Hence, I would like to test the model, if you share that with me. |
@JafferWilson I have uploaded my checkpoint above in here: |
@joy369 I will surely test it. And I am grateful to you for your check point upload. But I guess this is GPU based Checkpoint. Is there CPU version or how I can convert? It will be helpful. |
@raduk i have run decode with single_pass=1, ref and decode file are there but no rough_result. |
@raduk do you get these numbers using Python3? |
You can try find and replace all SEE values to SPL in rouge_conf.html. In my case (12.5 K) it took about an hour. |
I trained the model with the default parameters, and evaluated at the step 27847. (I didn't do the coverage training, just launched the training with default parameters)
These are the results I got:
ROUGE-1:
rouge_1_f_score: 0.6100 with confidence interval (0.6087, 0.6113)
rouge_1_recall: 0.5746 with confidence interval (0.5726, 0.5767)
rouge_1_precision: 0.6785 with confidence interval (0.6767, 0.6803)
ROUGE-2:
rouge_2_f_score: 0.4800 with confidence interval (0.4788, 0.4813)
rouge_2_recall: 0.4533 with confidence interval (0.4515, 0.4552)
rouge_2_precision: 0.5325 with confidence interval (0.5311, 0.5339)
ROUGE-l:
rouge_l_f_score: 0.5986 with confidence interval (0.5973, 0.5999)
rouge_l_recall: 0.5638 with confidence interval (0.5618, 0.5659)
rouge_l_precision: 0.6658 with confidence interval (0.6640, 0.6676)
How to interpret these with respect to the results in the paper ? Where can be the problem, given that the paper reports a rouge-1 of 0.3953 and the above one is 0.61 ?
The text was updated successfully, but these errors were encountered: