-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XSUM dataset differences with original #20
Comments
Hi Jonathan, I think this is not a bug, we had some extrapolation experiments (testing out-of-distribution performance) where we split the xsum dataset differently (e.g., xsum_news is a dataset for the extrapolation experiment. Specifically, the training data contains {world,uk,business} news and the test data contains other news (e.g. health, tech)). But nice catch! For the original xsum dataset, I also tuned the length penalty --length_pen and I think setting it to 0.8 in my case improve the performance. |
fyi, here is the screenshot of the last couple epoch's dev scores: and my model name is xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512, which I believe you can infer all the hyperparam setting.
|
Yes, that's correct. I can confirm that I get almost the same results with |
emmm, actually the result on table 2 comes from prefix model trained on the original xsum dataset (the HuggingFace one, and let's denote it as "xsum"). And the result on table 3 "within news" comes from prefix model trained on my new split, denote as "xsum_news". You said you can replicate the test results e.g. 20.93 using the xsum_news split? I think this is quite surprising because "xsum_news" is supposed to be a harder data split than "xsum". |
Yes I got really high scores for the OOD split ("xsum news" or table 3 results "within-news"). Note that on the aforementioned split, Bart Large fully fine-tuned is also much higher for me and slightly better than prefix tuning. |
hmm, this is interesting. Could I know your scripts and/or hyper-parameters? also just confirming that you are evaluating on the test set, not the dev set, right? |
I was trying to replicate your validation set results above. I will try on test now. |
ohh, well this makes sense. the validation set is ID (in-distribution). |
oh I see. It's still weird though. I get the same val results as you above on I think that you kind of threw a curve ball at me since the validation scores posted above are meant to be for I can double check by running my test scripts on both datasets. |
yea, sorry for the confusion. the CodaLab result was on xsum_news (I didn't realize that until you pointed out actually, Many thanks for that!!! I was running more OOD experiments later which changed my default...), but the screen shot I pasted above is on sum_original. |
@XiangLi1999 Hi, still the same issue. I use hyperpara you mentioned here python train_bart.py --mode xsum --preseqlen 200 --do_train yes --fp16 yes --bsz 16 --epoch 30 --gradient_accumulation_step 1 --learning_rate 0.00005 --mid_dim 512 --n_gpu 1
xsum_models/xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512
python finetune.py --model_name_or_path facebook/bart-large --output_dir xsum_models/xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512 --data_dir xsum --tuning_mode prefixtune --preseqlen 200 --do_train --label_smoothing 0.0 --use_deep no --gpus 1 --learning_rate 5e-05 --train_batch_size 16 --eval_batch_size 16 --num_train_epochs 30 --optim_prefix yes --preseqlen 200 --prefix_mode activation --format_mode cat --gradient_accumulation_steps 1 --learning_rate 5e-05 --weight_decay 0.0 --seed 101 --mid_dim 512 --use_dropout no --prefix_dropout 0.0 --max_source_length 512 --max_target_length 60 --val_max_target_length 60 --test_max_target_length 100 --fp16 --fp16_opt_level O1 Still, I got 1-2 rouge points lower on {
"val_avg_loss": 1.6198463439941406,
"val_avg_rouge1": 42.233881249999996,
"val_avg_rouge2": 19.36146875,
"val_avg_rougeL": 34.15171875,
"val_avg_gen_time": 0.15285185538232327,
"val_avg_gen_len": 35.65625,
"step_count": 27
},
{
"val_avg_loss": 1.6255970001220703,
"val_avg_rouge1": 42.20043125,
"val_avg_rouge2": 19.1729125,
"val_avg_rougeL": 34.023271875,
"val_avg_gen_time": 0.14824258256703615,
"val_avg_gen_len": 35.90625,
"step_count": 28
},
{
"val_avg_loss": 1.6247661113739014,
"val_avg_rouge1": 42.71155,
"val_avg_rouge2": 19.746046874999998,
"val_avg_rougeL": 34.4363625,
"val_avg_gen_time": 0.14882883243262768,
"val_avg_gen_len": 35.6875,
"step_count": 29
},
{
"val_avg_loss": 1.6258924007415771,
"val_avg_rouge1": 42.674909375,
"val_avg_rouge2": 19.538184375,
"val_avg_rougeL": 34.334109375,
"val_avg_gen_time": 0.14808641048148274,
"val_avg_gen_len": 35.28125,
"step_count": 30
},
{
"val_avg_loss": 1.6260249614715576,
"val_avg_rouge1": 42.778821875000006,
"val_avg_rouge2": 19.570328125,
"val_avg_rougeL": 34.363040625000004,
"val_avg_gen_time": 0.1509597236290574,
"val_avg_gen_len": 36.25,
"step_count": 31
} Any suggestion? |
I am a bit confused by the above sentence. The screenshot I posted is on xsum_original val set. So you are around .5 off. (where does the 1-2 rouge points lower come from?) I never tried this code on DDP settings (due to resource constraints, sadly). I am not 100% sure, but I guess for DDP you don't need to scale lr down, I assume DDP would automatically do this? so lr should still be 0.00005. |
Sorry, I notice that my I've tried to keep |
I think if you get dev performance that matches the screenshot, you would get test performance matching table 2. These two should be correlated, but not exactly the same, one is on dev and one is on test. I did length penalty of 0.8 + a slightly different rouge evaluation to get deterministic scores. The evaluation script just make the result deterministic and it doesn't make a big change to the number:
Yes, prefixlen=200 means Prefix(2%). Yes, add --length_pen 0.8 to decode would probably improve the performance. I think effective batch_size will be 8 times bigger, but I don't have a good intuition how it would relate to the learning rate. Intuitively, I guess keeping the old learning rate, or slightly increase it (0.00008) would still be fine, so I am not sure why it crashes for you... |
I see. Thanks for your reply!!! I'm trying your suggestion about the learning rate. |
Hello,
You shared the
xsum
dataset link here #2However I see from the colab link https://worksheets.codalab.org/bundles/0x58f85171b43f4e61bf411c35faab369d
and from the hyperparameters/data directory in https://worksheets.codalab.org/bundles/0xa3f0cd3c10c7490ab508a351968cbdcf that you have used
xsum_news
data. When I checkedxsum_news
, I found that the validation file has7,186
examples. However, the original dataset has11,327
examples. The test set is also different with11,333
examples inxsum_news
vs.20,418
in the original xsum.I was wondering if you could explain the differences in eval/test dataset sizes compared to the original and perhaps provide your script for preprocessing the original xsum.
Thanks!
The text was updated successfully, but these errors were encountered: