XSUM dataset differences with original #20

jpilaul · 2021-10-13T15:56:21Z

Hello,
You shared the xsum dataset link here #2

However I see from the colab link https://worksheets.codalab.org/bundles/0x58f85171b43f4e61bf411c35faab369d
and from the hyperparameters/data directory in https://worksheets.codalab.org/bundles/0xa3f0cd3c10c7490ab508a351968cbdcf that you have used xsum_news data. When I checked xsum_news, I found that the validation file has 7,186 examples. However, the original dataset has 11,327 examples. The test set is also different with 11,333 examples in xsum_news vs. 20,418 in the original xsum.

I was wondering if you could explain the differences in eval/test dataset sizes compared to the original and perhaps provide your script for preprocessing the original xsum.

Thanks!

The text was updated successfully, but these errors were encountered:

jpilaul · 2021-11-04T14:55:47Z

Still can't reconcile the data differences. If I train, PrefixTuning with the original dataset (from Huggingface), my results are 3 percentage points lower then stated in the paper. However, if I use the data in the xsum_news directory from the codalab then it's 1 percentage point lower. I have added a screen shot of the dataset sizes here:

XiangLi1999 · 2021-11-05T04:02:02Z

Hi Jonathan, I think this is not a bug, we had some extrapolation experiments (testing out-of-distribution performance) where we split the xsum dataset differently (e.g., xsum_news is a dataset for the extrapolation experiment. Specifically, the training data contains {world,uk,business} news and the test data contains other news (e.g. health, tech)). But nice catch!

For the original xsum dataset, I also tuned the length penalty --length_pen and I think setting it to 0.8 in my case improve the performance.

XiangLi1999 · 2021-11-05T04:09:05Z

fyi, here is the screenshot of the last couple epoch's dev scores: and my model name is xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512, which I believe you can infer all the hyperparam setting.

python finetune.py --model_name_or_path facebook/bart-large --output_dir xsum_models/xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512 --data_dir xsum --tuning_mode prefixtune --preseqlen 200 --do_train --label_smoothing 0.0 --use_deep no --gpus 1 --learning_rate 5e-05 --train_batch_size 16 --eval_batch_size 16 --num_train_epochs 30 --optim_prefix yes --preseqlen 200 --prefix_mode activation --format_mode cat --gradient_accumulation_steps 1 --learning_rate 5e-05 --weight_decay 0.0 --seed 101 --mid_dim 512 --use_dropout no --prefix_dropout 0.0 --max_source_length 512 --max_target_length 60 --val_max_target_length 60 --test_max_target_length 100 --fp16 --fp16_opt_level O1

jpilaul · 2021-11-05T05:09:22Z

Yes, that's correct. I can confirm that I get almost the same results with length_pen of 0.8 and with your altered split of xsum. However, on the original xsum, prefix tuning performs 2-3 rouge points lower. Just a note here: in table 2 of your paper, you compared your model trained on your altered split of xsum against a fully finetuned BART large on the original xsum...

XiangLi1999 · 2021-11-17T04:17:43Z

emmm, actually the result on table 2 comes from prefix model trained on the original xsum dataset (the HuggingFace one, and let's denote it as "xsum"). And the result on table 3 "within news" comes from prefix model trained on my new split, denote as "xsum_news". You said you can replicate the test results e.g. 20.93 using the xsum_news split? I think this is quite surprising because "xsum_news" is supposed to be a harder data split than "xsum".

jpilaul · 2021-11-17T05:00:10Z

Yes I got really high scores for the OOD split ("xsum news" or table 3 results "within-news"). Note that on the aforementioned split, Bart Large fully fine-tuned is also much higher for me and slightly better than prefix tuning.

XiangLi1999 · 2021-11-17T05:04:14Z

hmm, this is interesting. Could I know your scripts and/or hyper-parameters? also just confirming that you are evaluating on the test set, not the dev set, right?

jpilaul · 2021-11-17T05:06:36Z

I was trying to replicate your validation set results above. I will try on test now.

XiangLi1999 · 2021-11-17T05:07:27Z

ohh, well this makes sense. the validation set is ID (in-distribution).

jpilaul · 2021-11-17T05:12:36Z

oh I see. It's still weird though. I get the same val results as you above on xsum_news but I get 2-3 percentage points lower on xsum_original on val. But, I think that if I just run everything on test, I will get similar scores than in your paper. On xsum_original, my val set scores were always lower than my test set scores.

I think that you kind of threw a curve ball at me since the validation scores posted above are meant to be for xsum_news when I was originally asking about xsum_original :)

I can double check by running my test scripts on both datasets.

XiangLi1999 · 2021-11-17T05:25:59Z

yea, sorry for the confusion. the CodaLab result was on xsum_news (I didn't realize that until you pointed out actually, Many thanks for that!!! I was running more OOD experiments later which changed my default...), but the screen shot I pasted above is on sum_original.

zhaone · 2021-11-19T02:12:29Z

@XiangLi1999 Hi, still the same issue. I use hyperpara you mentioned here

python train_bart.py --mode xsum --preseqlen 200 --do_train yes --fp16 yes --bsz 16 --epoch 30  --gradient_accumulation_step 1 --learning_rate 0.00005 --mid_dim 512 --n_gpu 1
xsum_models/xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512
python finetune.py --model_name_or_path facebook/bart-large --output_dir xsum_models/xsumprefixtune_y_200_act_cat_b=16-e=30_d=0.0_l=0.0_lr=5e-05_w=0.0_s=101_d=n_m=512 --data_dir xsum --tuning_mode prefixtune --preseqlen 200 --do_train --label_smoothing 0.0 --use_deep no --gpus 1 --learning_rate 5e-05 --train_batch_size 16 --eval_batch_size 16 --num_train_epochs 30 --optim_prefix yes --preseqlen 200 --prefix_mode activation --format_mode cat --gradient_accumulation_steps 1 --learning_rate 5e-05 --weight_decay 0.0 --seed 101 --mid_dim 512 --use_dropout no --prefix_dropout 0.0  --max_source_length 512 --max_target_length 60 --val_max_target_length 60 --test_max_target_length 100  --fp16 --fp16_opt_level O1

Still, I got 1-2 rouge points lower on xsum, original, val set. However, my val loss almost matches the screenshot you post. Here is my metrics.json

{
    "val_avg_loss": 1.6198463439941406,
    "val_avg_rouge1": 42.233881249999996,
    "val_avg_rouge2": 19.36146875,
    "val_avg_rougeL": 34.15171875,
    "val_avg_gen_time": 0.15285185538232327,
    "val_avg_gen_len": 35.65625,
    "step_count": 27
},
{
    "val_avg_loss": 1.6255970001220703,
    "val_avg_rouge1": 42.20043125,
    "val_avg_rouge2": 19.1729125,
    "val_avg_rougeL": 34.023271875,
    "val_avg_gen_time": 0.14824258256703615,
    "val_avg_gen_len": 35.90625,
    "step_count": 28
},
{
    "val_avg_loss": 1.6247661113739014,
    "val_avg_rouge1": 42.71155,
    "val_avg_rouge2": 19.746046874999998,
    "val_avg_rougeL": 34.4363625,
    "val_avg_gen_time": 0.14882883243262768,
    "val_avg_gen_len": 35.6875,
    "step_count": 29
},
{
    "val_avg_loss": 1.6258924007415771,
    "val_avg_rouge1": 42.674909375,
    "val_avg_rouge2": 19.538184375,
    "val_avg_rougeL": 34.334109375,
    "val_avg_gen_time": 0.14808641048148274,
    "val_avg_gen_len": 35.28125,
    "step_count": 30
},
{
    "val_avg_loss": 1.6260249614715576,
    "val_avg_rouge1": 42.778821875000006,
    "val_avg_rouge2": 19.570328125,
    "val_avg_rougeL": 34.363040625000004,
    "val_avg_gen_time": 0.1509597236290574,
    "val_avg_gen_len": 36.25,
    "step_count": 31
}

Any suggestion?
Besides, have you ever tried training this model on multiple GPUs? On multiple GPUs, the performance is 2~3 points lower than that you post (using lr=0.00014). Can you give a set of hyperparameters suitable for DDP?

XiangLi1999 · 2021-11-19T02:25:50Z

I got 1-2 rouge points lower on xsum, original, val set. However, my val loss almost matches the screenshot you post.

I am a bit confused by the above sentence. The screenshot I posted is on xsum_original val set. So you are around .5 off. (where does the 1-2 rouge points lower come from?)

I never tried this code on DDP settings (due to resource constraints, sadly). I am not 100% sure, but I guess for DDP you don't need to scale lr down, I assume DDP would automatically do this? so lr should still be 0.00005.

zhaone · 2021-11-19T02:46:12Z

Sorry, 1-2 rouge points lower is compared to table2 of paper PREFIX(2%), the gap is 1, 1.3, 1.6 respectively, don’t know if these two experiments are comparable? Yes, for the screenshot you post, the gap is about 0.5. So I guess that the performance you report in table2 is on test set, and prefixlen=200 means PREFIX(2%)?

I notice that my gen_len is longer than that in your post by 1, should I add --length_pen 0.8 to improve the performance?

I've tried to keep lr=0.00005 but cannot replicate the performance on multiple GPUs. As far as I'm concerned, if I use 8 GPUs, the effective batch size should be 8 times than original set, so the lr should be increased? Also, I don't know if DDP does this automatically. Anyway, I'll try some other hyperparameters. If I can find a good one, I'll post it.

XiangLi1999 · 2021-11-19T03:27:01Z

I think if you get dev performance that matches the screenshot, you would get test performance matching table 2. These two should be correlated, but not exactly the same, one is on dev and one is on test. I did length penalty of 0.8 + a slightly different rouge evaluation to get deterministic scores. The evaluation script just make the result deterministic and it doesn't make a big change to the number:

def calculate_rouge(output_lns, reference_lns, use_stemmer=True):
    scorer = rouge_scorer.RougeScorer(ROUGE_KEYS, use_stemmer=use_stemmer)
    aggregator = scoring.BootstrapAggregator()
    my_determisitic_avg = defaultdict(list)

    for reference_ln, output_ln in zip(reference_lns, output_lns):
        scores = scorer.score(reference_ln, output_ln)
        for key, val in scores.items():
            # print(scores, val.fmeasure, val)
            my_determisitic_avg[key].append(val.fmeasure)
        aggregator.add_scores(scores)

    result = aggregator.aggregate()
    print(result)
    # print(my_determisitic_avg.keys())
    print('my determinsitic avg:')
    for k, v in my_determisitic_avg.items():
        v = np.array(v)
        print('{}={}'.format(k, round(v.mean()*100, 3 )))
    return {k: round(v.mid.fmeasure * 100, 4) for k, v in result.items()}

Yes, prefixlen=200 means Prefix(2%).

Yes, add --length_pen 0.8 to decode would probably improve the performance.

I think effective batch_size will be 8 times bigger, but I don't have a good intuition how it would relate to the learning rate. Intuitively, I guess keeping the old learning rate, or slightly increase it (0.00008) would still be fine, so I am not sure why it crashes for you...

zhaone · 2021-11-19T03:51:44Z

I see. Thanks for your reply!!! I'm trying your suggestion about the learning rate.

jpilaul closed this as completed Oct 26, 2021

jpilaul reopened this Nov 4, 2021

jpilaul closed this as completed Nov 5, 2021

jpilaul mentioned this issue Nov 5, 2021

Not able to replicate XSUM results #2

Open

zhaone mentioned this issue Nov 18, 2021

A piece of GPU can train this model? #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XSUM dataset differences with original #20

XSUM dataset differences with original #20

jpilaul commented Oct 13, 2021 •

edited

Loading

jpilaul commented Nov 4, 2021 •

edited

Loading

XiangLi1999 commented Nov 5, 2021

XiangLi1999 commented Nov 5, 2021 •

edited

Loading

jpilaul commented Nov 5, 2021 •

edited

Loading

XiangLi1999 commented Nov 17, 2021

jpilaul commented Nov 17, 2021

XiangLi1999 commented Nov 17, 2021 •

edited

Loading

jpilaul commented Nov 17, 2021

XiangLi1999 commented Nov 17, 2021

jpilaul commented Nov 17, 2021 •

edited

Loading

XiangLi1999 commented Nov 17, 2021 •

edited

Loading

zhaone commented Nov 19, 2021

XiangLi1999 commented Nov 19, 2021

zhaone commented Nov 19, 2021 •

edited

Loading

XiangLi1999 commented Nov 19, 2021

zhaone commented Nov 19, 2021

XSUM dataset differences with original #20

XSUM dataset differences with original #20

Comments

jpilaul commented Oct 13, 2021 • edited Loading

jpilaul commented Nov 4, 2021 • edited Loading

XiangLi1999 commented Nov 5, 2021

XiangLi1999 commented Nov 5, 2021 • edited Loading

jpilaul commented Nov 5, 2021 • edited Loading

XiangLi1999 commented Nov 17, 2021

jpilaul commented Nov 17, 2021

XiangLi1999 commented Nov 17, 2021 • edited Loading

jpilaul commented Nov 17, 2021

XiangLi1999 commented Nov 17, 2021

jpilaul commented Nov 17, 2021 • edited Loading

XiangLi1999 commented Nov 17, 2021 • edited Loading

zhaone commented Nov 19, 2021

XiangLi1999 commented Nov 19, 2021

zhaone commented Nov 19, 2021 • edited Loading

XiangLi1999 commented Nov 19, 2021

zhaone commented Nov 19, 2021

jpilaul commented Oct 13, 2021 •

edited

Loading

jpilaul commented Nov 4, 2021 •

edited

Loading

XiangLi1999 commented Nov 5, 2021 •

edited

Loading

jpilaul commented Nov 5, 2021 •

edited

Loading

XiangLi1999 commented Nov 17, 2021 •

edited

Loading

jpilaul commented Nov 17, 2021 •

edited

Loading

XiangLi1999 commented Nov 17, 2021 •

edited

Loading

zhaone commented Nov 19, 2021 •

edited

Loading