Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T0 (p=1) replicability #35

Open
tuhinjubcse opened this issue Jul 16, 2022 · 1 comment
Open

T0 (p=1) replicability #35

tuhinjubcse opened this issue Jul 16, 2022 · 1 comment

Comments

@tuhinjubcse
Copy link

tuhinjubcse commented Jul 16, 2022

Hi
@VictorSanh

Thanks for releasing the code and data. I am trying to retrain it in pytorch
Some questions , in your paper you have p=1 vs p=5.7 results

Say for p=1 we take one random prompt per example of a dataset. This is fine perfectly

I have some doubts about the

1) Sampling strategy: proportional to the number of examples in each dataset (we treated any dataset with over 500'000 examples as having 500'000/num_templates examples) -  
Does this mean for big datasets like gigaword you include  422661 examples instead of  3803957



2) On huggingface T0 it says Fine-tuning steps: 12'200  but in your script says 
export TRAIN_STEPS=1112200. Any idea how many epochs you trained ?



3) Can you tell the total number of samples included for p=1  given tasks ['commonsense_qa', 'dream', 'quail', 'quartz', 'social_i_qa', 'wiqa', 'cosmos_qa', 'qasc', 'quarel', 'sciq', 'wiki_hop', 'adversarial_qa_dbert', 'adversarial_qa_dbidaf', 'adversarial_qa_droberta', 'quoref', 'duorc_ParaphraseRC', 'duorc_SelfRC', 'ropes', 'wiki_qa', 'common_gen', 'wiki_bio', 'app_reviews', 'amazon_polarity', 'imdb', 'rotten_tomatoes', 'gigaword', 'cnn_dailymail', 'multi_news', 'samsum', 'xsum', 'ag_news', 'dbpedia_14', 'trec', 'paws_labeled_final', 'glue_mrpc', 'glue_qqp', 'yelp_review_full', 'kilt_tasks_hotpotqa']

I have Num examples = 3068602 , which was done by taking p=1 from individual datasets , for datasets bigger than 500k dividing num of samples by num_of_prompts. If you have the file for T0 ( p=1 ) or (p=5.7) do you mind sharing them 



4) Example grouping: We use packing to combine multiple training examples into a single sequence to reach the maximum sequence length . Not sure whats this ? Is it necessary and how can we do it ?
@VictorSanh
Copy link
Member

thanks for your patience @tuhinjubcse

  1. Sampling strategy: proportional to the number of examples in each dataset (we treated any dataset with over 500'000 examples as having 500'000/num_templates examples) -
    Does this mean for big datasets like gigaword you include 422661 examples instead of 3803957

I'll let @awebson confirm!

  1. On huggingface T0 it says Fine-tuning steps: 12'200 but in your script says
    export TRAIN_STEPS=1112200. Any idea how many epochs you trained ?

Yeah trained for 12'200 steps (don't think we ever reached even one epoch). 1'112'200 is coming from 1'000'000 t5 pertaining + 100'000 lm steps to obtain t5-lm + 12'200 steps of multitask fine-tuning

  1. Can you tell the total number of samples included for p=1 given tasks ['commonsense_qa', 'dream', 'quail', 'quartz', 'social_i_qa', 'wiqa', 'cosmos_qa', 'qasc', 'quarel', 'sciq', 'wiki_hop', 'adversarial_qa_dbert', 'adversarial_qa_dbidaf', 'adversarial_qa_droberta', 'quoref', 'duorc_ParaphraseRC', 'duorc_SelfRC', 'ropes', 'wiki_qa', 'common_gen', 'wiki_bio', 'app_reviews', 'amazon_polarity', 'imdb', 'rotten_tomatoes', 'gigaword', 'cnn_dailymail', 'multi_news', 'samsum', 'xsum', 'ag_news', 'dbpedia_14', 'trec', 'paws_labeled_final', 'glue_mrpc', 'glue_qqp', 'yelp_review_full', 'kilt_tasks_hotpotqa']
    I have Num examples = 3068602 , which was done by taking p=1 from individual datasets , for datasets bigger than 500k dividing num of samples by num_of_prompts. If you have the file for T0 ( p=1 ) or (p=5.7) do you mind sharing them

The mixtures t0_train_one_og_prompt and t0_train_all_og_prompts are what you need (see https://github.com/bigscience-workshop/t-zero/blob/master/training/README.md#data-preparation)

  1. Example grouping: We use packing to combine multiple training examples into a single sequence to reach the maximum sequence length . Not sure whats this ? Is it necessary and how can we do it ?

Since in tf the shapes are fixed (and not dynamic), we need to make sure to reduce padding as much as possible to make the best use of the compute. Packing means concatenating multiple inputs on the encoder side, and predicting the concatenation of the targets. Code: https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/dataset.py#L64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants