Skip to content


Repository files navigation

Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace

This paper investigated how the underlying abilities of Large Language Models (LLMs), such as creative writing, code generation, and logical reasoning, develop at varying paces during instruction tuning. We systematically studied the effects of data volume, parameter size (7b-33b), and data construction methods on the growth of each ability.

  • The codebase and commands are provided to reproduce our experimental results.
  • The human-curated dataset, DoIT, for training and evaluation can be found here.
  • We further validate the efficacy of our data construction on other foundation models such as Baichuan2. The deployable model checkpoints can be found in this repo. 😁Have Fun!

Each ability of LLMs has its own growth pace during instruction tuning.


The code is implemented using python 3.9 and PyTorch v2.0.1

pip install -r requirements.txt


  1. Get the DoIT dataset and move its content to "data/":
git clone
  1. Get the foundation models from here. They are LLaMA series models (Touvron et al., 2023) with further pre-training in Chinese. We use the "Plus" version that ranges from 7b to 33b in our experiments.

  2. To train models under different experimental settings:

# choices for data_type:
# ["curated-10", "curated-40", "curated-160", "curated-640", "curated-2560", "curated-10000","synthetic-10", "synthetic-40", "synthetic-160", "synthetic-640", "synthetic-2560", "synthetic-10000","synthetic-40960", "baseline", "reconstruct", "maximum", "mix-0", "mix-2560", "mix-40960"]
bash --data_type **the_setting_you_chose** --model_size 7b --model_name_or_path **path_to_foundation_model** --batch_size 8 --gradient_accumulation 1

    Training logs and model checkpoints will be saved in "/runs".


Evaluate models on human-curated valid/test sets:

  1. Generate predictions on the valid/test questions.
time python -u evaluate/ --model_name_or_path **path_to_saved_checkpoint** --eval_data_path data/curated/valid #or test

    The generated answers will be saved in "evaluate/pred-data".

  1. Calculate the scores of various experimental settings on the abilities in the valid or test set:
time python -u evaluate/ --pred_data_path evaluate/pred-data/valid #or test

    The computed scores will be saved in "evaluate/results".

  1. Plot the graphs as shown in Section 4.3 of the paper.
python evaluate/ --plot_type # choices=["overall", "curated_vs_synthetic-13b", "ood", "curated_vs_synthetic-7b"]

    The plotted graphs will be saved in "evaluate/plots".

Evaluate models on two public benchmarks:

  1. Get the dataset from the official repository of CMMLU and AGIEval and move their contents to "evaluate/cmmlu/data" and "evaluate/agieval/data" respectively.

  2. Calculate the scores in zero-shot and few-shot settings.

bash evaluate/cmmlu/ 0 MODEL_NAME_OR_PATH SAVE_TAG
bash evaluate/agieval/ 0 MODEL_NAME_OR_PATH SAVE_TAG
  - 0: Specifying the GPU to be used and the default is 0.
  - MODEL_NAME_OR_PATH: The path for model.
  - SAVE_TAG: The name of the output and log file; for example, "curated-1000_epoch10".


  title={Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace},
  author={Song, Chiyu and Zhou, Zhanchao and Yan, Jianhao and Fei, Yuejiao and Lan, Zhenzhong and Zhang, Yue},
  journal={arXiv preprint arXiv:2310.19651},


No description, website, or topics provided.







No releases published


No packages published