Skip to content

This is a fork of the distilling-step-by-step repository with the aim of creating a task-specific LLM distillation framework for healthcare.

License

Notifications You must be signed in to change notification settings

dermatologist/distilling-step-by-step

 
 

Repository files navigation

Distilling Step-by-Step!

Code for paper Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Changes in this fork

  • Add support for GCS
  • Add command line invocation with arguments
  • Add support for hosting distilled models using docker
  • Add support for hosting models as vertex AI endpoints
  • Add support for hosting models as TF Serving endpoints
  • Add kedro pipeline for distillation
  • Add support for vertex AI pipelines

Work in progress.

This is a fork of the distilling-step-by-step repository with the aim of creating a task-specific LLM distillation framework for healthcare. The data should be in the format (This may change):

{
  "input": "The input here",
  "label": "The output here",
  "rationale": "The rationale generated by chain of thought"
}

The rationale may be generated using MEDprompt's Self Generated COT Chain.

in the path:

  • datasets/generic/generic_test.json
  • datasets/generic/generic_train.json
  • WIP: GCS support
  • You can use a "teacher LLM" to generate labels and rationale.

Install

git clone https://github.com/dermatologist/distilling-step-by-step.git
cd distilling-step-by-step
pip install -e .

Command Usages

distillm

Example usages

  • Distilling step-by-step with PaLM label and PaLM rationale:
distillm  --from_pretrained google/t5-v1_1-small \
          --alpha 0.5 \
          --batch_size 4 \
          --max_steps 100 \
          --eval_steps 2 \
          --no_log \
          --output_dir output

Args usages

  • --from_pretrained: google/t5-v1_1-small, google/t5-v1_1-base, google/t5-v1_1-large, google/t5-v1_1-xxl
  • --dataset: esnli, anli1, cqa, svamp, generic
  • --label_type:
    • --label_type gt: Use GT label for training
    • --label_type llm: Use LLM predicted label for training
    • --label_type generic: Use provided label for training
  • --alpha: Task weight for multi-task training. Loss = alpha * label_prediction_loss + (1 - alpha) * rationale_generation_loss
    • --alpha 0.5: recommended
  • --batch_size: Batch size
  • --grad_steps: Gradient accumulation step
  • --max_input_length: Maximum input length
  • --eval_steps: How many steps to evaluate the model during training
  • --max_steps: Maximum steps for training
  • --run: Random seed to use
  • --model_type:
    • standard: Standard finetuning (--label_type gt) or distillation (--label_type llm)
    • task_prefix: Distilling step-by-step
  • --parallelize: Model parallelism
  • --output_dir: The directory for saving the distilled model
  • --gcs_project: The GCP project name
  • --gcs_path: The GCS path. _train.json and _test.json will be added to the path

Cite

If you find this repository useful, please consider citing:

@article{hsieh2023distilling,
  title={Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes},
  author={Hsieh, Cheng-Yu and Li, Chun-Liang and Yeh, Chih-Kuan and Nakhost, Hootan and Fujii, Yasuhisa and Ratner, Alexander and Krishna, Ranjay and Lee, Chen-Yu and Pfister, Tomas},
  journal={arXiv preprint arXiv:2305.02301},
  year={2023}
}

This fork

Blog posts

About

This is a fork of the distilling-step-by-step repository with the aim of creating a task-specific LLM distillation framework for healthcare.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.4%
  • Dockerfile 2.6%