This source code contains the implementation of Bitune, and it's sufficient to reproduce the results from the paper. Please note that it was used to explore different ideas, and many components have different names or refer to concepts not mentioned in the paper.
We plan to release a clean repo for Bitune in the near future.
The lm-evaluation-harness
directory contains the repository from EleutherAI/lm-evaluation-harness, adapted to our method. You can install it with the following command:
pip install -e lm-evaluation-harness
- Set the proper absolute path to this directory in the
common_0.sh
file. - The evaluation script requires
wandb
for logging. Update line 57 ofeval.py
with yourwandb
username.
- Instruction-Tuning Setup: Run the
instruct.sh
script. - Downstream Task Training: Run the
downstream.sh
script. Ensure to set the correct number of update steps (based on the values provided in the appendix), and uncomment the appropriate lines for the dataset name, evaluations (at the very bottom), and the method name. - Ablations: Uncomment the lines for selected ablation in
ablations.sh
and run the script.
- Implementation required a few modifications of HuggingFace model classes, available in the
models
directory:- Modified KV-cache, so it keeps the computation graph for gradients.
- Added mixing modules with trainable coefficients (
pass_scale_k
,pass_scale_v
). - Modified attention mask based on
enforce_bidir
parameter of theforward()
function. - Added a code snippet inside the
forward()
function responsible for calling the Bitune wrapper.
- The Bitune wrapper (
_pass_fn()
in thepasses.py
file):- Passes the prompt through the model two times to obtain two sets of KV-cache, while setting proper LoRA adapters & attention masks for each pass.
- Calls mixing modules to combine two sets of features (
pass_scale_k
,pass_scale_v
). - Does the final pass on the answer (in case of training), or generates the first answer token (for inference). In the case of further generation of tokens, Bitune wrapper is not called at all, as the prompt's KV-cache is already obtained and stored, so the generation continues as in unmodified model.
- Sets all LoRA's parameters as trainable again, as by default
peft
library sets inactive adapters as non-trainable.
- The mixing module (class
PassScale
defined inmodels/think_gemma.py
):- Contains trainable coefficients for mixing two sets of features, separate for keys & values, so two coefficients per attention block of the model.
- Defines
forward()
function that applies the mixing operation based on the variant specified in the config (config.pass_type
). Our final method is defined by the variant607
(the one used for experiments), and its simplified version801
.
The following versions of the libraries have been used:
transformers==4.38.2
peft==0.11.1
datasets==2.18.0
evaluate==0.4.0
@misc{kopiczko2024bitune,
title={Bitune: Bidirectional Instruction-Tuning},
author={Dawid J. Kopiczko and Tijmen Blankevoort and Yuki M. Asano},
year={2024},
eprint={2405.14862},
archivePrefix={arXiv},
primaryClass={cs.CL}
}