It's recommended to use mamba
to manage dependencies. mamba
is a drop-in replacement for conda
re-written in C++ to speed things up significantly (you can stick with conda
though). To provide reproducible environments, we use conda-lock
to generate lockfiles for each platform.
This code repository is based on the NLP Research Template. More details on setting up mamba
and conda-lock
can be found there.
For a fully reproducible environment and running on HPC clusters, we provide pre-built docker images at https://hub.docker.com/r/taczin/next_level/tags. We also provide a Dockerfile
that allows you to build new docker images with updated dependencies:
docker build --tag <username>/<imagename>:<tag> --platform=linux/<amd64/ppc64le> .
You can activate the environment by running
bash scripts/console.sh
which will start a docker container in an interactive session.
Before you can start successfully, you have to adapt the GPU devices and dataset mount paths in scripts/console.sh
.
We are using Weights & Biases. To enable W&B, enter your WANDB_ENTITY
and WANDB_PROJECT
in dlib/frameworks/wandb.py.
To pretrain the model, the pretraining data first has to be preprocessed separately. You can run this via:
bash scripts/preprocess.sh
This processes the data with the all-MiniLM-L6-v2 sentence-transformers model, with a chunk size of 256 and an encoder model batch size of 4096.
To start Next-Level pretraining, run:
bash scripts/pretrain.sh
We cannot provide the books3 dataset we used for pretraining and it was taken off most easily accessible platforms like the huggingface hub due to license controversy. If you don't have a version of the dataset available you can try substituting it by other book-based data, e.g., project Gutenberg. We have not tested data with more short-ranged depencencies, but it might also work. If you try it out, we would love to hear about your results.
First run preprocessing for the downstream dataset you want. You can edit the dataset in the script file.
bash scripts/preprocess_downstream.py
Then run fine-tuning and evaluation by:
bash scripts/downstream.sh
Note that the quality dataset needs to be downloaded first. (See scripts/download_quality.sh
.) The BookSum dataset is used for zero-shot embedding quality evaluation and is not fine-tuned on.
We are currently working on making model checkpoints available and providing an easier way of using the model for inference. Stay tuned.