Can a character-based transformer learn word boundaries?
To get setup create a hugging face account and ask @codebyzeb to add you to the group's private hugging face hub. The hub is where we keep data, tokenization, model and other artifacts. During training, we pull in these values directly from the hub (and occasionally also push progamatically to the hub).
In order to interact with the hub, you need to generate read and write access tokens from your hugging face account. Once generated, store these values as environment variables in a local .env
file with the names HF_READ_TOKEN, and HF_WRITE_TOKEN.
You will also need to ask @rdiehlmartinez to add you to the wandb (weights and biases) baby-lm project. We use wandb to log out metrics generated by our runs. Once you've joined the group, you will need to go to wandb to retrieve your API key. You will be prompted for this key calling the ./setup.sh
(see below).
Before running the code, make sure to run the setup script ./setup.sh
. This script sets up the requirements imports as well as git hooks for automatic code formatting. Additionally, this script makes sure you are logged into wandb and huggingface.
This project was developed using access to remote GPU nodes. Jobs could be queued onto these nodes using SLURM, with a cut-off of 12 hours.
In order to train for longer than 12 hours, use the --requeue-after NUM_HOURS
option when calling scripts/run_project.sh
.
When NUM_HOURS
hours have passed, the training script will automatically save the current state using wandb
and exit with code 124. run_project.sh
will
then detect this exit code and launch a new SLURM job. It's an ugly solution but it works for our purposes.
Hyperparameter tuning was conducted using a Weights & Biases sweep. To create a sweep, run the following:
wandb sweep --project PROJECT_NAME configs/sweep.yaml
This will create a new sweep in Weights & Biases with an ID and a URL. You can either run the agent by either:
- Running an agent directly using wandb:
wandb agent USER_ID/PROJECT_NAME/SWEEP_ID
- Run the script provided:
sh scripts/run_sweep_agent USER_ID/PROJECT_NAME/SWEEP_ID
- Queue a SLURM job to run the agent on a different node:
sbatch scripts/slurm_submit_sweep_agent.wilkes3 USER_ID/PROJECT_NAME/SWEEP_ID
.
The first option starts a wandb
agent on your machine. These can run on multiple machines and will sweep continuosly, starting new wandb
runs in succession.
The second option starts a wandb
agent that will only start a single wandb
run. This can also be run on multiple machines and each will terminate after the
run is complete. The last option queues this script on a remote machine using the SLURM job-queuing system. You will need to adjust the SLURM script to suit your needs.
Note that the sweep config launches the training process, with a requeue cutoff of 11 hours (see Continuous Training above). If you can run your training script continuously, you should remove this option before creating a sweep as it will try to launch a new SLURM job after 11 hours otherwise.