TransformerSegmentation

Can a character-based transformer learn word boundaries?

Installation

To get setup create a hugging face account and ask @codebyzeb to add you to the group's private hugging face hub. The hub is where we keep data, tokenization, model and other artifacts. During training, we pull in these values directly from the hub (and occasionally also push progamatically to the hub).

In order to interact with the hub, you need to generate read and write access tokens from your hugging face account. Once generated, store these values as environment variables in a local .env file with the names HF_READ_TOKEN, and HF_WRITE_TOKEN.

You will also need to ask @rdiehlmartinez to add you to the wandb (weights and biases) baby-lm project. We use wandb to log out metrics generated by our runs. Once you've joined the group, you will need to go to wandb to retrieve your API key. You will be prompted for this key calling the ./setup.sh (see below).

Before running the code, make sure to run the setup script ./setup.sh. This script sets up the requirements imports as well as git hooks for automatic code formatting. Additionally, this script makes sure you are logged into wandb and huggingface.

Continuous Training

This project was developed using access to remote GPU nodes. Jobs could be queued onto these nodes using SLURM, with a cut-off of 12 hours. In order to train for longer than 12 hours, use the --requeue-after NUM_HOURS option when calling scripts/run_project.sh.
When NUM_HOURS hours have passed, the training script will automatically save the current state using wandb and exit with code 124. run_project.sh will then detect this exit code and launch a new SLURM job. It's an ugly solution but it works for our purposes.

Hyperparameter Tuning

Hyperparameter tuning was conducted using a Weights & Biases sweep. To create a sweep, run the following:

wandb sweep --project PROJECT_NAME configs/sweep.yaml

This will create a new sweep in Weights & Biases with an ID and a URL. You can either run the agent by either:

Running an agent directly using wandb: wandb agent USER_ID/PROJECT_NAME/SWEEP_ID
Run the script provided: sh scripts/run_sweep_agent USER_ID/PROJECT_NAME/SWEEP_ID
Queue a SLURM job to run the agent on a different node: sbatch scripts/slurm_submit_sweep_agent.wilkes3 USER_ID/PROJECT_NAME/SWEEP_ID.

The first option starts a wandb agent on your machine. These can run on multiple machines and will sweep continuosly, starting new wandb runs in succession. The second option starts a wandb agent that will only start a single wandb run. This can also be run on multiple machines and each will terminate after the run is complete. The last option queues this script on a remote machine using the SLURM job-queuing system. You will need to adjust the SLURM script to suit your needs.

Note that the sweep config launches the training process, with a requeue cutoff of 11 hours (see Continuous Training above). If you can run your training script continuously, you should remove this option before creating a sweep as it will try to launch a new SLURM job after 11 hours otherwise.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
conf		conf
data		data
notebooks		notebooks
scripts		scripts
src		src
submodules/BabySLM		submodules/BabySLM
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt
run_probe.py		run_probe.py
setup.sh		setup.sh
setup_mac.sh		setup_mac.sh
train.py		train.py

codebyzeb/TransformerSegmentation

Folders and files

Latest commit

History

Repository files navigation

TransformerSegmentation

Installation

Continuous Training

Hyperparameter Tuning

About

Resources

Stars

Watchers

Forks

Languages