On the Effect of (Near) Duplicate Subwords in Language Modelling

Code to the paper On the Effect of (Near) Duplicate Subwords in Language Modelling. The code is based on the Languini Kitchen, a codebase for training language models. For an overview of the changes made to support vocabulary (de)duplication, see this diff or check out the (de)duplication implementation.

Reproducing Plots

To reproduce the plots from the paper without retraining models, you can load the relevant results via

wget https://y5d6.c15.e2-3.dev/public-bucket/results.zip
unzip results.zip

then install languini in a new environment

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip setuptools
pip install -e . --upgrade

and run the analysis notebook.

Training Models

You can also train models under the (de)duplicated settings of the paper:

Setup the environment as described above and follow the instructions for obtaining the dataset.
Train models as described in Languini. Configure duplication settings via the config arguments
- frac_duplicated: fraction of the vocabulary to duplicate
- p_duplicate: probability of a token being a duplicated token (given that the corresponding vocabulary item is duplicated)
- dedup_type: type of deduplication to apply to vocabulary ("whitespace", "lower", "plural", "all") or ""/None for no deduplication. Use the suffix "_50%" to only deduplicate half of the respective near duplicates.
- embed_noncanonical: whether to add an extra embedding indicating whether a token is "non-canonical"
For example, to train a model that coresponds to the $p(c) + e_\text{non-canonical}$ with $\mathbb{S}_\text{all}$ entry in Table 5, run
```
TRAIN_STEPS=18265
ACC_STEPS=8 # this works for a 4090 (24 GB)

torchrun --standalone languini/projects/gpt/main.py small \
    --train_batch_size 128 \
    --gradient_accumulation_steps $ACC_STEPS \
    --decay_steps $TRAIN_STEPS \
    --max_train_steps $TRAIN_STEPS \
    --frac_duplicated 0  \
    --dedup_type "all" \
    --embed_noncanonical \
    --seed 0
```
Evaluate the model as described in Languini. You can additionally specify which deduplication mapping $\mathbb{S}$ to use for the projected perplexity $\mathrm{PPL}_\mathbb{S}$ via the eval_dedup_type argument. E.g., to evaluate the run above, run
```
RUN_PATH="path/of/your/wandb/run" # alternatively specify checkpoint_file and config_file

./venv/bin/torchrun --standalone languini/projects/gpt/eval.py \
    --wandb_run $RUN_PATH \
    --eval_data_split test \
    --eval_dedup_type all \
    --last_n 128
```
if you specify a wandb run, this will automatically load the checkpoint from the run and finally upload the results to the run's summary.

Other Experiments

You can also reproduce the GLUE experiments via e.g.

./finetune_glue.sh path/of/your/wandb/run

or the Word2Vec experiments via e.g.

python -m languini.projects.word2vec.main --frac_duplicated 1.0 --p_dup 0.5

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
languini-instructions		languini-instructions
languini		languini
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analysis.ipynb		analysis.ipynb
debug.sh		debug.sh
finetune_glue.sh		finetune_glue.sh
finetune_task.sh		finetune_task.sh
plot_utils.py		plot_utils.py
run.sh		run.sh
runs.csv		runs.csv
setup.py		setup.py
sync.sh		sync.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

On the Effect of (Near) Duplicate Subwords in Language Modelling

Reproducing Plots

Training Models

Other Experiments

About

Releases

Packages

Languages

License

antonschafer/duplicate-subwords

Folders and files

Latest commit

History

Repository files navigation

On the Effect of (Near) Duplicate Subwords in Language Modelling

Reproducing Plots

Training Models

Other Experiments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages