Small Models, Big Impact: The BabyLM Challenge

Introduction

This project focuses on training language models using high-quality data instead of large amounts of data. We use the Encyclopedia Britannica as our main source, a reliable and well-organized corpus, to achieve two goals:

Competing in the BabyLM Challenge, which limits the dataset to 10 million words.
Building a model that can generate encyclopedic text using the full 38-million-word dataset.

We use knowledge distillation, a technique where smaller student models learn from larger teacher models. This allows us to create smaller, faster models that still perform well. Based on the BabyLlama architecture, our work shows that combining high-quality data with efficient training methods can produce excellent results, even for specific tasks or when resources are limited.

Files

1_scraper.py: downloads the articles from (WikiSource), saving them as JSON files in a folder structure organized by volume and part. File names reflect the article's title.
2_convert_britannica.py: takes the output of the first script and converts all JSON files to TXT, keeping only the content. The TXT files are saved in a directory structure that mirrors the original JSON file structure.
3_combiner.py: takes the output of the second script and combines all individual text files into a single TXT file, with each article separated by a new line. The resulting file is named combined_encyclopedia.txt, which can be downloaded from (HuggingFace).
4_cleanup.py: takes the combined_encyclopedia.txt and applies various regular expressions to clean and standardize the text, ensuring consistent formatting. Additionally, it adds <s> and </s> tokens between each article. The resulting file is encyclopedia_cleaned.txt.
5_splitter.py: takes the encyclopedia_cleaned.txt and generates a folder named "corpus_split" with two files: train.txt and val.txt. You can specify the number of words for each file, ensuring that articles are not broken up.
6_tokenizer.py: trains a tokenizer using the corpus_split/train.txt. It applies Byte Pair Encoding (BPE) with special tokens (<pad>, <s>, </s>) and saves the tokenizer model in the models directory as tokenizer-clean.json.
7_train_teachers.py: trains a teacher model using the tokenizer from the previous script. It applies knowledge distillation, using a configuration specified in a YAML file to guide the training process.
8_train_student.py: performs knowledge distillation from the two large models to a smaller model.
9_generator.py: uses the student model to generate encyclopedia-style text based on a given prompt (e.g., "<s> London was").

Dataset

We provide a downloadable dataset on Hugging Face:

(Encyclopedia Britannica Dataset): This includes the combined text corpus prepared through our scripts.

The dataset is sourced from the Encyclopedia Britannica available on (WikiSource).

Get the Paper and Presentation

For a detailed explanation of our work, methodology, and results, you can access:

Our paper, which outlines the entire process and findings.
Our presentation slides, summarizing the project’s highlights.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
1_scraper.py		1_scraper.py
2_convert_britannica.py		2_convert_britannica.py
3_combiner.py		3_combiner.py
4_cleanup.py		4_cleanup.py
5_splitter.py		5_splitter.py
6_tokenizer.py		6_tokenizer.py
7_train_teachers.py		7_train_teachers.py
8_train_student.py		8_train_student.py
9_generator.py		9_generator.py
README.md		README.md
custom_dataset.py		custom_dataset.py
gpt-705M.yaml		gpt-705M.yaml
llama-360M.yaml		llama-360M.yaml
paper.pdf		paper.pdf
presentation.pdf		presentation.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small Models, Big Impact: The BabyLM Challenge

Introduction

Files

Dataset

Get the Paper and Presentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

geochaval/BabyLM

Folders and files

Latest commit

History

Repository files navigation

Small Models, Big Impact: The BabyLM Challenge

Introduction

Files

Dataset

Get the Paper and Presentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages