(Image generated using Imagen3 by Google)
This repo is my experimental playground for building and training practical LLMs with relatively limited compute resources. The goal is to train real LLMs on actual datasets, starting with pre-training from scratch and eventually moving towards instruction tuning with pre-trained models.
- From Scratch: No pre-trained models, no fine-tuning, no API.
- Limited Training Time: ~ 24 hours.
- Limited Resources: 2 GPUs, 24GB RAM each.
- Publicly Available Resources: Using only publicly available datasets and tools.
The dataset used for training is the FineWebEdu Deduplicated from SmolLM-Corpus. The dataset is a de-duplicated version of FineWebEdu and was filtered using a classifier to retain only high-quality educational content.
Please note that the dataset is much bigger, and we will probably not use all of it for training.
We provide pre-trained model checkpoints for easy experimentation and text generation. You can download these checkpoints from our Google Drive folder.
Currently available checkpoints:
model_gpt2_medium.pth
To use a pre-trained checkpoint:
- Download the desired checkpoint file from the Google Drive link.
- Place it in the
checkpointsdirectory of your project. - Use it for text generation or continue training from this point.
For detailed instructions on downloading and using checkpoints, please refer to the Setup Guide.
For detailed setup instructions, please refer to docs/setup.md.
-
Clone the repository:
git clone https://github.com/Usman-Rafique/llm_forge.git cd llm_forge -
Set up the environment and install dependencies:
python -m venv llm source llm/bin/activate # On Windows use: llm\Scripts\activate pip install -r requirements.txt pip install -e . -
Train a model:
python -m llm_forge.train gpt2_medium --config_file configs/gpt2.yaml -
Generate text using a pre-trained checkpoint:
python -m llm_forge.generate_text gpt2_medium --config_file configs/gpt2.yaml --checkpoint checkpoints/model_gpt2_medium.pth
For detailed information on CLI arguments for training and text generation, please refer to:
llm_forge/
├── src/
│ └── llm_forge/
│ ├── models/
│ ├── data/
│ ├── utils/
│ ├── train.py
│ └── generate_text.py
├── tests/
├── configs/
├── docs/
├── checkpoints/
└── README.md
- Modular architecture for easy experimentation
- Support for GPT-style models
- Configurable model sizes and architectures
- Text generation capabilities
Models supported so far:
- GPT-2 from Scratch
Since the model can not even finish the first epoch, both training and validation loss are indicative of how well the model is able to fit the data. So I am including an approximation of the loss based on the training and val loss.
| Model | Implementation | Training Time | # Training Batches | Batch Size | # Parameters* | Loss | Notes |
|---|---|---|---|---|---|---|---|
| GPT-2 Medium | From Scratch | 24 hours | 179K | 4 | 407M | 3.38 | Gradient clipping used |
* These are the total number of trainable parameters in the model. In case of many models, such as GPT-2, the embedding matrix is not considered a trainable parameter in the respective paper.
Detailed training logs for each model are available in the docs folder. You can find the logs for specific models here:
GPT-2 Medium
Fruits are good for you because fruits contain vitamin C and vitamin E.
Fruits also contain fiber, which is very important for the health of the brain and the mind.
According to the Food and Agriculture Organization (FAO) it is recommended to consume a wide range of fruits and vegetables for the health of your body and the health of your brain and the mind.
Fruits are rich in vitamins, fiber, and vitamins and are also rich in vitamins and minerals.
Fruits are rich in vitamins and fiber as
I intend to log the issues that I face and the fixes that I apply here.
-
Handling Exploding/Vanishing Gradients:
- The model training crashed after a few hours, and the loss was reported as
NaN. This issue was caused by exploding (or vanishing) gradients. - Fixes:
- Applied Gradient Clipping to prevent gradients from exceeding a certain threshold.
- Added checks to ensure that the loss is not
NaNorInfbefore the backward pass.
- The model training crashed after a few hours, and the loss was reported as
-
Special Character Handling in Tokenization:
- One training session crashed due to a disallowed special token '<|endoftext|>' somewhere in the training data :/ Added
allowed_specialin the tokenization to fix it:tokens = self.tokenizer.encode(text, allowed_special={"<|endoftext|>"})
- One training session crashed due to a disallowed special token '<|endoftext|>' somewhere in the training data :/ Added
-
Learning Rate for GPT-2 Medium:
- I first tried a learning rate of
3e-4which seemed fine, but later found out that the learning rate should be1e-4led to faster convergence and lower loss. I still have not tried even lower learning rates or learning rate schedules.
- I first tried a learning rate of
-
Mixed Precision:
- I tried PyTorch's Mixed Precision for GPT-2 Medium but it did not reduce GPU memory usage. My setup is 2 x RTX 4090 GPUs with 24GB RAM each.
Contributions are welcome! Please check out our Contributing Guidelines for more information.
This project is licensed under the MIT License - see the LICENSE file for details.
- The initial code is adapted from Sebastian Raschka and his LLM Workshop 2024 on "Pretraining and Finetuning LLMs from the Ground Up"
- This repo was inspired by karpathy's nanoGPT, which I used as a reference to build my own earlier version of a GPT from scratch: GPT-Nano