# Tiny Stories Hackathon
> From Cluster of stars study group

## Rules

# TinyStories Hackathon Rules
This hackathon is intended to be a fun competition to give ourselves practice pretraining LLMs on consumer hardware. We will follow the [TinyStories paper](<https://arxiv.org/abs/2305.07759>) and train small language models on small datasets and hardware.

The hackathon will end on April 7th, [AOE](<https://en.wikipedia.org/wiki/AoE>).
## Datasets
1. [**TinyStories:**](<https://huggingface.co/datasets/roneneldan/TinyStories>)
   Note that the TinyStories dataset is split into two versions both in the HF dataset:
     - GPT-3.5 generated TinyStories
    - GPT-4 generated TinyStories
   The tar file appears to have the cleanest versions with the least number of duplicates.
2. **[Simple Wikipedia](<https://huggingface.co/datasets/lsb/simplewiki2023>)** (optional)
   This dataset can be used to give your model more world knowledge than from just the TinyStories dataset. But be careful that it doesn't cause your model to use words which a typical 3 to 4-year-olds doesn't understand. It may need to be cleaned.
## Evaluation
Models will be evaluated by LLM-as-a-judge following the methodology outlined in the TinyStories paper. More details including how to submit your model's outputs early next week.
## Model Size Limits
Participants will be slotted into one of the following categories based on their hardware:
- **Small**: Up to 30M parameters. Low-to-mid range laptop GPUs and Apple Silicon.
- **Medium**: Up to 60M parameters. Mid-range GPUs (including high-end laptop GPUs and Apple Silicon)
- **Large**: Up to 120M parameters. High-end GPUs and multi-GPU systems.
## Tokenizers
While you must train your model from scratch, you are welcome to use any pre-trained tokenizer or train your own tokenizer.
## Model Architecture
You are welcome to use any model architecture you want provided you stay within the parameter budget of your hardware by following the parameter counting rules below.

## Parameter Counting
The Parameter budget is the number of unique floating-point weights receiving gradient updates:
- Unique Weights: Count each distinct floating-point weight stored in the model once.
- Reuse Multiplier: For each weight, multiply by the number of distinct times it contributes to forward computation (e.g., due to layer-sharing, layer reuse, or non-standard head-sharing). Weight-tied embedding and decoder weights are the exception and are only counted once. MQA/GQA doesn't count as head-sharing.
## Teams
Teams are limited to a maximum of 2 members and must be formed and declared within the first week.
## Training Frameworks
You might want to take a look at the following libraries and frameworks and adopt one for pretraining:
- [Composer](<https://docs.mosaicml.com/projects/composer/en/stable/index.html>) and optionally [LLM Foundry](<https://github.com/mosaicml/llm-foundry>)
- [PyTorch Lightning](<https://lightning.ai/docs/pytorch/stable/>) and optionally [LitGPT](<https://github.com/Lightning-AI/litgpt>)
- Hugging Face [Trainer](<https://huggingface.co/docs/transformers/en/main_classes/trainer>), [Accelerate](<https://huggingface.co/docs/accelerate/en/index>), and optionally [Axolotl](<https://axolotl-ai-cloud.github.io/axolotl/>) (a wrapper on top of HF)
- [fastai](<https://docs.fast.ai/>) with either [fastxtend](<https://fastxtend.benjaminwarner.dev/text.huggingface.html>)/[blurr](<https://ohmeow.github.io/blurr/>)

## Data

In [11]:
from datasets import load_dataset
import tiktoken

from minai import *

Grab tiny stories data

In [3]:
ds = load_dataset('roneneldan/TinyStories')
trn = ds['train']
val = ds['validation']
trn

Dataset({
    features: ['text'],
    num_rows: 2119719
})

In [12]:
tokenizer = tiktoken.get_encoding('gpt2')

In [19]:
txt = trn[0]['text']
txt

'One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.'

In [20]:
inp = tokenizer.encode(trn[0]['text'])[:10]
inp

[3198, 1110, 11, 257, 1310, 2576, 3706, 20037, 1043, 257]

In [21]:
tokenizer.decode(inp)

'One day, a little girl named Lily found a'

In [6]:
dls = DataLoaders.from_dd(ds, batch_size=4)

In [9]:
trn[0]

{'text': 'One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.'}

In [10]:
trn[1]

{'text': 'Once upon a time, there was a little car named Beep. Beep loved to go fast and play in the sun. Beep was a healthy car because he always had good fuel. Good fuel made Beep happy and strong.\n\nOne day, Beep was driving in the park when he saw a big tree. The tree had many leaves that were falling. Beep liked how the leaves fall and wanted to play with them. Beep drove under the tree and watched the leaves fall on him. He laughed and beeped his horn.\n\nBeep played with the falling leaves all day. When it was time to go home, Beep knew he needed more fuel. He went to the fuel place and got more healthy fuel. Now, Beep was ready to go fast and play again the next day. And Beep lived happily ever after.'}

In [7]:
# Look at the data
xb, yb = next(iter(dls.train))
xb.shape, yb.shape, yb[:5]

ValueError: too many values to unpack (expected 2)