In this notebook, I used OpenAI's Scaling Laws paper to calculate hyperparameters for my model. However, the numbers generated at the end didn't pass the sniff test. The scaling laws told me that I should train my model for 4.28e+21 days--that's a lot of days. This motivated me to instead calculate hyperparameters using the Chinchilla paper.

You should read this file if you want to see my thought process. My actual calculations occur in scalinglaws.py. I didn't want to redo the write up.

I am going to train an encoder-only transformer language model using MLX on my Macbook Pro M2. In this file, I am going to calculate the optimal hyperparameters according to OpenAI's Scaling Laws paper.

See https://arxiv.org/pdf/2001.08361

In this paper, OpenAI investigates the optimal ratio between model parameter count, dataset size, and minimum compute. They find that each of these three parameters must be held in the following proportion:

D ∝ N^0.74 ∝ C^0.54

As long as none of the parameters are a bottlenecked, increasing the scale of your model will reliably decrease the model's loss function.

OpenAI assumes that you start with a fixed compute budget. However, I suspect that dataset size will be my limiting factor, so I will start with it.

In [1]:
import tiktoken
import os

def concatenate_txt_files(directory):
    combined_text = ""

    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.endswith('.txt'):
                file_path = os.path.join(root, file)
                
                # Attempt to read the file with utf-8 encoding
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        combined_text += f.read() + "\n\n"
                
                # If a UnicodeDecodeError occurs, try another encoding
                except UnicodeDecodeError:
                    try:
                        with open(file_path, 'r', encoding='ISO-8859-1') as f:
                            combined_text += f.read() + "\n\n"
                    except Exception as e:
                        print(f"Failed to read {file_path} due to {str(e)}")

    return combined_text

directory = "/Users/blakewhitmer/PlatoV1/PlatoV1Dataset/JustBooks"
combined_text = concatenate_txt_files(directory)

encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode(combined_text, allowed_special="all")

print("Number of tokens")
print(len(tokens))

Number of tokens
73584077


This dataset started much larger. I begain by putting Python code, math practice problems, and a smattering of ebooks together. But as I looked at initial generations after 10,000 or so iterations, they didn't look right. I thought my code was generating utter nonsense. So I cut out a lot. I first cut out the Project Gutenberg licenses. Don't worry, you can find them with my dataset on Hugging Face. But they were repeat data, and that really harms small models:

https://www.anthropic.com/research/scaling-laws-and-interpretability-of-learning-from-repeated-data

I also took out the math and Python code. I think having code, LaTeX, and parts of the Western Canon in Greek, German, and English would have been too much.

I also skimmed through the dataset and cut out anything that wasn't natural language. Did you know Project Gutenberg has an ebook that's just pi to a million digits? I tried to include the first 200 Project Gutenberg ebooks, just as a heuristic, but it seems that was a mistake.

You can see examples of some of the scripts I used to trim down my dataset in trimmingcanon.py. This was done almost entirely by intuition. If it didn't look like philosophy, I deleted it.

Now, I need to calculate optimal parameters for my model. The math for this is extremely simple: it's a simple ratio of N^0.74/D.

In [11]:
paramter_count = 73584077 ** (1 / 0.74)
print(format(paramter_count, ".2e"))

4.27e+10


It seems the optimal parameter count is around 40 million parameters. And the optimal minimum compute:

In [12]:
optimal_compute = paramter_count ** (1 / 0.54)
print(format(optimal_compute, ".2e"))

4.86e+19


Now, I need to make sure this can actually fit within my compute budget! The Scaling Laws paper measures optimal compute in pedaflop-days. And this is the only resource I could find on the throughput of my Mac's GPU:

https://www.cpu-monkey.com/en/igpu-apple_m2_pro_16_core

That is 11.36 tflops for fp16.

I will train the model for at least 1 day.

I will try to train it for about 7 days. But as time goes on, if my macbook doesn't have anything better to do, I might leave it training for longer. And so putting that all together:

In [21]:
pflop_day = 10 ** 15 * 60 * 60 * 24
total_fp_operations_needed = optimal_compute * pflop_day
fp_per_day = 11.36 * 10 ** 12 * 24 * 60 * 60
print(fp_per_day)

optimal_compute_in_days = total_fp_operations_needed / fp_per_day


print("Total training time in days: " + format(optimal_compute_in_days, ".2e"))

9.81504e+17
Total training time in days: 4.28e+21


That's a lot of days. Maybe I should have started with my compute budget, like the scaling laws paper recommended. 😬

I hope to train my model somewhere in between 3 days and a week. I might train it more as time goes on. This would give me a compute budget, model size, and dataset size of:

In [18]:

total_compute_budget = fp_per_day / 10 ** 15 * 7
optimal_parameters = total_compute_budget ** 0.54
minimum_dataset = optimal_parameters ** 0.74

print(total_compute_budget)
print(optimal_parameters)
print(minimum_dataset)

6870.528
118.02517729411888
34.13982176924179


Hmm... this doesn't look right.