The objective is to take a look at current small models and estimate how many parameters we could have a model with

We are aiming towards reducing the vocabulary size, which will have a reduction in the number of parameters. We are also considering a change in precision from FP32 to BF16, which should make training faster and the model lighter.

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer

input_text = "My name is Julien and I like to play football with my friends."

## Tiny-Stories

* [Paper](https://arxiv.org/pdf/2305.07759)
* [Models]()

In [6]:
# Load the model and tokenizer
tiny_stories_tokenizer = AutoTokenizer.from_pretrained(
    "roneneldan/TinyStories-1Layer-21M"
)
tiny_stories_model = AutoModelForCausalLM.from_pretrained(
    "roneneldan/TinyStories-1Layer-21M"
)

In [7]:
total_params = sum(p.numel() for p in tiny_stories_model.parameters())
print(f"Total number of parameters: {total_params:,}")

# Calculate the total size in bytes (assuming float32, 4 bytes per parameter)
total_size_bytes = total_params * 4

# Convert to megabytes
total_size_mb = total_size_bytes / (1024 * 1024)

print(f"Total size of the model: {total_size_mb:.2f} MB")

Total number of parameters: 66,155,520
Total size of the model: 252.36 MB


In [8]:
tiny_stories_model

GPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(50257, 1024)
    (wpe): Embedding(2048, 1024)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0): GPTNeoBlock(
        (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoSelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
          )
        )
        (ln_2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=1024, out_features=4096, bias=True)
          (c_proj): Linear(

In [4]:
# Tokenize input and generate model output
inputs = tiny_stories_tokenizer(input_text, return_tensors="pt")
outputs = tiny_stories_model.generate(inputs["input_ids"], max_length=50)

# Decode the generated output
generated_text = tiny_stories_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


My name is Julien and I like to play football with my friends. Do you want to play with me?"

Timmy was happy to play with his new football. He kicked the ball and it went flying through the air. He was


## Pythia

In [21]:
# Load the model and tokenizer
pythia_tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-14m")
pythia_model = AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-14m")

In [32]:
total_params = sum(p.numel() for p in pythia_model.parameters())
print(f"Total number of parameters: {total_params:,}")

# Calculate the total size in bytes (assuming float32, 4 bytes per parameter)
total_size_bytes = total_params * 4

# Convert to megabytes
total_size_mb = total_size_bytes / (1024 * 1024)

print(f"Total size of the model: {total_size_mb:.2f} MB")

Total number of parameters: 14,067,712
Total size of the model: 53.66 MB


In [23]:
# Tokenize input and generate model output
inputs = pythia_tokenizer(input_text, return_tensors="pt")
outputs = pythia_model.generate(
    inputs["input_ids"],
    max_new_tokens=128,
)

# Decode the generated output
generated_text = pythia_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


My name is Julien and I like to play football with my friends. I'm a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little bit of a little


## Minueza-32M

It is not a especially good LLM, but it is interesting as a point of reference: https://www.victornogueira.app/#/articles/the-making-of-minueza-32-m-transformer-model-trained-from-scratch/readme


In [24]:
minueza_tokenizer = AutoTokenizer.from_pretrained("Felladrin/Minueza-32M-Base")
minueza_model = AutoModelForCausalLM.from_pretrained("Felladrin/Minueza-32M-Base")

In [31]:
total_params = sum(p.numel() for p in minueza_model.parameters())
print(f"Total number of parameters: {total_params:,}")

# Calculate the total size in bytes (assuming float32, 4 bytes per parameter)
total_size_bytes = total_params * 4

# Convert to megabytes
total_size_mb = total_size_bytes / (1024 * 1024)

print(f"Total size of the model: {total_size_mb:.2f} MB")

Total number of parameters: 32,792,760
Total size of the model: 125.09 MB


In [26]:
# Tokenize input and generate model output
inputs = minueza_tokenizer(input_text, return_tensors="pt")
outputs = minueza_model.generate(
    inputs["input_ids"],
    max_new_tokens=128,
)

# Decode the generated output
generated_text = minueza_tokenizer.decode(outputs[0], skip_special_tokens=True)
generated_text

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


'My name is Julien and I like to play football with my friends. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football. I love the way I play football.'

## Qwen-2.5 0.5B Base

In [27]:
# Load the model and tokenizer
qwen_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
qwen_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B")

In [30]:
total_params = sum(p.numel() for p in qwen_model.parameters())
print(f"Total number of parameters: {total_params:,}")

# Calculate the total size in bytes (assuming float32, 4 bytes per parameter)
total_size_bytes = total_params * 4

# Convert to megabytes
total_size_mb = total_size_bytes / (1024 * 1024)

print(f"Total size of the model: {total_size_mb:.2f} MB")

Total number of parameters: 494,032,768
Total size of the model: 1884.59 MB


In [29]:
# Tokenize input and generate model output
inputs = qwen_tokenizer(input_text, return_tensors="pt")
outputs = qwen_model.generate(inputs["input_ids"], max_length=128)

# Decode the generated output
generated_text = qwen_tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Both `max_new_tokens` (=2048) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


KeyboardInterrupt: 