Poppins

Our Mission

Be the default way developers create custom LLMs

Plan to 1.0

✅ Create fundamentals for Ternary Quantization based on BitNet research:
✅ Stub Poppins front doors
- ✅ bootstrap(): Will create example train.xml
- ✅ train(): Will create model based on train.xml
- ✅ infer(): Will get response from model
- ✅ poppins bootstrap: CLI command that calls bootstrap()
- ✅ poppins train: CLI command that calls train()
- ✅ poppins infer: CLI command that calls infer()
✅ Push to GitHub
✅ Push to crates.io
✅ Deploy train.xsd to a Cloudflare Worker
✅ bootstrap()
- ✅ Accept an output_dir_path default to cwd & write example train.xml
- ✅ May also be called via cli @ poppins bootstrap
- ✅ CLI accepts -o or --output params for output_dir_path
✅ BPETokenizer
- ✅ Write tokenizer.json based on train.xml samples
- ✅ Add bpe_requested_tokens to train.xml constants
- ✅ Add bpe_min_merge_frequency to train.xml constants
train():
- ✅ Read training file (default to train.xml)
- ✅ Parse train.xml
- ✅ Validate train.xml
- ✅ Create TrainXML
- ✅ Write output directory (default to .poppins)
- ✅ Create Samples (holds training & validation samples)
- ✅ Write output_dir/train_corpus.xml
- ✅ Write output_dir/val_corpus.xml
- ✅ Write output_dir/tokenizer.json
- ✅ Write output_dir/train_corpus.bin
- ✅ Write output_dir/val_corpus.bin
- ✅ Write output_dir/train_index.bin
- ✅ Write output_dir/val_index.bin
- Write output_dir/manifest.json
...
MLA
- https://arxiv.org/pdf/2602.10718
RMSNorm
RoPE
ReLU²
KV Cache
Memory
- Multi Turn
- Turso
- RLM
- Abstract Syntax Tree

FAQ's about Poppins

Why is Poppins written in Rust?

Predictable Performance:
- Languages w/ a Garbage Collector (Python/Java/JavaScript) may pause during a model response for garbase collector maintenance
- Rust does not have a garbage collector, so token generation during inference remains smooth
Deploy Everywhere:
- Compile to WASM to run in the browser
- Compile to native iOS & Android libraries to run in mobile applications
- Deploy to small devices (ex: Rasberry Pi) b/c no operating system is required
Concurrency:
- Python's Global Interpreter Lock (GIL) prevents true parallelism
- Rust can use all CPU cores efficiently which helps us scale optimally
Peace:
- C++ solutions like llama.cpp (typically called from Python via llama-cpp-python) can crash with memory errors that are hard to debug
- With Rust developers never see common Python errors (ex: segmentation faults, memory corruption or hard to debug crashes in production) b/c Rust guarantees safety at the language level

FAQ's about Ai

What is a weight?

Weights are the learned parameters (numbers)
Weights are updated during training & fixed during inference
Raw Weights are f32 (big numbers that require 4 bytes to store)
Quantized Ternary Weights are -1, 0 or 1 (require 2 bits to store)

What is gradient descent?

Gradient descent is the process of optimizing weights
With machine learning, at the begining of training weights are random numbers
Then a prediction is made
Then we compute the error
Then we adjust the weights to reduce error
How much we adjust the weights is based on the learning rate

What is a learning rate?

The gradient tells us the direction and magnitude to change the weight (positive means increase, negative means decrease)
The learning rate is a small number (ex: 0.001) that controls how much we trust the gradient
If the learning rate is too large, weights jump around and never settle (divergence)
If learning rate is too small, training takes forever

What is deep learning?

Deep is a machine learning architecture w/ many layers (3 to hundreds)
Each layer transforms the data
Each layer learns different patterns
Each layer builds on the previous layer's representations

What is Ai?

AI is a system that receives inputs and provides outputs using learned weights
With traditional programming a human writes a function to identify cats
With Ai programming a model attempts to identify a cat, adjusts weights & repeats till it's good at identifying cats

What is a model?

A model is an instance of a neural network that has been trained w/ samples, can receive inputs (prompts) and provides quality outputs (responses)

What is a neural network?

A neural network is a mathematical function that transforms an input into an output through a series of calculations
A neural network's mathematical function includes weights and biases that are used to calculate the output
At the beginning of training the weights and biases are random & through training these numbers get good enough to produce quality outputs

What is an LLM?

An LLM is a Large Language Model
An LLM is a specific type of neural network designed to work with language (text)
The LLM receives an input (prompt) and gives back a probability distribution over the next token. Then the LLM receives another input (prompt + last token) and gives back another probability distribution. This continues till the most likely next token is a stop responding token.

What is Attention?

Attention computes, how much each token should pay attention to all other tokens w/in a sequence
Each token w/in the sequence is given 3 vectors, the query, key and value vectors
Attention refers to the weights (probabilities) that determine how much information to take from all visible tokens

What are Attention scores?

Logits
Raw dot products (Q·K), gives us a single value score for each token

What are Attention weights?

Attention weights are attention scores after softmax
Attention weights are probabilities that sum to 1
Attention weights tells us 'this token contributes attention_weight percent of its Value to the output for the current token'

What is Attention output?

Attention output is the weighted sum of values using the attention weights

What is a Transformer?

A Transformer is an deep learning architecture where each token w/in a sequence is aware of all other tokens w/in the sequence (Attention)

What is a Query vector?

A Query vector is given to a token an answers (what is this token looking for in other tokens)
A Query vector helps us search for related tokens w/in a sequence by comparing the Query vector of the current token w/ the Key vector of other tokens
During inference we get a query vector for the last token and compare to all other tokens
During training we get a query vector for all tokens w/in the ai response and compare to all other tokens simultaneously

What is a Key vector?

A Key vector contains information about what a token offers to others
We match the Key vector w/ the Query vector to determine if there is a relationship between 2 tokens

What is a Value vector?

A Value vector contains the actual data that will be passed forward if this token is selected

What is token selection?

Token selection in attention identifies which past tokens are most relevant to the current token

What is token prediction?

Token prediction in attention identifies what token is most likely to come after the current token

What is training?

Training is the process of creating a model that makes useful next token predictions
In the training process we show the model samples and let it learn from its mistakes (adjust its weights and biases)

What is a sample?

A sample is a simple training example that includes atleast 1 prompt and 1 model response
A sample may also include code snippets and sources

What is a multi-turn sample?

A multi-turn sample is a sample w/ multiple prompts and responses, to teach the model how to:
- Have a conversation
- Ask good follow up questions
- Build on previous responses

What is a corpus?

A corpus is a collection of samples

What is a hidden state?

Math anotation is h
The hidden state is the "current understanding" of the input as it flows through the model
Input tokens start as embeddings (not yet hidden states)
After passing through the first transformer layer, they become hidden states
Each layer transforms the hidden state further
A hidden state is a token vector after it has passed through atleast one layers
Hidden b/c
- Internal
- Not directly visible
- Intermediate representations
Identifies what the model “knows” about the sequence

What is the final hidden state?

The final hidden state (last layer's output) is what gets multiplied by output weights to predict the next token

What is an Output Projection Vector?

Annotation: W_out[i]
An Output Projection Vector is a vector of weights of embedding_dim length for a token that identifies what hidden state pattern predicts this token
Each token w/in the model's vocabulary has an Output Projection Vector
When we multiply the Output Projection Vector with the hidden state, we get a score indicating how well the hidden state matches the token

What is an Output Projection Matrix?

Annotation: W_out
Output Projection Matrix is embedding_dim length and vocab_size height ([vocab_size, hidden_dim])

What is a Linear Layer Row?

A Linear Layer Row is a vector of input_dim length that identifies what hidden state pattern predicts this neuron
Each neuron w/in a layer has a Linear Layer Row
When we multiply the Linear Layer Row with the hidden state, we get a score indicating how well the hidden state matches the neuron

What is a Linear Layer Matrix?

A Linear Layer Matrix is input_dim length and output_dim height ([output_dim, input_dim])

What is a bias?

A bias is single number added during the output calculation
Fixed during training
The bias is a constant number added after the weighted sum
Each token has bias and each neuron has a bias
Tells us How likely a token / neuron is in general
Small bias -> token / neuron rarely appears
High bias -> token / neuron appears often in many contexts

What is the output bias?

An output bias is computed during training, is a unique value for each token and identifies baseline tendencies for a token
High bias -> token appears often in many contexts
Small bias -> token rarely appears

What is inference?

Inference is when we use a trained model to generate a response

What is a token?

A token is a piece of text that the model understands as a single unit
Tokens can be:
- Words
- Parts of words
- Punctuation
- Individual characters
Spaces are typically attached to the following word & not separate tokens

What is a tokenizer?

A tokenizer is a tool that converts text into numbers (and back)
Computers don't understand words like "hi" - they only understand numbers
A tokenizer finds the middle ground: - it
- IF we give every word a unique number THEN we need a very large dictionary & can't handle words we've never seen
- IF we give every character a unique number THEN we lose word meanings
- Tokenizers splits text into pieces called "tokens" and gives each token a unique number ID

What is BPE?

BPE stands for Byte Pair Encoding
BPE is a method for deciding how to split text into tokens
BPE learns from the corpus which character & token combinations appear most frequently together, then merges them into tokens
BPE training can merge ANY two adjacent tokens, regardless of whether they're special, requested, or learned

How does BPE work?

Start with individual characters, spaces and punctuations marks as separate tokens
Count how often each adjacent pair of tokens appears next to each other in the entire corpus
Find the most frequent pair
IF the most frequent pair occurs more then MIN_MERGE_FREQUENCY (ex: 3) times THEN merge them into a new token and repeat the process ELSE stop merging

What are merge rules?

Merge rules tell us how to build bigger tokens from smaller ones
When we get NEW text (not in the training data) (like a user prompt), we apply the merge rules to tokenize the text
The sequence may have console.log("hi world"), vocab may have console & console.log so the merge rules will make it so the sequence token is the largest token we have in the vocab
Post merge, pre merge tokens remain

How are merge rules used?

Split prompt into characters, spaces and punctuation marks
Apply merge rules in the exact same order they were learned to build tokens & ensure consistency
Look up each token in the vocabulary to get its token ID
Look up each token embedding based on the token ID

What are special tokens?

Special tokens are pre-added to vocabulary
Structural tokens that define the corpus format (ex: <sample>, </sample>)
They can NOT be merged with adjacent tokens to form larger tokens
They appear in the token sequence as single units

What are requested tokens?

Requested tokens are pre-added to vocabulary
They can still be merged with adjacent tokens to form larger tokens
They appear in the token sequence as single units (unless merged to make even larger tokens)
Example: If console.log is a requested token then it starts as one token, & during BPE training, if console.log + ( appears frequently then BPE can still merge into console.log(

What is embedding?

Embedding is the process of turning a token into a token embedding

What is a token embedding?

A token embedding is a vector of numbers that represents the meaning of a token

What is a vector?

A vector is a list of numbers (ex: [1.5, 0, -2.3])

What is embedding dimension?

Embedding dimension is the length of a token embedding vector
More dimensions = more expressive power = more memory and computation

What is a dimension?

Slot / Index / Position w/in a vector

What is the origin?

Where x, y & z meet

What is a basis axis?

A basis axis is a unique direction from the origin that aligns w/ a dimension
Unique meaning no 2 basis axis w/in a vector share the same orientation
Models distribute meaning (nouniness, verbiness, pronouniness, animalness) across dimensions (basis axes)
What each dimension represents is not human defined, only the number of allowed dimensions (embedding dimension) is human defined

What is a latent feature?

A latent feature is a pattern the model discovered during training that:
- Is not explicitly named
- We did not manually define
- Exists only as numbers inside the network
Each dimension w/in a vector captures a pattern in the data, we do not know what that pattern is and there is no guarantee that it corresponds to a clean human concept (ex: animalness)
The model discovers useful internal dimensions automatically
Every dimension in embeddings, hidden states & neuron outputs is a latent feature

What is orientation?

Orientation is what way an arrow points from the origin
Independent of magnitude

What is an input?

An input is the token embeddings for all tokens w/in a sequence
A model moves the input through layers to comprehend the input & then predict the next token
An input is a matrix of numbers (length = embedding dimension, height = token length) that comes from somewhere, that somewhere might be the:
- Original sequence
- Output of a previous layer

What is a sequence?

A sequence is an ordered list of tokens

What is a matrix?

A matrix is a rectangular grid of numbers with rows and columns

What is a weighted input?

x1 * w1
h1 * w1
A weighted input is the result of multiplying an input by its corresponding weight

What is an output?

An output is a vector that is provided by a layer after aligning inputs w/ ternary weights
The output size is equal to the number of neurons in a layer
During attention & ffn compress the output size is equal to the embedding dimension
During ffn expand the output size is equal to the embedding dimension * 4

Got an output calculation example?

input = [2.0, 1.5, 0.5]

weights = [
    [1, 0, -1],   // neuron 0
    [0, 1, -1],   // neuron 1
    [-1, 1, -1],  // neuron 2
]

output[0] = (2.0×1) + (1.5×0) + (0.5×-1) = 2.0 + 0.0 - 0.5 = 1.5
output[1] = (2.0×0) + (1.5×1) + (0.5×-1) = 0.0 + 1.5 - 0.5 = 1.0
output[2] = (2.0×-1) + (1.5×1) + (0.5×-1) = -2.0 + 1.5 - 0.5 = -1.0

output = [1.5, 1.0, -1.0]  // length 3

What is a layer?

Collection of neurons that process data simultaneously
Each token goes through each neuron in a layer
The number of neurons w/in a layer is equal to the output_size

What is Quantization?

32-bit floating point numbers (f32) can represent most numbers with high precision, example:
- 0.000000001
- 3.1415926535
- -0.999999999
Precise (f32) numbers take up 4 bytes (32 bits) for every value
Quantization is the process of taking numbers that need many bits (3.1415926535)and mapping them to numbers that need fewer bits (3.14)

What is Ternary?

Ternary means 3 possible values

What is Ternary Quantization?

Ternary Quantizaion is a technique where we keep raw weights in f32 & create quantized weights that can be -1, 0 or +1
Raw values require 32 bits but quantized values (what we'll use in inference) only require 2 bits to tells us the value (0b01 tells us -1, 0b00 tells us 0 & 0b10 tells us +1)

Why the 0b in bit encoding?

0b is a prefix that tells the computer "there are binary digits coming next"

Why does 0b01 equal -1?

Not an outside of Poppins rule, just a choice, the on and off bits can represent whatever values we want them to be

A weight of +1 means?

This input matters a lot, add its effect

A weight of 0 means?

Ignore this input entirely

A weight of -1 means?

This input matters, but in the OPPOSITE direction

What is a neuron?

A neuron is a function that learns to detect different features (patterns)
This function has 1 weight vector and 1 bias number
A neurons job is to accept an embedding vector and provide a number that represents this tokens score as it relates to a particular feature
Calculation is dot product between input & weights
Each neuron learns a different pattern & together they form a massive pattern-detection system

How is Ternary Quantization done?

What is absolute value?

The absolute value of a number is its distance from zero, ignoring whether it's positive or negative

What is a logit?

dot product + bias
logit = (weight₁ * input₁) + (weight₂ * input₂) + ... + (weightₙ * inputₙ) + bias
When predicting the next token, the model computes logits for all tokens w/in its vocabulary
Larger logits mean the model thinks that token is more likely
Logits aren't probabilities (they don't sum to 1 & they can be negative)
Logits tells us, how compatible a token is w/ the current hidden state

What is softmax?

Softmax is the process of converting logits to probabilities
Softmax helps the much bigger score dominate
probability = eulers_num^(logit) / sum(eulers_num^(all_logits))
Numerator is Euler's number raised to the logit for one token
Denomenator is the sum of Euler's number raised to the logit for all tokens

Example:

Vocabulary is 3 tokens
Token 0: "apple"
Token 1: "banana"  
Token 2: "cherry"

logits = [2.0, 1.0, 0.1]

apple_numerator = e^(2.0) = 7.389
banana_numerator = e^(1.0) = 2.718
cherry_numerator = e^(0.1) = 1.105

denominator = sum(e^(all_logits)) = 7.389 + 2.718 + 1.105 = 11.212

What is Euler's number?

Euler's number (e), approximately 2.71828 is a mathematical constant like π
In neural networks, euler's number appears in the softmax formula because euler's number:
- Raised to any power is always positive (no negative numbers)
- Grows exponentially (large logits become much larger, small logits become tiny)

Got Euler's number examples?

e^0 = 1
e^1 = 2.718
e^2 = 7.389
e^3 = 20.085
e^(-1) = 0.368
e^(-2) = 0.135

What is SwiGLU?

SwiGLU stands for Swish Gated Linear Unit
SwiGLU is an activation function used in feed-forward networks

What is an activation function?

An activation function transforms a logit into an output

What is the ReLU?

ReLU stands for Rectified Linear Unit
ReLU is an activation function
IF logit > 0 THEN output = logit ELSE output = 0
output = ReLU(logit) = max(0, logit)

What is the ReLU?

ReLU stands for Rectified Linear Unit
ReLU² stands for ReLU squared
ReLU is an activation function
Raise the ReLU to the power of 2
ReLU² creates sparsity (many zeros) in the activations which is critical for ternary quantization

What is Sigmoid

Sigmoid is an activation function
Output is always between 0 and 1
output = sigmoid(logit) = 1 / (1 + eulers_num^(-logit))

What is dotproduct?

(weight₁ * input₁) + (weight₂ * input₂) + ... + (weightₙ * inputₙ)
Multiply corresponding elements of two vectors, sum them, one number response
Dot product is a measure of directional similarity
How much u points in the direction of v
A dot product:
- 0 tells us that 2 vectors point roughly in the same direction
- = 0 tells us that 2 vectors are perpendicular
- < 0 tells us that 2 vectors point in opposite direction

What is sparsity?

Sparsity means most values w/in a vector are zero
Dense vector = 0% sparsity = [1,2,3]
Sparse vector = 50% sparsity = [0,1,0,2]
Very sparse vector = 75% sparsity = [0,0,0,1]
When a vector is sparse we can:
- Store only the non-zero values (less memory)
- Compute only where values are non-zero (less work)

What is ⊙?

⊙ is circle w/ dot or element wise multiplication
Multiply corresponding elements of two vectors, vector response, example:
```
a = [2, 4, 6]
b = [1, 3, 5]

a ⊙ b = [2×1, 4×3, 6×5] = [2, 12, 30]
```

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Poppins

Our Mission

Plan to 1.0

FAQ's about Poppins

Why is Poppins written in Rust?

FAQ's about Ai

What is a weight?

What is gradient descent?

What is a learning rate?

What is deep learning?

What is Ai?

What is a model?

What is a neural network?

What is an LLM?

What is Attention?

What are Attention scores?

What are Attention weights?

What is Attention output?

What is a Transformer?

What is a Query vector?

What is a Key vector?

What is a Value vector?

What is token selection?

What is token prediction?

What is training?

What is a sample?

What is a multi-turn sample?

What is a corpus?

What is a hidden state?

What is the final hidden state?

What is an Output Projection Vector?

What is an Output Projection Matrix?

What is a Linear Layer Row?

What is a Linear Layer Matrix?

What is a bias?

What is the output bias?

What is inference?

What is a token?

What is a tokenizer?

What is BPE?

How does BPE work?

What are merge rules?

How are merge rules used?

What are special tokens?

What are requested tokens?

What is embedding?

What is a token embedding?

What is a vector?

What is embedding dimension?

What is a dimension?

What is the origin?

What is a basis axis?

What is a latent feature?

What is orientation?

What is an input?

What is a sequence?

What is a matrix?

What is a weighted input?

What is an output?

Got an output calculation example?

What is a layer?

What is Quantization?

What is Ternary?

What is Ternary Quantization?

Why the 0b in bit encoding?

Why does 0b01 equal -1?

A weight of +1 means?

A weight of 0 means?

A weight of -1 means?

What is a neuron?

How is Ternary Quantization done?

What is absolute value?

What is a logit?

What is softmax?

What is Euler's number?

Got Euler's number examples?

Packages