- Be the
defaultwaydeveloperscreatecustom LLMs
- ✅ Create fundamentals for
Ternary Quantizationbased onBitNetresearch: - ✅ Stub Poppins front doors
- ✅
bootstrap(): Will create exampletrain.xml - ✅
train(): Will create model based ontrain.xml - ✅
infer(): Will get response from model - ✅
poppins bootstrap: CLI command that callsbootstrap() - ✅
poppins train: CLI command that callstrain() - ✅
poppins infer: CLI command that callsinfer()
- ✅
- ✅ Push to GitHub
- ✅ Push to crates.io
- ✅ Deploy
train.xsdto a Cloudflare Worker - ✅
bootstrap()- ✅ Accept an
output_dir_pathdefault tocwd& write exampletrain.xml - ✅ May also be called via cli @
poppins bootstrap - ✅ CLI accepts
-oor--outputparams foroutput_dir_path
- ✅ Accept an
- ✅
BPETokenizer- ✅ Write
tokenizer.jsonbased ontrain.xmlsamples - ✅ Add
bpe_requested_tokenstotrain.xmlconstants - ✅ Add
bpe_min_merge_frequencytotrain.xmlconstants
- ✅ Write
train():- ✅ Read training file (default to
train.xml) - ✅ Parse
train.xml - ✅ Validate
train.xml - ✅ Create
TrainXML - ✅ Write output directory (default to
.poppins) - ✅ Create
Samples(holdstraining&validationsamples) - ✅ Write
output_dir/train_corpus.xml - ✅ Write
output_dir/val_corpus.xml - ✅ Write
output_dir/tokenizer.json - ✅ Write
output_dir/train_corpus.bin - ✅ Write
output_dir/val_corpus.bin - ✅ Write
output_dir/train_index.bin - ✅ Write
output_dir/val_index.bin - Write
output_dir/manifest.json
- ✅ Read training file (default to
- ...
- MLA
- RMSNorm
- RoPE
- ReLU²
- KV Cache
- Memory
- Multi Turn
- Turso
- RLM
- Abstract Syntax Tree
- Predictable Performance:
- Languages w/ a Garbage Collector (
Python/Java/JavaScript) may pause during a model response for garbase collector maintenance Rustdoes not have a garbage collector, so token generation during inference remains smooth
- Languages w/ a Garbage Collector (
- Deploy Everywhere:
- Compile to
WASMto run in the browser - Compile to native
iOS&Androidlibraries to run in mobile applications - Deploy to small devices (ex:
Rasberry Pi) b/c no operating system is required
- Compile to
- Concurrency:
- Python's Global Interpreter Lock (GIL) prevents true parallelism
Rustcan use all CPU cores efficiently which helps us scale optimally
- Peace:
- C++ solutions like
llama.cpp(typically called from Python viallama-cpp-python) can crash with memory errors that are hard to debug - With
Rustdevelopers never see common Python errors (ex: segmentation faults, memory corruption or hard to debug crashes in production) b/cRustguarantees safety at the language level
- C++ solutions like
- Weights are the learned parameters (numbers)
- Weights are updated during training & fixed during inference
- Raw Weights are
f32(big numbers that require 4 bytes to store) - Quantized Ternary Weights are
-1,0or1(require 2 bits to store)
- Gradient descent is the process of optimizing weights
- With machine learning, at the begining of training weights are random numbers
- Then a prediction is made
- Then we compute the error
- Then we adjust the weights to reduce error
- How much we adjust the weights is based on the learning rate
- The gradient tells us the direction and magnitude to change the weight (positive means increase, negative means decrease)
- The learning rate is a small number (ex: 0.001) that controls how much we trust the gradient
- If the learning rate is too large, weights jump around and never settle (divergence)
- If learning rate is too small, training takes forever
- Deep is a machine learning architecture w/ many layers (3 to hundreds)
- Each layer transforms the data
- Each layer learns different patterns
- Each layer builds on the previous layer's representations
- AI is a system that receives inputs and provides outputs using learned weights
- With traditional programming a human writes a function to identify cats
- With Ai programming a model attempts to identify a cat, adjusts weights & repeats till it's good at identifying cats
- A model is an instance of a neural network that has been trained w/ samples, can receive inputs (prompts) and provides quality outputs (responses)
- A neural network is a mathematical function that transforms an input into an output through a series of calculations
- A neural network's mathematical function includes weights and biases that are used to calculate the output
- At the beginning of training the weights and biases are random & through training these numbers get good enough to produce quality outputs
- An LLM is a Large Language Model
- An LLM is a specific type of neural network designed to work with language (text)
- The LLM receives an input (prompt) and gives back a probability distribution over the next token. Then the LLM receives another input (prompt + last token) and gives back another probability distribution. This continues till the most likely next token is a stop responding token.
- Attention computes, how much each token should pay attention to all other tokens w/in a sequence
- Each token w/in the sequence is given 3 vectors, the query, key and value vectors
- Attention refers to the weights (probabilities) that determine how much information to take from all visible tokens
- Logits
- Raw dot products (Q·K), gives us a single value score for each token
- Attention weights are attention scores after softmax
- Attention weights are probabilities that sum to 1
- Attention weights tells us 'this token contributes
attention_weightpercent of itsValueto theoutputfor the current token'
- Attention output is the weighted sum of values using the attention weights
- A Transformer is an deep learning architecture where each token w/in a sequence is aware of all other tokens w/in the sequence (Attention)
- A Query vector is given to a token an answers (what is this token looking for in other tokens)
- A Query vector helps us search for related tokens w/in a sequence by comparing the Query vector of the current token w/ the Key vector of other tokens
- During inference we get a query vector for the last token and compare to all other tokens
- During training we get a query vector for all tokens w/in the ai response and compare to all other tokens simultaneously
- A Key vector contains information about what a token offers to others
- We match the Key vector w/ the Query vector to determine if there is a relationship between 2 tokens
- A Value vector contains the actual data that will be passed forward if this token is selected
- Token selection in attention identifies which past tokens are most relevant to the current token
- Token prediction in attention identifies what token is most likely to come after the current token
- Training is the process of creating a model that makes useful next token predictions
- In the training process we show the model samples and let it learn from its mistakes (adjust its weights and biases)
- A sample is a simple training example that includes atleast 1 prompt and 1 model response
- A sample may also include code snippets and sources
- A multi-turn sample is a sample w/ multiple prompts and responses, to teach the model how to:
- Have a conversation
- Ask good follow up questions
- Build on previous responses
- A corpus is a collection of samples
What is a hidden state?
- Math anotation is
h - The hidden state is the "current understanding" of the input as it flows through the model
- Input tokens start as embeddings (not yet hidden states)
- After passing through the first transformer layer, they become hidden states
- Each layer transforms the hidden state further
- A hidden state is a token vector after it has passed through atleast one layers
- Hidden b/c
- Internal
- Not directly visible
- Intermediate representations
- Identifies what the model “knows” about the sequence
What is the final hidden state?
- The final hidden state (last layer's output) is what gets multiplied by output weights to predict the next token
- Annotation:
W_out[i] - An Output Projection Vector is a vector of weights of
embedding_dimlength for a token that identifies what hidden state pattern predicts this token - Each token w/in the model's vocabulary has an Output Projection Vector
- When we multiply the Output Projection Vector with the hidden state, we get a score indicating how well the hidden state matches the token
- Annotation:
W_out - Output Projection Matrix is
embedding_dimlength andvocab_sizeheight ([vocab_size, hidden_dim])
- A Linear Layer Row is a vector of
input_dimlength that identifies what hidden state pattern predicts this neuron - Each neuron w/in a layer has a Linear Layer Row
- When we multiply the Linear Layer Row with the hidden state, we get a score indicating how well the hidden state matches the neuron
- A Linear Layer Matrix is
input_dimlength andoutput_dimheight ([output_dim, input_dim])
- A bias is single number added during the output calculation
- Fixed during training
- The bias is a constant number added after the weighted sum
- Each token has bias and each neuron has a bias
- Tells us How likely a token / neuron is in general
- Small bias -> token / neuron rarely appears
- High bias -> token / neuron appears often in many contexts
- An output bias is computed during training, is a unique value for each token and identifies baseline tendencies for a token
- High bias -> token appears often in many contexts
- Small bias -> token rarely appears
- Inference is when we use a trained model to generate a response
- A token is a piece of text that the model understands as a single unit
- Tokens can be:
- Words
- Parts of words
- Punctuation
- Individual characters
- Spaces are typically attached to the following word & not separate tokens
- A tokenizer is a tool that converts text into numbers (and back)
- Computers don't understand words like "hi" - they only understand numbers
- A tokenizer finds the middle ground: - it
- IF we give every word a unique number THEN we need a very large dictionary & can't handle words we've never seen
- IF we give every character a unique number THEN we lose word meanings
- Tokenizers splits text into pieces called "tokens" and gives each token a unique number ID
- BPE stands for Byte Pair Encoding
- BPE is a method for deciding how to split text into tokens
- BPE learns from the corpus which character & token combinations appear most frequently together, then merges them into tokens
- BPE training can merge ANY two adjacent tokens, regardless of whether they're special, requested, or learned
- Start with individual characters, spaces and punctuations marks as separate tokens
- Count how often each adjacent pair of tokens appears next to each other in the entire corpus
- Find the most frequent pair
- IF the most frequent pair occurs more then MIN_MERGE_FREQUENCY (ex: 3) times THEN merge them into a new token and repeat the process ELSE stop merging
- Merge rules tell us how to build bigger tokens from smaller ones
- When we get NEW text (not in the training data) (like a user prompt), we apply the merge rules to tokenize the text
- The sequence may have console.log("hi world"), vocab may have console & console.log so the merge rules will make it so the sequence token is the largest token we have in the vocab
- Post merge, pre merge tokens remain
- Split prompt into characters, spaces and punctuation marks
- Apply merge rules in the exact same order they were learned to build tokens & ensure consistency
- Look up each token in the vocabulary to get its token ID
- Look up each token embedding based on the token ID
- Special tokens are pre-added to vocabulary
- Structural tokens that define the corpus format (ex:
<sample>,</sample>) - They can NOT be merged with adjacent tokens to form larger tokens
- They appear in the token sequence as single units
- Requested tokens are pre-added to vocabulary
- They can still be merged with adjacent tokens to form larger tokens
- They appear in the token sequence as single units (unless merged to make even larger tokens)
- Example: If
console.logis a requested token then it starts as one token, & during BPE training, ifconsole.log+(appears frequently then BPE can still merge intoconsole.log(
- Embedding is the process of turning a token into a token embedding
- A token embedding is a vector of numbers that represents the meaning of a token
- A vector is a list of numbers (ex:
[1.5, 0, -2.3])
- Embedding dimension is the length of a token embedding vector
- More dimensions = more expressive power = more memory and computation
- Slot / Index / Position w/in a vector
- Where x, y & z meet
- A basis axis is a unique direction from the origin that aligns w/ a dimension
- Unique meaning no 2 basis axis w/in a vector share the same orientation
- Models distribute meaning (nouniness, verbiness, pronouniness, animalness) across dimensions (basis axes)
- What each dimension represents is not human defined, only the number of allowed dimensions (embedding dimension) is human defined
- A latent feature is a pattern the model discovered during training that:
- Is not explicitly named
- We did not manually define
- Exists only as numbers inside the network
- Each dimension w/in a vector captures a pattern in the data, we do not know what that pattern is and there is no guarantee that it corresponds to a clean human concept (ex: animalness)
- The model discovers useful internal dimensions automatically
- Every dimension in embeddings, hidden states & neuron outputs is a latent feature
- Orientation is what way an arrow points from the origin
- Independent of magnitude
- An input is the token embeddings for all tokens w/in a sequence
- A model moves the input through layers to comprehend the input & then predict the next token
- An input is a matrix of numbers (length = embedding dimension, height = token length) that comes from somewhere, that somewhere might be the:
- Original sequence
- Output of a previous layer
- A sequence is an ordered list of tokens
- A matrix is a rectangular grid of numbers with rows and columns
- x1 * w1
- h1 * w1
- A weighted input is the result of multiplying an input by its corresponding weight
- An output is a vector that is provided by a layer after aligning inputs w/ ternary weights
- The output size is equal to the number of neurons in a layer
- During attention & ffn compress the output size is equal to the embedding dimension
- During ffn expand the output size is equal to the embedding dimension * 4
input = [2.0, 1.5, 0.5]
weights = [
[1, 0, -1], // neuron 0
[0, 1, -1], // neuron 1
[-1, 1, -1], // neuron 2
]
output[0] = (2.0×1) + (1.5×0) + (0.5×-1) = 2.0 + 0.0 - 0.5 = 1.5
output[1] = (2.0×0) + (1.5×1) + (0.5×-1) = 0.0 + 1.5 - 0.5 = 1.0
output[2] = (2.0×-1) + (1.5×1) + (0.5×-1) = -2.0 + 1.5 - 0.5 = -1.0
output = [1.5, 1.0, -1.0] // length 3- Collection of neurons that process data simultaneously
- Each token goes through each neuron in a layer
- The number of neurons w/in a layer is equal to the output_size
- 32-bit floating point numbers (
f32) can represent most numbers with high precision, example:0.0000000013.1415926535-0.999999999
- Precise (
f32) numbers take up 4 bytes (32 bits) for every value - Quantization is the process of taking numbers that need many bits (
3.1415926535)and mapping them to numbers that need fewer bits (3.14)
- Ternary means 3 possible values
- Ternary Quantizaion is a technique where we keep raw weights in
f32& create quantized weights that can be-1,0or+1 - Raw values require 32 bits but quantized values (what we'll use in inference) only require 2 bits to tells us the value (
0b01tells us-1,0b00tells us0&0b10tells us+1)
0bis a prefix that tells the computer "there are binary digits coming next"
- Not an outside of Poppins rule, just a choice, the on and off bits can represent whatever values we want them to be
- This input matters a lot, add its effect
- Ignore this input entirely
- This input matters, but in the OPPOSITE direction
- A neuron is a function that learns to detect different features (patterns)
- This function has 1 weight vector and 1 bias number
- A neurons job is to accept an embedding vector and provide a number that represents this tokens score as it relates to a particular feature
- Calculation is dot product between input & weights
- Each neuron learns a different pattern & together they form a massive pattern-detection system
- The absolute value of a number is its distance from zero, ignoring whether it's positive or negative
dot product + biaslogit = (weight₁ * input₁) + (weight₂ * input₂) + ... + (weightₙ * inputₙ) + bias- When predicting the next token, the model computes logits for all tokens w/in its vocabulary
- Larger logits mean the model thinks that token is more likely
- Logits aren't probabilities (they don't sum to 1 & they can be negative)
- Logits tells us, how compatible a token is w/ the current hidden state
- Softmax is the process of converting logits to probabilities
- Softmax helps the much bigger score dominate
probability = eulers_num^(logit) / sum(eulers_num^(all_logits))- Numerator is Euler's number raised to the logit for one token
- Denomenator is the sum of Euler's number raised to the logit for all tokens
- Example:
Vocabulary is 3 tokens Token 0: "apple" Token 1: "banana" Token 2: "cherry" logits = [2.0, 1.0, 0.1] apple_numerator = e^(2.0) = 7.389 banana_numerator = e^(1.0) = 2.718 cherry_numerator = e^(0.1) = 1.105 denominator = sum(e^(all_logits)) = 7.389 + 2.718 + 1.105 = 11.212
- Euler's number (
e), approximately 2.71828 is a mathematical constant likeπ - In neural networks, euler's number appears in the softmax formula because euler's number:
- Raised to any power is always positive (no negative numbers)
- Grows exponentially (large logits become much larger, small logits become tiny)
e^0 = 1
e^1 = 2.718
e^2 = 7.389
e^3 = 20.085
e^(-1) = 0.368
e^(-2) = 0.135- SwiGLU stands for Swish Gated Linear Unit
- SwiGLU is an activation function used in feed-forward networks
- An activation function transforms a logit into an output
- ReLU stands for Rectified Linear Unit
- ReLU is an activation function
- IF
logit > 0THENoutput = logitELSEoutput = 0 output = ReLU(logit) = max(0, logit)
- ReLU stands for Rectified Linear Unit
- ReLU² stands for ReLU squared
- ReLU is an activation function
- Raise the ReLU to the power of 2
- ReLU² creates sparsity (many zeros) in the activations which is critical for ternary quantization
- Sigmoid is an activation function
- Output is always between 0 and 1
output = sigmoid(logit) = 1 / (1 + eulers_num^(-logit))
(weight₁ * input₁) + (weight₂ * input₂) + ... + (weightₙ * inputₙ)- Multiply corresponding elements of two vectors, sum them, one number response
- Dot product is a measure of directional similarity
- How much u points in the direction of v
- A dot product:
-
0 tells us that 2 vectors point roughly in the same direction
- = 0 tells us that 2 vectors are perpendicular
- < 0 tells us that 2 vectors point in opposite direction
-
- Sparsity means most values w/in a vector are zero
- Dense vector = 0% sparsity =
[1,2,3] - Sparse vector = 50% sparsity =
[0,1,0,2] - Very sparse vector = 75% sparsity =
[0,0,0,1] - When a vector is sparse we can:
- Store only the non-zero values (less memory)
- Compute only where values are non-zero (less work)
- ⊙ is circle w/ dot or element wise multiplication
- Multiply corresponding elements of two vectors, vector response, example:
a = [2, 4, 6] b = [1, 3, 5] a ⊙ b = [2×1, 4×3, 6×5] = [2, 12, 30]