Official transformer architecture support for Dlib #3124

Cydral · 2025-11-28T12:07:56Z

Official transformer architecture support for Dlib

This pull request represents the consolidation and stabilization of transformer-related layers and components developed throughout 2024~2025. This substantial commit introduces official support for modern language modeling in Dlib, positioning the library as a new reference implementation for neural network building in natural language processing.

The extensions have been iteratively refined over the past year, with each component tested and validated across multiple architectures and use cases. This work establishes the foundation for upcoming multimodal capabilities, with active development underway for vision transformers and combined text-image processing.

Future releases will introduce examples demonstrating transformer architectures for image processing, followed by multimodal fusion combining textual and visual information. This PR represents an important milestone that could justify a new version of Dlib to mark the official introduction of these features.

Overview

This pull request introduces complete transformer architecture support to Dlib, enabling modern language modeling capabilities while maintaining Dlib's philosophy of simple APIs and production-ready implementations. All components are written in standard C++14 for cross-platform compatibility.

Major additions

Core architectural components

Attention mechanisms:

multi-head self-attention with scaled dot-product computation
canonical and fused transformer variants for research and production use
causal masking for autoregressive generation
rotary positional embeddings (RoPE) with YaRN extension for dynamic sequence length support
absolute positional encodings for traditional transformer architectures

Padding-aware attention:

tril_padding_context for dynamic per-sample padding mask coordination
causal masking with automatic padding exclusion for variable-length sequences
thread-safe static context for nested architecture compatibility
seamless integration with loss_cross_entropy_per_logit ignore index

Specialized layers:

linear layer with plane-wise matrix multiplication for sequence processing
rms_norm layer implementing efficient RMS normalization
reshape_to layer for dimension manipulation without data copying
token_embeddings layer combining embedding lookup with positional encoding
tril layer for triangular mask generation
transpose and multm_prev layers for attention computation
dropout_rate layer with configurable per-layer dropout schedules

Vision Transformer support:

patch_embeddings layer for image-to-sequence conversion with configurable patch size
optional learnable class token (CLS) for classification tasks
optional learnable position embeddings for spatial awareness

Advanced architectures:

mixture-of-experts (MoE) with dynamic expert routing and load balancing
hierarchical reasoning model (HRM) with dual recurrent modules
adaptive computation time (ACT) for dynamic computation allocation
SwiGLU gated activation for improved feed-forward networks

Optimization infrastructure

AdamW optimizer (dlib/dnn/solvers.h):

decoupled weight decay regularization for improved generalization
based on "Decoupled Weight Decay Regularization" (Loshchilov & Hutter, ICLR 2019)
standard optimizer for large language models and transformer architectures
per-layer learning rate and weight decay multipliers
production-ready implementation with proper bias correction

Learning rate scheduling (lr_scheduler):

three-phase training: linear warmup, peak, and decay
multiple decay strategies: linear, cosine, exponential, inverse square root
checkpoint support for training resumption
helper functions for transformer-specific configurations

Language modeling utilities

Dataset preparation (language_model_data.h):

build_single_token_prediction_dataset() for autoregressive training
build_multi_token_prediction_dataset() for sequence-to-sequence tasks
shuffle_training_dataset() for data randomization
augment_training_dataset() for noise injection and robustness improvement

Inference management:

inference_context class for autoregressive generation with sliding window
stochastic text generation with temperature, top-k, nucleus sampling, and repetition penalty

Evaluation metrics:

edit distance (Levenshtein) with normalization
token overlap metrics (precision, recall, F1-score)
n-gram overlap (BLEU-like) for structural similarity
compute_text_similarity() combining all metrics

Preprocessing:

detect_file_type() supporting 30+ formats via magic numbers and entropy analysis

Complete transformer implementations

Canonical transformer (canonical_transformer namespace):

explicit Q, K, V projections for modularity and research
transformer_block combining attention and feed-forward networks
transformer_stack for building deep architectures
support for dynamic sequence lengths through dimension inference

Fused transformer (fused_transformer namespace):

combined QKV projection for memory and compute efficiency
optimized for production deployment scenarios
compatible API with canonical variant

Loss functions

Cross-entropy per logit (loss_cross_entropy_per_logit):

specialized loss for sequence models working directly with linear layer output
computes loss only at last sequence position
avoids dimension flattening while preserving sequence structure
numerically stable via log-sum-exp trick

Example programs

Four progressive examples demonstrate the capabilities:

slm_basic_train_ex.cpp: character-level transformer training on Shakespeare text. Demonstrates fundamental attention mechanics and memorization capability.

slm_advanced_train_ex.cpp: BPE tokenization with compact architecture. Introduces specialized loss function and byte-for-byte verification.

slm_mixture_of_experts_ex.cpp: sparse conditional computation with production-grade utilities. Demonstrates shuffle and augmentation utilities for robust training.

slm_chatbot_ex.cpp: conversational AI training pipeline with two-stage approach. Demonstrates base language model pre-training followed by supervised fine-tuning on question-answer pairs. Includes stochastic text generation with configurable sampling strategies (temperature, repetition penalty, nucleus sampling) and layer-wise learning rate multipliers for efficient fine-tuning. Shows practical implementation of interactive chatbot with context management.

slm_vision_transformer_hybrid_ex.cpp: hybrid ViT combining patch embeddings with transformer encoder. Showcases two training processes: self-supervised feature learning and supervised learning.

Technical design

Matrix plane processing

Traditional Dlib layers operate channel-wise on 4D tensors. The extensions introduce plane-wise processing where (rows, cols) dimensions form semantic units for sequence data. This enables:

natural representation: (batch, 1, sequence_length, embedding_dim)
efficient attention computation over spatial planes
seamless integration with existing Dlib computational graph

Variable-length sequence handling

Training with batched sequences of different lengths requires coordinated masking:

tril_padding_context::set() computes per-sample padding lengths before forward pass
tril_ layer consults context to extend causal mask over padding tokens
loss_cross_entropy_per_logit::set_ignore_index() excludes padding from loss computation
automatic cleanup via tril_padding_context::clear() after training step

Implementation approach

All components follow Dlib's design patterns:

header-only implementations where appropriate
template-based abstractions for compile-time optimization
compatibility with existing training infrastructure (dnn_trainer, optimizers, serialization)
comprehensive inline documentation following Dlib's conventions

Testing and validation

The example programs demonstrate:

perfect memorization on training data (99.99% accuracy for basic example)
byte-for-byte reproduction capability (advanced example)
balanced expert utilization in MoE (coefficient of variation < 0.3)

Main files modified/added

New headers:

dlib/dnn/transformer.h - complete transformer implementations
dlib/dnn/layers_transformer.h - specialized layers for sequence processing
dlib/dnn/language_model_data.h - utilities for dataset preparation and evaluation
dlib/tokenizer/bpe_tokenizer.h - byte-pair encoding tokenization
dlib/dnn/solvers.h - AdamW optimizer addition

New examples:

examples/slm_basic_train_ex.cpp
examples/slm_advanced_train_ex.cpp
examples/slm_mixture_of_experts_ex.cpp
examples/slm_chatbot_ex.cpp
examples/slm_data.h - internal datasets for examples
examples/slm_vision_transformer_hybrid_ex.cpp

Abstract documentation:

docs/layers_abstract.h - layer specifications and usage patterns
docs/transformer_abstract.h - transformer architecture documentation
docs/language_model_data_abstract.h - language modeling utility documentation
docs/solvers_abstract.h - AdamW optimizer specification

Extended documentation

For more details, see the dedicated repository: https://github.com/Cydral/Dlib-Transformer-extensions

This contribution establishes official transformer support in Dlib, extending the library into modern natural language processing while maintaining its core values of simplicity, performance, and production readiness. The groundwork laid here enables upcoming vision transformer implementations and multimodal architectures, positioning Dlib as a comprehensive framework for contemporary deep learning applications.

…des an optimized linear transformation for multi-dimensional inputs.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…tion-free tokenization

…'s singleton

reunanen · 2025-12-30T12:34:08Z

dlib/geometry/rectangle.h

        {
-            const double scale = std::sqrt(area/static_cast<double>(rect.area()));
-            return centered_rect(rect, std::lround(rect.width()*scale), std::lround(rect.height()*scale));
+            // Le compilateur sait maintenant que rect_area != 0


Guessing comments should be in English :)

reunanen · 2025-12-30T12:39:12Z

examples/slm_vision_transformer_hybrid_ex.cpp

+        static constexpr long EMBEDDING_DIM = embedding_dim;
+        static constexpr long PATCH_SIZE = 4;     // 32/4 = 8x8 = 64 patches
+        static constexpr long NUM_PATCHES = 64;   // (32/4)^2
+        static constexpr long DONT_USE_ClASS_TOKEN = 0;


Why lower case l in ClASS?

reunanen · 2025-12-30T12:47:02Z

examples/slm_vision_transformer_hybrid_ex.cpp

+using namespace std;
+using namespace dlib;
+
+// Signal handling for clean termination


This signal handling seems to be duplicated in various different examples. Could it be extracted into a common utility?

Cydral and others added 30 commits April 28, 2025 22:10

Implementation of linear_ layer for neural networks. This layer provi…

3e9b9f1

…des an optimized linear transformation for multi-dimensional inputs.

Minor change

93ead3d

Update dlib/dnn/layers.h

bf1b805

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'davisking:master' into master

49bfbc6

Add reshape_to and flatten layers to Dlib's DNN module

f234faa

Missing update to "visitors.h"

26a2960

format fixing for reshape_to

c9a1ee4

Update dlib/test/dnn.cpp

02e62d8

Merge branch 'davisking:master' into master

394dee8

Vocabulary size fixed for learning, and function added for transforma…

778bfc1

…tion-free tokenization

Added a new example for learning a “complex” Transformer model.

03aafc2

Added a new example for learning a “complex” Transformer model.

22c2561

Updated example for training a Transformer model.

01cd0b2

fix for gcc/ffmpeg compilation

6b63e55

Fix a warning message for Ubuntu compilation.

ad1f757

Update for Linux environment.

c91c45a

Fix batch building

6fcc0aa

Slight improvement in model definition.

5a1773e

linear_ layer implementation improvement

10d7c59

finalizing the example

d4bf94b

Fixing break condition in training method.

a4dac0b

Fixing declaration order of variables.

63454e3

bpe_tokenizer improvements.

87ed70a

Example updated.

061c673

bpe_tokenizer class refactoring.

f6c8526

Example updated.

2db56f5

bpe_tokenizer class updated.

d4eeb2d

Decoding part of the bpe_tokenizer updated.

dcb5963

Network definition update

b81b502

Merge branch 'davisking:master' into master

80a6e0e

Cydral added 25 commits December 15, 2025 08:52

Update

17f859b

Update

f66369b

Update

84bd433

Update

c82213d

Update

8d1f4ea

Update

def6359

Update

9e76ed5

Update

426857c

Update

3fe4ee1

New example

c4086bc

New example added

1fc065d

Update

7b2c4ef

Update

9c86229

Update

0c730f9

Fix bug in cuda code for act layer

f028608

Embeddings class improvement

e2c229d

Update

d74b2f9

Update

9396527

Update

114dab9

Update

306b1d4

Fix tril_padding_context multiple definition linker errors with Meyer…

da591d3

…'s singleton

Add lr_mult_visitor for visit_layers_range

b69d284

Removed used var in patch_embeddings_/backward

1a6494f

Updated slm_mixture_of_experts_ex.cpp example

1ca58b1

Updated slm_chatbot_ex.cpp example

d7a4ebe

Cydral mentioned this pull request Dec 29, 2025

Fix ACT layer gradient computation on CUDA #3128

Merged

Update

e61623f

reunanen reviewed Dec 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Official transformer architecture support for Dlib #3124

Official transformer architecture support for Dlib #3124

Uh oh!

Cydral commented Nov 28, 2025 •

edited

Loading

Uh oh!

reunanen Dec 30, 2025

Uh oh!

reunanen Dec 30, 2025

Uh oh!

reunanen Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Official transformer architecture support for Dlib #3124

Are you sure you want to change the base?

Official transformer architecture support for Dlib #3124

Uh oh!

Conversation

Cydral commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Official transformer architecture support for Dlib

Overview

Major additions

Core architectural components

Optimization infrastructure

Language modeling utilities

Complete transformer implementations

Loss functions

Example programs

Technical design

Matrix plane processing

Variable-length sequence handling

Implementation approach

Testing and validation

Main files modified/added

Extended documentation

Uh oh!

reunanen Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

reunanen Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

reunanen Dec 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cydral commented Nov 28, 2025 •

edited

Loading