_____ _ _____ _____ ____
| ___(_)_ __ | ___| ___| __ )
| |_ | | '_ \ _____| |_ | |_ | _ \
| _| | | | | |_____| _| | _| | |_) |
|_| |_|_| |_| |_| |_| |____/
A fast, shallow, and wide financial embedding model built for local usage on consumer hardware.
Add photo of model architecture & benchmark/loss.
Financial forecasting and analysis require encoding a massive volume of text-from news headlines and earnings reports to long-form SEC filings. Existing models are often too deep for efficient local inference or too general for niche financial nuances.
Fin-FFB is designed to be a "shallow-but-wide" specialist. By trading depth for width, we achieve high-dimensional representational capacity while keeping latency low enough to run on a standard laptop. Its primary application is extracting the sentiment of the day and encoding complex financial reports locally.
The architecture disrupts the standard serial BERT design in favor of parallelized blocks and selective information flow:
- Parallel Computation: Attention and FFN blocks are computed concurrently on the same normalized input and then summed, following the PaLM/GPT-J approach. This maximizes hardware utilization.
-
Full Attention Residuals (AttnRes): Instead of standard additive residuals (
$h_l = h_{l-1} + f_{l-1}$ ), we use a depth-wise selective aggregation mechanism. A learned pseudo-query vector decides how much information to "pull" from every preceding layer (including the initial embedding). - Amygdala Salience Filtering (Layer 0): A biological-inspired pre-processing layer that acts as a high-pass filter. It uses sharp, low-temperature latent attention to identify and highlight the most salient tokens (e.g., critical financial terms, "shocks") before the deeper transformer layers process the sequence.
- Gated Attention: Implements a sigmoid gating mechanism on the attention output to improve non-linearity and filter noise.
- Bidirectional ALiBi: Uses Attention with Linear Biases (ALiBi) instead of positional embeddings, allowing for sequence length extrapolation and better handling of long financial documents.
- Modern Primitives: Uses RMSNorm for stability and SwiGLU activations in the FFN for better gradient flow.
According to the large.yaml configuration, the primary "Fat" target is:
| Parameter | Value |
|---|---|
| Vocabulary Size | 30,000 (ALBERT-base-v2) |
| Hidden Dimension ( |
1024 |
| Layers | 3 (+ Amygdala Layer 0) |
| Attention Heads | 16 ( |
| FFN Expansion | 4x ( |
| Context Window | 1024 Tokens |
| Activation | SwiGLU |
| Normalization | RMSNorm |
Traditional Transformers compute Attention followed by an FFN in a serial bottleneck. Fin-FFB disrupts this by computing both concurrently on the same normalized input ($x_{norm} = \text{RMSNorm}(h_l)$). This approach reduces the sequential path length and improves hardware utilization during training:
To increase the non-linearity of the attention block and provide better noise filtering for dense financial text, we implement a sigmoid gating mechanism. The raw attention output is scaled by a learned gate projection derived from the normalized input:
-
Gate Projection:
$\text{Gate} = \sigma(W_g \cdot x_{norm})$ -
Selective Filtering:
$\text{Output} = W_o \cdot (\text{Gate} \odot \text{Attn}(x_{norm}))$
Instead of standard additive residuals (
-
Pseudo-Query: A learned vector
$w_l$ computes scalar importance logits for each preceding layer. - Depth-Wise Softmax: Weights are normalized across the depth dimension, allowing the model to dynamically "choose" which preceding representations are most relevant for the current transformation.
-
Selective Sum:
$h_l = \sum_{i=0}^{l-1} \alpha_i v_i$
The Amygdala Indexer mimics the biological amygdala's role as a primary filter for salient information. Positioned as "Layer 0" in the AttnRes framework, it scans for crucial tokens before deeper processing.
- Purpose: It provides an "emotional" high-pass filter that identifies high-impact tokens.
- Dual-Path Initialization: Creates a parallel path for gradients. Subsequent layers can choose between raw embeddings and these salient highlights, significantly accelerating convergence.
- Representation De-correlation: Sharp, low-temperature attention helps break the "embedding cone" effect (representation collapse) by introducing high-entropy signals early.
-
Latent Efficiency: Operating in a reduced dimensional space (
$d_{latent} \ll d_{model}$ ) allows for complex salience filtering with minimal computational overhead.
Dropout is strategically placed to maintain training stability across the shallow-but-wide architecture:
-
Post-Embedding: Applied to
$v_0$ (initial embeddings) to regularize the massive$1024 \times 30000$ embedding matrix. - Attention Weights: Applied to the softmax scores before value aggregation to prevent head dominance.
- Post-Parallel Block: Applied after the sum of Attention and FFN outputs, acting as the final regularizer before the history update.
The model is trained on a curated 200M token corpus (200k articles) with a high 40% Masked Language Modeling (MLM) probability to force the model to understand broader context:
- 40% - EDGAR-CORPUS: 80k sections from SEC filings.
- 40% - NYT 100Y: 80k headlines and abstracts spanning a century of news.
- 20% - Wikipedia: 40k articles for general linguistic grounding.
Evaluation and comparison to other financial models (e.g., FinBERT, BloombergGPT-lite) are currently in progress and will be added here once validated.
The project is implemented in PyTorch with several strategies to enable training on consumer hardware (e.g., MacBook Air M4 or RTX 4060 8GB):
- JIT Tokenization: To save memory, tokenization and MLM masking are performed on-the-fly during batch collation using
data_loader/mlm_loader.py. - Gradient Accumulation: We use a physical batch size of 4 but accumulate gradients over 32 steps to reach a virtual batch size of 128, matching the effective training dynamics of larger models.
- Mixed Precision (bf16): Specifically optimized for Apple Silicon (MPS) and modern NVIDIA GPUs to reduce VRAM footprint and speed up computation.
- Weight Tying: The output MLM decoder shares weights with the input embedding matrix, significantly reducing the total parameter count.
Fin-FFB/
├── config/ # YAML configs for model variants (Nano, Small, Med, Large, X-Large)
├── data/ # Parquet datasets (Raw bulk data and processed 'ready' files)
├── data_loader/ # PyTorch Dataset adapters and JIT MLM Collator
│ ├── mlm_loader.py # Core logic for on-the-fly tokenization/masking
│ └── pd_adpt.py # Pandas-to-PyTorch adapter for Parquet files
├── data_utils/ # Pre-processing pipeline for SEC/NYT/Wiki sources
├── model/ # Core Architecture Implementation
│ ├── amygdala.py # Amygdala Indexer (Layer 0 salience filter)
│ ├── attn_core.py # Parallel Attention + FFN blocks
│ ├── attn_res_block.py # Full Attention Residuals (selective aggregation)
│ ├── alibi_utils.py # Bidirectional ALiBi bias generation
│ ├── fin_ffb.py # Base Encoder
│ └── fin_ffb_mlm.py # MLM Pre-training wrapper
├── utils/ # Training lifecycle, hardware management, and logging
├── train.py # Main training entry point
└── requirements.txt # Project dependencies
The project is a testament to what's possible with limited resources:
- Apple Silicon Optimization: Utilizes
MPSandbf16for maximum throughput on M-series chips. - VRAM Efficiency: Shallow depth (3 layers) allows for a massive 1024-dimension width even on 8GB cards.
- Training Time: Optimized to complete a full 200M token pre-training pass in approximately ~30 hours on a consumer setup.