A tiny benchmark for continual-learning architectures, plus a production-validation suite on real-world intent classification across four datasets.
Two components:
- Synthetic benchmark — symbolic, closed-world vocabulary where pretrained shortcuts are impossible. 100 invented-word facts across 10 lessons, three question types (recall, composition, negation), full forgetting curves.
- BANKING77 extended series (50+ experiments) — 77-class intent classification with frozen pretrained encoders + Mahalanobis prototype. Validated on CLINC150 (150 classes), HWU64 (64 classes), AG News (4-class topic).
| Configuration | Official Split | 5-fold CV (honest) | Forgetting |
|---|---|---|---|
| Naive MLP (catastrophic baseline) | 32.5% | — | severe |
| Replay buffer (cap=200) | 68.9% | — | moderate |
| Frozen MiniLM + Mahalanobis | 90.5% | — | zero |
| Frozen mpnet + Mahalanobis | 92.8% | — | zero |
| 3-encoder concat + Mahalanobis | 94.22% | 93.33% +/- 0.58pp | zero |
| Class merge (top-30 pairs → 49 groups) | 96.10% | — | zero (but redefines task) |
| Multi-label top-2 (T=0.05) | 97.73% strict | — | zero |
Important honesty note: the official BANKING77 test set is systematically 4.36pp easier than the training set (verified by train/test role swap). The honest cross-validation number is 93.33%, not 94.22%. Report both.
| Dataset | Classes | Catastrophic | Replay | Prototype (best) |
|---|---|---|---|---|
| BANKING77 | 77 | 32.5% | 68.9% | 94.22% |
| CLINC150 | 150 | 60.9% | 80.1% | 95.56% |
| HWU64 | 64 | 62.8% | 83.3% | 90.61% |
| AG News (topic) | 4 | 66.7% | 82.0% | 90.22% |
Ranking prototype >> replay >> catastrophic holds on 4 datasets, 3 task types. Gap shrinks with fewer classes (AG News has only 8pp gap to replay).
| Config | Cold start | E2E@1 | QPS@128 | Memory |
|---|---|---|---|---|
| MiniLM (384d) | 54ms | 24ms | 455 | 87MB + 0.7MB |
| mpnet (768d) | 903ms | 153ms | 32 | 418MB + 2.5MB |
| 3-enc concat (2176d) | 12.3s | 627ms | 9 | 700MB + 18MB |
Add a new class: 530ms for 100 examples. Mahalanobis prediction itself is ~1ms; encoder dominates all latency.
| Scorer | AUROC | ID accept @ 90% OOD reject |
|---|---|---|
| Global distance (baseline) | 0.78 | 52% |
| Margin (d2 - d1) | 0.85 | 52% (but at lower threshold) |
- Better encoders: MiniLM → mpnet (+2.24pp), +e5-large (+0.94pp)
- Shrinkage tuning: 1e-4 optimal (+0.13pp)
- ICDA embedding augmentation: +0.36pp (vanishes with better encoders)
- Margin-based OOD scoring: AUROC 0.78 → 0.85
- Multi-label via distance thresholds: 94% top-1 → 97.7% top-2
- Class merging on ambiguous pairs: 94.22% → 96.10% (at 49 groups)
- LoRA fine-tuning: -24.5pp even at best of 9 configs
- Multi-prototype Mahalanobis (k=3): -0.7pp
- Hard-margin replay: -3.6pp vs random reservoir
- Label noise cleanup: -0.2pp (volume > purity)
- PCA/LDA projection: -3.1pp
- Hierarchical classification: -1.8pp
- Weighted encoder fusion: -2.2pp
- kNN-Mahalanobis (100 neighbors): -2.6pp
- Cross-encoder reranking: -21pp (STS objective anti-correlates with intent)
- LLM cascade (Qwen 2.5 3B): -3.8pp
- Fine-tuned mpnet contrastive (oracle): +0.03pp on concat
- Bigger encoder bge-large alone: -0.6pp
A late-cycle audit revealed we never tested vanilla LinearSVC on the 2176-d concat features. Result: LinearSVC matches or slightly beats Mahalanobis under joint training (94.29% official / 94.36% CV vs 94.22% / 93.33%).
The honest narrative is narrower than "Mahalanobis is magic":
| Training mode | Classifier | Accuracy |
|---|---|---|
| Joint (all 77 classes at once) | LinearSVC | 94.36% CV |
| Joint (all 77 classes at once) | Mahalanobis | 93.33% CV |
| Class-incremental | Incremental OvR LogReg | 90.52% |
| Class-incremental | Mahalanobis | 94.22% |
Mahalanobis doesn't beat everything. It holds up under class-incremental constraints while naive linear classifiers lose 3.84pp. The mathematical invariances (order, granularity, streaming sufficient statistics) are the actual contribution. Under joint training, Mahalanobis and LinearSVC both hit the frozen-feature ceiling; it's not about discrimination, it's about CIL deployability.
Published BANKING77 SOTA (~95-96%) requires full gradient fine-tuning of the encoder. Our approach at 93.33% CV uses zero gradient training, has zero forgetting, is order-invariant, and deploys in 5MB. For production CIL where new intents must be added continuously without forgetting, our approach dominates; for pure accuracy without CIL constraints, fine-tuning still wins.
pip install -r requirements.txt
# Synthetic benchmark
python run.py train --arch catastrophic_mlp --seed 42
python run.py leaderboard
# BANKING77 experiments
PYTHONPATH=. python -m experiments.banking77 # core 5-architecture baseline
PYTHONPATH=. python -m experiments.banking77_3enc # 3-encoder concat (94.22%)
PYTHONPATH=. python -m experiments.banking77_variance_audit # honest error bars
PYTHONPATH=. python -m experiments.banking77_latency # production latency
PYTHONPATH=. python -m experiments.clinc150 # CLINC150 replication
PYTHONPATH=. python -m experiments.ag_news # non-intent tasktasks.py Fact / Lesson / Question / Answer / Dataset dataclasses
data.py synthetic dataset generator
evaluation.py class-incremental harness
recorder.py append-only event log
train.py drive one architecture end-to-end
architectures/ 20+ LearningSystem implementations
experiments/ BANKING77 + CLINC150 + HWU64 + AG News
banking77.py core + class-incremental eval
banking77_3enc.py 3-encoder concat (94.22% baseline)
banking77_variance_audit.py honest error bars
banking77_latency.py production latency/memory
banking77_merge_deep.py class merging to 96.10%
banking77_multilabel.py top-k distance thresholds
banking77_ood_fix.py margin-based OOD (AUROC 0.85)
banking77_lora_sweep.py fair LoRA falsification
clinc150.py CLINC150 cross-dataset
hwu64.py HWU64 cross-dataset
ag_news.py non-intent cross-task
... 40+ experiment scripts total
paper/paper.md full research paper draft
docs/ GitHub Pages site
runs/ immutable synthetic-benchmark run directories
MIT License. See LICENSE.
- GitHub Pages: https://danlex.github.io/nanolearn/
- Custom domain (DNS pending): https://nanolearn.alexandrudan.com