Skip to content

danlex/nanolearn

Repository files navigation

nanoLearn

Open In Colab GPU fine-tuning experiments

A tiny benchmark for continual-learning architectures, plus a production-validation suite on real-world intent classification across four datasets.

Two components:

  1. Synthetic benchmark — symbolic, closed-world vocabulary where pretrained shortcuts are impossible. 100 invented-word facts across 10 lessons, three question types (recall, composition, negation), full forgetting curves.
  2. BANKING77 extended series (50+ experiments) — 77-class intent classification with frozen pretrained encoders + Mahalanobis prototype. Validated on CLINC150 (150 classes), HWU64 (64 classes), AG News (4-class topic).

Headline Results (Honest)

BANKING77 — 77-class class-incremental intent classification

Configuration Official Split 5-fold CV (honest) Forgetting
Naive MLP (catastrophic baseline) 32.5% severe
Replay buffer (cap=200) 68.9% moderate
Frozen MiniLM + Mahalanobis 90.5% zero
Frozen mpnet + Mahalanobis 92.8% zero
3-encoder concat + Mahalanobis 94.22% 93.33% +/- 0.58pp zero
Class merge (top-30 pairs → 49 groups) 96.10% zero (but redefines task)
Multi-label top-2 (T=0.05) 97.73% strict zero

Important honesty note: the official BANKING77 test set is systematically 4.36pp easier than the training set (verified by train/test role swap). The honest cross-validation number is 93.33%, not 94.22%. Report both.

Cross-dataset validation — same ranking holds

Dataset Classes Catastrophic Replay Prototype (best)
BANKING77 77 32.5% 68.9% 94.22%
CLINC150 150 60.9% 80.1% 95.56%
HWU64 64 62.8% 83.3% 90.61%
AG News (topic) 4 66.7% 82.0% 90.22%

Ranking prototype >> replay >> catastrophic holds on 4 datasets, 3 task types. Gap shrinks with fewer classes (AG News has only 8pp gap to replay).

Production characteristics (latency benchmarks, Apple M-series CPU)

Config Cold start E2E@1 QPS@128 Memory
MiniLM (384d) 54ms 24ms 455 87MB + 0.7MB
mpnet (768d) 903ms 153ms 32 418MB + 2.5MB
3-enc concat (2176d) 12.3s 627ms 9 700MB + 18MB

Add a new class: 530ms for 100 examples. Mahalanobis prediction itself is ~1ms; encoder dominates all latency.

OOD rejection — margin score beats raw distance

Scorer AUROC ID accept @ 90% OOD reject
Global distance (baseline) 0.78 52%
Margin (d2 - d1) 0.85 52% (but at lower threshold)

What Worked vs What Failed

Worked (positive results)

  • Better encoders: MiniLM → mpnet (+2.24pp), +e5-large (+0.94pp)
  • Shrinkage tuning: 1e-4 optimal (+0.13pp)
  • ICDA embedding augmentation: +0.36pp (vanishes with better encoders)
  • Margin-based OOD scoring: AUROC 0.78 → 0.85
  • Multi-label via distance thresholds: 94% top-1 → 97.7% top-2
  • Class merging on ambiguous pairs: 94.22% → 96.10% (at 49 groups)

Failed (negative results, all verified empirically)

  • LoRA fine-tuning: -24.5pp even at best of 9 configs
  • Multi-prototype Mahalanobis (k=3): -0.7pp
  • Hard-margin replay: -3.6pp vs random reservoir
  • Label noise cleanup: -0.2pp (volume > purity)
  • PCA/LDA projection: -3.1pp
  • Hierarchical classification: -1.8pp
  • Weighted encoder fusion: -2.2pp
  • kNN-Mahalanobis (100 neighbors): -2.6pp
  • Cross-encoder reranking: -21pp (STS objective anti-correlates with intent)
  • LLM cascade (Qwen 2.5 3B): -3.8pp
  • Fine-tuned mpnet contrastive (oracle): +0.03pp on concat
  • Bigger encoder bge-large alone: -0.6pp

Corrected core insight (after honest audit)

A late-cycle audit revealed we never tested vanilla LinearSVC on the 2176-d concat features. Result: LinearSVC matches or slightly beats Mahalanobis under joint training (94.29% official / 94.36% CV vs 94.22% / 93.33%).

The honest narrative is narrower than "Mahalanobis is magic":

Training mode Classifier Accuracy
Joint (all 77 classes at once) LinearSVC 94.36% CV
Joint (all 77 classes at once) Mahalanobis 93.33% CV
Class-incremental Incremental OvR LogReg 90.52%
Class-incremental Mahalanobis 94.22%

Mahalanobis doesn't beat everything. It holds up under class-incremental constraints while naive linear classifiers lose 3.84pp. The mathematical invariances (order, granularity, streaming sufficient statistics) are the actual contribution. Under joint training, Mahalanobis and LinearSVC both hit the frozen-feature ceiling; it's not about discrimination, it's about CIL deployability.

SOTA position (honest)

Published BANKING77 SOTA (~95-96%) requires full gradient fine-tuning of the encoder. Our approach at 93.33% CV uses zero gradient training, has zero forgetting, is order-invariant, and deploys in 5MB. For production CIL where new intents must be added continuously without forgetting, our approach dominates; for pure accuracy without CIL constraints, fine-tuning still wins.

Quick Start

pip install -r requirements.txt

# Synthetic benchmark
python run.py train --arch catastrophic_mlp --seed 42
python run.py leaderboard

# BANKING77 experiments
PYTHONPATH=. python -m experiments.banking77                    # core 5-architecture baseline
PYTHONPATH=. python -m experiments.banking77_3enc               # 3-encoder concat (94.22%)
PYTHONPATH=. python -m experiments.banking77_variance_audit     # honest error bars
PYTHONPATH=. python -m experiments.banking77_latency            # production latency
PYTHONPATH=. python -m experiments.clinc150                     # CLINC150 replication
PYTHONPATH=. python -m experiments.ag_news                      # non-intent task

Project Layout

tasks.py              Fact / Lesson / Question / Answer / Dataset dataclasses
data.py               synthetic dataset generator
evaluation.py         class-incremental harness
recorder.py           append-only event log
train.py              drive one architecture end-to-end
architectures/        20+ LearningSystem implementations
experiments/          BANKING77 + CLINC150 + HWU64 + AG News
  banking77.py          core + class-incremental eval
  banking77_3enc.py     3-encoder concat (94.22% baseline)
  banking77_variance_audit.py  honest error bars
  banking77_latency.py         production latency/memory
  banking77_merge_deep.py      class merging to 96.10%
  banking77_multilabel.py      top-k distance thresholds
  banking77_ood_fix.py         margin-based OOD (AUROC 0.85)
  banking77_lora_sweep.py      fair LoRA falsification
  clinc150.py                  CLINC150 cross-dataset
  hwu64.py                     HWU64 cross-dataset
  ag_news.py                   non-intent cross-task
  ... 40+ experiment scripts total
paper/paper.md        full research paper draft
docs/                 GitHub Pages site
runs/                 immutable synthetic-benchmark run directories

License

MIT License. See LICENSE.

Links

About

Frozen encoder + Mahalanobis prototype for class-incremental intent classification. 50+ experiments across BANKING77, CLINC150, HWU64, AG News. Matches fine-tuned baselines at 5MB state with zero forgetting, order-invariance, 455 QPS.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors