Skip to content

Full source-port parity: training, quantization, autotune, and CLI#2

Open
ericstj wants to merge 8 commits into
mainfrom
full-port
Open

Full source-port parity: training, quantization, autotune, and CLI#2
ericstj wants to merge 8 commits into
mainfrom
full-port

Conversation

@ericstj
Copy link
Copy Markdown
Owner

@ericstj ericstj commented Jun 8, 2026

Source-ports the remaining native fastText functionality to managed C#, taking FastText.Net from inference-only to full parity with upstream fastText. Work was done in verifiable phases; verification is quality/round-trip/determinism based (not byte-exact) since std-library RNG/distributions differ from C++.

What's new

  • Training (supervised, cbow, skipgram) — FastText.Train + FastTextTrainArgs
  • Quantization authoring — product-quantizer k-means
  • Hyperparameter autotuningFastText.TrainSupervisedAutotune (Gaussian random search), precision@recall / recall@precision curves in Meter
  • Model serialization — save/load round-trip for full and quantized models
  • fasttext CLI mirroring upstream main.cc: supervised/skipgram/cbow, test/test-label, quantize, predict[-prob], print-word/sentence-vectors, print-ngrams, nn, analogies, dump
  • CLI packaging — framework-dependent dotnet tool (dotnet tool install -g FastText.Net.Cli, runs as fasttext)

Verification

  • 50/50 tests passing (Release)
  • CLI smoke-tested end-to-end (train → predict → test → nn → dump)
  • Tool package validated (correct DotnetToolSettings.xml, installs and runs)

Notes

  • Quantize retrain path is not yet supported (throws NotSupportedException).
  • Autotune is time/trial-bounded and stochastic; MaxTrials enables deterministic tests.

ericstj and others added 6 commits June 7, 2026 20:50
Port the complete fastText inference API and binary serialization to managed C#, mirroring src/fasttext.h.

Phase 1 (inference completeness):
- New internal Vector (vector.cc) and DenseMatrix ops; Dictionary accessors
  (getWord/getLabel/getId/getSubwords) and a labels-aware getLine.
- ModelLoader extracts the shared load path; FastTextModel delegates to it.
- New public FastText facade: word/subword/sentence/ngram vectors, word/subword
  ids, nearest neighbours, analogies, predict, and test() via a ported Meter.
- VectorTests validate vectors/ids/NN against a Python fasttext oracle generated
  from the committed models without retraining.

Phase 2 (serialization):
- Save methods on Args/Dictionary/Matrix/QuantMatrix/ProductQuantizer, each the
  inverse of Load; ModelLoader.Save mirrors saveModel. Facade gains SaveModel and
  SaveVectors/SaveOutput text exporters.
- SerializationTests assert byte-identical round-trips for dense models and a
  semantic (prediction) round-trip for the quantized lid model.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port fastText's quantize() to managed code so supervised models can be
compressed without the native library:

- ProductQuantizer: k-means training (Estep/Mstep, seeded LCG), Train,
  ComputeCode(s).
- DenseMatrix: RawData, L2NormRow, DivideRow.
- QuantMatrix: quantizing ctor (qnorm-aware) over a DenseMatrix.
- Dictionary: Prune + EosId for vocabulary cutoff.
- FastText.Quantize(dsub, qnorm, cutoff, qout): selectEmbeddings + prune
  pipeline; retrain-after-prune throws NotSupported until training lands.
- ModelLoader.BuildModel to rebuild the model after quantizing.

k-means is seeded but std-library shuffle/uniform differ across runtimes,
so results are not byte-identical to the original C++ model. Verified by
top-1 prediction agreement, word-vector cosine, and quantize/save/reload
determinism in QuantizationTests (42/42 pass).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port the fastText SGD training driver to pure managed C# for full parity:
RNG (MinstdRand), corpus building (Dictionary.ReadFromFile + discard
subsampling), DenseMatrix uniform init and gradient ops, the backprop
Loss.Forward for NS/HS/Softmax/OVA, Model.Update, and a public Train API
(FastText.Train/TrainSupervised/TrainUnsupervised + FastTextTrainArgs).

Trained models are not byte-identical to native fastText because the std
library RNG and distributions differ; training is verified by quality
(model learns its corpus), save/load round-trip, and fixed-seed
determinism. 46/46 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port fastText's autotune: a time-bounded random search over epoch, lr, dim,
wordNgrams, minn/maxn, bucket, dsub, and loss that maximizes a validation
metric, optionally under a model-size budget (via product quantization).

- AutotuneStrategy: Gaussian-perturbation sampling around the best-known args
  with a shrinking schedule (MinstdRand.NextGaussian, Marsaglia polar).
- Meter: precision@recall / recall@precision curves for the non-F1 metrics.
- Public API: FastText.TrainSupervisedAutotune + FastTextAutotuneArgs;
  FastTextTrainArgs.PinnedArgs to fix parameters; AutotuneMetric enum.
- MaxTrials caps trials for bounded, reproducible runs.

Search is stochastic and not byte-identical to native fastText (std-lib RNG
differs); verified by quality, fixed-seed determinism, pinned-arg respect, and
the size-constrained quantization path. 50/50 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements a managed re-implementation of fastText's command-line tool mirroring src/main.cc: supervised/skipgram/cbow training, test/test-label, quantize, predict/predict-prob, print-word/sentence-vectors, print-ngrams, nn, analogies, and dump.

Adds public FastText.DumpOption + Dump, with internal Dump(TextWriter) on Args, Dictionary, and DenseMatrix to back the dump command. CLI added to the solution and smoke-tested end-to-end.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds PackAsTool/ToolCommandName plus packaging metadata (mirroring the library: license, icon, readme, repo URL) so the CLI installs via 'dotnet tool install -g FastText.Net.Cli' and runs as 'fasttext'. Includes a consumer-focused tool README.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 8, 2026 05:51
Copy link
Copy Markdown
Owner Author

@ericstj ericstj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the managed C# source-port of fastText functionality so FastText.Net supports the full upstream surface area (training, quantization authoring, autotune, full model save/load, and a fasttext CLI tool) with verification via deterministic/round-trip/semantic tests.

Changes:

  • Add managed training for supervised/cbow/skipgram plus supporting math primitives (vectors/matrices, losses, RNG).
  • Add quantization authoring (PQ k-means), autotune (random search), and full model serialization (dense byte-identical round-trips + semantic checks for quantized).
  • Add a framework-dependent dotnet tool CLI (fasttext) mirroring upstream commands, plus new test oracles/fixtures.
Show a summary per file
File Description
tests/FastText.Net.Tests/VectorTests.cs Adds vector/id/NN parity tests against a Python-generated oracle.
tests/FastText.Net.Tests/vectors_oracle.json Adds committed oracle data for synthetic models (word/subword/sentence vectors, ids, NN).
tests/FastText.Net.Tests/TrainingTests.cs Adds training quality, round-trip, and determinism tests.
tests/FastText.Net.Tests/SerializationTests.cs Validates byte-identical dense model round-trips and semantic quantized round-trips.
tests/FastText.Net.Tests/QuantizationTests.cs Validates quantization fidelity, determinism, pruning, and guards.
tests/FastText.Net.Tests/gen_vectors_oracle.py Adds script to generate the inference oracle from committed synthetic models.
tests/FastText.Net.Tests/FastText.Net.Tests.csproj Copies the new oracle JSON into test output.
tests/FastText.Net.Tests/AutotuneTests.cs Adds autotune correctness/determinism and size-budget quantization tests.
src/FastText.Net/Vector.cs Adds a managed vector primitive using tensor operations.
src/FastText.Net/Training.cs Implements training loop and exposes FastText.Train* APIs.
src/FastText.Net/ProductQuantizer.cs Adds PQ training (k-means) and serialization support.
src/FastText.Net/ModelLoader.cs Centralizes model load/save and loss construction.
src/FastText.Net/Model.cs Extends model state and implements training forward/backprop for supported losses.
src/FastText.Net/MinstdRand.cs Adds managed std::minstd_rand equivalent plus Gaussian sampling for autotune.
src/FastText.Net/Meter.cs Adds precision/recall/F1 + PR-curve helpers for evaluation/autotune metrics.
src/FastText.Net/Matrix.cs Adds dense/quant matrices, init, norms, update ops, and serialization.
src/FastText.Net/FastTextModel.cs Refactors to use ModelLoader and updated ModelState construction.
src/FastText.Net/FastText.cs Adds full managed fastText facade (vectors/NN/analogies/predict/test/quantize/dump/save).
src/FastText.Net/Dictionary.cs Adds training dictionary construction, discard table, pruning, and richer line parsing.
src/FastText.Net/Autotune.cs Implements autotune search + public API and strategy.
src/FastText.Net/Args.cs Adds args serialization/dump and training-only fields.
FastText.Net.sln Adds the CLI project to the solution.
cli/FastText.Net.Cli/README.md Documents installation and command surface of the CLI tool.
cli/FastText.Net.Cli/Program.cs Implements the fasttext CLI command dispatcher and argument parsing.
cli/FastText.Net.Cli/FastText.Net.Cli.csproj Adds dotnet-tool packaging configuration for the CLI.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 25/25 changed files
  • Comments generated: 6

Comment thread src/FastText.Net/Matrix.cs
Comment thread src/FastText.Net/Model.cs
Comment thread cli/FastText.Net.Cli/Program.cs
Comment thread cli/FastText.Net.Cli/Program.cs
Comment thread src/FastText.Net/ModelLoader.cs
Comment thread src/FastText.Net/Training.cs
- Correct DenseMatrix.Uniform doc comment: blocks are sized total/10, so fewer than 10 threads leave a tail uninitialized (faithful to upstream densematrix.cc), rather than claiming full coverage.

- Restore upstream's '<k> (optional; 10 by default)' clarification to the nn and analogies usage output.

- Fix ModelLoader.Load brace/using formatting.

Behavior is unchanged and remains faithful to upstream fastText.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 25/25 changed files
  • Comments generated: 4

Comment thread src/FastText.Net/FastText.cs
Comment thread src/FastText.Net/Model.cs
Comment thread src/FastText.Net/Training.cs
Comment thread src/FastText.Net/Autotune.cs
- OneVsAllLoss.Forward: precompute a target set instead of List.Contains per output, dropping the inner loop from O(osz*targets) to O(osz).

- Skipgram: reuse a single ngram buffer across tokens instead of allocating one per token, cutting GC pressure on large corpora.

- Document that autotune forces loss to softmax unless pinned, matching upstream.

All changes are result-preserving; the determinism and round-trip tests still pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 25/25 changed files
  • Comments generated: 5

Comment on lines +641 to +653
var compacted = new Entry[_words.Length];
int next = 0;
for (int i = 0; i < _words.Length; i++)
{
if (_words[i].Type == EntryType.Label || (next < words.Count && words[next] == i))
{
compacted[next] = _words[i];
next++;
}
}
_nwords = words.Count;
_size = _nwords + _nlabels;
_words = compacted[..next];
Comment on lines +369 to +380
long minThreshold = 1;
int pos = 0;
while (ReadWordBuild(data, ref pos, out ReadOnlySpan<byte> token))
{
AddBuild(words, token);
if (_size > 0.75 * MaxVocabSize)
{
minThreshold++;
ThresholdBuild(words, minThreshold, minThreshold);
}
}
ThresholdBuild(words, _args.MinCount, _args.MinCountLabel);
Comment on lines +259 to +265
if (_input is not DenseMatrix input || _output is not DenseMatrix output)
{
throw new InvalidOperationException("Quantization requires a dense (non-quantized) model.");
}

_args.Qout = qout;
int dim = _args.Dim;
}

[Fact]
public void QuantizeRejectsUnsupervisedModels()
Comment on lines +98 to +99
public int GetSubwordId(ReadOnlySpan<byte> subword) =>
_nwords + (int)(Hash(subword) % (uint)_args.Bucket);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants