Conversation
Port the complete fastText inference API and binary serialization to managed C#, mirroring src/fasttext.h. Phase 1 (inference completeness): - New internal Vector (vector.cc) and DenseMatrix ops; Dictionary accessors (getWord/getLabel/getId/getSubwords) and a labels-aware getLine. - ModelLoader extracts the shared load path; FastTextModel delegates to it. - New public FastText facade: word/subword/sentence/ngram vectors, word/subword ids, nearest neighbours, analogies, predict, and test() via a ported Meter. - VectorTests validate vectors/ids/NN against a Python fasttext oracle generated from the committed models without retraining. Phase 2 (serialization): - Save methods on Args/Dictionary/Matrix/QuantMatrix/ProductQuantizer, each the inverse of Load; ModelLoader.Save mirrors saveModel. Facade gains SaveModel and SaveVectors/SaveOutput text exporters. - SerializationTests assert byte-identical round-trips for dense models and a semantic (prediction) round-trip for the quantized lid model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port fastText's quantize() to managed code so supervised models can be compressed without the native library: - ProductQuantizer: k-means training (Estep/Mstep, seeded LCG), Train, ComputeCode(s). - DenseMatrix: RawData, L2NormRow, DivideRow. - QuantMatrix: quantizing ctor (qnorm-aware) over a DenseMatrix. - Dictionary: Prune + EosId for vocabulary cutoff. - FastText.Quantize(dsub, qnorm, cutoff, qout): selectEmbeddings + prune pipeline; retrain-after-prune throws NotSupported until training lands. - ModelLoader.BuildModel to rebuild the model after quantizing. k-means is seeded but std-library shuffle/uniform differ across runtimes, so results are not byte-identical to the original C++ model. Verified by top-1 prediction agreement, word-vector cosine, and quantize/save/reload determinism in QuantizationTests (42/42 pass). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port the fastText SGD training driver to pure managed C# for full parity: RNG (MinstdRand), corpus building (Dictionary.ReadFromFile + discard subsampling), DenseMatrix uniform init and gradient ops, the backprop Loss.Forward for NS/HS/Softmax/OVA, Model.Update, and a public Train API (FastText.Train/TrainSupervised/TrainUnsupervised + FastTextTrainArgs). Trained models are not byte-identical to native fastText because the std library RNG and distributions differ; training is verified by quality (model learns its corpus), save/load round-trip, and fixed-seed determinism. 46/46 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Port fastText's autotune: a time-bounded random search over epoch, lr, dim, wordNgrams, minn/maxn, bucket, dsub, and loss that maximizes a validation metric, optionally under a model-size budget (via product quantization). - AutotuneStrategy: Gaussian-perturbation sampling around the best-known args with a shrinking schedule (MinstdRand.NextGaussian, Marsaglia polar). - Meter: precision@recall / recall@precision curves for the non-F1 metrics. - Public API: FastText.TrainSupervisedAutotune + FastTextAutotuneArgs; FastTextTrainArgs.PinnedArgs to fix parameters; AutotuneMetric enum. - MaxTrials caps trials for bounded, reproducible runs. Search is stochastic and not byte-identical to native fastText (std-lib RNG differs); verified by quality, fixed-seed determinism, pinned-arg respect, and the size-constrained quantization path. 50/50 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements a managed re-implementation of fastText's command-line tool mirroring src/main.cc: supervised/skipgram/cbow training, test/test-label, quantize, predict/predict-prob, print-word/sentence-vectors, print-ngrams, nn, analogies, and dump. Adds public FastText.DumpOption + Dump, with internal Dump(TextWriter) on Args, Dictionary, and DenseMatrix to back the dump command. CLI added to the solution and smoke-tested end-to-end. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds PackAsTool/ToolCommandName plus packaging metadata (mirroring the library: license, icon, readme, repo URL) so the CLI installs via 'dotnet tool install -g FastText.Net.Cli' and runs as 'fasttext'. Includes a consumer-focused tool README. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR completes the managed C# source-port of fastText functionality so FastText.Net supports the full upstream surface area (training, quantization authoring, autotune, full model save/load, and a fasttext CLI tool) with verification via deterministic/round-trip/semantic tests.
Changes:
- Add managed training for supervised/cbow/skipgram plus supporting math primitives (vectors/matrices, losses, RNG).
- Add quantization authoring (PQ k-means), autotune (random search), and full model serialization (dense byte-identical round-trips + semantic checks for quantized).
- Add a framework-dependent
dotnet toolCLI (fasttext) mirroring upstream commands, plus new test oracles/fixtures.
Show a summary per file
| File | Description |
|---|---|
| tests/FastText.Net.Tests/VectorTests.cs | Adds vector/id/NN parity tests against a Python-generated oracle. |
| tests/FastText.Net.Tests/vectors_oracle.json | Adds committed oracle data for synthetic models (word/subword/sentence vectors, ids, NN). |
| tests/FastText.Net.Tests/TrainingTests.cs | Adds training quality, round-trip, and determinism tests. |
| tests/FastText.Net.Tests/SerializationTests.cs | Validates byte-identical dense model round-trips and semantic quantized round-trips. |
| tests/FastText.Net.Tests/QuantizationTests.cs | Validates quantization fidelity, determinism, pruning, and guards. |
| tests/FastText.Net.Tests/gen_vectors_oracle.py | Adds script to generate the inference oracle from committed synthetic models. |
| tests/FastText.Net.Tests/FastText.Net.Tests.csproj | Copies the new oracle JSON into test output. |
| tests/FastText.Net.Tests/AutotuneTests.cs | Adds autotune correctness/determinism and size-budget quantization tests. |
| src/FastText.Net/Vector.cs | Adds a managed vector primitive using tensor operations. |
| src/FastText.Net/Training.cs | Implements training loop and exposes FastText.Train* APIs. |
| src/FastText.Net/ProductQuantizer.cs | Adds PQ training (k-means) and serialization support. |
| src/FastText.Net/ModelLoader.cs | Centralizes model load/save and loss construction. |
| src/FastText.Net/Model.cs | Extends model state and implements training forward/backprop for supported losses. |
| src/FastText.Net/MinstdRand.cs | Adds managed std::minstd_rand equivalent plus Gaussian sampling for autotune. |
| src/FastText.Net/Meter.cs | Adds precision/recall/F1 + PR-curve helpers for evaluation/autotune metrics. |
| src/FastText.Net/Matrix.cs | Adds dense/quant matrices, init, norms, update ops, and serialization. |
| src/FastText.Net/FastTextModel.cs | Refactors to use ModelLoader and updated ModelState construction. |
| src/FastText.Net/FastText.cs | Adds full managed fastText facade (vectors/NN/analogies/predict/test/quantize/dump/save). |
| src/FastText.Net/Dictionary.cs | Adds training dictionary construction, discard table, pruning, and richer line parsing. |
| src/FastText.Net/Autotune.cs | Implements autotune search + public API and strategy. |
| src/FastText.Net/Args.cs | Adds args serialization/dump and training-only fields. |
| FastText.Net.sln | Adds the CLI project to the solution. |
| cli/FastText.Net.Cli/README.md | Documents installation and command surface of the CLI tool. |
| cli/FastText.Net.Cli/Program.cs | Implements the fasttext CLI command dispatcher and argument parsing. |
| cli/FastText.Net.Cli/FastText.Net.Cli.csproj | Adds dotnet-tool packaging configuration for the CLI. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 25/25 changed files
- Comments generated: 6
- Correct DenseMatrix.Uniform doc comment: blocks are sized total/10, so fewer than 10 threads leave a tail uninitialized (faithful to upstream densematrix.cc), rather than claiming full coverage. - Restore upstream's '<k> (optional; 10 by default)' clarification to the nn and analogies usage output. - Fix ModelLoader.Load brace/using formatting. Behavior is unchanged and remains faithful to upstream fastText. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- OneVsAllLoss.Forward: precompute a target set instead of List.Contains per output, dropping the inner loop from O(osz*targets) to O(osz). - Skipgram: reuse a single ngram buffer across tokens instead of allocating one per token, cutting GC pressure on large corpora. - Document that autotune forces loss to softmax unless pinned, matching upstream. All changes are result-preserving; the determinism and round-trip tests still pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
+641
to
+653
| var compacted = new Entry[_words.Length]; | ||
| int next = 0; | ||
| for (int i = 0; i < _words.Length; i++) | ||
| { | ||
| if (_words[i].Type == EntryType.Label || (next < words.Count && words[next] == i)) | ||
| { | ||
| compacted[next] = _words[i]; | ||
| next++; | ||
| } | ||
| } | ||
| _nwords = words.Count; | ||
| _size = _nwords + _nlabels; | ||
| _words = compacted[..next]; |
Comment on lines
+369
to
+380
| long minThreshold = 1; | ||
| int pos = 0; | ||
| while (ReadWordBuild(data, ref pos, out ReadOnlySpan<byte> token)) | ||
| { | ||
| AddBuild(words, token); | ||
| if (_size > 0.75 * MaxVocabSize) | ||
| { | ||
| minThreshold++; | ||
| ThresholdBuild(words, minThreshold, minThreshold); | ||
| } | ||
| } | ||
| ThresholdBuild(words, _args.MinCount, _args.MinCountLabel); |
Comment on lines
+259
to
+265
| if (_input is not DenseMatrix input || _output is not DenseMatrix output) | ||
| { | ||
| throw new InvalidOperationException("Quantization requires a dense (non-quantized) model."); | ||
| } | ||
|
|
||
| _args.Qout = qout; | ||
| int dim = _args.Dim; |
| } | ||
|
|
||
| [Fact] | ||
| public void QuantizeRejectsUnsupervisedModels() |
Comment on lines
+98
to
+99
| public int GetSubwordId(ReadOnlySpan<byte> subword) => | ||
| _nwords + (int)(Hash(subword) % (uint)_args.Bucket); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Source-ports the remaining native fastText functionality to managed C#, taking
FastText.Netfrom inference-only to full parity with upstream fastText. Work was done in verifiable phases; verification is quality/round-trip/determinism based (not byte-exact) since std-library RNG/distributions differ from C++.What's new
FastText.Train+FastTextTrainArgsFastText.TrainSupervisedAutotune(Gaussian random search), precision@recall / recall@precision curves inMeterfasttextCLI mirroring upstreammain.cc: supervised/skipgram/cbow, test/test-label, quantize, predict[-prob], print-word/sentence-vectors, print-ngrams, nn, analogies, dumpdotnet tool(dotnet tool install -g FastText.Net.Cli, runs asfasttext)Verification
DotnetToolSettings.xml, installs and runs)Notes
Quantizeretrain path is not yet supported (throwsNotSupportedException).MaxTrialsenables deterministic tests.