A Swift 6.2 SPM library for text embeddings, semantic reranking, and NLP analysis on Apple platforms. It provides a unified EmbeddingProvider protocol backed by Apple's NaturalLanguage framework or GPU-accelerated MLX transformer models, tools for loading and embedding research manuscripts and academic program data, and utilities for benchmarking and threshold calibration.
Platforms: macOS 14+ · iOS 17+ Swift: 6.2 · Strict concurrency enabled
- Three embedding backends — offline Apple NLEmbedding (512d), GPU transformer models via MLX (384–1024d), and corpus-based frequency vectors
- Two reranking strategies — embedding-based bi-encoder and MLX cross-encoder for two-stage retrieval pipelines
- Document loaders — parse Markdown manuscripts and academic-program CSV files into labeled embedding corpora
- String analysis extensions — sentiment, readability (ARI), tokenization, stop-word removal, and POS-filtered lemmatization
- Benchmarking & calibration — 10 built-in test sets, threshold sweeping, and precision/recall/F1 reporting
- SwiftUI comparison view — interactive UI for evaluating and comparing embedding models on custom data
Add the package in Xcode via File › Add Package Dependencies, or in Package.swift:
dependencies: [
.package(url: "https://github.com/dyerlab/Linguistics", from: "1.0.0")
],
targets: [
.target(name: "MyTarget", dependencies: ["Linguistics"])
]MLX models are downloaded to
~/.cache/huggingface/hub/on first use. Apple NLEmbedding and FDL require no downloads.
An embedding converts a piece of text into a numeric vector — a point in a high-dimensional space where semantically related texts land close together. The three backends in this package represent fundamentally different philosophies about how to build that space, each with distinct trade-offs in quality, speed, and resource requirements.
How it works. FDL builds a vocabulary from a corpus you supply at initialization. Every unique token — after lemmatization and stop-word removal — becomes one dimension of the embedding space. To embed a new text, the library counts how many times each vocabulary token appears in that text. The result is a sparse vector whose length equals the vocabulary size.
This is a bag-of-words model: word order does not matter, and meaning is captured purely through the co-occurrence of tokens. Two texts are considered similar if they share many of the same words in similar proportions.
What it is good at. FDL works well in narrow, well-defined domains where the vocabulary is the signal. If you are comparing academic course descriptions within a program, or comparing documents that should share technical terminology, FDL is fast, fully offline, transparent, and interpretable — you can inspect every dimension and know exactly what it represents.
Key limitation. FDL cannot handle synonymy or paraphrase. The sentences "the method was effective" and "the approach worked well" share almost no tokens and will appear maximally dissimilar, even though they mean the same thing. FDL is also sensitive to corpus coverage — words not seen at init time are silently ignored.
Normalization note. FDL vectors are raw frequency counts and are not L2-normalized. Call vector.normal before computing cosine similarity or comparing against vectors from the other providers.
How it works. Apple's NaturalLanguage framework ships pretrained word vectors — each individual word in the vocabulary maps to a fixed 512-dimensional vector learned from large text corpora using techniques like GloVe or Word2Vec. To embed a sentence, the library tokenizes the text and averages the vectors of all recognized words. The resulting 512-dimensional vector is L2-normalized before being returned.
This is a static word-vector model: each word always maps to the same vector regardless of context. The word "bank" has one vector whether the sentence is about a river bank or a financial institution.
What it is good at. NLEmbedding is the fastest option by a large margin — embeddings are computed synchronously on the CPU in microseconds and require no internet connection or GPU. The model is part of the OS and needs no download. It performs well for general-purpose similarity tasks, especially at the word and short-phrase level where lexical overlap is a reasonable proxy for meaning.
Key limitation. Because word vectors are averaged, sentence-level structure and context are lost. Long passages with complex meaning tend to converge toward similar "average English" vectors, reducing discrimination. NLEmbedding also has a fixed vocabulary; out-of-vocabulary tokens (technical jargon, proper nouns, domain-specific abbreviations) are silently skipped. Use runSafe when benchmarking to skip pairs where too many tokens go unrecognized.
How it works. Transformer models process the entire input sequence at once, allowing every token's representation to be shaped by every other token in the sentence (the attention mechanism). The model was pretrained on billions of text examples and then fine-tuned specifically for the task of producing useful sentence embeddings — a process called contrastive learning, where similar sentence pairs are pulled together in the embedding space and dissimilar pairs are pushed apart.
To embed text, MLXEmbeddingService tokenizes the input using the model's own subword tokenizer, runs a full forward pass on the GPU via MLX, and then applies mean pooling over the output token representations, followed by L2 normalization. The result is a dense vector (384–1024 dimensions depending on model) that encodes sentence-level meaning, not just word overlap.
What it is good at. Transformer embeddings handle paraphrase, synonymy, and semantic nuance far better than the other approaches. "The method was effective" and "the approach worked well" will be close in vector space. They also handle domain-specific language better because subword tokenization can decompose unknown words into familiar pieces. The larger models (BGE Large, mxbai-embed-large) approach human-level semantic judgment on standard retrieval benchmarks.
Key limitation. These models require a GPU (Metal), which means they cannot run in command-line swift test contexts without Xcode. The first run requires downloading model weights from HuggingFace Hub (90 MB for MiniLM, up to ~1.2 GB for the 1024d models). Inference is also slower than NLEmbedding — milliseconds per text rather than microseconds — though still fast enough for interactive use on Apple Silicon.
| FDL | NLEmbedding | MLX Transformer | |
|---|---|---|---|
| Captures synonymy / paraphrase | No | Partial | Yes |
| Requires GPU | No | No | Yes |
| Requires download | No | No | Yes (90 MB – 1.2 GB) |
| Speed | Microseconds | Microseconds | Milliseconds |
| Dimensions | Vocab size (variable) | 512 | 384 – 1024 |
| Normalized | No (call .normal) |
Yes | Yes |
| Best for | Domain term overlap, corpus analysis | Fast general similarity, word-level tasks | Semantic search, reranking, paraphrase detection |
A common pattern is to use NLEmbedding for rapid prototyping and development on-device without a GPU, then switch to an MLX model for production quality. Use FDL when your task is inherently vocabulary-driven — comparing programs by their course terminology, tracking keyword prevalence over time, or any context where shared jargon is the explicit signal of interest.
The central protocol. All backends conform to it:
public protocol EmbeddingProvider: Sendable {
func embed(_ text: String) async throws -> Vector
func embedBatch(_ texts: [String]) async throws -> [Vector]
func similarity(between: String, and: String) async throws -> Float
var dimensions: Int { get async throws }
}Vectors are MatrixStuff.Vector values. All providers except FDLEmbeddingService return L2-normalized vectors — dot product equals cosine similarity.
A Codable, Sendable struct that bundles a vector with its provenance and optional metadata:
public struct TextEmbedding: Sendable, Codable, Hashable {
let provider: EmbeddingProviderOption // which model made this
let vector: Vector // L2-normalized embedding
let scaling: Double // optional weight (default 1.0)
let metadata: [String: String] // caller-defined labels
}Create one from any provider:
let embedding = try await provider.embed("Hello world", as: .nlEmbedding)An immutable, Codable collection of TextEmbedding values from a single source (a paper, a program, a document):
public struct Corpus: Sendable, Codable, Identifiable, Hashable {
let id: UUID
let label: String
let metadata: [String: String]
let embeddings: [TextEmbedding]
}Uses Apple's on-device NLEmbedding word vectors with average pooling. No downloads, no GPU required. 512-dimensional.
let service = try NLEmbeddingService()
let vector = try await service.embed("transformer architecture")
let score = try await service.similarity(between: "cat", and: "kitten")
// Word-level operations
let neighbors = service.neighbors(for: "neural", count: 5)
let distance = service.distance(from: "happy", to: "joyful")GPU-accelerated sentence transformers via MLX. Requires Metal (macOS/iOS device). Models are downloaded from HuggingFace Hub on first use.
// Default: mxbai-embed-large (1024d, ~1.2 GB)
let service = try await MLXEmbeddingService()
// Choose a model
let fast = try await MLXEmbeddingService(model: .miniLM) // 384d, ~90 MB
let balanced = try await MLXEmbeddingService(model: .bgeBase) // 768d, ~400 MB
let quality = try await MLXEmbeddingService(model: .bgeLarge) // 1024d, ~1.2 GB
let quantized = try await MLXEmbeddingService(model: .qwen3Embedding) // 4-bit
let matryoshka = try await MLXEmbeddingService(model: .nomicTextV1_5)
// Custom HuggingFace model
let custom = try await MLXEmbeddingService(model: .custom("sentence-transformers/all-mpnet-base-v2"))
// Download progress
let service = try await MLXEmbeddingService(model: .bgeLarge) { progress in
print("Downloading: \(Int(progress * 100))%")
}Builds a vocabulary from a corpus at init time. Each embedding is a raw frequency-count vector over the vocabulary. Useful for domain-specific bag-of-words comparisons.
let documents = ["Introduction to machine learning...", "Neural networks and deep learning..."]
let service = FDLEmbeddingService(corpus: documents)
let vector = try await service.embed("supervised learning methods")
// FDL vectors are NOT L2-normalized — normalize before cosine similarity
let normalized = vector.normal // MatrixStuff Vector.normalA Codable enum for tagging vectors with their provenance. Also a factory:
let provider = try await EmbeddingProviderOption.bgeBase.makeProvider()
// For FDL, pass the corpus:
let fdl = try await EmbeddingProviderOption.fdlEmbedding.makeProvider(corpus: myDocs)Each case has a displayName, abbreviation, color (SwiftUI), and requiresDownload flag.
Wraps any EmbeddingProvider. Encodes the query once, then scores all documents via dot product.
let service = try NLEmbeddingService()
let reranker = EmbeddingReranker(provider: service)
let results = try await reranker.rerank(
query: "machine learning applications",
documents: myDocuments,
topK: 10
)
for result in results {
print("\(result.score): \(result.item)")
}Jointly encodes query+document pairs. Significantly more accurate than bi-encoders for re-ranking, at the cost of latency.
let reranker = try await MLXCrossEncoderReranker(model: .bgeRerankerBase)
// Also: .bgeRerankerLarge, .bgeRerankerV2M3 (multilingual), .custom("hub-id")
let results = try await reranker.rerank(
query: "climate change effects on biodiversity",
documents: abstractTexts,
topK: 5
)Rerank any Sendable type with a text extractor:
struct Article: Sendable { let title: String; let body: String }
let ranked = try await reranker.rerank(
query: "protein folding",
items: articles,
topK: 3,
textExtractor: { "\($0.title) \($0.body)" }
)The recommended pattern for large corpora:
// Stage 1: Fast embedding retrieval (top-100)
let embedder = EmbeddingReranker(provider: try await MLXEmbeddingService(model: .miniLM))
let candidates = try await embedder.rerank(query: query, documents: allDocs, topK: 100)
// Stage 2: Accurate cross-encoder reranking (top-10)
let crossEncoder = try await MLXCrossEncoderReranker(model: .bgeRerankerV2M3)
let final = try await crossEncoder.rerank(
query: query,
items: candidates,
topK: 10,
textExtractor: { $0.item }
)Parses Markdown files produced by PDF-to-Markdown converters (e.g., marker, nougat). Classifies sections using a DocumentProfile, then embeds each section or paragraph.
let provider = try NLEmbeddingService()
// Load a single paper
let corpus = try await ManuscriptLoader.load(
from: paperURL,
profile: .scientificPaper, // default — covers IMRaD structure
granularity: .paragraph, // .section or .paragraph
using: provider,
as: .nlEmbedding
)
print(corpus.label) // first # heading = paper title
print(corpus.metadata["doi"]) // extracted from first 3 000 chars
// Load an entire directory
let corpora = try await ManuscriptLoader.loadAll(
from: markdownDirectory,
granularity: .section,
using: provider,
as: .nlEmbedding
)Each TextEmbedding in the corpus carries:
metadata["part"]—ManuscriptPartssection type (Abstract, Introduction, Methods, Results, Discussion, Other)metadata["granularity"]—"section"or"paragraph"metadata["text"]— the source text
let labReport = DocumentProfile(
id: "lab-report",
displayName: "Lab Report",
rules: [
SectionRule(pattern: #"^(purpose|objective)$"#, type: .Introduction),
SectionRule(pattern: #"^(procedure|protocol)$"#, type: .Methods),
SectionRule(pattern: #"^(data|observations?)$"#, type: .Results),
],
fallbackType: .Other
)
let corpus = try await ManuscriptLoader.load(
from: url, profile: labReport, granularity: .section,
using: provider, as: .nlEmbedding
)Loads a CSV describing university course catalogs and returns one Corpus per academic program. Each course is embedded once per university and reused across programs.
CSV format (header row required):
| University | Program | Course | Title | Credits | Bulletin |
|---|---|---|---|---|---|
| VCU | Biology | BIOL 101 | Principles of Biology | 3 | Introduction to cell biology... |
let provider = try await MLXEmbeddingService(model: .bgeBase)
let programs = try await AcademicProgramLoader.load(
from: csvURL,
using: provider,
as: .bgeBase
)
for program in programs {
print("\(program.metadata["university"]!) — \(program.label)")
print(" \(program.embeddings.count) courses")
}Each TextEmbedding carries:
metadata["course"]— course codemetadata["text"]—"\(Title) \(Bulletin)"scaling— credit hours (asDouble)
A bundled sample dataset is included at Sources/Linguistics/Data/vcu_stem_programs.csv.
String extensions powered by Apple's NaturalLanguage framework.
let text = "The results were surprisingly effective and well-received."
text.sentiment // Double: –1.0 (negative) to 1.0 (positive)
text.sentimentScore // averaged across paragraphs
text.sentenceLevelSentiment // [Double] — one per sentence
text.sentimentString // emoji: "😊", "😐", "😞"text.ARI // Automated Readability Index (grade level)
text.words // word count
text.sentences // sentence count
text.paragraphs // paragraph countlet tokens = text.wordTokens(language: .english, minTokenLength: 3, lowercased: true)
let clean = text.tokensWithoutStopwords()
// Access or extend the built-in stop-word list
var stops = String.englishStopwords
stops.insert("however")
let filtered = text.tokensWithoutStopwords(stopwords: stops)// Content lemmas: nouns, verbs, and adjectives — stop words removed by default
let lemmas = text.contentLemmas()
// e.g. ["result", "surprising", "effective", "receive"]Measures a provider's ability to distinguish semantically similar from dissimilar pairs. The key metric is discriminationGap (avgHighSimilarity − avgLowSimilarity).
let benchmark = EmbeddingBenchmark()
let provider = try NLEmbeddingService()
// Use a built-in test set
let result = try await benchmark.runWithReport(
provider: provider,
name: "NLEmbedding",
pairs: EmbeddingBenchmark.scientificPairs
)
print(result.discriminationGap) // higher is better
print(result.accuracy)
print(result.summary)
// runSafe skips pairs where NLEmbedding lacks vocabulary
let safeResult = await benchmark.runSafe(
provider: provider,
name: "NLEmbedding",
pairs: EmbeddingBenchmark.allTestSets.flatMap(\.pairs)
)Built-in test sets: generalPairs, shortPhrasePairs, technicalPairs, questionPairs, scientificPairs, singleWordPairs, longPassagePairs, paraphrasePairs, retrievalPairs, conversationalPairs.
Sweeps similarity thresholds over labeled examples to find the best operating point for your use case.
let calibrator = ThresholdCalibrator.semanticSearch // preset sweep range
// Also: .duplicateDetection, .contentDiscovery
let examples: [LabeledPair] = [
LabeledPair("neural network", "deep learning", isMatch: true),
LabeledPair("neural network", "organic chemistry", isMatch: false),
]
let result = try await calibrator.calibrate(provider: provider, examples: examples)
print("Best F1 threshold: \(result.bestF1Threshold)")
print("Best precision threshold: \(result.bestPrecisionThreshold)")
print(result.report) // full threshold sweep table
// Also calibrate a reranker
let rerankerResult = try await calibrator.calibrate(reranker: crossEncoder, examples: examples)An interactive view for evaluating embedding providers on custom text pairs. Drop it into any SwiftUI app to explore model selection, discrimination charts, and threshold recommendations.
import SwiftUI
import Linguistics
struct ContentView: View {
var body: some View {
EmbeddingComparisonView()
}
}Features:
- Enter custom text pairs or load from built-in test sets
- Run NLEmbedding-only (instant) or full GPU comparison
- View per-provider similarity distributions as charts
- Get automatic provider and threshold recommendations
- Swift 6 strict concurrency throughout.
MLXEmbeddingServiceandMLXCrossEncoderRerankerareactors to isolate Metal/MLX GPU state. All public types areSendable. FDLEmbeddingServiceis not L2-normalized — callvector.normal(MatrixStuff) before computing cosine similarity or comparing with other providers.- Model files are never bundled — MLX models are downloaded from HuggingFace Hub on first use.
- Tests use Swift Testing (
@Test,#expect) not XCTest. MLX tests require Metal and must be run from Xcode —swift testfrom the command line runs NLEmbedding tests only. EmbeddingComparisonViewModelis@Observable+@MainActor— do not add@Publishedproperties.
MIT