Linguistics

A Swift 6.2 SPM library for text embeddings, semantic reranking, and NLP analysis on Apple platforms. It provides a unified EmbeddingProvider protocol backed by Apple's NaturalLanguage framework or GPU-accelerated MLX transformer models, tools for loading and embedding research manuscripts and academic program data, and utilities for benchmarking and threshold calibration.

Platforms: macOS 14+ · iOS 17+ Swift: 6.2 · Strict concurrency enabled

Features

Three embedding backends — offline Apple NLEmbedding (512d), GPU transformer models via MLX (384–1024d), and corpus-based frequency vectors
Two reranking strategies — embedding-based bi-encoder and MLX cross-encoder for two-stage retrieval pipelines
Document loaders — parse Markdown manuscripts and academic-program CSV files into labeled embedding corpora
String analysis extensions — sentiment, readability (ARI), tokenization, stop-word removal, and POS-filtered lemmatization
Benchmarking & calibration — 10 built-in test sets, threshold sweeping, and precision/recall/F1 reporting
SwiftUI comparison view — interactive UI for evaluating and comparing embedding models on custom data

Installation

Add the package in Xcode via File › Add Package Dependencies, or in Package.swift:

dependencies: [
    .package(url: "https://github.com/dyerlab/Linguistics", from: "1.0.0")
],
targets: [
    .target(name: "MyTarget", dependencies: ["Linguistics"])
]

MLX models are downloaded to ~/.cache/huggingface/hub/ on first use. Apple NLEmbedding and FDL require no downloads.

Understanding the Three Embedding Approaches

An embedding converts a piece of text into a numeric vector — a point in a high-dimensional space where semantically related texts land close together. The three backends in this package represent fundamentally different philosophies about how to build that space, each with distinct trade-offs in quality, speed, and resource requirements.

Frequency-Dependent Linguistic (FDL) Embeddings

How it works. FDL builds a vocabulary from a corpus you supply at initialization. Every unique token — after lemmatization and stop-word removal — becomes one dimension of the embedding space. To embed a new text, the library counts how many times each vocabulary token appears in that text. The result is a sparse vector whose length equals the vocabulary size.

This is a bag-of-words model: word order does not matter, and meaning is captured purely through the co-occurrence of tokens. Two texts are considered similar if they share many of the same words in similar proportions.

What it is good at. FDL works well in narrow, well-defined domains where the vocabulary is the signal. If you are comparing academic course descriptions within a program, or comparing documents that should share technical terminology, FDL is fast, fully offline, transparent, and interpretable — you can inspect every dimension and know exactly what it represents.

Key limitation. FDL cannot handle synonymy or paraphrase. The sentences "the method was effective" and "the approach worked well" share almost no tokens and will appear maximally dissimilar, even though they mean the same thing. FDL is also sensitive to corpus coverage — words not seen at init time are silently ignored.

Normalization note. FDL vectors are raw frequency counts and are not L2-normalized. Call vector.normal before computing cosine similarity or comparing against vectors from the other providers.

Apple NLEmbedding (Word-Vector Averaging)

How it works. Apple's NaturalLanguage framework ships pretrained word vectors — each individual word in the vocabulary maps to a fixed 512-dimensional vector learned from large text corpora using techniques like GloVe or Word2Vec. To embed a sentence, the library tokenizes the text and averages the vectors of all recognized words. The resulting 512-dimensional vector is L2-normalized before being returned.

This is a static word-vector model: each word always maps to the same vector regardless of context. The word "bank" has one vector whether the sentence is about a river bank or a financial institution.

What it is good at. NLEmbedding is the fastest option by a large margin — embeddings are computed synchronously on the CPU in microseconds and require no internet connection or GPU. The model is part of the OS and needs no download. It performs well for general-purpose similarity tasks, especially at the word and short-phrase level where lexical overlap is a reasonable proxy for meaning.

Key limitation. Because word vectors are averaged, sentence-level structure and context are lost. Long passages with complex meaning tend to converge toward similar "average English" vectors, reducing discrimination. NLEmbedding also has a fixed vocabulary; out-of-vocabulary tokens (technical jargon, proper nouns, domain-specific abbreviations) are silently skipped. Use runSafe when benchmarking to skip pairs where too many tokens go unrecognized.

GPU Transformer Embeddings (MLX)

How it works. Transformer models process the entire input sequence at once, allowing every token's representation to be shaped by every other token in the sentence (the attention mechanism). The model was pretrained on billions of text examples and then fine-tuned specifically for the task of producing useful sentence embeddings — a process called contrastive learning, where similar sentence pairs are pulled together in the embedding space and dissimilar pairs are pushed apart.

To embed text, MLXEmbeddingService tokenizes the input using the model's own subword tokenizer, runs a full forward pass on the GPU via MLX, and then applies mean pooling over the output token representations, followed by L2 normalization. The result is a dense vector (384–1024 dimensions depending on model) that encodes sentence-level meaning, not just word overlap.

What it is good at. Transformer embeddings handle paraphrase, synonymy, and semantic nuance far better than the other approaches. "The method was effective" and "the approach worked well" will be close in vector space. They also handle domain-specific language better because subword tokenization can decompose unknown words into familiar pieces. The larger models (BGE Large, mxbai-embed-large) approach human-level semantic judgment on standard retrieval benchmarks.

Key limitation. These models require a GPU (Metal), which means they cannot run in command-line swift test contexts without Xcode. The first run requires downloading model weights from HuggingFace Hub (90 MB for MiniLM, up to ~1.2 GB for the 1024d models). Inference is also slower than NLEmbedding — milliseconds per text rather than microseconds — though still fast enough for interactive use on Apple Silicon.

Choosing a Backend

	FDL	NLEmbedding	MLX Transformer
Captures synonymy / paraphrase	No	Partial	Yes
Requires GPU	No	No	Yes
Requires download	No	No	Yes (90 MB – 1.2 GB)
Speed	Microseconds	Microseconds	Milliseconds
Dimensions	Vocab size (variable)	512	384 – 1024
Normalized	No (call `.normal`)	Yes	Yes
Best for	Domain term overlap, corpus analysis	Fast general similarity, word-level tasks	Semantic search, reranking, paraphrase detection

A common pattern is to use NLEmbedding for rapid prototyping and development on-device without a GPU, then switch to an MLX model for production quality. Use FDL when your task is inherently vocabulary-driven — comparing programs by their course terminology, tracking keyword prevalence over time, or any context where shared jargon is the explicit signal of interest.

Core Concepts

`EmbeddingProvider`

The central protocol. All backends conform to it:

public protocol EmbeddingProvider: Sendable {
    func embed(_ text: String) async throws -> Vector
    func embedBatch(_ texts: [String]) async throws -> [Vector]
    func similarity(between: String, and: String) async throws -> Float
    var dimensions: Int { get async throws }
}

Vectors are MatrixStuff.Vector values. All providers except FDLEmbeddingService return L2-normalized vectors — dot product equals cosine similarity.

`TextEmbedding`

A Codable, Sendable struct that bundles a vector with its provenance and optional metadata:

public struct TextEmbedding: Sendable, Codable, Hashable {
    let provider: EmbeddingProviderOption  // which model made this
    let vector: Vector                      // L2-normalized embedding
    let scaling: Double                     // optional weight (default 1.0)
    let metadata: [String: String]          // caller-defined labels
}

Create one from any provider:

let embedding = try await provider.embed("Hello world", as: .nlEmbedding)

`Corpus`

An immutable, Codable collection of TextEmbedding values from a single source (a paper, a program, a document):

public struct Corpus: Sendable, Codable, Identifiable, Hashable {
    let id: UUID
    let label: String
    let metadata: [String: String]
    let embeddings: [TextEmbedding]
}

Embedding Providers

NLEmbeddingService — Offline, Instant

Uses Apple's on-device NLEmbedding word vectors with average pooling. No downloads, no GPU required. 512-dimensional.

let service = try NLEmbeddingService()

let vector = try await service.embed("transformer architecture")
let score  = try await service.similarity(between: "cat", and: "kitten")

// Word-level operations
let neighbors = service.neighbors(for: "neural", count: 5)
let distance  = service.distance(from: "happy", to: "joyful")

MLXEmbeddingService — GPU Transformer Models

GPU-accelerated sentence transformers via MLX. Requires Metal (macOS/iOS device). Models are downloaded from HuggingFace Hub on first use.

// Default: mxbai-embed-large (1024d, ~1.2 GB)
let service = try await MLXEmbeddingService()

// Choose a model
let fast    = try await MLXEmbeddingService(model: .miniLM)         // 384d, ~90 MB
let balanced = try await MLXEmbeddingService(model: .bgeBase)       // 768d, ~400 MB
let quality  = try await MLXEmbeddingService(model: .bgeLarge)      // 1024d, ~1.2 GB
let quantized = try await MLXEmbeddingService(model: .qwen3Embedding) // 4-bit
let matryoshka = try await MLXEmbeddingService(model: .nomicTextV1_5)

// Custom HuggingFace model
let custom = try await MLXEmbeddingService(model: .custom("sentence-transformers/all-mpnet-base-v2"))

// Download progress
let service = try await MLXEmbeddingService(model: .bgeLarge) { progress in
    print("Downloading: \(Int(progress * 100))%")
}

FDLEmbeddingService — Corpus Frequency Vectors

Builds a vocabulary from a corpus at init time. Each embedding is a raw frequency-count vector over the vocabulary. Useful for domain-specific bag-of-words comparisons.

let documents = ["Introduction to machine learning...", "Neural networks and deep learning..."]
let service = FDLEmbeddingService(corpus: documents)

let vector = try await service.embed("supervised learning methods")

// FDL vectors are NOT L2-normalized — normalize before cosine similarity
let normalized = vector.normal  // MatrixStuff Vector.normal

`EmbeddingProviderOption`

A Codable enum for tagging vectors with their provenance. Also a factory:

let provider = try await EmbeddingProviderOption.bgeBase.makeProvider()
// For FDL, pass the corpus:
let fdl = try await EmbeddingProviderOption.fdlEmbedding.makeProvider(corpus: myDocs)

Each case has a displayName, abbreviation, color (SwiftUI), and requiresDownload flag.

Reranking

EmbeddingReranker — Bi-Encoder (Fast)

Wraps any EmbeddingProvider. Encodes the query once, then scores all documents via dot product.

let service = try NLEmbeddingService()
let reranker = EmbeddingReranker(provider: service)

let results = try await reranker.rerank(
    query: "machine learning applications",
    documents: myDocuments,
    topK: 10
)

for result in results {
    print("\(result.score): \(result.item)")
}

MLXCrossEncoderReranker — Cross-Encoder (Accurate)

Jointly encodes query+document pairs. Significantly more accurate than bi-encoders for re-ranking, at the cost of latency.

let reranker = try await MLXCrossEncoderReranker(model: .bgeRerankerBase)
// Also: .bgeRerankerLarge, .bgeRerankerV2M3 (multilingual), .custom("hub-id")

let results = try await reranker.rerank(
    query: "climate change effects on biodiversity",
    documents: abstractTexts,
    topK: 5
)

Generic Reranking

Rerank any Sendable type with a text extractor:

struct Article: Sendable { let title: String; let body: String }

let ranked = try await reranker.rerank(
    query: "protein folding",
    items: articles,
    topK: 3,
    textExtractor: { "\($0.title) \($0.body)" }
)

Two-Stage Pipeline

The recommended pattern for large corpora:

// Stage 1: Fast embedding retrieval (top-100)
let embedder = EmbeddingReranker(provider: try await MLXEmbeddingService(model: .miniLM))
let candidates = try await embedder.rerank(query: query, documents: allDocs, topK: 100)

// Stage 2: Accurate cross-encoder reranking (top-10)
let crossEncoder = try await MLXCrossEncoderReranker(model: .bgeRerankerV2M3)
let final = try await crossEncoder.rerank(
    query: query,
    items: candidates,
    topK: 10,
    textExtractor: { $0.item }
)

Document Loaders

ManuscriptLoader — Markdown Research Papers

Parses Markdown files produced by PDF-to-Markdown converters (e.g., marker, nougat). Classifies sections using a DocumentProfile, then embeds each section or paragraph.

let provider = try NLEmbeddingService()

// Load a single paper
let corpus = try await ManuscriptLoader.load(
    from: paperURL,
    profile: .scientificPaper,   // default — covers IMRaD structure
    granularity: .paragraph,     // .section or .paragraph
    using: provider,
    as: .nlEmbedding
)

print(corpus.label)              // first # heading = paper title
print(corpus.metadata["doi"])    // extracted from first 3 000 chars

// Load an entire directory
let corpora = try await ManuscriptLoader.loadAll(
    from: markdownDirectory,
    granularity: .section,
    using: provider,
    as: .nlEmbedding
)

Each TextEmbedding in the corpus carries:

metadata["part"] — ManuscriptParts section type (Abstract, Introduction, Methods, Results, Discussion, Other)
metadata["granularity"] — "section" or "paragraph"
metadata["text"] — the source text

Custom Document Profiles

let labReport = DocumentProfile(
    id: "lab-report",
    displayName: "Lab Report",
    rules: [
        SectionRule(pattern: #"^(purpose|objective)$"#,  type: .Introduction),
        SectionRule(pattern: #"^(procedure|protocol)$"#, type: .Methods),
        SectionRule(pattern: #"^(data|observations?)$"#, type: .Results),
    ],
    fallbackType: .Other
)

let corpus = try await ManuscriptLoader.load(
    from: url, profile: labReport, granularity: .section,
    using: provider, as: .nlEmbedding
)

AcademicProgramLoader — Course Catalog CSV

Loads a CSV describing university course catalogs and returns one Corpus per academic program. Each course is embedded once per university and reused across programs.

CSV format (header row required):

University	Program	Course	Title	Credits	Bulletin
VCU	Biology	BIOL 101	Principles of Biology	3	Introduction to cell biology...

let provider = try await MLXEmbeddingService(model: .bgeBase)

let programs = try await AcademicProgramLoader.load(
    from: csvURL,
    using: provider,
    as: .bgeBase
)

for program in programs {
    print("\(program.metadata["university"]!) — \(program.label)")
    print("  \(program.embeddings.count) courses")
}

Each TextEmbedding carries:

metadata["course"] — course code
metadata["text"] — "\(Title) \(Bulletin)"
scaling — credit hours (as Double)

A bundled sample dataset is included at Sources/Linguistics/Data/vcu_stem_programs.csv.

Text Analysis

String extensions powered by Apple's NaturalLanguage framework.

Sentiment

let text = "The results were surprisingly effective and well-received."

text.sentiment              // Double: –1.0 (negative) to 1.0 (positive)
text.sentimentScore         // averaged across paragraphs
text.sentenceLevelSentiment // [Double] — one per sentence
text.sentimentString        // emoji: "😊", "😐", "😞"

Readability

text.ARI        // Automated Readability Index (grade level)
text.words      // word count
text.sentences  // sentence count
text.paragraphs // paragraph count

Tokenization

let tokens = text.wordTokens(language: .english, minTokenLength: 3, lowercased: true)
let clean   = text.tokensWithoutStopwords()

// Access or extend the built-in stop-word list
var stops = String.englishStopwords
stops.insert("however")
let filtered = text.tokensWithoutStopwords(stopwords: stops)

Lemmatization & POS Filtering

// Content lemmas: nouns, verbs, and adjectives — stop words removed by default
let lemmas = text.contentLemmas()
// e.g. ["result", "surprising", "effective", "receive"]

Benchmarking & Calibration

EmbeddingBenchmark

Measures a provider's ability to distinguish semantically similar from dissimilar pairs. The key metric is discriminationGap (avgHighSimilarity − avgLowSimilarity).

let benchmark = EmbeddingBenchmark()
let provider  = try NLEmbeddingService()

// Use a built-in test set
let result = try await benchmark.runWithReport(
    provider: provider,
    name: "NLEmbedding",
    pairs: EmbeddingBenchmark.scientificPairs
)

print(result.discriminationGap)  // higher is better
print(result.accuracy)
print(result.summary)

// runSafe skips pairs where NLEmbedding lacks vocabulary
let safeResult = await benchmark.runSafe(
    provider: provider,
    name: "NLEmbedding",
    pairs: EmbeddingBenchmark.allTestSets.flatMap(\.pairs)
)

Built-in test sets: generalPairs, shortPhrasePairs, technicalPairs, questionPairs, scientificPairs, singleWordPairs, longPassagePairs, paraphrasePairs, retrievalPairs, conversationalPairs.

ThresholdCalibrator

Sweeps similarity thresholds over labeled examples to find the best operating point for your use case.

let calibrator = ThresholdCalibrator.semanticSearch  // preset sweep range
// Also: .duplicateDetection, .contentDiscovery

let examples: [LabeledPair] = [
    LabeledPair("neural network", "deep learning", isMatch: true),
    LabeledPair("neural network", "organic chemistry", isMatch: false),
]

let result = try await calibrator.calibrate(provider: provider, examples: examples)

print("Best F1 threshold: \(result.bestF1Threshold)")
print("Best precision threshold: \(result.bestPrecisionThreshold)")
print(result.report)  // full threshold sweep table

// Also calibrate a reranker
let rerankerResult = try await calibrator.calibrate(reranker: crossEncoder, examples: examples)

SwiftUI Comparison View

An interactive view for evaluating embedding providers on custom text pairs. Drop it into any SwiftUI app to explore model selection, discrimination charts, and threshold recommendations.

import SwiftUI
import Linguistics

struct ContentView: View {
    var body: some View {
        EmbeddingComparisonView()
    }
}

Features:

Enter custom text pairs or load from built-in test sets
Run NLEmbedding-only (instant) or full GPU comparison
View per-provider similarity distributions as charts
Get automatic provider and threshold recommendations

Architecture Notes

Swift 6 strict concurrency throughout. MLXEmbeddingService and MLXCrossEncoderReranker are actors to isolate Metal/MLX GPU state. All public types are Sendable.
FDLEmbeddingService is not L2-normalized — call vector.normal (MatrixStuff) before computing cosine similarity or comparing with other providers.
Model files are never bundled — MLX models are downloaded from HuggingFace Hub on first use.
Tests use Swift Testing (@Test, #expect) not XCTest. MLX tests require Metal and must be run from Xcode — swift test from the command line runs NLEmbedding tests only.
EmbeddingComparisonViewModel is @Observable + @MainActor — do not add @Published properties.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Sources/Linguistics		Sources/Linguistics
Tests/LinguisticsTests		Tests/LinguisticsTests
.gitignore		.gitignore
Package.swift		Package.swift
README.md		README.md
convert_pdfs.py		convert_pdfs.py

Folders and files

Latest commit

History

Repository files navigation