ObjCTokenizer

Pure-Foundation Objective-C port of HuggingFace's swift-transformers Tokenizers module, packaged as a macOS framework. Loads any tokenizer.json (WordPiece, Unigram, BPE) and produces byte-identical output to HuggingFace's reference Python implementation.

License: MIT (this project) over Apache 2.0 (upstream swift-transformers attribution in NOTICE).

Dependencies: Foundation only. No third-party libraries. No Hub (downloading is out of scope — ship tokenizer.json as a bundle resource alongside your app).

Quick start

@import ObjCTokenizer;

NSError *err = nil;
NSURL *url = [NSBundle.mainBundle URLForResource:@"tokenizer" withExtension:@"json"];
OCTTokenizer *tok = [OCTTokenizer tokenizerWithJSONFileURL:url error:&err];
NSArray<NSNumber *> *ids = [tok encode:@"Hello, world." error:&err];
NSString *text = [tok decode:ids error:&err];

Power-user API with padding / truncation / attention masks:

OCTEncodeOptions *opt = [OCTEncodeOptions new];
opt.maxLength = 512;
opt.padding = OCTPaddingMaxLength;
opt.addSpecialTokens = YES;
OCTEncoding *enc = [tok encodeAsEncoding:text options:opt error:&err];
// enc.ids, enc.attentionMask, enc.tokenTypeIds, enc.offsets

A single OCTTokenizer instance is immutable after init and safe to call from multiple threads.

Build

The project is a stand-alone Xcode workspace — open ObjCTokenizer.xcodeproj in Xcode and Cmd+B / Cmd+U, or from the command line:

xcodebuild -project ObjCTokenizer.xcodeproj -scheme ObjCTokenizer build
xcodebuild -project ObjCTokenizer.xcodeproj -scheme ObjCTokenizer test

The product is ObjCTokenizer.framework with module support (@import ObjCTokenizer; works), @rpath install name, and 16 public headers (umbrella + 15 OCT*.h).

Deployment target is macOS 26.4. The framework has no third-party dependencies and links only Foundation.

Status

Kernel coverage, verified against HuggingFace reference output by the golden-corpus tests:

Kernel	Family	Test	Status
WordPiece	BGE-small / BERT	`OCTGoldenCorpusTests.testBGESmall…`	✅
Unigram	T5-small	`OCTGoldenCorpusTests.testT5Small…`	✅
BPE	GPT-2	`OCTGoldenCorpusTests.testGPT2…`	✅
BPE	Llama-7B	`OCTGoldenCorpusTests.testLlama7b…`	✅

All four families pass byte-identical against a 585-record corpus generated by transformers.AutoTokenizer.from_pretrained(...).

Layout

ObjCTokenizer/ — framework sources. Public OCT*.{h,m} plus Internal/ for symbols that aren't part of the API.
ObjCTokenizer/ObjCTokenizer.h — umbrella header. Imports every public header via <ObjCTokenizer/X.h>.
ObjCTokenizerTests/ — XCTest bundle.
ObjCTokenizerTests/Resources/ — tokenizer JSONs (BGE-small, GPT-2, Llama-7B, T5-small), golden corpora, and the shared corpus.txt.
ObjCTokenizer.xcodeproj/ — Xcode 16+ project using File System Synchronized groups (PBXFileSystemSynchronizedRootGroup), so adding a new OCT*.{h,m} file picks up automatically. Public-header status is declared in the publicHeaders exception set inside project.pbxproj.

Regenerating golden corpora

The byte-identity tests pin against a snapshot of HuggingFace's reference output. To refresh them — for example, after upgrading transformers — use the make golden target in the sibling Make-based source repo (../ObjCTokenizers/), which spins up a Python venv with transformers + tokenizers and runs scripts/generate_golden.py. The resulting *_golden.json files are checked into ObjCTokenizerTests/Resources/ and read at test time via [NSBundle bundleForClass:].

This is intentionally a one-off data pipeline rather than an Xcode build phase — corpus regeneration is rare and shouldn't gate every build.

Adding a new tokenizer family

Drop the tokenizer.json into ObjCTokenizerTests/Resources/. It will be auto-bundled into the .xctest by FSSync.
Generate a golden corpus with make golden (or by hand) into the same folder.
Add a runGoldenCorpusFamily: test method in ObjCTokenizerTests/OCTGoldenCorpusTests.m.
If it diverges, the port is wrong — never the test.

Discipline

Port first, refactor later. The Swift original is well-engineered. Avoid "improvements" while porting — they create divergence bugs.
Byte-identical or it's a port bug. Anything less than 585/585 means the port is wrong, not the test.
Foundation primitives over the obvious splitter. Use enumerateSubstringsInRange:options:NSStringEnumerationByLines instead of componentsSeparatedByString:@"\n"; prefer NSData byte iteration to building intermediate NSStrings when the algorithm is byte-level.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Conformance		Conformance
ObjCTokenizer.xcodeproj		ObjCTokenizer.xcodeproj
ObjCTokenizer		ObjCTokenizer
ObjCTokenizerTests		ObjCTokenizerTests
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ObjCTokenizer

Quick start

Build

Status

Layout

Regenerating golden corpora

Adding a new tokenizer family

Discipline

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ObjCTokenizer

Quick start

Build

Status

Layout

Regenerating golden corpora

Adding a new tokenizer family

Discipline

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages