Pure-Foundation Objective-C port of HuggingFace's
swift-transformers
Tokenizers module, packaged as a macOS framework. Loads any
tokenizer.json (WordPiece, Unigram, BPE) and produces byte-identical
output to HuggingFace's reference Python implementation.
License: MIT (this project) over Apache 2.0 (upstream swift-transformers
attribution in NOTICE).
Dependencies: Foundation only. No third-party libraries. No Hub
(downloading is out of scope — ship tokenizer.json as a bundle resource
alongside your app).
@import ObjCTokenizer;
NSError *err = nil;
NSURL *url = [NSBundle.mainBundle URLForResource:@"tokenizer" withExtension:@"json"];
OCTTokenizer *tok = [OCTTokenizer tokenizerWithJSONFileURL:url error:&err];
NSArray<NSNumber *> *ids = [tok encode:@"Hello, world." error:&err];
NSString *text = [tok decode:ids error:&err];Power-user API with padding / truncation / attention masks:
OCTEncodeOptions *opt = [OCTEncodeOptions new];
opt.maxLength = 512;
opt.padding = OCTPaddingMaxLength;
opt.addSpecialTokens = YES;
OCTEncoding *enc = [tok encodeAsEncoding:text options:opt error:&err];
// enc.ids, enc.attentionMask, enc.tokenTypeIds, enc.offsetsA single OCTTokenizer instance is immutable after init and safe to call
from multiple threads.
The project is a stand-alone Xcode workspace — open ObjCTokenizer.xcodeproj
in Xcode and Cmd+B / Cmd+U, or from the command line:
xcodebuild -project ObjCTokenizer.xcodeproj -scheme ObjCTokenizer build
xcodebuild -project ObjCTokenizer.xcodeproj -scheme ObjCTokenizer testThe product is ObjCTokenizer.framework with module support
(@import ObjCTokenizer; works), @rpath install name, and 16 public
headers (umbrella + 15 OCT*.h).
Deployment target is macOS 26.4. The framework has no third-party dependencies and links only Foundation.
Kernel coverage, verified against HuggingFace reference output by the golden-corpus tests:
| Kernel | Family | Test | Status |
|---|---|---|---|
| WordPiece | BGE-small / BERT | OCTGoldenCorpusTests.testBGESmall… |
✅ |
| Unigram | T5-small | OCTGoldenCorpusTests.testT5Small… |
✅ |
| BPE | GPT-2 | OCTGoldenCorpusTests.testGPT2… |
✅ |
| BPE | Llama-7B | OCTGoldenCorpusTests.testLlama7b… |
✅ |
All four families pass byte-identical against a 585-record corpus
generated by transformers.AutoTokenizer.from_pretrained(...).
ObjCTokenizer/— framework sources. PublicOCT*.{h,m}plusInternal/for symbols that aren't part of the API.ObjCTokenizer/ObjCTokenizer.h— umbrella header. Imports every public header via<ObjCTokenizer/X.h>.ObjCTokenizerTests/— XCTest bundle.ObjCTokenizerTests/Resources/— tokenizer JSONs (BGE-small, GPT-2, Llama-7B, T5-small), golden corpora, and the sharedcorpus.txt.ObjCTokenizer.xcodeproj/— Xcode 16+ project using File System Synchronized groups (PBXFileSystemSynchronizedRootGroup), so adding a newOCT*.{h,m}file picks up automatically. Public-header status is declared in thepublicHeadersexception set insideproject.pbxproj.
The byte-identity tests pin against a snapshot of HuggingFace's reference
output. To refresh them — for example, after upgrading transformers —
use the make golden target in the sibling Make-based source repo
(../ObjCTokenizers/), which spins up a Python
venv with transformers + tokenizers and runs scripts/generate_golden.py.
The resulting *_golden.json files are checked into
ObjCTokenizerTests/Resources/ and
read at test time via [NSBundle bundleForClass:].
This is intentionally a one-off data pipeline rather than an Xcode build phase — corpus regeneration is rare and shouldn't gate every build.
- Drop the
tokenizer.jsonintoObjCTokenizerTests/Resources/. It will be auto-bundled into the.xctestby FSSync. - Generate a golden corpus with
make golden(or by hand) into the same folder. - Add a
runGoldenCorpusFamily:test method inObjCTokenizerTests/OCTGoldenCorpusTests.m. - If it diverges, the port is wrong — never the test.
- Port first, refactor later. The Swift original is well-engineered. Avoid "improvements" while porting — they create divergence bugs.
- Byte-identical or it's a port bug. Anything less than 585/585 means the port is wrong, not the test.
- Foundation primitives over the obvious splitter. Use
enumerateSubstringsInRange:options:NSStringEnumerationByLinesinstead ofcomponentsSeparatedByString:@"\n"; preferNSDatabyte iteration to building intermediateNSStrings when the algorithm is byte-level.