Skip to content

apocryphx/ObjCTokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ObjCTokenizer

Pure-Foundation Objective-C port of HuggingFace's swift-transformers Tokenizers module, packaged as a macOS framework. Loads any tokenizer.json (WordPiece, Unigram, BPE) and produces byte-identical output to HuggingFace's reference Python implementation.

License: MIT (this project) over Apache 2.0 (upstream swift-transformers attribution in NOTICE).

Dependencies: Foundation only. No third-party libraries. No Hub (downloading is out of scope — ship tokenizer.json as a bundle resource alongside your app).

Quick start

@import ObjCTokenizer;

NSError *err = nil;
NSURL *url = [NSBundle.mainBundle URLForResource:@"tokenizer" withExtension:@"json"];
OCTTokenizer *tok = [OCTTokenizer tokenizerWithJSONFileURL:url error:&err];
NSArray<NSNumber *> *ids = [tok encode:@"Hello, world." error:&err];
NSString *text = [tok decode:ids error:&err];

Power-user API with padding / truncation / attention masks:

OCTEncodeOptions *opt = [OCTEncodeOptions new];
opt.maxLength = 512;
opt.padding = OCTPaddingMaxLength;
opt.addSpecialTokens = YES;
OCTEncoding *enc = [tok encodeAsEncoding:text options:opt error:&err];
// enc.ids, enc.attentionMask, enc.tokenTypeIds, enc.offsets

A single OCTTokenizer instance is immutable after init and safe to call from multiple threads.

Build

The project is a stand-alone Xcode workspace — open ObjCTokenizer.xcodeproj in Xcode and Cmd+B / Cmd+U, or from the command line:

xcodebuild -project ObjCTokenizer.xcodeproj -scheme ObjCTokenizer build
xcodebuild -project ObjCTokenizer.xcodeproj -scheme ObjCTokenizer test

The product is ObjCTokenizer.framework with module support (@import ObjCTokenizer; works), @rpath install name, and 16 public headers (umbrella + 15 OCT*.h).

Deployment target is macOS 26.4. The framework has no third-party dependencies and links only Foundation.

Status

Kernel coverage, verified against HuggingFace reference output by the golden-corpus tests:

Kernel Family Test Status
WordPiece BGE-small / BERT OCTGoldenCorpusTests.testBGESmall…
Unigram T5-small OCTGoldenCorpusTests.testT5Small…
BPE GPT-2 OCTGoldenCorpusTests.testGPT2…
BPE Llama-7B OCTGoldenCorpusTests.testLlama7b…

All four families pass byte-identical against a 585-record corpus generated by transformers.AutoTokenizer.from_pretrained(...).

Layout

  • ObjCTokenizer/ — framework sources. Public OCT*.{h,m} plus Internal/ for symbols that aren't part of the API.
  • ObjCTokenizer/ObjCTokenizer.h — umbrella header. Imports every public header via <ObjCTokenizer/X.h>.
  • ObjCTokenizerTests/ — XCTest bundle.
  • ObjCTokenizerTests/Resources/ — tokenizer JSONs (BGE-small, GPT-2, Llama-7B, T5-small), golden corpora, and the shared corpus.txt.
  • ObjCTokenizer.xcodeproj/ — Xcode 16+ project using File System Synchronized groups (PBXFileSystemSynchronizedRootGroup), so adding a new OCT*.{h,m} file picks up automatically. Public-header status is declared in the publicHeaders exception set inside project.pbxproj.

Regenerating golden corpora

The byte-identity tests pin against a snapshot of HuggingFace's reference output. To refresh them — for example, after upgrading transformers — use the make golden target in the sibling Make-based source repo (../ObjCTokenizers/), which spins up a Python venv with transformers + tokenizers and runs scripts/generate_golden.py. The resulting *_golden.json files are checked into ObjCTokenizerTests/Resources/ and read at test time via [NSBundle bundleForClass:].

This is intentionally a one-off data pipeline rather than an Xcode build phase — corpus regeneration is rare and shouldn't gate every build.

Adding a new tokenizer family

  1. Drop the tokenizer.json into ObjCTokenizerTests/Resources/. It will be auto-bundled into the .xctest by FSSync.
  2. Generate a golden corpus with make golden (or by hand) into the same folder.
  3. Add a runGoldenCorpusFamily: test method in ObjCTokenizerTests/OCTGoldenCorpusTests.m.
  4. If it diverges, the port is wrong — never the test.

Discipline

  • Port first, refactor later. The Swift original is well-engineered. Avoid "improvements" while porting — they create divergence bugs.
  • Byte-identical or it's a port bug. Anything less than 585/585 means the port is wrong, not the test.
  • Foundation primitives over the obvious splitter. Use enumerateSubstringsInRange:options:NSStringEnumerationByLines instead of componentsSeparatedByString:@"\n"; prefer NSData byte iteration to building intermediate NSStrings when the algorithm is byte-level.

About

Objective-C port of the tokenizer in HuggingFace's swift-transformers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors