Skip to content

botisan-ai/gpt3-tokenizer

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
January 27, 2023 11:57
February 15, 2022 11:15
December 3, 2022 23:56
January 27, 2023 12:07
January 27, 2023 11:37

GPT3 Tokenizer

Build NPM Version NPM Downloads

This is a isomorphic TypeScript tokenizer for OpenAI's GPT-3 model. Including support for gpt3 and codex tokenization. It should work in both NodeJS and Browser environments.

Usage

First, install:

yarn add gpt3-tokenizer

In code:

import GPT3Tokenizer from 'gpt3-tokenizer';

const tokenizer = new GPT3Tokenizer({ type: 'gpt3' }); // or 'codex'
const str = "hello πŸ‘‹ world 🌍";
const encoded: { bpe: number[]; text: string[] } = tokenizer.encode(str);
const decoded = tokenizer.decode(encoded.bpe);

Reference

This library is based on the following:

The main difference between this library and gpt-3-encoder is that this library supports both gpt3 and codex tokenization (The dictionary is taken directly from OpenAI so the tokenization result is on par with the OpenAI Playground). Also Map API is used instead of JavaScript objects, especially the bpeRanks object, which should see some performance improvement.

License

MIT