Tokenizers

Since the driver for the computation is a language unit, we need ways to segment texts into desired units.

A tokenizer can be defined through the tokenizer parameter of the InspectorArgs class. For defining custom tokenizers, see custom components. Off-the-shelf choices are the following:

A default whitespace tokenizer that goes beyond Latin characters (i.e., whitespace, by default)
Any tokenizer from 🤗 Hugging Face, represented by a string hf::$TOKENIZER_NAME, where $TOKENIZER_NAME is the name of a model's tokenizer as indicated in the Hugging Face repository

This ample choice (including custom tokenizers) avoids any assumptions on what actually is a language unit, also broaden the applicability of 🕵️‍♀️ Variationist to a wide range of language varieties.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenizers.md

tokenizers.md

Tokenizers

Files

tokenizers.md

Latest commit

History

tokenizers.md

File metadata and controls

Tokenizers