Skip to content

Latest commit

 

History

History
10 lines (6 loc) · 1.03 KB

tokenizers.md

File metadata and controls

10 lines (6 loc) · 1.03 KB

Tokenizers

Since the driver for the computation is a language unit, we need ways to segment texts into desired units.

A tokenizer can be defined through the tokenizer parameter of the InspectorArgs class. For defining custom tokenizers, see custom components. Off-the-shelf choices are the following:

  • A default whitespace tokenizer that goes beyond Latin characters (i.e., whitespace, by default)
  • Any tokenizer from 🤗 Hugging Face, represented by a string hf::$TOKENIZER_NAME, where $TOKENIZER_NAME is the name of a model's tokenizer as indicated in the Hugging Face repository

This ample choice (including custom tokenizers) avoids any assumptions on what actually is a language unit, also broaden the applicability of 🕵️‍♀️ Variationist to a wide range of language varieties.