Skip to content

Latest commit

 

History

History
46 lines (36 loc) · 3.67 KB

tokenize.md

File metadata and controls

46 lines (36 loc) · 3.67 KB

Tokenization Tool

The Dolma toolkit comes with a command line tool for tokenizing documents. The tool can be accessed using the dolma tokens command.

The tokenizer is optimized processing large datasets that are split over multiple files. If you are using the dolma toolkit to curate a dataset, this is achieved via the dolma mix command.

The Dolma tokenizer tool can use any HuggingFace-compatible tokenizer.

The library employs the following strategy to provide a light shuffling of the data:

  • First, paths to input files are shuffled.
  • Each parallel process in the tokenizer opens N files in parallel for tokenization.
  • The process reads a chunk k of documents, equally divided between the N files (i.e., k/N documents per file).
  • The process shuffles the documents in the chunk.
  • The process writes the output.

The tokenization library outputs to files: a .npy file containing the concatenated tokenized documents, and a .csv.gz file containing the metadata for each tokenized document. The metadata file contains the following columns:

  • start: the start index of the document in the .npy file.
  • end: the end index of the document in the .npy file.
  • path: the path to the original document.
  • id: the id of the original document.

Parameters

The following parameters are supported either via CLI (e.g. dolma tokens --parameter.name value) or via config file (e.g. dolma -c config.json tokens, where config.json contains {"parameter" {"name": "value"}}):

Parameter Required? Description
documents Yes One or more paths for input document files. Paths can contain arbitrary wildcards. Can be local, or an S3-compatible cloud path.
destination Yes One or more paths for output files. Should match number of documents paths. Can be local, or an S3-compatible cloud path.
tokenizer.name_or_path Yes Name or path of the tokenizer to use. Must be a HuggingFace-compatible tokenizer.
tokenzier.bos_token_id Yes if tokenizer.eos_token_id is missing The id of the beginning-of-sequence token.
tokenizer.eos_token_id Yes if tokenizer.bos_token_id is missing The id of the end-of-sequence token.
tokenizer.pad_token_id No The id of the padding token.
tokenizer.segment_before_tokenization No Whether to segment documents by paragraph before tokenization. This is useful for tokenizers like Llama that are very slow on long documents. Might not be needed once this bugfix is merged. Defaults to False.
processes No Number of processes to use for tokenization. By default 1 process is used.
files_per_process No Maximum number of files per tokenization process. By default, only one file is processed. This controls the number of output files generated.
batch_size No Number of k sequences to tokenize and shuffle before writing to disk. By default, k=10000.
ring_size No Number of N files to open in parallel for tokenization. By default, N=8.
max_size No Maximum size of a file in bytes. By default, 1GB.
dtype No Data type for the memmap file; must be a valid numpy dtype. By default, uint16.
work_dir.input No Path to a local scratch directory where temporary input files can be placed. If not provided, Dolma will make one for you and delete it upon completion.
work_dir.output No Path to a local scratch directory where temporary output files can be placed. If not provided, Dolma will make one for you and delete it upon completion.
dryrun No If true, only print the configuration and exit without running the tokenizer.
seed No Seed for random number generation.