Example of using text tokenizers and dictionaries #1930

NikolayTV · 2021-11-20T10:49:38Z

Problem: Cannot properly set tokenizers and dictionaries
catboost version: 1.0.3
Operating System: Windows 11
GPU: +

This topic is not covered in documentation properly.
I cannot understand how to set different tokenizers for specific columns.
This code example also gives an error.


model = CatBoostClassifier(
    task_type = 'GPU',
    tokenizers=[
        {
            'tokenizer_id': 'Sence',
            'separator_type': 'BySense',
            'lowercasing': 'True',
            'token_types':['Word'],
            'sub_tokens_policy':'SeveralTokens'
        }      
    ],
    dictionaries = [
        {
            'dictionary_id': 'BiGram',
            'max_dictionary_size': '50000'
        },
        {
            'dictionary_id': 'Word',
            'max_dictionary_size': '50000'
        }],
)

model.fit(
    train_pool,
    eval_set=val_pool,
    verbose=False, plot=True)

CatBoostError: C:/Program Files (x86)/Go Agent/pipelines/BuildMaster/catboost.git/catboost/private/libs/options/runtime_text_options.cpp:140: No options for tokenizerId Space

The text was updated successfully, but these errors were encountered:

andrey-khropov added the documentation label Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example of using text tokenizers and dictionaries #1930

Example of using text tokenizers and dictionaries #1930

NikolayTV commented Nov 20, 2021 •

edited

Example of using text tokenizers and dictionaries #1930

Example of using text tokenizers and dictionaries #1930

Comments

NikolayTV commented Nov 20, 2021 • edited

NikolayTV commented Nov 20, 2021 •

edited