Skip to content

Tokenizer Config en

谷深 edited this page Dec 16, 2022 · 2 revisions

Configure the tokenizer

When you configure the tokenizer, you must specify the analyzers, tokenizer_config, and traditional_tables parameters. The following configuration file named analyzer.json provides an example:


{   

    "analyzers":

    {

        "simple_analyzer":

        {   

            "tokenizer_name" : "simple_tab_tokenzier",

            "stopwords" : [],

            "normalize_options" :

            {

                "case_sensitive" : true,

                "traditional_sensitive" : true,

                "width_sensitive" : true

            }

        },

        "singlews_analyzer":

        {

            "tokenizer_name" : "singlews_tokenzier",

            "stopwords" : [],

            "normalize_options" :

            {

                "case_sensitive" : true,

                "traditional_sensitive" : true,

                "width_sensitive" : true

            }

        }

    },

    "tokenizer_config" : {

        "modules" : [

            {

                "module_name": "TokenizerModule1",

                "module_path": "libTokenizerModule1.so",

                "parameters": {

                }

            }

        ],

        "tokenizers" : [

            {

                "tokenizer_name" : "simple_tab_tokenzier",

                "tokenizer_type" : "simple",

                "module_name"    : "",

                "parameters": {

                    "delimiter" : "\t"

                }

            },

            {

                "tokenizer_name" : "singlews_tokenzier",

                "tokenizer_type" : "aliws",

                "module_name"    : "",

                "parameters": {

                }

            }

        ]

    },

    "traditional_tables": [

        {

            "table_name" : "table_name",

            "transform_table" : {

                "0x4E7E" : "0x4E7E"                                # Specify that the traditional Chinese character "乾" is not converted to the simplified Chinese character.

            }

        }

    ]

}

  • stopword: a random string. You can also leave this field empty.

  • case_sensitive: specifies whether the text that the tokenizer processes is case-sensitive. If it is case-sensitive, the original text is retained. If it is not case-sensitive, uppercase letters in the original text are converted to lowercase letters.

  • traditional_sensitive: specifies whether the tokenizer distinguishes simplified and traditional Chinese characters. If the tokenizer distinguishes simplified and traditional Chinese characters, the original text that contains traditional Chinese characters is retained. If the tokenizer does not distinguish simplified and traditional Chinese characters, all traditional Chinese characters in the original text are converted to simplified Chinese characters.

  • width_sensitive: specifies whether the text is width-sensitive. If it is width-sensitive, the original text is retained. If it is not width-sensitive, all full-width characters in the original text are converted to half-width characters.

  • traditional_table_name: specifies the table used for converting traditional Chinese characters into simplified Chinese characters. When the value of traditional_sensitive is false, all traditional Chinese characters are converted to simplified Chinese characters by default. You can set a specific table name in traditional_tables to prevent special traditional Chinese characters from being converted to simplified Chinese characters.

Analyzers

Clone this wiki locally