Skip to content

Add split_length by token in preprocessor #4983

@yudataguy

Description

@yudataguy

Is your feature request related to a problem? Please describe.
With LLM like chatgpt, its measurement is token, not word. In order to maximize the embedding etc, it is best to have split by token feature

Describe the solution you'd like
Add another split choice token, and measure the chunk by token size.

Other packages like langchain, llama-index already has this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions