# Split by tokens

语言模型有一个令牌限制。您不应超出令牌限制。因此，当您将文本拆分为块时，最好计算标记的数量。有很多标记器。当您计算文本中的标记时，您应该使用与语言模型中使用的相同的标记生成器。

## tiktoken

tiktoken 是 OpenAI 创建的快速 BPE 标记器。

我们可以用它来估计所使用的tokens。对于 OpenAI 模型来说，它可能会更准确。

文本如何分割：按传入的字符。
如何测量块大小：通过 `tiktoken` 分词器。

In [1]:
%pip install --upgrade --quiet  tiktoken

Note: you may need to restart the kernel to use updated packages.


In [2]:
# This is a long document we can split up.
with open("../../data/a.txt",encoding="utf-8") as f:
    state_of_the_union = f.read()
from langchain.text_splitter import CharacterTextSplitter

In [3]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)

In [4]:
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.

Last year COVID-19 kept us apart. This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.

With a duty to one another to the American people to the Constitution.


请注意，如果我们使用CharacterTextSplitter.from_tiktoken_encoder，则文本仅由CharacterTextSplitter分割，并且tiktoken分词器用于合并分割。这意味着 split 可以大于 tiktoken tokenizer 测量的块大小。我们可以使用 RecursiveCharacterTextSplitter.from_tiktoken_encoder 来确保分割不大于语言模型允许的标记块大小，其中如果每个分割具有更大的大小，则每个分割将被递归分割。

我们还可以直接加载一个 tiktoken splitter，这确保每个 split 都小于 chunk 大小。

In [5]:
from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our


## spaCy
[spaCy](https://spacy.io/) 是一个用于高级自然语言处理的开源软件库，用编程语言 Python 和 Cython 编写。

NLTK 的另一种替代方法是使用 [spaCy tokenizer](https://spacy.io/api/tokenizer)。

1.文本如何分割：通过 spaCy tokenizer。

2.如何测量块大小：按字符数。

In [6]:
%pip install --upgrade --quiet  spacy

Note: you may need to restart the kernel to use updated packages.


In [7]:
# This is a long document we can split up.
with open("../../data/b.txt",encoding="utf-8") as f:
    state_of_the_union = f.read()

In [16]:
%pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz (13.7 MB)
     ---------------------------------------- 0.0/13.7 MB ? eta -:--:--
     ---------------------------------------- 0.0/13.7 MB ? eta -:--:--
     --------------------------------------- 0.0/13.7 MB 165.2 kB/s eta 0:01:23
     --------------------------------------- 0.0/13.7 MB 326.8 kB/s eta 0:00:42
     --------------------------------------- 0.1/13.7 MB 393.8 kB/s eta 0:00:35
     --------------------------------------- 0.1/13.7 MB 655.8 kB/s eta 0:00:21
      -------------------------------------- 0.2/13.7 MB 850.1 kB/s eta 0:00:16
     - -------------------------------------- 0.4/13.7 MB 1.3 MB/s eta 0:00:11
     - -------------------------------------- 0.6/13.7 MB 1.7 M

  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chromadb 0.4.22 requires pydantic>=1.9, but you have pydantic 1.8.2 which is incompatible.
chromadb 0.4.22 requires typer>=0.9.0, but you have typer 0.3.2 which is incompatible.
docarray 0.32.1 requires pydantic>=1.10.2, but you have pydantic 1.8.2 which is incompatible.
langfuse 2.6.4 requires pydantic<3.0,>=1.10.7, but you have pydantic 1.8.2 which is incompatible.
openai 1.6.1 requires pydantic<3,>=1.9.0, but you have pydantic 1.8.2 which is incompatible.


Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz (13.7 MB)
     ---------------------------------------- 0.0/13.7 MB ? eta -:--:--
     ---------------------------------------- 0.0/13.7 MB ? eta -:--:--
     --------------------------------------- 0.0/13.7 MB 262.6 kB/s eta 0:00:53
     --------------------------------------- 0.0/13.7 MB 281.8 kB/s eta 0:00:49
     --------------------------------------- 0.1/13.7 MB 438.9 kB/s eta 0:00:32
     --------------------------------------- 0.2/13.7 MB 706.2 kB/s eta 0:00:20
      -------------------------------------- 0.3/13.7 MB 983.0 kB/s eta 0:00:14
     - -------------------------------------- 0.4/13.7 MB 1.4 MB/s eta 0:00:10
     - -------------------------------------- 0.6/13.7 MB 1.9 M

ERROR: Exception:
Traceback (most recent call last):
  File "E:\anaconda3\envs\langchain\Lib\site-packages\pip\_internal\cli\base_command.py", line 180, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "E:\anaconda3\envs\langchain\Lib\site-packages\pip\_internal\cli\req_command.py", line 245, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "E:\anaconda3\envs\langchain\Lib\site-packages\pip\_internal\commands\install.py", line 377, in run
    requirement_set = resolver.resolve(
                      ^^^^^^^^^^^^^^^^^
  File "E:\anaconda3\envs\langchain\Lib\site-packages\pip\_internal\resolution\resolvelib\resolver.py", line 95, in resolve
    result = self._result = resolver.resolve(
                            ^^^^^^^^^^^^^^^^^
  File "E:\anaconda3\envs\langchain\Lib\site-packages\pip\_vendor\resolvelib\resolvers.py", line 546, in resolve
    state = resolution.resolve(requirements, max_rounds=max_rounds

In [17]:
from langchain.text_splitter import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

