# Tokenization for Semantic Search (ELSER and E5)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/tokenization.ipynb)

Elasticsearch offers some [semantic search](https://www.elastic.co/what-is/semantic-search) models, most notably [ELSER](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-elser.html) and [E5](https://www.elastic.co/search-labs/blog/articles/multilingual-vector-search-e5-embedding-model), to search through documents in a _menaningful_ way. Part of the process is breaking up texts (both for indexing documents and for queries) into tokens. Tokens are commonly thought of as words, but this is not accurate. Other substrings in the text also carry meaning to the semantic models and therefore have to be split out separately. For ELSER, our English-only model, this is done with the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) tokenizer.

For Elasticsearch users it is important to know how texts are broken up into tokens because currently only the [first 512 tokens per field](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512) are considered. This means that when you index longer texts, all tokens after the 512 will not be represented in your semantic search. Hence it is valuable to know the number of tokens for your input texts.

Currently it is not possible to get the token count information via the API, so we share the code for calculating token counts here. This notebook also shows how to break longer text up into chunks of the right size so that no information is lost during indexing, which has to be done by the user (as of version 8.12, future version will remove the necessity and auto-chunk behind the scenes).


# Install packages

As stated above, ELSER uses [BERT](https://huggingface.co/blog/bert-101)'s tokenizer internally. Here we install the `transformers` package that gives us an interface to this tokenizer. (We install the `tabulate` packge to be able to print a nice table for comparison later on.)

In [1]:
!pip install -qU tabulate transformers

Next, we import everything we need. You can ignore a potential warning on models not being available because we only need the tokenizer here.

In [2]:
import json
from urllib.request import urlopen

from tabulate import tabulate
from transformers import AutoTokenizer, BertTokenizer

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


# Define tokenizers

Now we are ready to initialize the BERT tokenizer that ELSER uses and the E5 tokenizer for the multilingual semantic search. We also define a whitespace tokenizer in order to compare the naive version on creating tokens to the two tokenizers.

In [3]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
e5_tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-base')

def whitespace_tokenize(text):
    return text.split()

# Load example data

Download the movies example data that is also used in the other search examples.

In [4]:
url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/movies.json"
response = urlopen(url)
movies = json.load(response)

# Compare token counts

Compare the token counts of the different tokenization methods for the descriptions of the movies.

In [40]:
def count_tokens(text):
    whitespace_tokens = len(whitespace_tokenize(text))
    bert_tokens = len(bert_tokenizer.encode(text))
    e5_tokens = len(e5_tokenizer.encode(text))
    return [whitespace_tokens, bert_tokens, e5_tokens, f"{text[:80]}..."]

counts = [count_tokens(movie["plot"]) for movie in movies]

print(tabulate(sorted(counts), ["whitespace", "BERT", "E5", "text"]))

  whitespace    BERT    E5  text
------------  ------  ----  -----------------------------------------------------------------------------------
          16      21    30  An organized crime dynasty's aging patriarch transfers control of his clandestin...
          19      25    32  Two imprisoned men bond over a number of years, finding solace and eventual rede...
          20      25    34  Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven ...
          20      33    36  An insomniac office worker and a devil-may-care soapmaker form an underground fi...
          22      28    27  An undercover cop and a mole in the police attempt to identify each other while ...
          23      26    31  A computer hacker learns from mysterious rebels about the true nature of his rea...
          26      36    42  A thief who steals corporate secrets through the use of dream-sharing technology...
          27      36    42  The lives of two mob hitmen, a boxer, a gan

Notice that both the BERT the E5 tokenizers yields more tokens in every example, in some cases even twice as many. Why is that? Let's look at an example:

In [41]:
example_movie = movies[0]["plot"]
print(example_movie)
print()

movie_tokens = bert_tokenizer.encode(example_movie)
print(str([bert_tokenizer.decode([t]) for t in movie_tokens]))


The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.

['[CLS]', 'the', 'lives', 'of', 'two', 'mob', 'hit', '##men', ',', 'a', 'boxer', ',', 'a', 'gangster', 'and', 'his', 'wife', ',', 'and', 'a', 'pair', 'of', 'diner', 'bandits', 'inter', '##t', '##wine', 'in', 'four', 'tales', 'of', 'violence', 'and', 'redemption', '.', '[SEP]']


We can observe:
- There are special tokens `[CLS]` and `[SEP]` to model the the beginning and end of the text. These two extra tokens will become relevant below.
- Punctuations are they own tokens.
- Compounds words are split into two tokens, for example `hitmen` becomes `hit` and `##men`.

Given this behavior, it is easy to see how longer tests yield lots of tokens and can quickly get beyond the 512 tokens limitation mentioned above.

# Handling long texts

We saw how to count the number of tokens using the tokenizers from different models. ELSER uses the BERT tokenizer, so when using `.elser_model_2` it internally splits the text with this method.

Currently there is a limitation that [only the first 512 tokens are considered](https://www.elastic.co/guide/en/machine-learning/8.12/ml-nlp-limitations.html#ml-nlp-elser-v1-limit-512). To work around this, we can first split the input text into chunks of 512 tokens and feed the chunks to Elasticsearch. Actually, we need to use a limit of 510 to leave space for the two special tokens (`[CLS]` and `[SEP]`) that we saw.

In [35]:
SEMANTIC_SEARCH_TOKEN_LIMIT = 510  # 512 minus space for the 2 special tokens

def chunk(tokens, chunk_size=SEMANTIC_SEARCH_TOKEN_LIMIT):
    for i in range(0, len(tokens), chunk_size):
        yield tokens[i:i+chunk_size]

Loading a longer example text:

In [43]:
# url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/lorem-ipsum.txt"
# response = urlopen(url)
response = open("./lorem-ipsum.txt")  # TODO remove in favor of download
long_text = response.read()

Next we tokenize the long text, exclude the special tokens, create chunks of size 510 tokens and map the tokens back to text. Notice that on the first run the BERT tokenizer itself is warning us about the 512 tokens limitation.

In [50]:
tokens = bert_tokenizer.encode(long_text)[1:-1]  # exclude special tokens at beginning and end
chunked = [
    bert_tokenizer.decode(tokens_chunk)
    for tokens_chunk in chunk(tokens)
]
chunked

['lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. est pellentesque elit ullamcorper dignissim. sit amet cursus sit amet dictum sit amet. enim neque volutpat ac tincidunt vitae semper quis lectus. nulla facilisi etiam dignissim diam quis enim lobortis. id velit ut tortor pretium. ut tortor pretium viverra suspendisse potenti nullam ac tortor. senectus et netus et malesuada fames ac. sed faucibus turpis in eu. maecenas ultricies mi eget mauris pharetra. in iaculis nunc sed augue. sit amet cursus sit amet dictum. sit amet luctus venenatis lectus magna. adipiscing tristique risus nec feugiat. nisi quis eleifend quam adipiscing vitae proin sagittis nisl rhoncus. scelerisque varius morbi enim nunc faucibus a. purus semper eget duis at tellus at. cursus metus aliquam eleifend mi. tristique senectus et netus et malesuada fames. netus et malesuada fames ac. viverra aliquet eget sit amet tellus cras. hac habitasse platea

Now these chunks can be indexed and we can be sure the semantic search model consideres our whole text.