<a href="https://colab.research.google.com/github/llermaly/elasticsearch-labs/blob/supporting-blog-content%2Fhow-to-use-jina-v2-embeddings/supporting-blog-content/how-to-use-jina-v2-embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook, we will extend the [Jina Late Chunking implementation example ](https://github.com/jina-ai/late-chunking/blob/main/examples.ipynb) to index the documents and embeddings to Elasticsearch, and run queries against those documents.

The Jina part of the implementation will be keep untouched.

This is supporting material for the following blog post:
https://www.elastic.co/search-labs/blog/how-to-use-jina-v2-embeddings


# [Late Chunking](https://jina.ai/news/late-chunking-in-long-context-embedding-models)

This notebooks explains how the "Late Chunking" can be implemented. First you need to install the requirements:

In [3]:
!pip install transformers==4.43.4

Collecting transformers==4.43.4
  Downloading transformers-4.43.4-py3-none-any.whl.metadata (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m41.0/43.7 kB[0m [31m7.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m41.0/43.7 kB[0m [31m7.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m368.2 kB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.43.4-py3-none-any.whl (9.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m44.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.44.2
    Uninstalling transformers-4.44.2:
      Successfully uninstalled tran

Then we load a model which we want to use for the embedding. We choose `jinaai/jina-embeddings-v2-base-en` but any other model which supports mean pooling is possible. However, models with a large maximum context-length are preferred.

In [4]:
from transformers import AutoModel
from transformers import AutoTokenizer

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

configuration_bert.py:   0%|          | 0.00/8.24k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_bert.py:   0%|          | 0.00/97.7k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-implementation:
- modeling_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/275M [00:00<?, ?B/s]

Now we define the text which we want to encode and split it into chunks. The `chunk_by_sentences` function also returns the span annotations.
Those specify the number of tokens per chunk which is needed for the chunked pooling.

In [5]:
def chunk_by_sentences(input_text: str, tokenizer: callable):
    """
    Split the input text into sentences using the tokenizer
    :param input_text: The text snippet to split into sentences
    :param tokenizer: The tokenizer to use
    :return: A tuple containing the list of text chunks and their corresponding token spans
    """
    inputs = tokenizer(input_text, return_tensors='pt', return_offsets_mapping=True)
    punctuation_mark_id = tokenizer.convert_tokens_to_ids('.')
    sep_id = tokenizer.convert_tokens_to_ids('[SEP]')
    token_offsets = inputs['offset_mapping'][0]
    token_ids = inputs['input_ids'][0]
    chunk_positions = [
        (i, int(start + 1))
        for i, (token_id, (start, end)) in enumerate(zip(token_ids, token_offsets))
        if token_id == punctuation_mark_id
        and (
            token_offsets[i + 1][0] - token_offsets[i][1] > 0
            or token_ids[i + 1] == sep_id
        )
    ]
    chunks = [
        input_text[x[1] : y[1]]
        for x, y in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
    ]
    span_annotations = [
        (x[0], y[0]) for (x, y) in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
    ]
    return chunks, span_annotations

Now let's try to segement a toy example.

In [6]:
input_text = "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."

# determine chunks
chunks, span_annotations = chunk_by_sentences(input_text, tokenizer)
print('Chunks:\n- "' + '"\n- "'.join(chunks) + '"')


Chunks:
- "Berlin is the capital and largest city of Germany, both by area and by population."
- " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."
- " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."


Now we encode the chunks with the traditional and the context-sensitive late_chunking method:

In [7]:
def late_chunking(
    model_output: 'BatchEncoding', span_annotation: list, max_length=None
):
    token_embeddings = model_output[0]
    outputs = []
    for embeddings, annotations in zip(token_embeddings, span_annotation):
        if (
            max_length is not None
        ):  # remove annotations which go bejond the max-length of the model
            annotations = [
                (start, min(end, max_length - 1))
                for (start, end) in annotations
                if start < (max_length - 1)
            ]
        pooled_embeddings = [
            embeddings[start:end].sum(dim=0) / (end - start)
            for start, end in annotations
            if (end - start) >= 1
        ]
        pooled_embeddings = [
            embedding.detach().cpu().numpy() for embedding in pooled_embeddings
        ]
        outputs.append(pooled_embeddings)

    return outputs

In [8]:
# chunk before
embeddings_traditional_chunking = model.encode(chunks)

# chunk afterwards (context-sensitive chunked pooling)
inputs = tokenizer(input_text, return_tensors='pt')
model_output = model(**inputs)
embeddings = late_chunking(model_output, [span_annotations])[0]

Finally, we compare the similarity of the word "Berlin" with the chunks. The similarity should be higher for the context-sensitive chunked pooling method:

In [9]:
import numpy as np

cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

berlin_embedding = model.encode('Berlin')

for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
    print(f'similarity_new("Berlin", "{chunk}"):', cos_sim(berlin_embedding, new_embedding))
    print(f'similarity_trad("Berlin", "{chunk}"):', cos_sim(berlin_embedding, trad_embeddings))

similarity_new("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.849546
similarity_trad("Berlin", "Berlin is the capital and largest city of Germany, both by area and by population."): 0.8486219
similarity_new("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.82489026
similarity_trad("Berlin", " Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits."): 0.70843387
similarity_new("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.8498009
similarity_trad("Berlin", " The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."): 0.75345534


# Indexing to Elasticsearch

Now, let's index the brand new embeddings to Elasticsearch and run queries

In [10]:
!pip install elasticsearch

Collecting elasticsearch
  Downloading elasticsearch-8.15.0-py3-none-any.whl.metadata (8.7 kB)
Collecting elastic-transport<9,>=8.13 (from elasticsearch)
  Downloading elastic_transport-8.15.0-py3-none-any.whl.metadata (3.6 kB)
Downloading elasticsearch-8.15.0-py3-none-any.whl (523 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m523.3/523.3 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading elastic_transport-8.15.0-py3-none-any.whl (64 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.4/64.4 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: elastic-transport, elasticsearch
Successfully installed elastic-transport-8.15.0 elasticsearch-8.15.0


In [11]:
from elasticsearch import Elasticsearch, helpers, exceptions
from getpass import getpass

In [12]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# Create the client instance
client = Elasticsearch(
    # For local development
    # hosts=["http://localhost:9200"]
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
)

Elastic Cloud ID: ··········
Elastic Api Key: ··········


## Creating the inference endpoint

In [13]:
API_KEY = getpass("HuggingFace API key:  ")

client.inference.delete(inference_id="jina-embeddings-v2-base-en")
client.inference.put(
    task_type="text_embedding",
    inference_id="jina-embeddings-v2-base-en",
    body={
        "service": "hugging_face",
        "service_settings": {"api_key": API_KEY, "url": "https://api-inference.huggingface.co/models/jinaai/jina-embeddings-v2-base-en" }
    },
)

HuggingFace API key:  ··········


ObjectApiResponse({'model_id': 'jina-embeddings-v2-base-en', 'task_type': 'text_embedding', 'service': 'hugging_face', 'service_settings': {'url': 'https://api-inference.huggingface.co/models/jinaai/jina-embeddings-v2-base-en', 'similarity': 'cosine', 'dimensions': 768, 'rate_limit': {'requests_per_minute': 3000}}, 'task_settings': {}})

## Creating index



In [14]:
client.indices.delete(index="jina-late-chunking", ignore_unavailable=True)
client.indices.create(
    index="jina-late-chunking",
    mappings={
        "properties": {
            "content_embedding": {
                "type": "dense_vector",
                "dims": 768,
                "similarity": "cosine",
                "element_type": "float"
            },
            "content": {"type": "text"},
        }
    },
)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'jina-late-chunking'})

## Loading documents

In [15]:
# Prepare the documents to be indexed
documents = []
for chunk, new_embedding in zip(chunks, embeddings):
    documents.append(
        {
            "_index": "jina-late-chunking",
            "_source": {
                "content_embedding": new_embedding,
                "content": chunk,
            },
        }
    )
# Use helpers.bulk to index
helpers.bulk(client, documents)


(3, [])

## Running semantic search

In [16]:
response = client.search(
    index="jina-late-chunking",
    knn={
        "field": "content_embedding",
        "query_vector_builder": {
            "text_embedding": {
                "model_id": "jina-embeddings-v2-base-en",
                "model_text": "who inspired taking care of the sea?",
            }
        },
        "k": 10,
        "num_candidates": 100,
    },
)

print("Late chunking results")
for hit in response["hits"]["hits"]:
    doc_id = hit["_id"]
    score = hit["_score"]
    content = hit["_source"]["content"]
    print(f"Score: {score}\nContent: {content}\n")

Late chunking results
Score: 0.6046643
Content:  Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.

Score: 0.6044569
Content:  The city is also one of the states of Germany, and is the third smallest state in the country in terms of area.

Score: 0.6022606
Content: Berlin is the capital and largest city of Germany, both by area and by population.

