# Lost In Translation? Multilingual Embedding Models Are All You Need*

This notebook by Quynh Nguyen shows how cross-lingual vector search overcomes language barriers, enabling you to query and retrieve information in any language from both single and multilingual datasets. It accompanies the piece *Lost In Translation? Multilingual Embedding Models Are All You Need* from [Elasticsearch Labs](https://www.elastic.co/search-labs).

## Prerequisites

Before running this notebook, please ensure you have deployed the `.multilingual-e5-small_linux-x86_64` model within the [*Trained Models* view](https://www.elastic.co/docs/explore-analyze/machine-learning/nlp/ml-nlp-deploy-model) as covered in the accompanying article. 

In [26]:
import requests
import json
import urllib3

# Disable InsecureRequestWarning
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

### Download multilingual coco dataset
### Here we are retrieving first 100 rows for this example
### Alternatively, you can use dataset library from Hugging Face
url = "https://datasets-server.huggingface.co/rows?dataset=romrawinjp%2Fmultilingual-coco&config=default&split=restval&offset=0&length=100"
# Make the GET request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()

    # Define the output file path
    output_file = "multilingual_coco_sample.json"

    # Save the JSON data to a file
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4, ensure_ascii=False)

    print(f"Data successfully downloaded and saved to {output_file}")
else:
    print(f"Failed to download data: {response.status_code}")
    print(response.text)

Data successfully downloaded and saved to multilingual_coco_sample.json


In [6]:
from getpass import getpass

# Get credentials securely for localhost Elasticsearch
print("Enter your Elasticsearch credentials:")
elastic_endpoint = input("Enter your Elastic endpoint: ")
api_key = getpass("Enter your API key: ")

Enter your Elasticsearch credentials:


In [27]:
from elasticsearch import Elasticsearch

try:
    es = Elasticsearch(
        hosts=[elastic_endpoint],
        api_key=api_key,
        # Disable SSL verification if using localhost
        verify_certs=False,
    )
    es.ping()

    print("Successfully connected to Elasticsearch")
except Exception as e:
    print(f"Error connecting to Elasticsearch: {e}")
    print("Please check your credentials")
    raise

Successfully connected to Elasticsearch


  _transport = transport_class(


Before we start, let's do some house keeping. We will create a helper function to prettify the ES response and also set up some constants.

You can find the Model ID via 2 ways:
1) In Kibana UI, search `Trained Models` to navigate to the list of models
2) Via API `GET _ml/trained_models` or via the library wrapper

In [28]:
available_models = es.ml.get_trained_models()
print(available_models)

{'count': 3, 'trained_model_configs': [{'model_id': '.elser_model_2', 'model_type': 'pytorch', 'model_package': {'packaged_model_id': 'elser_model_2', 'model_repository': 'https://ml-models.elastic.co', 'minimum_version': '11.0.0', 'size': 438123914, 'sha256': '2e0450a1c598221a919917cbb05d8672aed6c613c028008fedcd696462c81af0', 'metadata': {}, 'tags': [], 'vocabulary_file': 'elser_model_2.vocab.json'}, 'created_by': 'api_user', 'version': '12.0.0', 'create_time': 1756216912000, 'model_size_bytes': 0, 'estimated_operations': 0, 'license_level': 'platinum', 'description': 'Elastic Learned Sparse EncodeR v2', 'tags': ['elastic'], 'metadata': {}, 'input': {'field_names': ['text_field']}, 'inference_config': {'text_expansion': {'vocabulary': {'index': '.ml-inference-native-000002'}, 'tokenization': {'bert': {'do_lower_case': True, 'with_special_tokens': True, 'max_sequence_length': 512, 'truncate': 'first', 'span': -1}}}}, 'location': {'index': {'name': '.ml-inference-native-000002'}}}, {'mo

In [29]:
# Helper function to prettify the ES response
prettify_response = lambda response: json.dumps(
    response.body.get("hits").get("hits"), indent=2, ensure_ascii=False
)
# Default E5 model id
MODEL_ID = ".multilingual-e5-small"

In [30]:
# Define the index mapping
index_name = "coco"
mapping = {
    "mappings": {
        "properties": {
            "language": {"type": "keyword"},
            "description": {"type": "text"},
            "en": {"type": "text"},
            "image_url": {"type": "keyword"},
        }
    }
}

# Create the index if it doesn't exist
if not es.indices.exists(index=index_name):
    es.indices.create(index=index_name, body=mapping)

# Load the JSON data
with open("./multilingual_coco_sample.json", "r") as f:
    data = json.load(f)

rows = data["rows"]
# List of languages to process
languages = ["en", "es", "de", "it", "vi", "th"]

bulk_data = []
for obj in rows:
    row = obj["row"]
    image_url = row.get("image")
    image_url = image_url["src"]

    # Process each language
    for lang in languages:
        # Skip if language not present in this row
        if lang not in row:
            continue

        # Get all descriptions for this language
        descriptions = row[lang]
        first_eng_caption = row["en"][0]

        # Prepare bulk indexing data
        for description in descriptions:
            if description == "":
                continue
            # Add index operation
            bulk_data.append({"index": {"_index": index_name}})
            # Add document
            bulk_data.append(
                {
                    "language": lang,
                    "description": description,
                    "en": first_eng_caption,
                    "image_url": image_url,
                }
            )

# Perform bulk indexing
if bulk_data:
    try:
        response = es.bulk(operations=bulk_data)
        if response["errors"]:
            print("Some documents failed to index")
        else:
            print(f"Successfully bulk indexed {len(bulk_data)} documents")
    except Exception as e:
        print(f"Error during bulk indexing: {str(e)}")

print("Indexing complete!")

Successfully bulk indexed 4840 documents
Indexing complete!


Now we are going to create a pipeline to vectorize the descriptions text_field through our inference text embedding model.

In [31]:
pipeline_body = {
    "description": "Pipeline to run the descriptions text_field through our inference text embedding model",
    "processors": [
        {"set": {"field": "temp_desc", "value": "passage: {{description}}"}},
        {
            "inference": {
                "field_map": {"temp_desc": "text_field"},
                "model_id": MODEL_ID,
                "target_field": "vector_description",
            }
        },
        {"remove": {"field": "temp_desc"}},
    ],
}

try:
    es.ingest.put_pipeline(id="vectorize_descriptions", body=pipeline_body)
    print("Pipeline 'vectorize_descriptions' created successfully.")
except Exception as e:
    print(f"Error creating pipeline: {str(e)}")

Pipeline 'vectorize_descriptions' created successfully.


We also need to create a new Elasticsearch index with the specified vector mapping.

In [32]:
index_body = {
    "mappings": {
        "properties": {
            "description": {"type": "text"},
            "en": {"type": "text"},
            "image_url": {"type": "keyword"},
            "language": {"type": "keyword"},
            "vector_description.predicted_value": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine",
                "index_options": {"type": "bbq_hnsw"},
            },
        }
    }
}

try:
    es.indices.delete(index="coco_multilingual", ignore=[400, 404])
    es.indices.create(index="coco_multilingual", body=index_body)
    print("Index 'coco_multilingual' created successfully.")
except Exception as e:
    print(f"Error creating index: {str(e)}")

  es.indices.delete(index="coco_multilingual", ignore=[400, 404])


Index 'coco_multilingual' created successfully.


Now, we just need to run the pipeline to bring and vectorize the data into the Elasticsearch index.

In [33]:
reindex_body = {
    "source": {"index": "coco"},
    "dest": {"index": "coco_multilingual", "pipeline": "vectorize_descriptions"},
}

response = es.reindex(
    body=reindex_body,
    # Not waiting for completion here cause this process might take a while
    wait_for_completion=False,
)

print("Reindex task started. Task info:")
print(response)

Reindex task started. Task info:
{'task': 'vwUBsf4QRcmG-K41mgbung:126018'}


Voilà, now let's try some queries and have some fun!

In [34]:
query_body = {
    "size": 10,
    "_source": ["description", "language", "en"],
    "knn": {
        "field": "vector_description.predicted_value",
        "k": 10,
        "num_candidates": 100,
        "query_vector_builder": {
            "text_embedding": {"model_id": MODEL_ID, "model_text": "query: kitty"}
        },
    },
}

response = es.search(index="coco_multilingual", body=query_body)
print(prettify_response(response))

[
  {
    "_index": "coco_multilingual",
    "_id": "NOa5kpkBDJxtywa-22dZ",
    "_score": 0.9305427,
    "_source": {
      "description": "Eine Katze, die auf einem Rad eines Fahrzeugs sitzt.",
      "en": "An orange striped tabby cat laying on top of a red vehicle's wheel.",
      "language": "de"
    }
  },
  {
    "_index": "coco_multilingual",
    "_id": "qObKkpkBDJxtywa-83Ci",
    "_score": 0.9305427,
    "_source": {
      "description": "Eine Katze, die auf einem Rad eines Fahrzeugs sitzt.",
      "en": "An orange striped tabby cat laying on top of a red vehicle's wheel.",
      "language": "de"
    }
  },
  {
    "_index": "coco_multilingual",
    "_id": "ypQC5JkBsSOgrYFSA-ja",
    "_score": 0.9305427,
    "_source": {
      "description": "Eine Katze, die auf einem Rad eines Fahrzeugs sitzt.",
      "en": "An orange striped tabby cat laying on top of a red vehicle's wheel.",
      "language": "de"
    }
  },
  {
    "_index": "coco_multilingual",
    "_id": "UJQS5JkBsSOgrYFSI

In [35]:
query_body = {
    "size": 100,
    "_source": ["description", "language", "en"],
    "knn": {
        "field": "vector_description.predicted_value",
        "k": 50,
        "num_candidates": 1000,
        "query_vector_builder": {
            "text_embedding": {
                "model_id": MODEL_ID,
                "model_text": "query: kitty lying on something",
            }
        },
    },
}

response = es.search(index="coco_multilingual", body=query_body)
print(prettify_response(response))

[
  {
    "_index": "coco_multilingual",
    "_id": "Lua5kpkBDJxtywa-22dZ",
    "_score": 0.9332434,
    "_source": {
      "description": "An orange striped tabby cat laying on top of a red vehicle's wheel.",
      "en": "An orange striped tabby cat laying on top of a red vehicle's wheel.",
      "language": "en"
    }
  },
  {
    "_index": "coco_multilingual",
    "_id": "oubKkpkBDJxtywa-83Ci",
    "_score": 0.9332434,
    "_source": {
      "description": "An orange striped tabby cat laying on top of a red vehicle's wheel.",
      "en": "An orange striped tabby cat laying on top of a red vehicle's wheel.",
      "language": "en"
    }
  },
  {
    "_index": "coco_multilingual",
    "_id": "xJQC5JkBsSOgrYFSA-ja",
    "_score": 0.9332434,
    "_source": {
      "description": "An orange striped tabby cat laying on top of a red vehicle's wheel.",
      "en": "An orange striped tabby cat laying on top of a red vehicle's wheel.",
      "language": "en"
    }
  },
  {
    "_index": "coco

In [36]:
query_body = {
    "size": 100,
    "_source": ["description", "language", "en"],
    "knn": {
        "field": "vector_description.predicted_value",
        "k": 50,
        "num_candidates": 1000,
        "query_vector_builder": {
            "text_embedding": {"model_id": MODEL_ID, "model_text": "query: 고양이"}
        },
    },
}

response = es.search(index="coco_multilingual", body=query_body)
print(prettify_response(response))

[
  {
    "_index": "coco_multilingual",
    "_id": "C-a5kpkBDJxtywa-22JY",
    "_score": 0.9282237,
    "_source": {
      "description": "Một cô gái mỉm cười khi cô ấy ôm một con mèo và mặc một chiếc váy màu sắc rực rỡ.",
      "en": "A woman in a room with a cat.",
      "language": "vi"
    }
  },
  {
    "_index": "coco_multilingual",
    "_id": "f-bKkpkBDJxtywa-82ug",
    "_score": 0.9282237,
    "_source": {
      "description": "Một cô gái mỉm cười khi cô ấy ôm một con mèo và mặc một chiếc váy màu sắc rực rỡ.",
      "en": "A woman in a room with a cat.",
      "language": "vi"
    }
  },
  {
    "_index": "coco_multilingual",
    "_id": "oZQC5JkBsSOgrYFSA-PY",
    "_score": 0.9282237,
    "_source": {
      "description": "Một cô gái mỉm cười khi cô ấy ôm một con mèo và mặc một chiếc váy màu sắc rực rỡ.",
      "en": "A woman in a room with a cat.",
      "language": "vi"
    }
  },
  {
    "_index": "coco_multilingual",
    "_id": "J5QS5JkBsSOgrYFSIe02",
    "_score": 0.92822