# Multilingual semantic search

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/04-multilingual.ipynb)

In this example we'll use a multilingual embedding model
[multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) to perform search on a toy dataset of mixed
language documents. Using this model, we can search in two ways:
 * Across languages, for example using a query in German to find documents in English
 * Within a non-English language, for example using a query in German to find documents in German

 While this example is using dense retrieval only, it's possible to also combine dense and tradition lexical retrieval
 with hybrid search. For more information on lexical multilingual search, please see the blog post
 [Multilingual search using language identification in Elasticsearch](https://www.elastic.co/blog/multilingual-search-using-language-identification-in-elasticsearch).

 The dataset used contains snippets of Wikipedia passages from the [MIRACL](https://project-miracl.github.io/) dataset.

# 🧰 Requirements

For this example, you will need:

- Python 3.6 or later
- An Elastic deployment with minimum **4GB machine learning node**
   - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook))
- The [Elastic Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/installation.html)


## Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

- Go to the [Create deployment](https://cloud.elastic.co/deployments/create) page
   - Select **Create deployment**

# Install packages and initialize the Elasticsearch Python client

To get started, we'll need to connect to our Elastic deployment using the Python client.
Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.

First we need to `pip` install the packages we need for this example.

In [None]:
!pip install elasticsearch
!pip install sentence_transformers
!pip install torch

Next we need to import the `elasticsearch` module and the `getpass` module.
`getpass` is part of the Python standard library and is used to securely prompt for credentials.

In [None]:
import getpass
import textwrap
import torch

from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

device = "cuda" if torch.cuda.is_available() else "cpu"

model = SentenceTransformer("intfloat/multilingual-e5-base", device=device)

Now we can instantiate the Python Elasticsearch client.
First we prompt the user for their password and Cloud ID.

🔐 NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.

Then we create a `client` object that instantiates an instance of the `Elasticsearch` class.

In [None]:
# Found in the "Manage Deployment" page
CLOUD_ID = getpass.getpass("Enter Elastic Cloud ID:  ")

# Password for the "elastic" user generated by Elasticsearch
ELASTIC_PASSWORD = getpass.getpass("Enter Elastic password:  ")

# Create the client instance
client = Elasticsearch(
    cloud_id=CLOUD_ID,
    basic_auth=("elastic", ELASTIC_PASSWORD)
)

Confirm that the client has connected with this test

In [None]:
print(client.info())

Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.

Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys.


# Create Elasticsearch index with required mappings

We need to add a field to support dense vector storage and search.
Note the `passage_embedding` field below, which is used to store the dense vector representation of the `passage` field.

In [None]:
# Define the mapping
mapping = {
    "mappings": {
        "properties": {
            "language": {"type": "keyword"},
            "id": {"type": "keyword"},
            "title": {"type": "text"},
            "passage": {"type": "text"},
            "passage_embedding": {
                "type": "dense_vector",
                "dims": 768,
                "index": "true",
                "similarity": "cosine"
            }
        }
    }
}

# Create the index (deleting any existing index)
client.indices.delete(index="articles", ignore_unavailable=True)
client.indices.create(index="articles", body=mapping)


## Dataset

Let's index some data.
Note that we are embedding the `passage` field using the sentence transformer model.
Once indexed, you'll see that your documents contain a `passage_embedding` field (`"type": "dense_vector"`) which contains a vector of floating point values.
This is the embedding of the `passage` field in vector space.
We'll use this field to perform semantic search using kNN.

In [None]:
articles = [
    {
        "language": "en",
        "id": "1643584#0",
        "title": "Bloor Street",
        "passage": """Bloor Street is a major east–west residential and commercial thoroughfare in Toronto, Ontario, Canada. Bloor Street runs from the Prince Edward Viaduct, which spans the Don River Valley, westward into Mississauga where it ends at Central Parkway. East of the viaduct, Danforth Avenue continues along the same right-of-way. The street, approximately long, contains a significant cross-sample of Toronto's ethnic communities. It is also home to Toronto's famous shopping street, the Mink Mile.""",
    },
    {
        "language": "en",
        "id": "2190499#0",
        "title": "Elphinstone College",
        "passage": """Elphinstone College is an institution of higher education affiliated to the University of Mumbai. Established in 1856, it is one of the oldest colleges of the University of Mumbai. It is reputed for producing luminaries like Bal Gangadhar Tilak, Bhim Rao Ambedkar, Virchand Gandhi, Badruddin Tyabji, Pherozshah Mehta, Kashinath Trimbak Telang, Jamsetji Tata and for illustrious professors that includes Dadabhai Naoroji. It is further observed for having played a key role in spread of Western education in the Bombay Presidency.""",
    },
    {
        "language": "en",
        "id": "8881#0",
        "title": "Doctor (title)",
        "passage": """Doctor is an academic title that originates from the Latin word of the same spelling and meaning. The word is originally an agentive noun of the Latin verb "" 'to teach'. It has been used as an academic title in Europe since the 13th century, when the first Doctorates were awarded at the University of Bologna and the University of Paris. Having become established in European universities, this usage spread around the world. Contracted "Dr" or "Dr.", it is used as a designation for a person who has obtained a Doctorate (e.g. PhD). In many parts of the world it is also used by medical practitioners, regardless of whether or not they hold a doctoral-level degree.""",
    },
    {
        "language": "de",
        "id": "9002#0",
        "title": "Gesundheits- und Krankenpflege",
        "passage": """Die Gesundheits- und Krankenpflege als Berufsfeld umfasst die Versorgung und Betreuung von Menschen aller Altersgruppen, insbesondere kranke, behinderte und sterbende Erwachsene. Die Gesundheits- und Kinderkrankenpflege hat ihren Schwerpunkt in der Versorgung von Kindern und Jugendlichen. In beiden Fachrichtungen gehört die Verhütung von Krankheiten und Gesunderhaltung zum Aufgabengebiet der professionellen Pflege.""",
    },
    {
        "language": "de",
        "id": "7769762#0",
        "title": "Tourismusregion (Österreich)",
        "passage": """Unter Tourismusregion versteht man in Österreich die in den Landestourismusgesetzen verankerten Tourismusverbände mehrerer Gemeinden, im weiteren Sinne aller Gebietskörperschaften.""",
    },
    {
        "language": "de",
        "id": "2270104#0",
        "title": "London Wall",
        "passage": """London Wall ist die strategische Stadtmauer, die die Römer um Londinium gebaut haben, um die Stadt zu schützen, die über den wichtigen Hafen an der Themse verfügte. Bis ins späte Mittelalter hinein bildete diese Stadtmauer die Grenzen von London. Heute ist "London Wall" auch der Name einer Straße, die an einem noch bestehenden Abschnitt der Stadtmauer verläuft.""",
    },
    {
        "language": "de",
        "id": "2270104#1",
        "title": "London Wall",
        "passage": """Die Mauer wurde Ende des zweiten oder Anfang des dritten Jahrhunderts erbaut, wahrscheinlich zwischen 190 und 225, vermutlich zwischen 200 und 220. Sie entstand somit etwa achtzig Jahre nach dem im Jahr 120 erfolgten Bau der Festung, deren nördliche und westliche Mauern verstärkt und in der Höhe verdoppelt wurden, um einen Teil der neuen Stadtmauer zu bilden. Die Anlage wurde zumindest bis zum Ende des vierten Jahrhunderts weiter ausgebaut. Sie zählt zu den letzten großen Bauprojekten der Römer vor deren Rückzug aus Britannien im Jahr 410.""",
    },
]

## Index documents

Our dataset is a Python list that contains dictionaries of passages from Wikipedia articles in two languages.
We'll use the `helpers.bulk` method to index our documents in batches.

The following code iterates over the articles and creates a list of actions to be performed.
Each action is a dictionary containing an "index" operation on our Elasticsearch index.
The passage is encoded using our selected model, and the encoded vector is added to the article document.
Note that the E5 models require that a prefix instruction is used "passage: " to tell the model that it is to embed a passage.
On the query side, the query string will be prefixed with "query: ".
The article document is then added to the list of actions.

Finally, we call the `bulk` method, specifying the index name and the list of actions.

In [None]:
actions = []
for article in articles:
    actions.append({"index": {"_index": "articles"}})
    passage = article["passage"]
    passageEmbedding = model.encode(f"passage: {passage}").tolist()
    article["passage_embedding"] = passageEmbedding
    actions.append(article)

client.bulk(index="articles", operations=actions)

# Multilingual Semantic Search

In the following, we will search using two kinds of queries:
 * Query in English to find documents in any language
 * Query in German to find documents in German only (using a filter),
   to show the model's capabilities in non-English languages

Note again that the query is prefixed with "query: ", which the model requires to encode the query properly.

A quick translation for those unfamiliar with German:
 * "health" -> "Gesundheit"
 * "wall" -> "Mauer"

In [None]:
def pretty_response(response):
    for hit in response["hits"]["hits"]:
        score = hit["_score"]
        language = hit["_source"]["language"]
        id = hit["_source"]["id"]
        title = hit["_source"]["title"]
        passage = hit["_source"]["passage"]
        print()
        print(f"ID: {id}")
        print(f"Language: {language}")
        print(f"Title: {title}")
        print(f"Passage: {textwrap.fill(passage, 120)}")
        print(f"Score: {score}")

In [None]:
def query(q, language=None):
    knn = {
        "field": "passage_embedding",
        "query_vector" : model.encode(f"query: {q}").tolist(),
        "k": 2,
        "num_candidates": 5
    }

    if language:
        knn["filter"] = {
            "term": {
                "language": language,
            }
        }

    return client.search(index="articles", knn=knn)

In [13]:
pretty_response(query("health"))


ID: 9002#0
Language: de
Title: Gesundheits- und Krankenpflege
Passage: Die Gesundheits- und Krankenpflege als Berufsfeld umfasst die Versorgung und Betreuung von Menschen aller Altersgruppen,
insbesondere kranke, behinderte und sterbende Erwachsene. Die Gesundheits- und Kinderkrankenpflege hat ihren Schwerpunkt
in der Versorgung von Kindern und Jugendlichen. In beiden Fachrichtungen gehört die Verhütung von Krankheiten und
Gesunderhaltung zum Aufgabengebiet der professionellen Pflege.
Score: 0.8986236

ID: 8881#0
Language: en
Title: Doctor (title)
Passage: Doctor is an academic title that originates from the Latin word of the same spelling and meaning. The word is originally
an agentive noun of the Latin verb "" 'to teach'. It has been used as an academic title in Europe since the 13th
century, when the first Doctorates were awarded at the University of Bologna and the University of Paris. Having become
established in European universities, this usage spread around the world. Contract

Note that in the results above, we see that the document about healthcare,
even though it's in German, matches better to the query "health",
versus the English document which doesn't talk about health specifically but about doctors more generally.
This is the power of a multilingual embedding which embeds meaning across languages.

In [14]:
pretty_response(query("wall", language="de"))


ID: 2270104#0
Language: de
Title: London Wall
Passage: London Wall ist die strategische Stadtmauer, die die Römer um Londinium gebaut haben, um die Stadt zu schützen, die über
den wichtigen Hafen an der Themse verfügte. Bis ins späte Mittelalter hinein bildete diese Stadtmauer die Grenzen von
London. Heute ist "London Wall" auch der Name einer Straße, die an einem noch bestehenden Abschnitt der Stadtmauer
verläuft.
Score: 0.8941859

ID: 2270104#1
Language: de
Title: London Wall
Passage: Die Mauer wurde Ende des zweiten oder Anfang des dritten Jahrhunderts erbaut, wahrscheinlich zwischen 190 und 225,
vermutlich zwischen 200 und 220. Sie entstand somit etwa achtzig Jahre nach dem im Jahr 120 erfolgten Bau der Festung,
deren nördliche und westliche Mauern verstärkt und in der Höhe verdoppelt wurden, um einen Teil der neuen Stadtmauer zu
bilden. Die Anlage wurde zumindest bis zum Ende des vierten Jahrhunderts weiter ausgebaut. Sie zählt zu den letzten
großen Bauprojekten der Römer vor der

In [15]:
pretty_response(query("Mauer", language="de"))


ID: 2270104#1
Language: de
Title: London Wall
Passage: Die Mauer wurde Ende des zweiten oder Anfang des dritten Jahrhunderts erbaut, wahrscheinlich zwischen 190 und 225,
vermutlich zwischen 200 und 220. Sie entstand somit etwa achtzig Jahre nach dem im Jahr 120 erfolgten Bau der Festung,
deren nördliche und westliche Mauern verstärkt und in der Höhe verdoppelt wurden, um einen Teil der neuen Stadtmauer zu
bilden. Die Anlage wurde zumindest bis zum Ende des vierten Jahrhunderts weiter ausgebaut. Sie zählt zu den letzten
großen Bauprojekten der Römer vor deren Rückzug aus Britannien im Jahr 410.
Score: 0.88160384

ID: 2270104#0
Language: de
Title: London Wall
Passage: London Wall ist die strategische Stadtmauer, die die Römer um Londinium gebaut haben, um die Stadt zu schützen, die über
den wichtigen Hafen an der Themse verfügte. Bis ins späte Mittelalter hinein bildete diese Stadtmauer die Grenzen von
London. Heute ist "London Wall" auch der Name einer Straße, die an einem noch bestehe