# Semantic Search using the Inference API with the Hugging Face Inference Endpoints Service

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/integrations/hugging-face/huggingface-integration-millions-of-documents-with-cohere-reranking.ipynb)


Learn how to use the [Inference API](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-apis.html) with the Hugging Face Inference Endpoint service for semantic search.

# 🧰 Requirements

For this example, you will need:

- An Elastic deployment:
   - We'll be using [Elastic serverless](https://www.elastic.co/docs/current/serverless) for this example (available with a [free trial](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook))

- Elasticsearch 8.14 or above.
   
- A paid [Hugging Face Inference Endpoint](https://huggingface.co/docs/inference-endpoints/guides/create_endpoint) is required to use the Inference API with 
the Hugging Face Inference Endpoint service.

# Create Elastic Cloud deployment or serverless project

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

# Install packages and connect with Elasticsearch Client

To get started, we'll need to connect to our Elastic deployment using the Python client (version 8.12.0 or above).
Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.

First we need to `pip` install the following packages:

- `elasticsearch`

In [None]:
!pip install elasticsearch
%pip install datasets

Next, we need to import the modules we need. 🔐 NOTE: getpass enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.

In [1]:
from elasticsearch import Elasticsearch, helpers
from getpass import getpass
import datasets

Now we can instantiate the Python Elasticsearch client.

First we prompt the user for their password and Cloud ID.
Then we create a `client` object that instantiates an instance of the `Elasticsearch` class.

In [2]:
# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# Create the client instance
client = Elasticsearch(
    # For local development
    # hosts=["http://localhost:9200"]
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
    request_timeout=120,
    max_retries=10,
    retry_on_timeout=True,
)

### Test the Client
Before you continue, confirm that the client has connected with this test.

In [3]:
print(client.info())


# define this now so we can use it later
def pretty_search_response(response):
    if len(response["hits"]["hits"]) == 0:
        print("Your search returned no results.")
    else:
        for hit in response["hits"]["hits"]:
            id = hit["_id"]
            score = hit["_score"]
            text = hit["_source"]["text_field"]

            pretty_output = f"\nID: {id}\nScore: {score}\nText: {text}"

            print(pretty_output)

{'name': 'serverless', 'cluster_name': 'd3ae40d244564c39961aa942d9d47f84', 'cluster_uuid': 'poKWeRbiS--nyD43R_NROw', 'version': {'number': '8.11.0', 'build_flavor': 'serverless', 'build_type': 'docker', 'build_hash': '00000000', 'build_date': '2023-10-31', 'build_snapshot': False, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '8.11.0', 'minimum_index_compatibility_version': '8.11.0'}, 'tagline': 'You Know, for Search'}


Refer to [the documentation](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new) to learn how to connect to a self-managed deployment.

Read [this page](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new) to learn how to connect using API keys.

<a name="create-the-inference-endpoint"></a>
## Create the inference endpoint object

Let's create the inference endpoint by using the [Create inference API](https://www.elastic.co/guide/en/elasticsearch/reference/current/put-inference-api.html).

You'll need an Hugging Face API key (access token) for this that you can find in your Hugging Face account under the [Access Tokens](https://huggingface.co/settings/tokens).

You will also need to have created a [Hugging Face Inference Endpoint service instance](https://huggingface.co/docs/inference-endpoints/guides/create_endpoint) and noted the `url` of your instance. For this notebook, we deployed the `multilingual-e5-small` model.

In [5]:
API_KEY = getpass("Huggingface API key:  ")
client.inference.put(
    inference_id="my_hf_endpoint_object",
    body={
        "service": "hugging_face",
        "service_settings": {
            "api_key": API_KEY,
            "url": "<HF-URL>",
            "similarity": "dot_product",
        },
    },
    task_type="text_embedding",
)

ObjectApiResponse({'inference_id': 'my_hf_endpoint_object', 'task_type': 'text_embedding', 'service': 'hugging_face', 'service_settings': {'url': 'https://yb0j0ol2xzvro0oc.us-east-1.aws.endpoints.huggingface.cloud', 'similarity': 'dot_product', 'dimensions': 384, 'rate_limit': {'requests_per_minute': 3000}}, 'task_settings': {}})

In [6]:
client.inference.inference(
    inference_id="my_hf_endpoint_object", input="this is the raw text of my document!"
)

ObjectApiResponse({'text_embedding': [{'embedding': [0.026027203, -0.011120652, -0.048804738, -0.108695105, 0.06134937, -0.003066093, 0.053232085, 0.103629395, 0.046043355, 0.0055427994, 0.036174323, 0.022110537, 0.084891565, -0.008215214, -0.017915571, 0.041923355, 0.048264034, -0.0404355, -0.02609504, -0.023076748, 0.0077286777, 0.023034474, 0.010379155, 0.06257496, 0.025658935, 0.040398516, -0.059809092, 0.032451782, 0.020798752, -0.053219322, -0.0447653, -0.033474423, 0.085040554, -0.051343303, 0.081006914, 0.026895791, -0.031822708, -0.06217641, 0.069435075, -0.055062667, -0.014967285, -0.0040517864, 0.03874908, 0.07854211, 0.017526977, 0.040629108, -0.023190023, 0.056913305, -0.06422566, -0.009403182, -0.06666503, 0.035270344, 0.004515737, 0.07347306, 0.011125566, -0.07184689, -0.08095445, -0.04214626, -0.108447045, -0.019494658, 0.06303337, 0.019757038, -0.014584281, 0.060923614, 0.06465893, 0.108431116, 0.04072316, 0.03705652, -0.06975359, -0.050562095, -0.058487326, 0.05989619

**IMPORTANT:** If you use Elasticsearch 8.12, you must change `inference_id` in the snippet above to `model_id`! 

#

## Create an ingest pipeline with an inference processor

Create an ingest pipeline with an inference processor by using the [`put_pipeline`](https://www.elastic.co/guide/en/elasticsearch/reference/master/put-pipeline-api.html) method. Reference the `inference_id` created above as `model_id` to infer on the data that is being ingested by the pipeline.

In [7]:
client.ingest.put_pipeline(
    id="hf_pipeline",
    processors=[
        {
            "inference": {
                "model_id": "my_hf_endpoint_object",
                "input_output": {
                    "input_field": "text_field",
                    "output_field": "text_embedding",
                },
            }
        }
    ],
)

ObjectApiResponse({'acknowledged': True})

Let's note a few important parameters from that API call:

- `inference`: A processor that performs inference using a machine learning model.
- `model_id`: Specifies the ID of the inference endpoint to be used. In this example, the inference ID is set to `my_hf_endpoint_object`. Use the inference ID you defined when created the inference task.
- `input_output`: Specifies input and output fields.
- `input_field`: Field name from which the `dense_vector` representation is created.
- `output_field`:  Field name which contains inference results. 

## Create index

The mapping of the destination index - the index that contains the embeddings that the model will create based on your input text - must be created. The destination index must have a field with the [dense_vector](https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html) field type to index the output of the model we deployed in Hugging Face (`multilingual-e5-small`).

Let's create an index named `hf-endpoint-index` with the mappings we need.

In [8]:
client.indices.create(
    index="hf-endpoint-index",
    settings={
        "index": {
            "default_pipeline": "hf_pipeline",
        }
    },
    mappings={
        "properties": {
            "text": {"type": "text"},
            "text_embedding": {
                "type": "dense_vector",
                "dims": 384,
                "similarity": "dot_product",
            },
        }
    },
)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'hf-endpoint-index'})

## If you are using Elasticsearch serverless or v8.15+ then you will have access to the new `semantic_text` field
`semantic_text` has significantly faster ingest times and is recommended.

https://github.com/elastic/elasticsearch/blob/main/docs/reference/mapping/types/semantic-text.asciidoc

In [9]:
client.indices.create(
    index="hf-semantic-text-index",
    mappings={
        "properties": {
            "infer_field": {
                "type": "semantic_text",
                "inference_id": "my_hf_endpoint_object",
            },
            "text_field": {"type": "text", "copy_to": "infer_field"},
        }
    },
)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'hf-semantic-text-index'})

## Insert Documents

In this example, we want to show the power of using GPUs in Hugging Face's Inference Endpoint service by indexing millions of multilingual documents from the miracl corpus. The speed at which these documents ingest will depend on whether you use a semantic text field (faster) or an ingest pipeline (slower) and will also depend on how much hardware your rent for your Hugging Face inference endpoint. Using a semantic_text field with a single T4 GPU, it may take about 3 hours to index 1 million documents. 

In [10]:
langs = [
    "ar",
    "bn",
    "en",
    "es",
    "fa",
    "fi",
    "fr",
    "hi",
    "id",
    "ja",
    "ko",
    "ru",
    "sw",
    "te",
    "th",
    "zh",
]


all_langs_datasets = [
    iter(datasets.load_dataset("miracl/miracl-corpus", lang)["train"]) for lang in langs
]

miracl-corpus.py:   0%|          | 0.00/3.15k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.85k [00:00<?, ?B/s]

Loading dataset shards:   0%|          | 0/28 [00:00<?, ?it/s]

In [11]:
MAX_BULK_SIZE = 1000
MAX_BULK_UPLOADS = 1000

sentinel = object()
for j in range(MAX_BULK_UPLOADS):
    documents = []
    while len(documents) < MAX_BULK_SIZE - len(all_langs_datasets):
        for ds in all_langs_datasets:
            text = next(ds, sentinel)
            if text is not sentinel:
                documents.append(
                    {
                        "_index": "hf-semantic-text-index",
                        "_source": {"text_field": text["text"]},
                    }
                )
                # if you are using an ingest pipeline instead of a
                # semantic text field, use this instead:
                # documents.append(
                #     {
                #         "_index": "hf-endpoint-index",
                #         "_source": {"text": text['text']},
                #     }
                # )

    try:
        response = helpers.bulk(client, documents, raise_on_error=False, timeout="60s")
        print("Docs uplaoded:", (j + 1) * MAX_BULK_SIZE)

    except Exception as e:
        print("exception:", str(e))

Docs uplaoded: 1000
Docs uplaoded: 2000


KeyboardInterrupt: 

## Semantic search

After the dataset has been enriched with the embeddings, you can query the data using [semantic search](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#knn-semantic-search). Pass a `query_vector_builder` to the k-nearest neighbor (kNN) vector search API, and provide the query text and the model you have used to create the embeddings.

In [12]:
query = "English speaking countries"
semantic_search_results = client.search(
    index="hf-semantic-text-index",
    query={"semantic": {"field": "infer_field", "query": query}},
)

In [13]:
pretty_search_response(semantic_search_results)


ID: DDbC4pEBhYre9Ocn7zIr
Score: 0.92574656
Text: Orodha ya nchi kufuatana na wakazi

ID: bjbC4pEBhYre9OcnzC3U
Score: 0.9159906
Text: Intercontinental Cup

ID: njbC4pEBhYre9OcnzC3U
Score: 0.91523564
Text: รายการจัดเรียงตามทวีปและประเทศ

ID: bDbC4pEBhYre9Ocn3jBM
Score: 0.9142189
Text: a b c ĉ d e f g ĝ h ĥ i j ĵ k l m n o p r s ŝ t u ŭ v z

ID: 8jbD4pEBhYre9OcnDTSL
Score: 0.9127883
Text: With Australia:
With Adelaide United:

ID: MzbC4pEBhYre9Ocn_TQ1
Score: 0.9116771
Text: Más información en .

ID: _DbC4pEBhYre9Ocn7zEr
Score: 0.9106927
Text: (AS)= Asia (AF)= Afrika (NA)= Amerika ya kaskazini (SA)= Amerika ya kusini (A)= Antaktika (EU)= Ulaya na (AU)= Australia na nchi za Pasifiki.

ID: fDbC4pEBhYre9Ocn7zEr
Score: 0.9096315
Text: Stadi za lugha ya mazungumzo ni kuzungumza na kusikiliza.

ID: DDbC4pEBhYre9Ocn3jBL
Score: 0.90771043
Text: "*(Meksiko mara nyingi huhesabiwa katika Amerika ya Kati kwa sababu za kiutamaduni)"

ID: IjbC4pEBhYre9Ocn3i9L
Score: 0.9070151
Text: Englan is a small vi

In [17]:
try:
    client.inference.delete(inference_id="my_cohere_rerank_endpoint")
except Exception:
    pass
client.inference.put(
    task_type="rerank",
    inference_id="my_cohere_rerank_endpoint",
    body={
        "service": "cohere",
        "service_settings": {
            "api_key": "<COHERE-API-KEY>",
            "model_id": "rerank-english-v3.0",
        },
        "task_settings": {"top_n": 100, "return_documents": True},
    },
)

ObjectApiResponse({'inference_id': 'my_cohere_rerank_endpoint', 'task_type': 'rerank', 'service': 'cohere', 'service_settings': {'model_id': 'rerank-english-v3.0', 'rate_limit': {'requests_per_minute': 10000}}, 'task_settings': {'top_n': 100, 'return_documents': True}})

In [18]:
reranked_search_results = client.search(
    index="hf-semantic-text-index",
    retriever={
        "text_similarity_reranker": {
            "retriever": {
                "standard": {
                    "query": {"semantic": {"field": "infer_field", "query": query}}
                }
            },
            "field": "text_field",
            "inference_id": "my_cohere_rerank_endpoint",
            "inference_text": query,
            "rank_window_size": 100,
        }
    },
)

In [19]:
pretty_search_response(reranked_search_results)


ID: _DbC4pEBhYre9Ocn7zEr
Score: 0.1766716
Text: (AS)= Asia (AF)= Afrika (NA)= Amerika ya kaskazini (SA)= Amerika ya kusini (A)= Antaktika (EU)= Ulaya na (AU)= Australia na nchi za Pasifiki.

ID: zDbC4pEBhYre9OcnzC7V
Score: 0.06394842
Text: Waingereza nao wakatawala Afrika Mashariki na Kusini, na kuwa sehemu ya Sudan na Somalia, Uganda, Kenya, Tanzania (chini ya jina la Tanganyika), Zanzibar, Nyasaland, Rhodesia, Bechuanaland, Basutoland na Swaziland chini ya utawala wao na baada ya kushinda katika vita huko Afrika ya Kusini walitawala Transvaal, Orange Free State, Cape Colony na Natal, na huko Afrika ya Magharibi walitawala Gambia, Sierra Leone, the Gold Coast na Nigeria.

ID: bDbC4pEBhYre9Ocn3jBM
Score: 0.013532149
Text: a b c ĉ d e f g ĝ h ĥ i j ĵ k l m n o p r s ŝ t u ŭ v z

ID: LDbD4pEBhYre9OcnHje5
Score: 0.010130412
Text: Mifano maarufu ya bunge ni Majumba ya Bunge mjini London, Kongresi mjini Washingtin D.C., Bundestag mjini Berlin na Duma nchini Moscow, Parlamento Italiano mjin

**NOTE:** The value of `model_id` in the `query_vector_builder` must match the value of `inference_id` you created in the [first step](#create-the-inference-endpoint).