# Ensuring Semantic Precision With Minimum Score

<a target="_blank" href="https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/supporting-blog-content/ensuring-semantic-precision-with-minimum-score/ensuring-semantic-precision-with-minimum-score.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Learn how to use min_score with semantic text and hybrid search to improve precision. Please see [blog post for further context](https://www.elastic.co/search-labs/blog/semantic-precision-with-minimum-score).

## Requirements

For this example, you will need:

- An Elastic deployment:
  - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook))

- Elasticsearch 9.1 or above, or [Elasticsearch serverless](https://www.elastic.co/elasticsearch/serverless)

## Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

## Install packages and connect with Elasticsearch Client

To get started, we'll need to connect to our Elastic deployment using the Python client (version 8.15.0 or above).
Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.

First we need to `pip` install the following packages:

- `elasticsearch`

In [None]:
!pip install elasticsearch tqdm

Next, we need to import the modules we need.

In [None]:
from elasticsearch import Elasticsearch, helpers
from getpass import getpass
import json
import gzip
from tqdm.auto import tqdm as tqdm

Now we can instantiate the Python Elasticsearch client.

First we prompt the user for their password and Cloud ID.
Then we create a `client` object that instantiates an instance of the `Elasticsearch` class. üîê NOTE: getpass enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.

In [None]:
# https://www.elastic.co/docs/deploy-manage/deploy/cloud-enterprise/connect-elasticsearch#ece-connect-endpoint
ELASTIC_ENDPOINT = input("Elastic HTTPS Endpoint: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# Create the client instance
client = Elasticsearch(
    hosts=[ELASTIC_ENDPOINT],
    api_key=ELASTIC_API_KEY,
    request_timeout=120
)

INDEX_NAME = "search-movies"

### Test the Client
Before you continue, confirm that the client has connected with this test.

In [None]:
print(client.info())

Refer to [the documentation](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new) to learn how to connect to a self-managed deployment.

Read [this page](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new) to learn how to connect using API keys.

## Create Inference Service
To allow for fast inference we will use the [inference service](https://www.elastic.co/docs/explore-analyze/elastic-inference/eis).

In [None]:
client.inference.put(task_type='sparse_embedding', inference_id='movie-inference',
                    body = {
                        "service": "elastic",
                        "service_settings": {
                            "model_id": "elser"
                        }
                    })

## Create the Index

Now we need to create the movie index. Note that this code deletes any existing index with the same name.

In [None]:
client.indices.delete(index=INDEX_NAME, ignore_unavailable=True)
client.indices.create(
    index=INDEX_NAME,
    mappings={
    "dynamic": "true",
    "dynamic_templates": [
      {
        "all_text_fields": {
          "match_mapping_type": "string",
          "mapping": {
            "analyzer": "english",
            "fields": {
              "keyword": {
                "ignore_above": 2048,
                "type": "keyword"
              }
            }
          }
        }
      }
    ],
    "properties": {
      "title": {
        "type": "text"
      },
      "overview": {
        "type": "text",
        "copy_to": "overview_vector"
      },
      "overview_vector": {
        "type": "semantic_text"
      }
    }
  }
)

Notice how we configured the mappings. We defined `overview_vector` as a `semantic_text` field.
The `inference_id` parameter defines the inference endpoint that is used to generate the embeddings for the field.
Then we configured the `overview` field to [copy its value](https://www.elastic.co/guide/en/elasticsearch/reference/current/copy-to.html) to the `overview_vector` field.

While `copy_to` is not required to use `semantic_text`, it enables use cases like hybrid search where semantic and lexical techniques are used together. We will cover a hybrid search example later in this notebook.

## Populate the Index

Let's populate the index with our example dataset

### Colab - Download Data - TODO update with real URL

If running the notebook in Colab, as opposed to checking out or downloading the elasticsearch-labs repository, we need to download the data to the colab environment

In [None]:
!wget -O movies.json.gz https://github.com/mbrunnert/elasticsearch-labs/raw/refs/heads/ensuring-semantic-precision-with-minimum-score/supporting-blog-content/ensuring-semantic-precision-with-minimum-score/movies.json.gz

### Run data import script

This will take a little while, probably around 15 minutes...

In [None]:
def data_generator(file_json, index):
    for doc in file_json:
        # doc["_run_ml_inference"] = True
        yield {
            "_index": index,
            "_source": doc,
        }

print("Indexing movies data, this might take a while...")
file = gzip.open('movies.json.gz', "r")
json_bytes = file.read()
json_str = json_bytes.decode("utf-8")
file_json = json.loads(json_str)
total_documents = len(file_json)
progress_bar = tqdm(total=total_documents, unit="documents")
success_count = 0


for ok, info in helpers.streaming_bulk(
    client=client,
    chunk_size=16,
    actions=data_generator(file_json, INDEX_NAME),
    raise_on_error=True,
):
    if ok:
        success_count += 1
    else:
        print(f"Unable to index {info['index']['_id']}: {info['index']['error']}")
    progress_bar.update(1)
    progress_bar.set_postfix(success=success_count)


progress_bar.close()

# Calculate the success percentage
success_percentage = (success_count / total_documents) * 100
print(f"Indexing completed! Success percentage: {success_percentage}%")
print("Done indexing movies data")

## Semantic Search

Now that our index is populated, we can query it using semantic search.

### Aside: Pretty printing Elasticsearch search results

Your `search` API calls will return hard-to-read nested JSON.
We'll create a little function called `pretty_search_response` to return nice, human-readable outputs from our examples.

In [None]:
def pretty_search_response(response):
    if len(response["hits"]["hits"]) == 0:
        print("Your search returned no results.")
    else:
      position = 1
      for position,hit in enumerate(response["hits"]["hits"], start=1):
          id = hit["_id"]
          score = hit["_score"]
          title = hit["_source"]["title"]
          overview = hit["_source"]["overview"]
          rating = hit["_source"].get("rating", "unknown")

          pretty_output = f"\nPosition: {position} Score: {score} ID: {id}\nTitle: {title}\nOverview: {overview}\nRating: {rating}"

          print(pretty_output)
          position = position + 1

### Semantic Search with the `semantic` Query

We can use the [`semantic` query](https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-semantic-query.html) to quickly & easily query the `semantic_text` field in our index.
Under the hood, an embedding is automatically generated for our query text using the `semantic_text` field's inference endpoint. Notice at the bottom of search results, some movies are not really super hero movies, such as Beethoven's 3rd

In [None]:
response = client.search(
    index=INDEX_NAME,
    query={"semantic": {"field": "overview_vector", "query": "superhero movie"}},
    size=100
)

pretty_search_response(response)


We will now move on to increase the precision by introducing minimum score.

### Normalising and Enforcing Minimum Score

The below example adds the minimum score parameter to reduce irrelevant results. It utilises the minmax normalizer to get a nice score distribution of 0-1, which also helps in setting an appropriate score threshold. Feel free to experiment with different thresholds

In [None]:
response = client.search(
    index=INDEX_NAME,
    size=100,
    retriever={
    "linear": {
      "rank_window_size": 500,
      "min_score": 0.25,
      "retrievers": [
        {
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "semantic": {
                  "field": "overview_vector",
                  "query": "superhero movie"
                }
              }
            }
          }
        }
      ]
    }
  }
,
)

pretty_search_response(response)

These results demonstate how to cut-off the long tail of irrelavant results. Lets build on the semantic search retriever with a couple of hybrid search examples

## Hybrid Search Using Linear Retriever

The below hybrid search request uses linear retriever with weights to control the importance of the semantic vs lexical search. Note that weights add up to 1, which means the total score will remain within the 0-1 interval. This will again make it easier to set and understand the minimum threshold.

In [None]:
response = client.search(
    index=INDEX_NAME,
    size=100,
    retriever={
    "linear": {
      "rank_window_size": 500,
      "min_score": 0.25,
      "retrievers": [
        {
          "weight": 0.6,
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "semantic": {
                  "field": "overview_vector",
                  "query": "superhero movie"
                }
              }
            }
          }
        },
        {
          "weight": 0.4,
          "normalizer": "minmax",
          "retriever": {
            "standard": {
              "query": {
                "multi_match": {
                  "query": "superhero movie",
                  "fields": ["overview","keywords", "title"],
                  "type": "cross_fields",
                  "minimum_should_match": "2"
                }
              }
            }
          }
        }
      ]
    }
  }
)

pretty_search_response(response)

Note that in this case the minimum score is applied to the total score, not just the semantic score.

## Hybrid Search Using RRF

The below search uses RRF and applies the minimum score only to the semantic score. This allows us to control the semantic and lexical precision individually. To control lexical precision, we often use the minimum_should_match parameter, as in this example.

In [None]:
response = client.search(
    index=INDEX_NAME,
    size=100,
    retriever={
    "rrf": {
      "rank_window_size": 500,
      "retrievers": [
        {
          "linear": {
            "rank_window_size": 500,
            "min_score": 0.25,
            "retrievers": [
              {
                "normalizer": "minmax",
                "retriever": {
                  "standard": {
                    "query": {
                      "semantic": {
                        "field": "overview_vector",
                        "query": "superhero movie"
                      }
                    }
                  }
                }
              }
            ]
          }
        },
        {
          "standard": {
            "query": {
              "multi_match": {
                "query": "superhero movie",
                "fields": ["overview", "keywords","title"],
                "type": "cross_fields",
                "minimum_should_match": "2"
              }
            }
          }
        }
      ]
    }
  }
)

pretty_search_response(response)

## Conclusion

We have shown how the minimum score parameter can be used to improve precision in search results, especially for semantic search. Feel free to experiment with different queries and min_score settings.