# Semantic search quick start

<a target="_blank" href="https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/00-quick-start.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This first notebook will introduce you to some core concepts and building blocks of working with the official Elasticsearch Python client. First you will cover connecting to the client, cresting and populting an index, adding a mapping, and running some initial simple queries.

As a next step, you will perform semantic search using [Sentence Transformers](https://www.sbert.net) for text embedding. Learn how to integrate traditional text-based search with semantic search, for a hybrid search system.

## Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

Once logged in to your Elastic Cloud account, go to the [Create deployment](https://cloud.elastic.co/deployments/create) page and select **Create deployment**. Leave all settings with their default values.

## Install packages and import modules

To get started, we'll need to connect to our Elastic deployment using the Python client.
Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.

First we need to install the `elasticsearch` Python client.

In [None]:
!pip install -qU elasticsearch

## Initialize the Elasticsearch client

Now we can instantiate the [Elasticsearch python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html), providing the cloud id and password in your deployment.

In [None]:
from elasticsearch import Elasticsearch
from getpass import getpass

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
ELASTIC_API_KEY = getpass("Elastic Api Key: ")

# Create the client instance
client = Elasticsearch(
    # For local development
    # hosts=["http://localhost:9200"]
    cloud_id=ELASTIC_CLOUD_ID,
    api_key=ELASTIC_API_KEY,
)

Elastic Cloud ID: ··········
Elastic Api Key: ··········


If you're running Elasticsearch locally or self-managed, you can pass in the Elasticsearch host instead. [Read more](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#_verifying_https_with_certificate_fingerprints_python_3_10_or_later) on how to connect to Elasticsearch locally.

Confirm that the client has connected with this test.

In [None]:
print(client.info())

## Index some test data

Our client is set up and connected to our Elastic deployment.
Now we need some data to test out the basics of Elasticsearch queries.
We'll use a small index of books with the following fields:

- `title`
- `authors`
- `summary`
- `publish_date`
- `num_reviews`
- `publisher`

### Create an index

First ensure that you do not have a previously created index with the name `book_index`.

In [None]:
index_name = "book_index"
client.indices.delete(index=index_name, ignore_unavailable=True)

ObjectApiResponse({'acknowledged': True})

🔐 NOTE: at any time you can come back to this section and run the `delete` function above to remove your index and start from scratch.

Let's create an Elasticsearch index with the correct mappings for our test data.

In [None]:
# Define the mapping
mappings = {
    "properties": {
        "title": {
            "type" : "text"
        },
        "authors" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
              }
             }
        },
        "summary" : {
            "type" : "text"
        },
        "publish_date" : {
            "type" : "date"
        },
        "num_reviews" : {
            "type" : "long"
        },
        "publisher" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
              }
             }
        }
    }
}


# Create the index
client.indices.create(index=index_name, mappings=mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'book_index'})

### Index test data

Run the following command to upload some test data, containing information about 10 popular programming books from this [dataset](https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/data.json).

In [None]:
import json
from urllib.request import urlopen

url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/data.json"
response = urlopen(url)
books = json.loads(response.read())

There are a few different ways to index data using Python.
You can use the `index` operation to add each individual document, or add multiple documents at once with the `bulk` operators.

In [None]:
client.index(index = index_name, id = 0, document = books[0])

ObjectApiResponse({'_index': 'book_index', '_id': '0', '_version': 2, 'result': 'updated', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 10, '_primary_term': 1})

The Elasticsearch `bulk` [API](https://www.elastic.co/guide/en/elasticsearch/reference/8.11/docs-bulk.html) can perform multiple operations in a sinlge API call.

In [None]:
def generate_operations(documents, index_name):
  operations = []
  for i, document in enumerate(documents):
      operations.append({"index": {"_index": index_name, "_id": i}})
      operations.append(document)
  return operations

client.bulk(index=index_name, operations=generate_operations(books, index_name), refresh=True)

Alternatively, you can also index documents with the bulk [helper function](https://elasticsearch-py.readthedocs.io/en/v8.0.0/helpers.html).

In [None]:
from elasticsearch import helpers

def generate_docs(documents, index_name):
    for i, document in enumerate(documents):
        yield dict(_index=index_name, _id=f"{i}", _source=document)

helpers.bulk(client, generate_docs(books, index_name))

## Making queries

With the new index created and populated, we can search throuh the documents.

In [None]:
response = client.search(
    index = index_name,
    query = {
        "match" : {
            "title" : "javascript"
        }
    }
)

In [None]:
response

ObjectApiResponse({'took': 0, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 2, 'relation': 'eq'}, 'max_score': 2.156847, 'hits': [{'_index': 'book_index', '_id': '5', '_score': 2.156847, '_source': {'title': 'Eloquent JavaScript', 'authors': ['marijn haverbeke'], 'summary': 'A modern introduction to programming', 'publish_date': '2018-12-04', 'num_reviews': 38, 'publisher': 'no starch press'}}, {'_index': 'book_index', '_id': '8', '_score': 1.8162923, '_source': {'title': 'JavaScript: The Good Parts', 'authors': ['douglas crockford'], 'summary': 'A deep dive into the parts of JavaScript that are essential to writing maintainable code', 'publish_date': '2008-05-15', 'num_reviews': 51, 'publisher': 'oreilly'}}]}})

## Aside: Pretty printing Elasticsearch responses

Your API calls will return hard-to-read nested JSON.
We'll create a little function called `pretty_response` to return nice, human-readable outputs from our examples.

In [None]:
def pretty_response(response):
    if len(response['hits']['hits']) == 0:
        print('Your search returned no results.')
    else:
        for hit in response['hits']['hits']:
            id = hit['_id']
            publication_date = hit['_source']['publish_date']
            score = hit['_score']
            title = hit['_source']['title']
            summary = hit['_source']['summary']
            publisher = hit["_source"]["publisher"]
            num_reviews = hit["_source"]["num_reviews"]
            authors = hit["_source"]["authors"]
            pretty_output = (f"\nID: {id}\nPublication date: {publication_date}\nTitle: {title}\nSummary: {summary}\nPublisher: {publisher}\nReviews: {num_reviews}\nAuthors: {authors}\nScore: {score}")
            print(pretty_output)

In [None]:
pretty_response(response)


ID: 5
Publication date: 2018-12-04
Title: Eloquent JavaScript
Summary: A modern introduction to programming
Publisher: no starch press
Reviews: 38
Authors: ['marijn haverbeke']
Score: 2.156847

ID: 8
Publication date: 2008-05-15
Title: JavaScript: The Good Parts
Summary: A deep dive into the parts of JavaScript that are essential to writing maintainable code
Publisher: oreilly
Reviews: 51
Authors: ['douglas crockford']
Score: 1.8162923


# Going deeper: Semantic Search with Embedding Model

We can go beyond text search with Elastic's semantic search capabilities. By transforming our data into vectors we are able to run more complex queries.

For this example, we're using `all-MiniLM-L6-v2`, part of the `sentence_transformers` library. You can read more about this model on [Huggingface](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [None]:
!pip install -qU sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

We will create a new index that will allow running semantic searches.

In [None]:
index_name = "book_vectors_index"
client.indices.delete(index=index_name, ignore_unavailable=True)

# Define the mapping
mappings = {
    "properties": {
        "title_vector": {
            "type": "dense_vector",
            "dims": 384,
            "index": "true",
            "similarity": "cosine"
        }
    }
}

# Create the index
client.indices.create(index=index_name, mappings=mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'book_vectors_index'})

We can apply the transformer on our dataset from earlier, and modify the `generate_operations` functions for the new use case.

In [None]:
operations = []
for i, book in enumerate(books):
    operations.append({"index": {"_index": index_name, "_id": i}})
    # Transforming the title into an embedding using the model
    book["title_vector"] = model.encode(book["title"]).tolist()
    operations.append(book)
client.bulk(index=index_name, operations=operations, refresh=True)

## Making queries

Now that we have indexed the books, we want to perform a semantic search for books that are similar to a given query.
We embed the query and perform a search. Notie how we get more results that the previous `match` query did not return.

In [None]:
response = client.search(
    index=index_name,
    knn={
      "field": "title_vector",
      "query_vector": model.encode("javascript books"),
      "k": 10,
      "num_candidates": 100
    }
)

pretty_response(response)


ID: 8
Publication date: 2008-05-15
Title: JavaScript: The Good Parts
Summary: A deep dive into the parts of JavaScript that are essential to writing maintainable code
Publisher: oreilly
Reviews: 51
Authors: ['douglas crockford']
Score: 0.8042828

ID: 4
Publication date: 2015-03-27
Title: You Don't Know JS: Up & Going
Summary: Introduction to JavaScript and programming as a whole
Publisher: oreilly
Reviews: 36
Authors: ['kyle simpson']
Score: 0.6989136

ID: 5
Publication date: 2018-12-04
Title: Eloquent JavaScript
Summary: A modern introduction to programming
Publisher: no starch press
Reviews: 38
Authors: ['marijn haverbeke']
Score: 0.6796988

ID: 0
Publication date: 2019-10-29
Title: The Pragmatic Programmer: Your Journey to Mastery
Summary: A guide to pragmatic programming for software engineers and developers
Publisher: addison-wesley
Reviews: 30
Authors: ['andrew hunt', 'david thomas']
Score: 0.6206549

ID: 9
Publication date: 2012-06-27
Title: Introduction to the Theory of Comput

## Filtering

Filter context is mostly used for filtering structured data. For example, use filter context to answer questions like:

- _Does this timestamp fall into the range 2015 to 2016?_
- _Is the status field set to "published"?_

Filter context is in effect whenever a query clause is passed to a filter parameter, such as the `filter` or `must_not` parameters in a `bool` query.

[Learn more](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context) about filter context in the Elasticsearch docs.

### Example: Keyword Filtering

This is an example of adding a keyword filter to the query.

The example retrieves the top books that are similar to "javascript books" based on their title vectors, and also Addison-Wesley as publisher.

In [None]:
response = client.search(
    index=index_name,
    knn={
      "field": "title_vector",
      "query_vector": model.encode("javascript books"),
      "k": 10,
      "num_candidates": 100,
      "filter": {
          "term": {
              "publisher.keyword": "addison-wesley"
          }
      }
    }
)

pretty_response(response)


ID: 0
Publication date: 2019-10-29
Title: The Pragmatic Programmer: Your Journey to Mastery
Summary: A guide to pragmatic programming for software engineers and developers
Publisher: addison-wesley
Reviews: 30
Authors: ['andrew hunt', 'david thomas']
Score: 0.6206549

ID: 6
Publication date: 1994-10-31
Title: Design Patterns: Elements of Reusable Object-Oriented Software
Summary: Guide to design patterns that can be used in any object-oriented language
Publisher: addison-wesley
Reviews: 45
Authors: ['erich gamma', 'richard helm', 'ralph johnson', 'john vlissides']
Score: 0.5649922
