# Semantic search quick start

<a target="_blank" href="https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/search/00-quick-start.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This interactive notebook will introduce you to some basic operations with Elasticsearch, using the official [Elasticsearch Python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html).
You'll perform semantic search using [Sentence Transformers](https://www.sbert.net) for text embedding. Learn how to integrate traditional text-based search with semantic search, for a hybrid search system.

## Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up [here](https://cloud.elastic.co/registration?utm_source=github&utm_content=elasticsearch-labs-notebook) for a free trial.

Once logged in to your Elastic Cloud account, go to the [Create deployment](https://cloud.elastic.co/deployments/create) page and select **Create deployment**. Leave all settings with their default values.

## Install packages and import modules

To get started, we'll need to connect to our Elastic deployment using the Python client.
Because we're using an Elastic Cloud deployment, we'll use the **Cloud ID** to identify our deployment.

First we need to install the `elasticsearch` Python client.

In [4]:
!pip install -qU elasticsearch sentence-transformers ipywidgets

# Setup the Embedding Model

For this example, we're using `all-MiniLM-L6-v2`, part of the `sentence_transformers` library. You can read more about this model on [Huggingface](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).

In [5]:
import tqdm as notebook_tqdm

In [6]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

## Initialize the Elasticsearch client

Now we can instantiate the [Elasticsearch python client](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/index.html), providing the cloud id and password in your deployment.

In [15]:
from elasticsearch import Elasticsearch
from getpass import getpass

# CLOUD_ID = getpass("Elastic Cloud ID")
# CLOUD_PASSWORD = getpass("Elastic Password")

# Create the client instance
# client = Elasticsearch(
#     cloud_id=CLOUD_ID,
#     basic_auth=("elastic", CLOUD_PASSWORD)
# )
client = Elasticsearch("http://localhost:9200")
 

If you're running Elasticsearch locally or self-managed, you can pass in the Elasticsearch host instead. [Read more](https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#_verifying_https_with_certificate_fingerprints_python_3_10_or_later) on how to connect to Elasticsearch locally.

Confirm that the client has connected with this test.

In [21]:
print(client.info())

{'name': '6a4e9d6e6c2c', 'cluster_name': 'docker-cluster', 'cluster_uuid': '48MIUprFR1qNyRo5a5-GJw', 'version': {'number': '8.10.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '6d20dd8ce62365be9b1aca96427de4622e970e9e', 'build_date': '2023-09-19T08:16:24.564900370Z', 'build_snapshot': False, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}


## Index some test data

Our client is set up and connected to our Elastic deployment.
Now we need some data to test out the basics of Elasticsearch queries.
We'll use a small index of books with the following fields:

- `title`
- `authors`
- `publish_date`
- `num_reviews`
- `publisher`

### Create an index

First ensure that you do not have a previously created index with the name `book_index`.

In [22]:
client.indices.delete(index="book_index", ignore_unavailable=True)

ObjectApiResponse({'acknowledged': True})

🔐 NOTE: at any time you can come back to this section and run the `delete` function above to remove your index and start from scratch.

Let's create an Elasticsearch index with the correct mappings for our test data. 

In [23]:
# Define the mapping
mappings = {
    "properties": {
        "title_vector": {
            "type": "dense_vector",
            "dims": 384,
            "index": "true",
            "similarity": "cosine"
        }
    }
}

# Create the index
client.indices.create(index='book_index', mappings=mappings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'book_index'})

### Index test data

Run the following command to upload some test data, containing information about 10 popular programming books from this [dataset](https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/data.json).
`model.encode` will encode the text into a vector on the fly, using the model we initialized earlier.

In [38]:
import json
import requests

# url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/notebooks/search/data.json"
# response = requests.get(url)
# books=''
# if (response.status_code):
#     books = json.loads(response.text)

f = open('data.json')
books = json.load(f)

# print(json.dumps(books, indent=2))

operations = []
for book in books:
    operations.append({"index": {"_index": "book_index"}})
    # Transforming the title into an embedding using the model
    book["title_vector"] = model.encode(book["title"]).tolist()
    operations.append(book)

print(json.dumps(operations, indent=2))

client.bulk(index="book_index", operations=operations, refresh=True)

[
  {
    "index": {
      "_index": "book_index"
    }
  },
  {
    "title": "The Pragmatic Programmer: Your Journey to Mastery",
    "authors": [
      "andrew hunt",
      "david thomas"
    ],
    "summary": "A guide to pragmatic programming for software engineers and developers",
    "publish_date": "2019-10-29",
    "num_reviews": 30,
    "publisher": "addison-wesley",
    "title_vector": [
      0.021622776985168457,
      0.063673235476017,
      -0.09705562144517899,
      -0.061669863760471344,
      -0.001204114523716271,
      -0.034123435616493225,
      0.13590006530284882,
      0.03375348448753357,
      -0.042110804468393326,
      -0.024675676599144936,
      -0.009651674889028072,
      -0.003184354165568948,
      0.08659922331571579,
      -0.060329899191856384,
      0.06627333909273148,
      0.06954074651002884,
      -0.11942077428102493,
      -0.04336177557706833,
      0.022371141240000725,
      -0.07261000573635101,
      -0.039486344903707504,
      -0.03

ObjectApiResponse({'errors': False, 'took': 33, 'items': [{'index': {'_index': 'book_index', '_id': '96ssy4sB2f-yFDbxj7gK', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 60, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'book_index', '_id': '-Kssy4sB2f-yFDbxj7gK', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 61, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'book_index', '_id': '-assy4sB2f-yFDbxj7gK', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 62, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'book_index', '_id': '-qssy4sB2f-yFDbxj7gK', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 63, '_primary_term': 1, 'status': 201}}, {'index': 

## Aside: Pretty printing Elasticsearch responses

Your API calls will return hard-to-read nested JSON.
We'll create a little function called `pretty_response` to return nice, human-readable outputs from our examples.

In [27]:
def pretty_response(response):
    if len(response['hits']['hits']) == 0:
        print('Your search returned no results.')
    else:
        for hit in response['hits']['hits']:
            id = hit['_id']
            publication_date = hit['_source']['publish_date']
            score = hit['_score']
            title = hit['_source']['title']
            summary = hit['_source']['summary']
            publisher = hit["_source"]["publisher"]
            num_reviews = hit["_source"]["num_reviews"]
            authors = hit["_source"]["authors"]
            pretty_output = (f"\nID: {id}\nPublication date: {publication_date}\nTitle: {title}\nSummary: {summary}\nPublisher: {publisher}\nReviews: {num_reviews}\nAuthors: {authors}\nScore: {score}")
            print(pretty_output)

## Making queries

Now that we have indexed the books, we want to perform a semantic search for books that are similar to a given query.
We embed the query and perform a search.

In [28]:
response = client.search(
    index="book_index",
    knn={
      "field": "title_vector",
      "query_vector": model.encode("javascript books"),
      "k": 10,
      "num_candidates": 100
    }
)

pretty_response(response)


ID: w6uGyosB2f-yFDbxT7ip
Publication date: 2008-05-15
Title: JavaScript: The Good Parts
Summary: A deep dive into the parts of JavaScript that are essential to writing maintainable code
Publisher: oreilly
Reviews: 51
Authors: ['douglas crockford']
Score: 0.8042828

ID: v6uGyosB2f-yFDbxT7io
Publication date: 2015-03-27
Title: You Don't Know JS: Up & Going
Summary: Introduction to JavaScript and programming as a whole
Publisher: oreilly
Reviews: 36
Authors: ['kyle simpson']
Score: 0.6989136

ID: wKuGyosB2f-yFDbxT7io
Publication date: 2018-12-04
Title: Eloquent JavaScript
Summary: A modern introduction to programming
Publisher: no starch press
Reviews: 38
Authors: ['marijn haverbeke']
Score: 0.6796988

ID: u6uGyosB2f-yFDbxT7io
Publication date: 2019-10-29
Title: The Pragmatic Programmer: Your Journey to Mastery
Summary: A guide to pragmatic programming for software engineers and developers
Publisher: addison-wesley
Reviews: 30
Authors: ['andrew hunt', 'david thomas']
Score: 0.620655

ID:

## Filtering

Filter context is mostly used for filtering structured data. For example, use filter context to answer questions like:

- _Does this timestamp fall into the range 2015 to 2016?_
- _Is the status field set to "published"?_

Filter context is in effect whenever a query clause is passed to a filter parameter, such as the `filter` or `must_not` parameters in a `bool` query.

[Learn more](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html#filter-context) about filter context in the Elasticsearch docs.

### Example: Keyword Filtering

This is an example of adding a keyword filter to the query.

The example retrieves the top books that are similar to "javascript books" based on their title vectors, and also Addison-Wesley as publisher.

In [29]:
response = client.search(
    index="book_index",
    knn={
      "field": "title_vector",
      "query_vector": model.encode("javascript books"),
      "k": 10,
      "num_candidates": 100,
      "filter": {
          "term": {
              "publisher.keyword": "addison-wesley"
          }
      }
    }
)

pretty_response(response)


ID: u6uGyosB2f-yFDbxT7io
Publication date: 2019-10-29
Title: The Pragmatic Programmer: Your Journey to Mastery
Summary: A guide to pragmatic programming for software engineers and developers
Publisher: addison-wesley
Reviews: 30
Authors: ['andrew hunt', 'david thomas']
Score: 0.620655

ID: wauGyosB2f-yFDbxT7io
Publication date: 1994-10-31
Title: Design Patterns: Elements of Reusable Object-Oriented Software
Summary: Guide to design patterns that can be used in any object-oriented language
Publisher: addison-wesley
Reviews: 45
Authors: ['erich gamma', 'richard helm', 'ralph johnson', 'john vlissides']
Score: 0.5649922
