# Similarity search with ApproxRetrievalStrategy

This workbook demonstrates how to perform a similarity search using [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) and [ApproxRetrievalStrategy](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.ApproxRetrievalStrategy). We will first download sample dataset and  split documents into chunks using langchain and then index into elasticsearch through [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents).

The [ApproxRetrievalStrategy](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.ApproxRetrievalStrategy) uses HNSW algorithm to find [nearest neighbor](https://en.wikipedia.org/wiki/Nearest_neighbor_search), which is the fastest and memory efficient algorithm. During the indexing, dense vector fields are generated and is store within the index. We can either provide an embedding function or provide a `query_model_id` to embed the query. In this example we will provide a `query_model_id` during the query time. 




## Install packages and import modules

In [90]:
!python3 -m pip install -qU langchain elasticsearch tiktoken  sentence-transformers eland  transformers

from getpass import getpass
from langchain.vectorstores import ElasticsearchStore
from langchain.embeddings.openai import OpenAIEmbeddings
from urllib.request import urlopen
from langchain.text_splitter import CharacterTextSplitter
import json

In [94]:
API_KEY = getpass("Elastic deployment API Key")
CLOUD_ID = getpass("Elastic deployment Cloud ID")
!eland_import_hub_model --cloud-id $CLOUD_ID --hub-model-id sentence-transformers/all-MiniLM-L6-v2 --task-type text_embedding --es-api-key $API_KEY --start

2023-08-22 17:32:51,016 INFO : Establishing connection to Elasticsearch
2023-08-22 17:32:51,827 INFO : Connected to cluster named 'f5045883120d4bc483f8095d5bf21210' (version: 8.9.0)
2023-08-22 17:32:51,830 INFO : Loading HuggingFace transformer tokenizer and model 'sentence-transformers/all-MiniLM-L6-v2'
2023-08-22 17:32:54,943 ERROR : Trained model with id 'sentence-transformers__all-minilm-l6-v2' already exists
2023-08-22 17:32:54,943 INFO : Run the script with the '--clear-previous' flag if you want to overwrite the existing model.


## Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up here for a free trial.

We'll use the Cloud ID to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.

We will use ElasticsearchStore to connect to our elastic cloud deployment, This would help create and index data easily. 




In [116]:
CLOUD_ID = getpass("Elastic deployment Cloud ID")
CLOUD_USERNAME = "elastic"
CLOUD_PASSWORD = getpass("Elastic deployment Password")

# we will explicity set ApproxRetrievalStrategy when instantiate ElasticsearchStore instance
vector_store = ElasticsearchStore(
            es_cloud_id=CLOUD_ID, 
            es_user=CLOUD_USERNAME, 
            query_field="content",
            vector_query_field="text_embedding.predicted_value",
            es_password=CLOUD_PASSWORD,
            index_name= "approx-search-demo",
            strategy=ElasticsearchStore.ApproxRetrievalStrategy(query_model_id="sentence-transformers__all-minilm-l6-v2")
            
        )


## Download the sample dataset

Let's download the sample dataset and deserialize the document to make document chunking easier.

In [110]:
url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/workplace-search/example-data/"  
# url = "https://raw.githubusercontent.com/saarikabhasi/elasticsearch-labs/main/notebooks/integrations/hugging-face/blogs.json"
response = urlopen(url)

workplace_docs = json.loads(response.read())

## Create ingestion pipeline

In [117]:
# ingest pipeline definition
PIPELINE_ID="vectorize_workplace"

vector_store.client.ingest.put_pipeline(id=PIPELINE_ID, processors=[{
        "inference": {
          "model_id": "sentence-transformers__all-minilm-l6-v2",
          "target_field": "text_embedding",
          "field_map": {
            "content": "text_field"
          }
        }
      }])

ObjectApiResponse({'acknowledged': True})

## Split Documents into Passages

Next, We will chunk these documents into 800 token passages with an overlap of 0 tokens using a text splitter, [CharacterTextSplitter](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.CharacterTextSplitter.html) and then create documents from these texts. 

In [118]:
# define index name
INDEX_NAME="approx-search-demo"

# flag to check if index has to be deleted before creating
SHOULD_DELETE_INDEX=True

# define index mapping
INDEX_MAPPING = {
    "properties": {
      "content": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "summary":{
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "name":{
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "text_embedding": {
        "properties": {
          "is_truncated": {
            "type": "boolean"
          },
          "model_id": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "predicted_value": {
            "type": "dense_vector",
            "dims": 384,
            "index": True,
            "similarity": "l2_norm"
          }
        }
      }
    }
  }

INDEX_SETTINGS = {
    "index": {
      "number_of_replicas": "1",
      "number_of_shards": "1",
      "default_pipeline": PIPELINE_ID
    }
}

# check if we want to delete index before creating the index
if(SHOULD_DELETE_INDEX):
  if vector_store.client.indices.exists(index=INDEX_NAME):
    print("Deleting existing %s" % INDEX_NAME)
    vector_store.client.indices.delete(index=INDEX_NAME, ignore=[400, 404])

print("Creating index %s" % INDEX_NAME)
vector_store.client.indices.create(index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS,
                  ignore=[400, 404])


Deleting existing approx-search-demo


  vector_store.client.indices.delete(index=INDEX_NAME, ignore=[400, 404])


Creating index approx-search-demo


  vector_store.client.indices.create(index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS,


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'approx-search-demo'})

In [121]:
metadata = []
content = []

for doc in workplace_docs:
  content.append(doc["content"])
  metadata.append({
      "name": doc["name"],
      "summary": doc["summary"],
  })

text_splitter = CharacterTextSplitter(chunk_size=800, chunk_overlap=0)
docs = text_splitter.create_documents(content, metadatas=metadata)
# actions = []
# for doc in workplace_docs:
#     actions.append({"index": {"_index": "approx-search-demo"}})
#     actions.append(doc)
# vector_store.client.bulk(index="approx-search-demo", operations=actions)

Created a chunk of size 866, which is longer than the specified 800
Created a chunk of size 1120, which is longer than the specified 800


## Index data into elasticsearch

Now that we have our document ready, next we will index data to elasticsearch using [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents). We will use Cloud ID and Passwords values set in the Create cloud deployment step.

In the instance, we will set strategy to ElasticsearchStore.ApproxRetrievalStrategy()

In this example, we will explicity set `strategy = ElasticsearchStore.ApproxRetrievalStrategy()` although ElasticsearchStore uses ApproxRetrievalStrategy strategy by default.

Note: We are providing `query_model_id` in this example, so we are not providing an embedding function. 

In [122]:
documents = vector_store.from_documents(
    docs, es_cloud_id=CLOUD_ID, es_user=CLOUD_USERNAME, es_password=CLOUD_PASSWORD,   index_name= "approx-search-demo",
    strategy=ElasticsearchStore.ApproxRetrievalStrategy(query_model_id="sentence-transformers__all-minilm-l6-v2")
)

Error adding texts: 59 document(s) failed to index.


BulkIndexError: 59 document(s) failed to index.

## Show Result functions

Next, we will create a small function to show the results of our query in human-readable format. This function would be used in our examples to display the results.

In [18]:
def showResults(output):
  print("Total results: ", len(output))
  for index in range(len(output)):
    print(output[index])

In [108]:
results = documents.similarity_search("How does the compensation work?")
showResults(results)

Total results:  0
