# Similarity search with ApproxRetrievalStrategy and NLP model

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/integrations/langchain/llangchain-vector-store-approx-search.ipynb)

This workbook demonstrates how to perform a similarity search using [ElasticsearchStore](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html) and [ApproxRetrievalStrategy](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.ApproxRetrievalStrategy). First we will download sample dataset and  split documents into chunks using langchain and then index into elasticsearch through [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents).

The [ApproxRetrievalStrategy](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.ApproxRetrievalStrategy) uses HNSW algorithm to find [nearest neighbor](https://en.wikipedia.org/wiki/Nearest_neighbor_search), which is the fastest and memory efficient algorithm. During the indexing, dense vector fields are generated and is store within the index. We can either provide an embedding function or provide a `query_model_id` to embed the query. In this notebook we will provide a `query_model_id` - during the query time. 

For our model, We will use [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to transform the search text into the dense vector.




## Install packages and import modules

In [3]:
!python3 -m pip install -qU langchain elasticsearch tiktoken  sentence-transformers eland  transformers

from getpass import getpass
from langchain.vectorstores import ElasticsearchStore
from langchain.embeddings.openai import OpenAIEmbeddings
from urllib.request import urlopen
from langchain.text_splitter import CharacterTextSplitter
import json

## Connect to Elasticsearch

ℹ️ We're using an Elastic Cloud deployment of Elasticsearch for this notebook. If you don't have an Elastic Cloud deployment, sign up here for a free trial.

We'll use the Cloud ID to identify our deployment, because we are using Elastic Cloud deployment. To find the Cloud ID for your deployment, go to https://cloud.elastic.co/deployments and select your deployment.

We will use ElasticsearchStore to connect to our elastic cloud deployment, This would help create and index data easily. In the constructor, we will explicity mention `query_field` and `vector_query_field`, to map our inference pipeline to source and target fields.

Note: For demonstration we will explicity set strategy to `ElasticsearchStore.ApproxRetrievalStrategy` although ElasticsearchStore uses `ApproxRetrievalStrategy` strategy by default.



In [5]:
CLOUD_ID = getpass("Elastic deployment Cloud ID")
CLOUD_USERNAME = "elastic"
CLOUD_PASSWORD = getpass("Elastic deployment Password")

vector_store = ElasticsearchStore(
            es_cloud_id=CLOUD_ID, 
            es_user=CLOUD_USERNAME, 
            query_field="text_field",
            vector_query_field="vector_query_field.predicted_value",
            es_password=CLOUD_PASSWORD,
            index_name= "approx-search-demo",
            strategy=ElasticsearchStore.ApproxRetrievalStrategy(query_model_id="sentence-transformers__all-minilm-l6-v2")
            
        )


## Deploy model using Eland

ℹ️ Once you have created elastic cloud deployment, [autoscale](https://www.elastic.co/guide/en/cloud/current/ec-autoscaling.html) to have least one machine learning (ML) node with enough (4GB) memory. Also ensure that the Elasticsearch cluster is running. 


We are using the [`eland`](https://www.elastic.co/guide/en/elasticsearch/client/eland/current/overview.html) tool to install a `text_embedding` model - [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).  The model will transfer your search query into vector which will be used for the search over the set of documents stored in Elasticsearch. 
Using the [`eland_import_hub_model`](https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html#ml-nlp-pytorch) script, we can download and install `all-MiniLM-L6-v2` transformer model. Setting the NLP `--task-type` as `text_embedding`. 

Authenticate your request to cloud deployment by provided cloud id, cloud username and password. Alternatively, You could also use [API key](https://www.elastic.co/guide/en/kibana/current/api-keys.html#create-api-key) in place of username and password.  




In [None]:
!eland_import_hub_model -u $CLOUD_USERNAME -p $CLOUD_PASSWORD --cloud-id $CLOUD_ID --hub-model-id sentence-transformers/all-MiniLM-L6-v2 --task-type text_embedding --start --clear-previous

## Download the sample dataset

Let's download the sample dataset and deserialize the document to make document chunking easier.

In [4]:
url = "https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/example-apps/workplace-search/example-data/data.json"  
response = urlopen(url)

workplace_docs = json.loads(response.read())

## Create ingestion pipeline

We need to create a text embedding ingest pipeline to generate vector (text) embeddings for `text_field`.

The pipeline below is defining a processor for the [inference](https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-processor.html) to the NLP model.

In [None]:
PIPELINE_ID="vectorize_workplace"

vector_store.client.ingest.put_pipeline(id=PIPELINE_ID, processors=[{
        "inference": {
          "model_id": "sentence-transformers__all-minilm-l6-v2",
          "field_map": {
            "query_field": "text_field"
          },
            "target_field": "vector_query_field",
        }
      }])

## Create Index with mappings

We will now create an elasticsearch index with correct mapping before we index documents. 
We are adding `predicted_value` to `vector_query_field` field to store the vector embeddings. We will search over the content and would be mapped to the `text_field`.

In [None]:
# define index name
INDEX_NAME="approx-search-demo"

# flag to check if index has to be deleted before creating
SHOULD_DELETE_INDEX=True

# define index mapping
INDEX_MAPPING = {
    "properties": {
      "text_field": {"type": "text"},
      "vector_query_field": {
        "properties": {
          "is_truncated": {
            "type": "boolean"
          },
          "predicted_value": {
            "type": "dense_vector",
            "dims": 384,
            "index": True,
            "similarity": "l2_norm"
          }
        }
      }
    }
  }


INDEX_SETTINGS = {"index": { "default_pipeline": PIPELINE_ID}}

# check if we want to delete index before creating the index
if(SHOULD_DELETE_INDEX):
  if vector_store.client.indices.exists(index=INDEX_NAME):
    print("Deleting existing %s" % INDEX_NAME)
    vector_store.client.indices.delete(index=INDEX_NAME, ignore=[400, 404])

print("Creating index %s" % INDEX_NAME)
vector_store.client.indices.create(index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS,
                  ignore=[400, 404])


## Split Documents into Passages

Next, We will chunk these documents into 800 token passages with an overlap of 0 tokens using a text splitter, [CharacterTextSplitter](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.CharacterTextSplitter.html) and then create documents from these texts. 

In [None]:
metadata = []
content = []

# data.json
for doc in workplace_docs:
  content.append(doc["content"])
  metadata.append({
      "name": doc["name"],
      "summary": doc["summary"]
  })

text_splitter = CharacterTextSplitter(chunk_size=800, chunk_overlap=0)
docs = text_splitter.create_documents(content, metadatas=metadata)

## Index data into elasticsearch

Now that we have our document ready, next we will index data to elasticsearch using [ElasticsearchStore.from_documents](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.elasticsearch.ElasticsearchStore.html#langchain.vectorstores.elasticsearch.ElasticsearchStore.from_documents). We will use Cloud ID and Passwords values set in the Create cloud deployment step. 

We will  `query_model_id` to  `"sentence-transformers__all-minilm-l6-v2"` , to embed dense vectors.

ℹ️ Note: Before you begin indexing, ensure you have [started your trained model deployment](https://www.elastic.co/guide/en/machine-learning/current/ml-nlp-deploy-model.html) in your index.


In [51]:
documents = ElasticsearchStore.from_documents(
    docs, es_cloud_id=CLOUD_ID, es_user=CLOUD_USERNAME, es_password=CLOUD_PASSWORD,   
    index_name= "approx-search-demo",
    query_field="text_field",
    vector_query_field="vector_query_field.predicted_value",
    strategy=ElasticsearchStore.ApproxRetrievalStrategy(query_model_id="sentence-transformers__all-minilm-l6-v2")
)

## Querying the dataset with similarity_search

Now that we have indexed our sample data to elasticsearch, we will perform a similarity search on query - `How does the compensation work?`. 

In [52]:
results = vector_store.similarity_search("How does compensation work")

print(results)

[Document(page_content='Performance-Based Compensation:\nIn addition to the defined compensation bands, we emphasize a performance-based compensation model. Performance evaluations will be conducted regularly, and employees exceeding performance expectations will be eligible for bonuses, incentives, and salary increases. This approach rewards high achievers and motivates employees to excel in their roles.\n\nConclusion:\nBy implementing this compensation bands strategy, our IT company aims to establish fair and competitive compensation practices that align with market standards and foster employee satisfaction. Regular evaluations and market benchmarking will enable us to adapt and refine the strategy to meet the evolving needs of our organization.', metadata={'summary': 'This document outlines a compensation framework for IT teams. It includes job levels, compensation bands, and performance-based incentives to ensure fair and competitive wages. Regular market benchmarking will be cond

## Next steps

Now you know how to integrate LangChain with Elasticsearch vector store, using your choice of NLP model and run similarity search on the indexed dataset to get the top 4 similar content. 

Next, checkout our [langchain-vector-store](https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/langchain/langchain-vector-store.ipynb) notebook to see how to query and filter on the metadata.