# Similarity search

This guide gives an overview of the similarity search features. Since `1.2.0` Argilla supports adding vectors to Argilla records which can then be used for finding the most similar records to a given one. This feature uses vector or semantic search combined with more traditional search (keyword and filter based). Vector search leverages machine learning to capture rich semantic features by embedding items (text, video, images, etc.) into a vector space, which can be then used to find "semantically" similar items.

In this guide, you'll find how to:
* Setup your Elasticsearch or Opensearch endpoint with vector search support.
* Encode text into vectors for Argilla records.
* Use similarity search.

The next section gives a general overview about how similarity search works in Argilla.

## How it works
Similarity search in Argilla works as follows:

1. One or several vectors can be included in the `vectors` field of Argilla Records. The `vectors` field accepts a dictionary as for certain use cases you might want to use several vectors. 
2. The vectors are stored at indexing time, once the records are logged with `rg.log`.
3. If you have stored vectors in your dataset, you can use the similarity search feature in Argilla UI or the `vector` param in the `rg.load` method of the Python Client.

In future versions, embedding services might be developed to facilitate steps 1 and 2 and associate vectors to records automatically. 




<div class="alert alert-info">

Note
    
It's completely up to the user which encoding or embedding mechanism to use for producing these vectors. In the "Encode text fields" section of this document you will find several examples and details about this process, using open source libraries (e.g., Hugging Face) as well as paid services (e.g., Cohere or OpenAI).

Currently, Argilla uses vector search only for searching similar records (nearest neighbours) of a given vector. This can be leveraged from Argilla UI as well as the Python Client. In the future, vector search could be leveraged as well for free text queries using Argilla UI.
    
</div>


## Setup Elasticsearch or Opensearch with vector search support


TODO: @frascuchon please add some basic content (bullet point and references) here


<div class="alert alert-warning">

Warning

Add here potential issues with ES or Opensearch in terms of performance, loosing/not seeing data from past versions, etc.
    
</div>

## Encode text into vectors for Argilla records
The first and most important thing to do before leveraging similarity search is to turn text into a numerical representation: a vector. In practical terms, you can think of a vector as an array or list of numbers. You can associate this list of numbers with an Argilla Record by using the aforementioned `vectors` field. But the question is: **how do you create these vectors?** 

Over the years, many approaches have been used to turn text into numerical representations. The goal is to "encode" meaning, context,  topics, etc.. This can be used to find "semantically" similar text. Some of these approaches are: *LSA* (Latent Semantic Analysis), *tf-idf*, *LDA* (Latent Dirichlet Allocation), or *doc2Vec*. More recent methods fall in the category of "neural" methods, which leveragage the power of large neural networks to *embed* text into dense vectors (a large array of real numbers). These methods have demonstrated a great ability of capturing semantic features. These methods are powering a new wave of technologies that fall under categories like neural search, semantic search, or vector search. Most of these methods involve using a large language model to encode the full context of a textual snippet, such as a sentence, a paragraph, and more lately larger documents.

<div class="alert alert-info">

Note
   
In the context of Argilla, we intentionally use the term `vector` in favour of `embedding` to emphasize that users can leverage methods other than neural, which might be cheaper to compute, or be more useful for their use cases.
</div>

In the next sections, we show how to encode text using different models and services and how to add them to Argilla records.

### Sentence Transformers
SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. There are dozens of [pre-trained models available](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads) on the Hugging Face Hub.

For the example below, we are using the Fast Sentence Transformers library developed by David Berenstein. This library enables sentence transformers 5x faster sentence transformers using tools like quantization and ONNX. The same could be done by using 

The code below will load a dataset from the Hub, encode the `text` field, and create the `vectors` field which will contain only one key (`mini-lm-sentence-transformers`). 

<div class="alert alert-info">

Note
   
Vector keys are arbitrary names that will be used as a name for the vector and shown in the UI if there's more than 1 so users can decide which vector to use for finding similar records. Remember you can associate several vectors to one record by using different keys. 
</div>


To run the code below you need to install `fast_sentence_transformers` and `datasets` with pip: `pip install fast_sentence_transformers datasets`

In [None]:
from fast_sentence_transformers import FastSentenceTransformer as SentenceTransformer

from datasets import load_dataset

# Define fast version of sentence transformers
encoder = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")

# Load dataset
dataset = load_dataset("banking77", split="test")

# Encode text field using batched computation
dataset = dataset.map(lambda batch: {"vectors": encoder.encode(batch["text"])}, batch_size=32, batched=True)

# Turn vectors into a dictionary
dataset = dataset.map(
    lambda r: {"vectors": {"mini-lm-sentence-transformers": r["vectors"]}}
)

Our dataset now contains a `vectors` field with the embedding vector generated by the sentence transformer model.

In [4]:
dataset.to_pandas().head()

Unnamed: 0,text,label,vectors
0,How do I locate my card?,11,{'mini-lm-sentence-transformers': [-0.06855544...
1,"I still have not received my new card, I order...",11,{'mini-lm-sentence-transformers': [-0.16821901...
2,I ordered a card but it has not arrived. Help ...,11,{'mini-lm-sentence-transformers': [-0.17539088...
3,Is there a way to know when my card will arrive?,11,{'mini-lm-sentence-transformers': [0.072342686...
4,My card has not arrived yet.,11,{'mini-lm-sentence-transformers': [-0.25441116...


This dataset can be transformed into an Argilla Dataset by using the `DatasetForTextClassification.from_datasets` method. Then, this dataset can be logged into Argilla as follows:

In [None]:
import argilla as rg

rg_ds = rg.DatasetForTextClassification.from_datasets(dataset, annotation="label")

rg.log(
    name="banking77",
    records=rg_ds,
    chunk_size=50,
)

### OpenAI

OpenAI provides a API endpoint called [Embeddings](https://beta.openai.com/docs/api-reference/embeddings) to get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. 

The code below will load a dataset from the Hub, encode the `text` field, and create the `vectors` field which will contain only one key (`openai`) using the Embeddings endpoint.

To run the code below you need to install `openai` and `datasets` with pip: `pip install openai datasets`.

You also need to setup your OpenAI API key, for example with: `export OPENAI_API_KEY=<your_api_key> `

In [None]:
import openai
from datasets import load_dataset

openai.api_key = 

# Load dataset
dataset = load_dataset("banking77", split="test")

def get_embedding(texts, model="text-embedding-ada-002"):
    response = openai.Embedding.create(input = texts, model=model)
    vectors = [item["embedding"] for item in response["data"]]
    return vectors

# Encode text. Get only 500 vectors as this is a paid service, remove the select to do the full dataset
dataset = dataset.select(range(100)).map(lambda batch: {"vectors": get_embedding(batch["text"])}, batch_size=16, batched=True)

# Turn vectors into a dictionary
dataset = dataset.map(
    lambda r: {"vectors": {"openai-text-embedding-ada-002": r["vectors"]}}
)

In [64]:
dataset.to_pandas().head()

Unnamed: 0,text,label,vectors
0,How do I locate my card?,11,{'openai-text-embedding-ada-002': [-0.02451298...
1,"I still have not received my new card, I order...",11,{'openai-text-embedding-ada-002': [-0.03631252...
2,I ordered a card but it has not arrived. Help ...,11,{'openai-text-embedding-ada-002': [-0.03900404...
3,Is there a way to know when my card will arrive?,11,{'openai-text-embedding-ada-002': [-0.01449656...
4,My card has not arrived yet.,11,{'openai-text-embedding-ada-002': [-0.03863124...


In [82]:
import argilla as rg

rg.delete("banking77-openai")

rg_ds = rg.DatasetForTextClassification.from_datasets(dataset, annotation="label")

rg.log(
    name="banking77-openai",
    records=record,
    chunk_size=50,
)










  0%|                                                                                                                                                                            | 0/1 [00:00<?, ?it/s][A[A[A[A[A[A[A[A[ACannot log data ((), {'records': TextClassificationRecord(text='How do I locate my card?', inputs={'text': 'How do I locate my card?'}, prediction=None, prediction_agent=None, annotation='card_arrival', annotation_agent=None, vectors={'openai-text-embedding-ada-002': [-0.02451298199594021, 0.0020777606405317783, -0.0250512957572937, -0.039257533848285675, -0.04398419335484505, 0.008829662576317787, -0.023462612181901932, -0.00806158035993576, -0.012656943872570992, 0.010031022131443024, 0.03573879599571228, -0.020127691328525543, -0.024447333067655563, -0.004890779498964548, 0.005488176830112934, 0.0034990410786122084, 0.009479578584432602, 0.00735258124768734, 0.02666623704135418, -0.009886596351861954, -0.04001905024051666, 0.0014729780377820134, 0.0021

GenericApiError: Argilla server returned an error with http status: 500
Error details: [{'code': 'argilla.api.errors::GenericServerError', 'params': {'type': 'elasticsearch.helpers.BulkIndexError', 'message': '1 document(s) failed to index.'}}]

## Use similarity search

### Argilla UI

### Argilla Python Client