# Similarity search

This guide gives an overview of the similarity search features. Since `1.2.0` Argilla supports adding vectors to Argilla records which can then be used for finding the most similar records to a given one. This feature uses vector or semantic search combined with more traditional search (keyword and filter based). Vector search leverages machine learning to capture rich semantic features by embedding items (text, video, images, etc.) into a vector space, which can be then used to find "semantically" similar items.

In this guide, you'll find how to:
* Setup your Elasticsearch or Opensearch endpoint with vector search support.
* Encode text for Argilla records.
* Use similarity search.

The next section gives a general overview about how similarity search works in Argilla.

## How it works
Similarity search in Argilla works as follows:

1. One or several vectors can be included in the `vector` field of Argilla Records. The `vector` field accepts a dictionary as for certain use cases you might want to use several vectors. In
2. The vectors are stored at indexing time, once the records are logged with `rg.log`.
3. If you have stored vectors in your dataset, you can use the similarity search feature in Argilla UI or the `vector` param in the `rg.load` method of the Python Client.

In future versions, embedding services might be developed to skip 1 and 2 and associate vectors to records automatically. 




<div class="alert alert-info">

Note
    
It's completely up to the user which encoding or embedding mechanism to use for producing these vectors. In the "Encode text fields" section of this document you will find several examples and details about this process, using open source libraries (e.g., Hugging Face) as well as paid services (e.g., Cohere or OpenAI).

Currently, Argilla uses vector search only for searching similar records (nearest neighbours) of a given vector. This can be leveraged from Argilla UI as well as the Python Client. In the future, vector search could be leveraged as well for free text queries using Argilla UI.
    
</div>


## Setup Elasticsearch or Opensearch with vector search support


TODO: @frascuchon please add some basic content (bullet point and references) here


<div class="alert alert-warning">

Warning

Add here potential issues with ES or Opensearch in terms of performance, loosing/not seeing data from past versions, etc.
    
</div>

## Encode text fields
The first and most important thing to do before leveraging similarity search is to turn text into a numerical representation: a vector. In practical terms, you can think of a vector as an array or list of numbers. You can associate this list of numbers with an Argilla Record by using the aforementioned `vectors` field. But the question is: **how do you create these vectors?** 

Over the years, many approaches have been used to turn text into numerical representations. The goal is to "encode" meaning, context,  topics, etc.. This can be used to find "semantically" similar text. Some of these approaches are: *LSA* (Latent Semantic Analysis), *tf-idf*, *LDA* (Latent Dirichlet Allocation), or *doc2Vec*. More recent methods fall in the category of "neural" methods, which leveragage the power of large neural networks to *embed* text into dense vectors (a large array of real numbers). These methods have demonstrated a great ability of capturing semantic features. These methods are powering a new wave of technologies that fall under categories like neural search, semantic search, or vector search. Most of these methods involve using a large language model to encode the full context of a textual snippet, such as a sentence, a paragraph, and more lately larger documents.

<div class="alert alert-info">

Note
   
In the context of Argilla, we intentionally use the term `vector` in favour of `embedding` to emphasize that users can leverage methods other than neural, which might be cheaper to compute, or be more useful for their use cases.
</div>

### Sentence Transformers

### OpenAI

### Cohere

### spaCy

### BertTopic

## Use similarity search

### Argilla UI

### Argilla Python Client