# Hybrid Search

### Dense vectors and Sparse Vectors

* Dense Vectors

Now so far with our embeddings we have been representing our text with **dense vectors**.  That is, when we submitted our text to OpenAI to be embedded, what was returned to us is a vector of 1536 entries, where each element represents a different characteristic encapsulating the *meaning* of the text. 

```python
text_inputs = ["There are good people here"]

MODEL = "text-embedding-3-small"

res = client.embeddings.create(
        input=text_inputs, model=MODEL
    )
    vectors = res.data
```

This returns something like the following.

```
dense_vector = [-0.00281827, -0.00190404, -0.03839779,  0.08865113,  0.01493002,
       -0.03636289, -0.07172307,  0.08416843, -0.01134681, -0.0448859 ,
        0.01821093, -0.01549773,  0.02148448,  0.0063812 ,  0.04373574,
        0.02649802, -0.01536502,  0.01486366,  0.00197961,  0.07797524,
        0.03580255,  0.02223651,  0.0183289 ,  0.00479604,  0.02714683,
        0.01404528, -0.02565751, -0.00899487,  0.02135177,  0.02487599,
        0.01212096, -0.03798491,  0.00182294, -0.00522735, -0.01257808,
       -0.01607281,  0.02760394, -0.01060216, -0.01015241, -0.03108393 ...]
```

We call the vector above a dense vector because every value in the vector is filled.  Notice that there are no zeros.  Instead, each entry represents a separate characteristic about the document.

* Sparse Embeddings

However, there is a different way to represent a document.  A simpler way from a simpler time.  In this way, instead of each element representing element representing a different *characteristic* of a given text, it simply represents a different word.  If the word is present in the chunk of text, we represent that presence with a digit, and if the word is not in the chunk we represent that with a zero.  That's it.

For example, if you look at the `sparse_embedding.py` file you can see our code for reading in text from a .txt file on WWI into our vectorizer from sklearn.  If you look at one of the vectors from one of the vectors from one of the chunks, you'll see that it is almost all zeros.

```python
tfidf_matrix.A[0][:50]

# array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
#        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
#        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
```


Remember that this is because each element in the vector represents a different word, and only if a word is present in that chunk is the digit non-zero.

> Don't worry, our vector doesn't have to represent all of the words in the English language, but instead will assign a different index to every word in our corpus (ie. all of the unique words that we feed into the model). 

### Comparing Dense vs Sparse embeddings

Ok, so which one is better?  

Well there are a lot of benefits to our original dense embeddings.  Dense embeddings are good at capturing nuanced semantic meanings, allowing for similarity searches that can identify relevant documents even if they don't share exact keywords -- ie semantic search. 

However, sparse embeddings come with a couple of benefits.  

1. Sometimes, having an exact word for word match *is* important.
2. It takes less time and computation to calculate a keyword match with our sparse embedding than to calculate a similarity score for every chunk in our corpus.


### Introducing Hybrid Search

So this is where hybrid search comes into play.  With hybrid search the entire corpus (ie. all of the chunks in our database) are represented with both dense and sparse embeddings.  Then when a query comes in, both a sparse and dense embedding is generated.  With the sparse embedding ideal keyword matching, and the dense embedding better for capturing the query's meaning.

What does hybrid search look like with llamaindex?  Something like this.

```python
query_engine = index.as_query_engine(
    similarity_top_k=2, sparse_top_k=10, vector_store_query_mode="hybrid"
)
```

So the, above favors keyword search over semantic search.  The top ten results are returned using the sparse embeddings, or keyword search, whereas only the top two are returned using the dense vectors or a semantic search. 

### Seeing it in action

Ok, so if you install the pip libraries, and then run `index.py` file, you can see the hybrid approach in action.

```bash
pip3 install -r requirements.text
python3 -i index.py
```

We implemented it with Qdrant (pronounced quadrant) database.  It may take a while to run...

In the first few lines of code, we set up the qdrant database.

```python
client = QdrantClient(path="./qdrant_data")
vector_store = QdrantVectorStore(
    "ww1", client=client, enable_hybrid=True, batch_size=20
)
```

> Notice that we have to specify `enable_hybrid=True` at the very beginning.  The batch size means that our chunks are embedded 20 at a time.

And then we set up our query engine to use a hybrid search.

```python
query_engine = index.as_query_engine(
    similarity_top_k=2, sparse_top_k=12, vector_store_query_mode="hybrid"
)
```

### Summary

In this lesson, we learned about hybrid search.  With hybrid search, each chunk is embedded using both dense and sparse vectors.  Dense vectors are what we have been using so far, and with them, each element (ie dimension) represents a different characteristic of the text.  With sparse embeddings each dimension in a vector represents the presence or absence of a word.  Sparse embeddings are faster to query and favor exact word matches whereas dense embeddings favor semantic or matches based on meaning.

### Resources

[LLamaindex Qdrant Hybrid](https://docs.llamaindex.ai/en/stable/examples/vector_stores/qdrant_hybrid.html)

[Pinecone Hybrid Search](https://www.pinecone.io/learn/hybrid-search-intro/)

[Medium Text vs Vector](https://towardsdatascience.com/text-search-vs-vector-search-better-together-3bd48eb6132a)

[Summary Advanced Techniques](https://medium.com/@sauravjoshi23/complex-query-resolution-through-llamaindex-utilizing-recursive-retrieval-document-agents-and-sub-d4861ecd54e6#:~:text=The%20Sub%20Question%20Query%20Engine%20aims%20at%20deconstructing%20a%20complex,information%20from%20relevant%20data%20sources.)