In [1]:
%%capture
!pip install langchain==0.2.5 faiss-cpu==1.8.0 cohere==5.5.8 langchain-community==0.2.5 rank_bm25==0.2.2 sentence-transformers==3.0.1
!pip install llama-cpp-python==0.2.78  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

There’s a lot of research on how to best use language models for search. Three
broad categories of these models are dense retrieval, reranking, and RAG.

# Dense Retrieval Example


Dense retrieval systems rely on the concept of embeddings, the same concept
we’ve encountered in the previous chapters, and turn the search problem into
retrieving the nearest neighbors of the search query (after both the query and
the documents are converted into embeddings). Figure 8-1 shows how dense
retrieval takes a search query, consults its archive of texts, and outputs a set of
relevant results.

Let’s take a look at a dense retrieval example by using Cohere to search the Wikipedia
page for the film Interstellar. In this example, we will do the following:
1. Get the text we want to make searchable and apply some light processing to
chunk it into sentences.
2. Embed the sentences.
3. Build the search index.
4. Search and see the results.

## 1. Getting the text archive and chunking it

In [2]:

import cohere

# Paste your API key here. Remember to not share publicly
api_key = ''

# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)

In [1]:
text = """
Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan.
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine.
Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind.

Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007.
Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar.
Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm.
Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles.
Interstellar uses extensive practical and miniature effects and the company Double Negative created additional digital effects.

Interstellar premiered on October 26, 2014, in Los Angeles.
In the United States, it was first released on film stock, expanding to venues using digital projectors.
The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014.
It received acclaim for its performances, direction, screenplay, musical score, visual effects, ambition, themes, and emotional weight.
It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics. Since its premiere, Interstellar gained a cult following,[5] and now is regarded by many sci-fi experts as one of the best science-fiction films of all time.
Interstellar was nominated for five awards at the 87th Academy Awards, winning Best Visual Effects, and received numerous other accolades"""

# Split into a list of sentences
texts = text.split('.')

# Clean up to remove empty spaces and new lines
texts = [t.strip(' \n') for t in texts]

In [2]:
texts

['Interstellar is a 2014 epic science fiction film co-written, directed, and produced by Christopher Nolan',
 'It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain, Bill Irwin, Ellen Burstyn, Matt Damon, and Michael Caine',
 'Set in a dystopian future where humanity is struggling to survive, the film follows a group of astronauts who travel through a wormhole near Saturn in search of a new home for mankind',
 'Brothers Christopher and Jonathan Nolan wrote the screenplay, which had its origins in a script Jonathan developed in 2007',
 'Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar',
 'Cinematographer Hoyte van Hoytema shot it on 35 mm movie film in the Panavision anamorphic format and IMAX 70 mm',
 'Principal photography began in late 2013 and took place in Alberta, Iceland, and Los Angeles',
 'Interstellar uses extensive practical a

## 2. Embedding the Text Chunks


Let’s now embed the texts. We’ll send them to the Cohere
API, and get back a vector for each text:

In [None]:
import numpy as np

# Get the embeddings
response = co.embed(
  texts=texts,
  input_type="search_document",
).embeddings

embeds = np.array(response)
print(embeds.shape)

This outputs (15, 4096), which indicates that we have 15 vectors, each one of size
4,096.

## 3. Building The Search Index
Before we can search, we need to build a search index.
An index stores the embeddings and is optimized to quickly retrieve the nearest
neighbors even if we have a very large number of points:

In [None]:
import faiss

dim = embeds.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(np.float32(embeds))

## 4. Search the index
We can now search the dataset using any query we want. We simply
embed the query and present its embedding to the index, which will retrieve the most
similar sentence from the Wikipedia article.
Let’s define our search function:

In [None]:
import pandas as pd

def search(query, number_of_results=3):

  # 1. Get the query's embedding
  query_embed = co.embed(texts=[query],
                input_type="search_query",).embeddings[0]

  # 2. Retrieve the nearest neighbors
  distances , similar_item_ids = index.search(np.float32([query_embed]), number_of_results)

  # 3. Format the results
  texts_np = np.array(texts) # Convert texts list to numpy for easier indexing
  results = pd.DataFrame(data={'texts': texts_np[similar_item_ids[0]],
                              'distance': distances[0]})

  # 4. Print and return the results
  print(f"Query:'{query}'\nNearest neighbors:")
  return results

In [None]:
query = "how precise was the science"
results = search(query)
results

| texts | distance |
| --- | --- |
| It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics | 10757.379883 |
| Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar | 11566.131836 |
| Interstellar uses extensive practical and mini... | 11922.833008 |


The first result has the least distance, and so is the most similar to the query. Looking
at it, it answers the question perfectly. Notice that this wouldn’t have been possible if
we were only doing keyword search because the top result did not include the same
keywords in the query.

### Disadvantages of Dense Retrieval

#### 1. Irrelevant Results When the Answer Is Missing
- If the retrieved texts do not contain the answer, dense retrieval still returns results based on similarity scores.
- **Example**: A query like *"What is the mass of the moon?"* might return unrelated results about movies or cinematography.
- This happens because dense retrieval is based on embeddings and similarity, not direct keyword matching.

#### 2. Handling Long Texts (Chunking Challenges)
- **Transformer models have a limited context size**, restricting the number of tokens they can process at once.
- Long documents need to be split into smaller chunks, but how this is done affects retrieval quality.

##### Chunking Strategies:
- **One Vector per Document:**
  - Embedding only a representative part (e.g., title or introduction) leaves out a lot of information.
  - Averaging embeddings from multiple chunks compresses information, reducing accuracy.
- **Multiple Vectors per Document:**
  - Documents are split into smaller chunks, and each chunk is embedded separately.
  - This improves retrieval accuracy but increases storage and computational requirements.

##### Best Practices for Chunking:
- **Sentence-level chunks**: Too granular, losing context.
- **Paragraph-level chunks**: Works well if paragraphs are concise.
- **Overlapping chunks**: Improves context retention by including nearby text.
- **Adding titles or surrounding text**: Helps provide better contextual understanding.

As the field advances, more dynamic and LLM-based chunking methods are expected to emerge.


# Reranking

A reranker takes in the search query and a number of search results, and returns
the optimal ordering of these documents so the most relevant ones to the query are
higher in ranking. Cohere’s Rerank endpoint is a simple way to start using a first
reranker. We simply pass it the query and texts and get the results back. We don’t
need to train or tune it:

In [None]:
query = "how precise was the science"
results = co.rerank(query=query, documents=texts, top_n=3, return_documents=True)
results.results

In [None]:
for idx, result in enumerate(results.results):
    print(idx, result.relevance_score , result.document.text)

0 0.1698185 It has also received praise from many astronomers for its scientific accuracy and portrayal of theoretical astrophysics

1 0.07004896 The film had a worldwide gross over $677 million (and $773 million with subsequent re-releases), making it the tenth-highest grossing film of 2014

2 0.0043994132 Caltech theoretical physicist and 2017 Nobel laureate in Physics[4] Kip Thorne was an executive producer, acted as a scientific consultant, and wrote a tie-in book, The Science of Interstellar

### Difference Between Dense Retrieval and Reranking

| Feature         | **Dense Retrieval** | **Reranking** |
|---------------|-----------------|----------------|
| **Purpose** | Retrieves the most relevant documents from a large corpus. | Reorders the retrieved documents to improve ranking accuracy. |
| **Process** | Uses vector embeddings and similarity search to retrieve top-k candidates. | Takes the top-k retrieved documents and refines their ranking based on a more sophisticated scoring model. |
| **Speed** | Fast, optimized for large-scale retrieval. | Slower, as it applies a second-stage ranking process. |
| **Computational Cost** | Lower, as it relies on approximate nearest neighbor (ANN) search. | Higher, as it often uses transformer models or deep learning for precise ranking. |
| **Model Type** | Typically based on **bi-encoder** architectures (e.g., SBERT, DPR). | Often uses **cross-encoder** architectures (e.g., BERT-based rerankers). |
| **Strengths** | Efficient for large document collections; good for first-stage retrieval. | Improves precision by considering fine-grained contextual relationships between query and documents. |
| **Weaknesses** | May retrieve irrelevant results if the embedding similarity is misleading. | Computationally expensive and requires processing each query-document pair separately. |
| **Example Use Case** | Finding a set of potentially relevant articles for a query. | Refining the ranking of retrieved articles to show the most relevant ones at the top. |

#### How They Work Together
1. **Dense Retrieval (First Stage)**: Quickly retrieves the top-k most relevant documents using embeddings.
2. **Reranking (Second Stage)**: Uses a more detailed model to reorder these documents for better relevance.

This two-step approach balances **efficiency** (dense retrieval) and **accuracy** (reranking), making it a common pipeline in information retrieval systems. 🚀


# Retrieval-Augmented Generation


A basic RAG pipeline is made up of a search step followed by a grounded
generation step where the LLM is prompted with the question and the information
retrieved from the search step.

RAG systems incorporate search capabilities in addition to generation capabilities.
They can be seen as an improvement to generation systems because they reduce
their hallucinations and improve their factuality. They also enable use cases of “chat
with my data” that consumers and companies can use to ground an LLM on internal
company data, or a specific data source of interest (e.g., chatting with a book).
This also extends to search systems. More search engines are incorporating an LLM
to summarize results or answer questions submitted to the search engine. Examples
include Perplexity, Microsoft Bing AI, and Google Gemini.