# Notebook Information

This notebook demonstrates how to implement simgle RAG with Cromo db.

---

This notebook is maintained by:

**Name:** Ekaterina Antonova
**Email:** [ekaterina_antonova@epam.com](ekaterina_antonova@epam.com)

---

When discussing Retrieval-Augmented Generation (RAG), it's crucial to understand that there are four fundamental components that form the foundation of a successful RAG system. While you can enhance your RAG by incorporating advanced techniques inside this components, the components itself remain consistent. Here's a breakdown of each key component:

#### 1. Document Processing
This step involves extracting information from your documents, which is one of the most important aspects of RAG. Remember the rule of thumb: "Garbage in, garbage out." The quality of your RAG system heavily depends on how you process documents. This includes choosing strategies for extracting, cleaning, splitting, and chunking information. The better the document processing, the more reliable the output of your RAG will be.
#### 2. Retrieval Step
This is a critical step where documents are fetched from your data store, allowing the Large Language Model (LLM) to generate responses based on these documents. Choosing an effective retrieval strategy is crucial. Various methods can be employed, such as dense vectors, sparse vectors, full-text search, or hybrid search approaches. The goal is to retrieve the most relevant documents to enchance accuracy of the system. 
#### 3. Re-Rank Step (Optional)
In scenarios that require more precise ranking of the top 'n' results, re-ranking techniques are essential. While this step is optional, it's particularly useful for systems that need a higher degree of accuracy in the initial results. More advanced concepts and techniques for re-ranking will be covered in further topics.
#### 4. Result Representation
This component focuses on how the final output is presented to the user. It might involve summarization, chat capabilities, or other forms of analysis, often facilitated by an LLM. The goal here is to deliver the processed and retrieved information in a format that meets the user's needs effectively.

### Let's explore these concepts in examples. We will do simple and vanila RAG with the help of langchain and chromo db

In [None]:
!pip3 install openai sentence_transformers chromadb pypdf tiktoken langchain langchain_community langchain_core pydantic > /dev/null

In [None]:
import os
import dotenv
import openai
import time
import tiktoken
from openai import AzureOpenAI
import json 
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.base import Embeddings
from sentence_transformers import SentenceTransformer
from typing import List
from langchain.vectorstores import Chroma
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings

### 1. Document processing
#### Let's start with the file.  For our example we will take pdf form life science domain. 

In [None]:
url_pdf = "https://www.delstrigo.com/wp-content/uploads/sites/32/2023/09/DELSTRIGO-Patient-Education-Brochure-PDF.pdf"

# Create your PDF loader
loader = PyPDFLoader(url_pdf)

# Load the PDF document
documents = loader.load()

# Chunk the financial report
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

In [None]:
print("Total text chunks: ", len(docs))
print("\033[1mContent of chunk:\033[0m \n", docs[0].page_content)

#### We showed
the basics of text extraction and splitting, specifically using tools readily available in the LangChain library. For a simple implementation, a vanilla text extractor and splitter was employed. However, there are more advanced techniques for text splitting, which will be covered in future lectures. 

For now, let's focus on a few key considerations when splitting text into chunks.
#### Setting the chunk_size Parameter

An important parameter in text splitting is chunk_size. In our example, we set it to 1024, but this value can vary based on several factors, such as:

* __Model's Context Length__: The maximum input size a model can handle.
* __Semantic Meaning__: The need to preserve meaningful segments within each chunk.
* __Model's Ability to Retain Context__: Some models can manage larger contexts better than others.
 
Choosing an appropriate chunk_size ensures that the model can effectively process and retain the information within each chunk.


### NB 
When using __RecursiveCharacterTextSplitter__, the chunk size is measured in characters. However, keep in mind that most models has their context window size in terms of __tokens__, not characters.

__Tokens__: These are smaller units of text, such as words or subwords. The process of breaking down text into tokens is known as tokenization.

### Measuring Tokens in Text

To determine how many tokens are in a given piece of text, you can use the tokenizer associated with the model. For OpenAI's models, the tokenizer is called tiktoken. Here's a simple example of how to use tiktoken to count tokens in your text. 
If you want to read more about [tokenization](https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/) . Also nice [OpenAI cookbook](https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken)

In [None]:
content = docs[0].page_content

# Simple length in charachters 
print("Characters: ", len(content))

#Count length in tokens
enc = tiktoken.encoding_for_model("gpt-4o")
print("Tokens: ", len(enc.encode(content)))

### 2. Retrieval Step

In this step, we'll explore examples using dense vectors and a hybrid approach. 

Full-Text Search (BM-25): Typically used as the baseline for many retrieval systems. It focuses on matching terms in the query with terms in the documents. For similarity search (RAG Baseline) it can be used as a baseline as well but in RAG, the standard approach is similarity search.  It involves embedding both the text and the query into dense vectors (also called embeddings) using a model. The retrieval process then calculates the distance between these vectors to rank the most relevant documents.

#### Understanding Similarity Search

__Distance Metrics__: To rank the vectors, you measure the "distance" between the document vectors and the query vector. Common metrics include:
 * Cosine Similarity: Measures the cosine of the angle between two vectors.
 * L2 (Euclidean) Distance: Measures the straight-line distance between vectors.
 * Inner Product: Calculates the dot product between vectors.
Choosing the appropriate distance metric is crucial and often depends on the specific use case and the capabilities of the vector store being used.

__Embedding__: Convert both your documents and the query into dense vectors using an embedding model.

__Choosing a Model for Dense Vector Retrieval__

Once you have decided on the distance metric, you also need to select an embedding model for dense vector retrieval. When choosing one, consider:
 * Context Length: The maximum input size the model can handle.
 * Model Size: Larger models might be more powerful but can be more computationally expensive.
 * Cost: Commercial models often have usage fees.
 * Domain Specificity: Some models are trained on specific domains, which may make them more effective for certain areas.

Many large language model (LLM) providers, like OpenAI, offer their own embeddings. 
Open-Source Models: Numerous open-source models are available for embedding text. So no need to limit yourself with providers like OpenAI. 

To identify top-performing models, you can refer to the [MTEB (Massive Text Embedding Benchmark)](https://huggingface.co/spaces/mteb/leaderboard), which provides rankings and evaluations of various embedding models. 

More complex retrieval approaches you will learn in further lessons

In [None]:
## How to make embedding using OpenAI API
# Instantiate LLM

azure_llm = AzureChatOpenAI(
  api_key         = os.environ['OPENAI_API_KEY'],
  api_version     = "2023-07-01-preview",
  azure_endpoint  = "https://ai-proxy.lab.epam.com",
  model           = "gpt-4o-mini-2024-07-18",
  temperature     = 0.0
)


In [None]:
## How to make embedding using open-source Huggingface hub (https://huggingface.co)

""" We will need to write the wrapper over Sentence Trancformer class 
due to implementation limitation inside langchain but if you are using it outside langchain it is not necessary """

class SentenceTransformerEmbeddings(Embeddings):
    def __init__(self, model_name: str):
        self.model = SentenceTransformer(model_name)

    def embed_documents(self, documents: List[str]) -> List[List[float]]:
        return  [self.model.encode(d).tolist() for d in documents]

    def embed_query(self, query: str) -> List[float]:
        return self.model.encode([query])[0].tolist()

In [None]:
model_name = 'BAAI/bge-base-en-v1.5'
embedding = SentenceTransformerEmbeddings(model_name)

In [None]:
# Load the documents into Chroma using the Azure OpenAI embeddings or the embedding from SentenceTransformers

db = Chroma.from_documents(docs, embedding, collection_metadata = {"hnsw:space": "cosine"} )

In [None]:
# if you want to use different embeddings or different distance you need to delete the collection and reindex everything

# db.delete_collection() #delete collection 

In [None]:
query = "what is the side effects of taking Delstrigo?"

In [None]:
## Standard vanila similarity search 

retrieved_docs = db.similarity_search_with_score(query)
retrieved_docs

### Hybrid search
Hybrid search is a method that combines different search techniques(like sparse vectors, dense vectors, full-text search) to improve the accuracy and relevance of search results in information retrieval systems.

In [None]:
# now lets perform vanila hybrid search. We will combine similarity with full-text search. 
# Vanila approach here is to use text query as a filter and show chunks that has the mentioned word inside the context

docs_hybrid = db.similarity_search_with_score(query, k = 10, where_document = {"$contains": "side effects"})
docs_hybrid

### 3. Re-Rank Step (Optional)
We will cover this in next lessons

### 4. Result Representation
Let's look how we can answer the question based on the results that we are getting in more user friendly output. 

In [None]:
client = AzureOpenAI(
    api_key=os.environ['OPENAI_API_KEY'],
    api_version="2023-07-01-preview",
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    azure_deployment='gpt-4-1106-preview'
)


start = time.time()

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
)

response = client.chat.completions.create(
    model='gpt-4-1106-preview',
    temperature=0,
    messages=[
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": f"Query: {query} Docs: {retrieved_docs}"}
    ]
  )

print(f"Took {time.time() - start} seconds to summarize documents with GPT-4.")

In [None]:
response.choices[0].message.content