<a href="https://colab.research.google.com/github/anastaszi/GenAI/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retreival Augmented Generation
Retrieval-Augmented Generation, abbreviated RAG, is an architecture used in natural language processing (NLP) that combines the strengths of both retrieval and generation models to perform various language understanding and generation tasks.

RAG applications typically consist of a few different components:

1. **Retrieval Component**: The retrieval component is responsible for searching a large database of text to find relevant information. It usually employs an information retrieval (IR) system or a pre-trained retrieval model like a dense retriever, which ranks documents or passages based on their relevance to a given query.

2. **Generation Component**: The generation component is a language model, often a variant of the Transformer architecture, that can generate human-like text. It is capable of taking retrieved passages or documents and generating coherent and contextually relevant responses or completions.

3. **Interaction**: RAG models combine these two components in a way that allows them to interact. Typically, the retrieved passages or documents serve as additional context for the generation model. This context can help the generation model produce more informed and contextually relevant responses.

4. **Applications**: RAG models are versatile and can be used for a wide range of NLP tasks, including question answering, text summarization, chatbots, and more. For instance, in question answering, the retrieval component can identify relevant passages containing answers, and the generation component can generate concise and accurate responses based on that context.

5. **Fine-Tuning**: RAG models are often fine-tuned on task-specific data to optimize their performance for a particular application.

6. **Efficiency**: RAG models can be more efficient than traditional approaches for some tasks, especially when dealing with large-scale document collections. By narrowing down the search space with retrieval, they can reduce the computational burden on the generation model.

RAG is particularly valuable when dealing with tasks that require access to external knowledge sources, such as open-domain question answering. It allows models to retrieve relevant information from a vast corpus of text and use that information to generate coherent and contextually appropriate responses.

## RAG vs. Fine-Tuning
In general, RAG is a means towards tailoring more specifically to a subset of content. Fine-tuning is another method aimed at doing this, however the requirements on fine-tuning are often both compute intensive and labeled data intensive. RAG offers an alternative to fine-tuning that doens't require extensive compute and labeled data.

Before we get started with this tutorial, we'll need to install a few additional libraries (if not installed already)

In [None]:
%pip install langchain pypdf openai chromadb

For this tutorial, we'll be using the `cnn_dailymail` dataset from `HuggingFace`. The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

In [None]:
from datasets import load_dataset
# Load the dataset from Huggingface
dataset = load_dataset("cnn_dailymail", "3.0.0")

# Visually inspect
dataset

In [None]:
# Limit to 50 rows as a sample
filtered_pdf = dataset["train"].to_pandas().head(50)

# Print out the articles to visually inspect
for i in filtered_pdf["article"]:
    print(i)

This demo will be using OpenAI for both embeddings and for a text generation model. In order to use these models, a paid token needs to be configured:

In [None]:
# Configure OpenAI key
import os
import getpass

os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')


## Indexing Content
Indexing content in a Retrieval Question-Answering (QA) chain for Large Language Models (LLMs) is a crucial step in the pipeline to enable efficient and accurate information retrieval. Retrieval QA chains combine the strengths of both retrieval-based and generation-based approaches to answer questions or provide information. Here's an overview of the concept:

1. **Retrieval Component**:
   
   - In a Retrieval QA chain, the process begins with the retrieval component. This component is responsible for searching and identifying relevant documents, passages, or content from a large corpus of data. It doesn't generate answers directly but narrows down the search space.

   - **Indexing**: The first step in indexing content involves creating an index of the documents or passages that the QA system will search. This index is designed for efficient and fast lookup based on specific retrieval criteria.

   - **Preprocessing**: Text data within documents is preprocessed to improve retrieval efficiency. Common preprocessing steps include tokenization, stemming, removing stop words, and encoding text into numerical representations like embeddings.

   - **Vector Embeddings**: Many modern retrieval systems use vector embeddings to represent documents or passages. Each document is transformed into a dense vector in a high-dimensional space. This allows for efficient similarity computations between queries and documents.

2. **Question Representation**:

   - The next step involves representing the user's question as a query. The question is tokenized and transformed into a vector representation similar to the documents in the index. This vector is used to compare the question to the documents in the retrieval index.

3. **Scoring and Ranking**:

   - Once the question is represented as a vector, a similarity score is computed between the question vector and the vectors of indexed documents. Common similarity measures include cosine similarity or dot product.

   - Documents are ranked based on their similarity scores to the question. The most similar documents are considered as potential candidates for answering the question.

4. **Passage Selection**:

   - After ranking the documents, the retrieval system may select specific passages or segments within the documents that are most likely to contain the answer. This step helps in reducing the amount of text that needs to be processed by the generation component.

The key advantage of this Retrieval QA chain is that it combines the precision of retrieval-based systems (which are good at finding relevant documents) with the flexibility and language understanding of generation-based models (which are good at generating human-like answers). By indexing content efficiently and selecting relevant passages, the retrieval component narrows down the search space for the generation component, resulting in faster and more accurate responses to user queries.

Overall, indexing content is a critical step in enabling efficient and effective question-answering systems powered by LLMs, particularly in scenarios where the information to be retrieved is extensive and diverse.

In order to index the `cnn_dailymail` content, we'll need to first create a list of the content to be indexed

In [None]:
# Pre-process the text to be able to load it into Chroma
article_content = filtered_pdf["article"].to_list()

### `Chroma`
`Chroma` is an light-weight, open source vector database. A vector database, also known as a vector store or vector database management system, is a specialized type of database designed to efficiently store, manage, and query high-dimensional vector data. In vector databases, data is represented as vectors of numerical values, and the primary focus is on similarity search, retrieval, and analysis of this vector data. These databases are commonly used in applications where similarity or distance measurements between data points are essential, such as content-based recommendation systems, image similarity search, natural language processing (NLP), and more.

Here are some key characteristics and features of vector databases:

1. **Vector Data Storage**: Vector databases are optimized for storing high-dimensional vector data efficiently. They use data structures and indexing methods that enable fast retrieval of vectors based on similarity or distance metrics.

2. **Similarity Search**: The primary function of a vector database is to perform similarity search. Users can query the database with a vector, and the database returns vectors from the stored data that are most similar to the query vector based on a specified distance metric (e.g., cosine similarity, Euclidean distance).

3. **Indexing**: Vector databases employ indexing techniques specifically designed for vector data. These indexing structures enable quick lookup and retrieval of vectors based on their properties. Examples of indexing methods include tree structures like Ball Trees or Approximate Nearest Neighbors (ANN) algorithms.

4. **Scalability**: Many vector databases are designed to scale horizontally, making them suitable for handling large datasets and distributed deployments. This is crucial for applications that involve vast amounts of vector data.

5. **Multimodal Data**: Some vector databases support multimodal data, meaning they can handle vectors representing different types of data, such as text, images, audio, or sensor data, in a unified manner.

6. **Integration**: Vector databases are often integrated with other components of data pipelines, such as machine learning models or recommendation engines, to enable efficient similarity-based operations.

7. **Dimensionality Reduction**: Some vector databases offer dimensionality reduction techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce the dimensionality of vector data for more efficient storage and querying.

8. **Real-Time and Batch Processing**: Depending on the use case, vector databases can support real-time or batch processing of vector data, making them suitable for various applications, including real-time recommendations and offline analysis.

Vector databases play a crucial role in many modern AI and data-driven applications that require the efficient retrieval and analysis of high-dimensional vector data. They enable content-based recommendations, similarity-based search, clustering, and other operations that rely on measuring similarity or distance between data points.

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter

# Create the Index using Chroma
chroma_index = Chroma.from_texts(article_content, OpenAIEmbeddings())

# Test the index with a similarity serach
docs = chroma_index.similarity_search("harry potter", k=1)

Take a look at the result of the test similarity search

In [None]:
# Inspect the returned page content
print(docs[0].page_content)

## Configure the Chain
Now that we've indexed all the revelant content, let's configure the chain. A chain is a concept in `Langchain`:

```Using an LLM in isolation is fine for simple applications, but more complex applications require chaining LLMs - either with each other or with other components.LangChain provides the Chain interface for such "chained" applications. We define a Chain very generically as a sequence of calls to components, which can include other chains.```

These chains are very useful when stringing components together to create a cohesive end to end pipeline.

In [None]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

chain = RetrievalQA.from_chain_type(OpenAI(), chain_type="stuff", retriever=chroma_index.as_retriever())

## Query the Chain to test knowledge
Now that the chain is configured, we can easily test it using a question

In [None]:
query = "What's going on with the Harry Potter actor these days?"

chain.run(query)

## Enhancing the Chain with Prompt Engineering
Now that we've validated that the extisting chain work with the LLM and the vectorized content, we can further discuss incorporating prompt templating. Using teh `RetrievalQAChain` from `LangChain` means that, up until this point, we are relying on the prompting mechansims built into that chain. However, it's clear that prompts play a large role on the relevancy of responses from these LLMs, so it's important to know how to use custom prompting if necessary.

In [None]:
from langchain.prompts import ChatPromptTemplate
from langchain.prompts.chat import SystemMessage, HumanMessagePromptTemplate
human_message = """### Given the CONTEXT, answer the QUESTION.
CONTEXT: {context}
###
QUESTION: {question}
###
"""

system_message = """You are a cheerful, helpful assistant that answers questions using the context given.
If you cannot answer the question using the context, then just say `I don't know`"""

template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content=(system_message)),
        HumanMessagePromptTemplate.from_template(human_message)
    ]
)

See [here](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api) for some general guidelines of OpenAI prompting best practices.

In [None]:
# Pass the prompt template via the chain_types_kwargs parameter
chain_type_kwargs = {"prompt": template}
chain_with_prompting = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff",
                                   retriever=chroma_index.as_retriever(), chain_type_kwargs=chain_type_kwargs)

Test the new chain with prompt templating

In [None]:
print(chain_with_prompting.run(query))

## Summary
This tutorial serves as an informative and hands-on introduction to the powerful Retrieval-Augmented Generation (RAG) pipeline. Through a series of interactive examples, users will explore the fundamental components of RAG, showcasing its unique ability to combine retrieval-based and generation-based approaches in natural language processing. By working with real-world data, participants will learn how to set up a retrieval system, query and rank documents, and seamlessly integrate this contextual information into a generation model. This practical guide offers a comprehensive overview of the RAG architecture, enabling users to harness its capabilities for tasks such as question answering, content summarization, and more. Whether you're new to RAG or looking to deepen your understanding, this notebook provides a clear and hands-on demonstration of its core functionalities.