# Retrieval Augmented Generation (RAG)
## IMD1107 - Natural Language Processing
### [Dr. Elias Jacob de Menezes Neto](htttps://docente.ufrn.br/elias.jacob)

## Summary

### Keypoints

- Retrieval-Augmented Generation (RAG) enhances traditional generative models by integrating information retrieval systems.

- RAG addresses challenges like limited knowledge scope, factual inaccuracy, and contextual irrelevance in generative AI.

- The RAG architecture consists of a retriever component to search external knowledge sources and a generator to produce responses.

- Vector databases like ChromaDB are used to efficiently store and query embedding data for RAG systems.

- Proper chunking of documents is crucial for effective storage and retrieval in RAG systems.

- Different embedding models and LLMs can be combined to create varied RAG implementations.

### Takeaways

- RAG systems significantly improve the accuracy and relevance of AI-generated responses by utilizing external knowledge.

- Careful consideration of chunking strategies, embedding models, and retrieval methods is essential for optimal RAG performance.

- The choice of vector database and embedding model impacts the efficiency and effectiveness of information retrieval.

- Integrating retrievers with LLMs using tools like LangChain enables flexible and powerful RAG implementations.

- Maximum Marginal Relevance (MMR) search can enhance result diversity and reduce redundancy in retrieved information.

- RAG systems can be adapted for various domains and use cases, making them versatile for knowledge-intensive AI applications.

# Introduction to Retrieval-Augmented Generation (RAG)

## Definition and Concept
Retrieval-Augmented Generation (RAG) is a powerful AI framework that enhances the capabilities of traditional generative models by integrating information retrieval systems. This combination allows RAG to generate text that is not only coherent and contextually relevant but also accurate and up-to-date. By using the strengths of both retrieval and generation, RAG provides a strong solution for various applications requiring precise and contextually enriched information.

### Key Principles of RAG
- Retrieval of relevant information from an external knowledge base
- Assimilation of retrieved information into the generation process
- Improved context awareness and factual accuracy in generated outputs
- Enhanced flexibility and scalability through external knowledge sources
- Real-time access to up-to-date information for dynamic applications




## What problems does RAG solve?

Retrieval-Augmented Generation (RAG) addresses several key challenges in the field of natural language processing (NLP) and artificial intelligence (AI). Below are some of the primary problems that RAG aims to solve:

1. Limited Knowledge Scope of Generative Models
    - **Problem**: Traditional generative models (e.g., GPT-3) are limited by the static knowledge they acquired during their training phase. They cannot access or utilize information that was not included in their training data, making them less effective in providing up-to-date or domain-specific information.
    - **RAG Solution**: RAG incorporates an information retrieval component that allows the model to fetch relevant, up-to-date information from external knowledge sources. This expands the knowledge base beyond the training data and enables the generation of more current and relevant content.

2. Factual Inaccuracy
    - **Problem**: Generative models can produce text that is grammatically correct and coherent but factually incorrect. This is particularly problematic in applications where accuracy is critical, such as healthcare, legal advice, and academic writing.
    - **RAG Solution**: By retrieving factual information from reliable external sources and integrating it into the generation process, RAG enhances the factual accuracy of the generated text. This ensures that the responses are not only linguistically sound but also factually correct.

3. Contextual Irrelevance
    - **Problem**: Generative models often struggle with maintaining context, especially in multi-turn conversations or complex queries. They may provide responses that are contextually irrelevant or inconsistent with previous information.
    - **RAG Solution**: The retrieval component of RAG allows the model to consider a broader and more relevant context, leading to more coherent and contextually appropriate responses. This is particularly beneficial for dialogue systems and complex question-answering tasks.

4. Data Sparsity and Training Limitations
    - **Problem**: Generative models require large amounts of diverse training data to perform well across different domains. However, gathering and curating such extensive datasets can be resource-intensive and time-consuming.
    - **RAG Solution**: RAG reduces the reliance on extensive training data by using external knowledge bases. This means that even with less training data, RAG can still perform effectively across various topics and domains by retrieving and using relevant external information.

5. Inflexibility and Scalability
    - **Problem**: Traditional models need to be retrained to incorporate new information or adapt to different domains, which can be a costly and time-consuming process.
    - **RAG Solution**: RAG systems can be easily updated by modifying the external knowledge source without requiring retraining of the entire model. This makes RAG more flexible and adaptable to new information and different domains, enhancing its scalability.

6. Real-Time Information Retrieval
    - **Problem**: Many applications require real-time access to the most current information. Traditional generative models, which rely solely on pre-trained data, cannot meet this demand effectively.
    - **RAG Solution**: RAG's retrieval mechanism enables real-time access to the latest information, making it suitable for applications that require immediate and current responses, such as news generation or real-time customer support.

## Advantages of RAG
RAG offers several advantages over traditional generation methods:

1. **Enhanced Factual Accuracy**: By retrieving relevant information from a reliable external source, RAG can generate responses that are more factually accurate and consistent with real-world knowledge.

2. **Improved Context Awareness**: The retrieval component allows RAG to consider a wider range of contextual information, enabling it to generate more coherent and contextually appropriate responses.

3. **Reduced Reliance on Training Data**: RAG can capitalize on external knowledge sources, reducing the need for extensive training data to cover various topics and domains.

4. **Flexibility and Adaptability**: RAG systems can be easily adapted to different domains or updated with new knowledge by modifying the external knowledge source, without requiring retraining of the entire model.


## Use Cases and Applications
RAG finds applications in various real-world scenarios where accurate and contextually relevant language generation is crucial:

- **Question Answering**: RAG can retrieve relevant information to provide accurate and informative answers to user queries.

- **Dialogue Systems**: RAG enables more engaging and contextually appropriate conversations in chatbots and virtual assistants.

- **Content Generation**: RAG can assist in generating articles, summaries, or descriptions by retrieving relevant facts and details from external sources.

- **Knowledge-Intensive Tasks**: RAG is particularly useful in domains that require access to a large amount of factual knowledge, such as healthcare, finance, or legal services.

# RAG Architecture and Overview

## Components of a RAG System
A typical RAG system consists of two main components:

1. **Retriever**: The retriever is responsible for searching and retrieving relevant information from an external knowledge source based on the input query or context. It uses techniques like similarity search or information retrieval to find the most relevant pieces of information.

2. **Generator**: The generator is a language model that takes the retrieved information as input and generates the final output response. It integrates the retrieved knowledge into the generation process to produce informative and contextually appropriate responses.


<p align="center">
<img src="images/rag.webp" alt="" style="width: 50%; height: 50%"/>
</p>
 
The diagram illustrates the components and workflow of a Retrieval-Augmented Generation (RAG) application. Here’s a detailed explanation of each component:

### Data Preparation
1. **Raw Data Sources (A)**: This is the initial stage where raw data is collected from various sources such as documents, PDFs, web pages, etc.

2. **Information Extraction (B)**: This step involves extracting relevant information from the raw data. Techniques like Optical Character Recognition (OCR), PDF data extraction, and web crawlers are used to convert unstructured data into a structured format.

3. **Chunking (C)**: The extracted information is then divided into smaller, manageable chunks. This process helps in handling large documents and ensures that the data can be processed efficiently.

4. **Embedding (D)**: Each chunk of data is converted into a vector representation (embedding). This numerical representation captures the semantic meaning of the text, making it easier to compare and retrieve relevant information.

### Retrieval-Augmented Generation (RAG) Workflow
1. **Query (1)**: A user query is received, which initiates the RAG process.

2. **Embedding (2)**: The query is also converted into an embedding (vector representation) to assist comparison with the stored data embeddings.

3. **Vector Database**: Both the data embeddings and the query embedding are stored and managed in a vector database. This specialized database is optimized for handling and retrieving vector data.

4. **Relevant Data (3)**: The vector database is queried to find the most relevant data chunks that match the query embedding. These relevant data chunks are then retrieved.

5. **LLM(s) (4)**: The retrieved relevant data is fed into a Large Language Model (LLM). The LLM uses this data to generate a response that is contextually accurate and relevant to the query.

6. **Response (5)**: Finally, the generated response is provided to the user.

## Data Flow in RAG
The data flow in a RAG system follows these steps:

1. The input query or context is provided to the retriever.
2. The retriever searches the external knowledge source and retrieves the most relevant information based on the input.
3. The retrieved information is passed to the generator along with the input query or context.
4. The generator integrates the retrieved knowledge into the generation process and produces the final output response.

## Incorporation with LLMs
RAG can be integrated with existing Large Language Models (LLMs) to enhance their generation capabilities. The retriever component can be added as a preprocessing step before feeding the input to the LLM. The retrieved information can be concatenated with the input or used to condition the LLM's generation process.

# Vector Databases

A vector database is a specialized database designed to efficiently store and query embedding data, extending the capabilities of traditional relational databases. The key distinguishing feature of a vector database is that query results are not exact matches to the query. Instead, using a specified similarity metric, the vector database returns embeddings that are similar to the query. This makes vector databases ideal for applications that involve comparing and retrieving embeddings based on their similarity rather than exact values.

## Example Use Case

Suppose you have stored some information about UFRN, like recent news and legislation that applies to our undergrad students. You can embed this information and store it in a vector database. When a user asks a question about UFRN, the query is embedded and compared to the stored embeddings. The vector database returns the most similar embeddings, which can then be used to generate a response. The vector database will accept a query like `Com quantos dias de antecedência um professor deve divulgar os resultados de uma prova antes de aplicar a próxima prova?` and embed the query (using the same embedding model as the one used to populate the database). It will then compare the embedded query to other embeddings in the vector database and return the documents that have embeddings most similar to the query embedding.

The vector database will identify the document that had an embedding most similar to the query, likely based on the document's semantics. With that information, the LLM can generate a response that is contextually relevant to the user's query and factually accurate based on the retrieved document.

## Fundamental Components of a Vector Database

To make efficient storage and querying of embeddings possible, vector databases are equipped with features that balance the speed and accuracy of query results. Here are the central components you should know about:

1. **Embedding Function**: When using a vector database, you often store and query data in its raw form rather than uploading embeddings directly. Internally, the vector database needs to know how to convert your data to embeddings, and you have to specify an embedding function for this. For text, you can use embedding functions from libraries like SentenceTransformers or OpenAI's embedding models (or any other function that maps raw text to vectors).

2. **Similarity Metric**: To assess embedding similarity, you need a similarity metric such as cosine similarity, dot product, or Euclidean distance. Choosing the right similarity metric depends on your specific application.

3. **Indexing**: When dealing with a large number of embeddings, comparing a query embedding to every stored embedding can be too slow. Vector databases employ indexing algorithms that group similar embeddings together. At query time, the query embedding is compared to a smaller subset of embeddings based on the index. This is called approximate nearest neighbor search, as the recommended embeddings aren't guaranteed to have the highest similarity to the query.

4. **Metadata**: You can store metadata with each embedding to provide context and make query results more precise. Metadata can be filtered much like in a relational database. For example, you could store the publication year of a document as metadata and only search for similar documents published in a specific year.

5. **Storage Location**: Vector databases can store embeddings and metadata both in memory and on disk. In-memory storage allows for faster reads and writes, while disk storage is important for data persistence.

6. **CRUD Operations**: Most vector databases support create, read, update, and delete (CRUD) operations, allowing you to maintain and interact with data similar to a relational database.

> While there is much more detail and complexity to explore with vector databases, these central concepts provide a solid foundation to get started.

## [ChromaDB](https://www.trychroma.com/): A Vector Database for Embeddings

ChromaDB is a specialized vector database designed for storing and querying embeddings. Built upon SQLite, a renowned relational database, ChromaDB utilizes the SQLite engine to efficiently manage the storage and retrieval of embeddings. This system provides a straightforward and user-friendly interface for handling embeddings and querying similar vectors. Below are some key features and details about ChromaDB:

### Key Features

- **Efficient Storage**: ChromaDB utilizes SQLite's sturdy storage mechanisms to store embeddings efficiently. This ensures that large volumes of data can be handled without significant performance degradation.

- **Querying Capabilities**: The database is optimized for querying similar vectors, which is essential for tasks involving embeddings, such as nearest neighbor searches and similarity computations.

- **User-Friendly Interface**: ChromaDB offers a simple and natural interface for managing embeddings, making it accessible for users with varying levels of database proficiency.


### Running ChromaDB

You can use ChromaDB in several ways:
- Using a in-memory database, that will be lost when the program ends.
- Using a file-based database, that will be persisted in a file.
- Running as a server, that will allow you to query the database from other applications. For this, there are Docker images available.

You can interact with ChromaDB using their Python library, their REST API or through LangChain, which has a built-in ChromaDB combination. For our class, we'll use a file-based database and LangChain to interact with ChromaDB.

# Building your RAG System with ChromaDB and LangChain

We'll create a chatbot that has access to a knowledge base about UFRN. The chatbot will use a RAG system to retrieve information from the knowledge base and generate responses to user queries. We'll utilize ChromaDB to store and query embeddings of the knowledge base documents and LangChain to interact with ChromaDB and generate responses using a Large Language Model (LLM).

For documents about UFRN, we'll use the following sources:
- https://ufrn.br/imprensa/materias-especiais/80473/primeira-patente-da-ufrn-completa-uma-decada
- https://ufrn.br/imprensa/materias-especiais/80657/design-de-aplicativo-na-escola
- https://ufrn.br/imprensa/materias-especiais/80222/forro-na-ufrn-2
- https://ufrn.br/imprensa/materias-especiais/79929/projeto-do-ceres-resgata-especies-nativas-da-caatinga
- [Estatuto da UFRN](https://sigrh.ufrn.br/sigrh/public/colegiados/anexos/estatuto.pdf)
- [Regulamento dos Cursos de Graduação](https://ufrn.br/resources/documentos/regulamentos/regulamento-dos-cursos-de-graduacao-da-UFRN-2024.pdf)
- [Vídeo sobre o Restaurante Universitário](https://youtu.be/OHv-nx4ukZU?si=-uo5kd8OK_Zg7T9s)

This way, we'll see how to use [LangChain Document Loaders](https://python.langchain.com/v0.2/docs/how_to/#document-loaders) to retrieve information from several types of sources and how to use ChromaDB to store and query embeddings of these documents.

## Step 1 - Define our LLM

In [34]:
from dotenv import load_dotenv
import os
from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama

# Load environment variables from a .env file into the environment
load_dotenv()

# Initialize the ChatOpenAI model with specific parameters
# 'model' specifies the model to use, in this case 'gpt-4o-mini'
# 'temperature' controls the randomness of the model's output, with 0 being deterministic
model_openai = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Invoke the model with a specific prompt/question
response = model_openai.invoke(
    "O processo avaliativo na UFRN pode ser organizado em até quantas avaliações?"
)

# Print the content of the response from the model
print(response.content)

Na Universidade Federal do Rio Grande do Norte (UFRN), o processo avaliativo pode ser organizado em até quatro avaliações, conforme as diretrizes estabelecidas pela instituição. Essas avaliações podem incluir provas, trabalhos, atividades práticas e outras formas de avaliação que o professor considerar adequadas para medir o aprendizado dos alunos. É sempre bom consultar o regulamento específico do curso ou as orientações do professor para detalhes mais precisos.


In [35]:
# Initialize the ChatOllama model with specific parameters

model_llama = ChatOllama(
    model="llama3.1",  # 'model' specifies the model to use, in this case 'llama3.1'
    temperature=0,  # 'temperature' controls the randomness of the model's output, with 0 being deterministic
    base_url="http://localhost:11434",  # 'base_url' specifies the base URL for the model's API endpoint
)

# Invoke the model with a specific prompt/question
response = model_llama.invoke(
    "O processo avaliativo na UFRN pode ser organizado em até quantas avaliações?"
)

# Print the content of the response from the model
print(response.content)

A resposta é 5.


This is just wrong. According to article 100 of [this document](https://ufrn.br/resources/documentos/regulamentos/regulamento-dos-cursos-de-graduacao-da-UFRN-2024.pdf), the maximum number is 3.

>> Art. 100. O processo avaliativo pode ser organizado em até 3 (três) unidades avaliativas

OpenAI's model is very powerful, but doesn't really know the specifics of UFRN's regulations. That why it hallucinated a number that is not correct. We can use the LLM to generate the response, but we need to make sure the information is correct. This is one major advantage of using RAG systems.

-----

## Step 2 - Load the Documents

To prepare the documents for use with ChromaDB, we need to follow these steps:

- Downloading the Documents

- Extracting Text from Documents

- Splitting Text into Chunks
- Determine an appropriate chunk size for your specific use case.
- Consider factors like the desired granularity of information retrieval and the limitations of the basic storage system.
- Carry Out a function to split the extracted text into smaller chunks based on the chosen chunk size.
- Options include splitting by a fixed number of characters, sentences, or paragraphs.
- Ensure that the chunks are meaningful and maintain coherence.
- Store the resulting chunks in a list or any other suitable data structure.

- Storing Chunks in ChromaDB
- Initialize a connection to ChromaDB using the provided API or client library.
- Create a new collection or database to store the text chunks.
- Repeat over the list of text chunks and insert each chunk into ChromaDB.
- Use appropriate methods provided by the ChromaDB library to insert the chunks efficiently.
- Consider adding metadata or tags to each chunk if needed for better organization and retrieval.
- Verify that the chunks are successfully stored in ChromaDB by querying the database.


LangChain provides several [document loaders](https://python.langchain.com/v0.2/docs/integrations/document_loaders/) that can help you load data from various sources, such as URLs, files, and databases. You can use these loaders to extract text from different types of documents and prepare them for storage in ChromaDB.

### Loading PDFs

In [36]:
import os
import requests

# Import PyPDFLoader from langchain_community.document_loaders to load and process PDF documents
from langchain_community.document_loaders import PyPDFLoader

# Initialize an empty list to store all documents
all_documents = []


# Function to download a PDF document from a given URL and save it with a specified filename
def download_pdf(url, filename):
    # Check if the file already exists to avoid re-downloading
    if not os.path.exists(filename):
        # Make a GET request to the URL to fetch the PDF content
        r = requests.get(url)
        # Open the file in write-binary mode and save the content
        with open(filename, "wb") as f:
            f.write(r.content)


# Download the PDF document from the specified URL and save it to the 'data' directory
download_pdf(
    "https://ufrn.br/resources/documentos/regulamentos/regulamento-dos-cursos-de-graduacao-da-UFRN-2024.pdf",
    "data/regulamento-dos-cursos-de-graduacao-da-UFRN-2024.pdf",
)

# Initialize the PyPDFLoader with the path to the downloaded PDF document
pdf_loader = PyPDFLoader("data/regulamento-dos-cursos-de-graduacao-da-UFRN-2024.pdf")

# Load and split the PDF document into individual pages
pages = pdf_loader.load_and_split()

In [37]:
# Print the content of the first page of the PDF document
# 'pages' is a list where each element represents a page of the PDF
print(pages[0])

page_content='MINISTÉRIO DA EDUCAÇÃO 
UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE 
                  
       
 
  
RESOLUÇÃO Nº 016/2023-CONSEPE, de 04 de julho de 2023. 
 
 
Atualiza o Regulamento dos Cursos de Graduação da 
Universidade Federal do Rio Grande do Norte - UFRN. 
 
 
O VICE -REITOR DA UNIVERSIDADE FEDERAL DO RIO GRANDE DO NORTE faz saber que o 
Conselho de Ensino, Pesquisa e Extensão, no uso das atribuições que lhe confere o art. 17, Inciso III, do 
Estatuto da UFRN,  
 
CONSIDERANDO o art. 207 da Constituição Federal ao d eterminar que as universidades gozam 
de autonomia didático -científica, administrativa e de gestão financeira e patrimonial, e obedecerão ao 
princípio de indissociabilidade entre ensino, pesquisa e extensão;  
 
CONSIDERANDO a Lei nº 9.394/1996, que estabelec e as diretrizes e bases da educação 
nacional; 
 
CONSIDERANDO a necessidade de atualizar as normas relativas ao ensino de graduação, 
conforme determina o art. 359 da Resolução no 171/2013 - CO

In [38]:
len(pages)

77

In [39]:
all_documents += pages
pages = list()

In [40]:
# Download the second PDF document from the specified URL and save it to the 'data' directory
download_pdf(
    "https://sigrh.ufrn.br/sigrh/public/colegiados/anexos/estatuto.pdf",
    "data/estatuto.pdf",
)

# Initialize the PyPDFLoader with the path to the downloaded PDF document
pdf_loader = PyPDFLoader("data/estatuto.pdf")

# Load and split the PDF document into individual pages
pages = pdf_loader.load_and_split()

# Print the content of the second page of the PDF document
# 'pages' is a list where each element represents a page of the PDF
print(pages[1])

page_content='Sumário
TÍTULO I – DA INSTITUIÇÃO..............................................................5
CAPÍTULO I – Da natureza jurídica..............................................5
CAPÍTULO II – Dos princípios e dos objetivos........................... 6
SEÇÃO I – Dos princípios....................................................... 6
SEÇÃO II – Dos objetivos........................................................7
CAPÍTULO III – Da constituição básica....................................... 9
TÍTULO II – DA ADMINISTRAÇÃO UNIVERSITÁRIA....................14
CAPÍTULO I – Dos Conselhos Superiores................................14
SEÇÃO I – Do Conselho Universitário – CONSUNI......... 14
SEÇÃO II – Da Assembleia Universitária........................... 17
SEÇÃO III – Do Conselho de Ensino, Pesquisa e
Extensão – CONSEPE....................................18
SEÇÃO IV – Do Conselho de Administração –
CONSAD.......................................................... 22
SEÇÃO V – Do C

In [41]:
all_documents += pages
pages = list()

### Loading HTML from URLs

In [42]:
# Import the UnstructuredURLLoader class from the langchain_community.document_loaders module
# This class is used to load and process content from URLs
from langchain_community.document_loaders import UnstructuredURLLoader
import nltk

nltk.download("averaged_perceptron_tagger")

# Define a list of URLs to be loaded
urls = [
    "https://portal.imd.ufrn.br/portal/noticias/7318/ufrn-associa-se-a-fiware-foundation-e-cria-ihub-em-cidades-inteligentes-no-imd",
    "https://portal.imd.ufrn.br/portal/noticias/7338/equipes-ligadas-ao-imd-vencem-hackathon-inovar-2024",
    "https://portal.imd.ufrn.br/portal/noticias/7319/projeto-do-imd-com-idosos-comemora-conclus%C3%A3o-da-primeira-turma",
]

# Initialize the UnstructuredURLLoader with the list of URLs
# This loader will fetch and process the content from the provided URLs
loader = UnstructuredURLLoader(urls=urls)

# Load the content from the URLs
# 'pages' will be a list where each element contains the content of one URL
pages = loader.load()

# Note: For pages where the content is created dynamically (e.g., using JavaScript), there are other loaders that can be used with Selenium and PlayWright.
# You can learn more about them on the Document Loaders Page of Langchain Community.

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jacob/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [43]:
print(pages[0])

page_content='Mais notícias

UFRN associa-se a FIWARE Foundation e cria iHUB em Cidades Inteligentes no IMD

UFRN associa-se a FIWARE Foundation e cria iHUB em Cidades Inteligentes no IMD

Iniciativa visa tornar cidades brasileiras mais inteligentes, sustentáveis e integradas

27-06-2024 / ASCOM

Cidades Inteligentes Smart Metropolis



Após assinar um associação com a Fundação FIWARE, organização internacional que promove o desenvolvimento de soluções inteligentes em todo o mundo, a Universidade Federal do Rio Grande do Norte (UFRN) prevê a criação de um novo espaço (iHUB) para desenvolvimento de pesquisas, estudantes e iniciativas na área de Cidades Inteligentes.

Durante o Summit Cidades, evento nacional que aconteceu em Florianópolis (SC) na última terça-feira, 25, o Instituto Metrópole Digital (IMD/UFRN), representado pelo professor Frederico Lopes, oficializou o acordo, que marcou o início da criação do iHUB e de uma série de ações possibilitam a disseminação de tecnologias open-

In [44]:
all_documents += pages
pages = list()

### Loading Arbitrary Text

In [45]:
arbitrary_text = """
Fortalecer a atuação do Poder Judiciário na proteção do meio ambiente com uso de Inteligência Artificial (IA): esse é um dos objetivos de projeto realizado pelo Programa das Nações Unidas para o Desenvolvimento (Pnud), o Conselho Nacional de Justiça (CNJ) e a Universidade Federal do Rio Grande do Norte (UFRN). A iniciativa utilizará IA e técnicas da ciência de dados para extrair informações úteis dos textos processuais, a fim de realizar análises e previsões em ações judiciais do assunto Direito Ambiental.  

Segundo explica Rafael Leite, juiz auxiliar da Presidência do CNJ, o projeto visa usar IA, técnicas de processamento de linguagem natural e análise dos dados do processo e do conteúdo textual de suas peças para obter informações sobre a atuação do Judiciário na área do meio ambiente. “A IA é importante ferramenta para dar suporte à atuação dos magistrados e aumentar a efetividade da jurisdição – por exemplo, identificando padrões de conduta, impacto em biomas específicos e efeitos cumulativos de determinadas atividades –, além de melhor orientar ações de fiscalização de combate ao desmatamento ilegal e outros crimes ambientais”, afirma o magistrado.  

O CNJ já realiza acompanhamento de ações judiciais nos assuntos de Direito Ambiental por meio do Painel Interativo SireneJud, que reúne informações, por exemplo, sobre Terras Indígenas e áreas de desmatamento. O painel consome dados de diferentes fontes, como o DataJud – Base Nacional de Dados do Poder Judiciário, o Instituto Nacional de Pesquisas Espaciais (Inpe) e o Instituto Brasileiro de Geografia e Estatística (IBGE). 

A UFRN irá colocar à disposição do SireneJud as APIs (Application Programming Interfaces) desenvolvidas de modo a agregar novas informações ao painel interativo. As APIs permitem que os sistemas se comuniquem e os dados sejam integrados. 

Arquitetura
No projeto, a arquitetura definida para desenvolver as soluções insere-se na linguística computacional, que usa redes neurais e outros métodos para “extrair sentido de texto”, explica Elias Jacob, professor da UFRN. “A IA nada mais é do que uma representação do mundo a partir dos dados que foram passados para ela. O que ela faz? Detecta padrões”, descreve Jacob. 

A Plataforma Codex proverá o conjunto de dados que serão utilizados pelos analistas e desenvolvedores na construção dos modelos de IA e demais soluções. “O Codex importa as informações dos diversos sistemas processuais eletrônicos utilizados pelos tribunais. Ele extrai tanto os metadados do processo, como número e nome das partes, quanto o teor dos textos que estão lá”, explica o professor. 

De acordo com o Painel de Monitoramento de Implantação do Codex, o sistema tem extraído informações de 141 fontes de dados vinculadas a sistemas processuais dos tribunais brasileiros. “Todo o projeto tem a função de detectar padrões para extrair informação – que antes era inacessível – para que o humano possa tomar as decisões necessárias”, diz Jacob.  

Uso de IA no Poder Judiciário
Elias Jacob também propõe desmistificar a ideia de que modelos de IA promoveriam uma espécie de robotização do trabalho de juízes e juízas. “O que precisa ficar claro é que são ferramentas que nos ajudam a fazer algo, e não que vieram substituir o trabalho humano”, diz. 

Ele ressalta que a demanda social pelo serviço prestado pela jurisdição é elevada e, mesmo com a digitalização do processo judicial e a alta produtividade de magistrados e servidores, o Poder Judiciário enfrenta problemas de congestionamento processual. Nesse sentido, a IA é um caminho: “no Brasil, um juiz julga mais de 9 mil processos por ano. Existem várias formas de melhorar o atendimento dessa demanda. Uma delas é o desenvolvimento de ferramentas de IA”. 

O projeto com a UFRN é realizado no âmbito do Programa Justiça 4.0. A iniciativa busca fortalecer a atuação do Judiciário na tutela do meio ambiente, considerando que a Justiça brasileira dispõe de um conjunto de informações e dados relevantes sobre conflitos e crimes ambientais, explica Jacob. “Os problemas da sociedade recaem no Judiciário, que, em tese, é a melhor fonte de conhecimento sobre os problemas que assolam o país; na questão ambiental, não seria diferente”, argumenta o professor. 

Conheça os produtos previstos na parceria entre CNJ, PNUD e UFRN:
Solução de IA capaz de recomendar aos magistrados precedentes na área ambiental, buscando situações similares e permitindo maior uniformização dos julgamentos; 
Dados tratados contendo o recorte de causas ambientais que já tramitaram no Brasil;  
Ferramenta para identificar os maiores réus em causas ambientais e poluidores em geral a partir de dados retirados da Base Nacional de Dados do Poder Judiciário (DataJud);   
Solução de IA capaz de ler textos jurídicos e identificar elementos importantes, como o tipo de crime cometido, o dano causado, o bioma envolvido, o valor da condenação e o uso da legislação nacional e internacional; e
Solução de IA para prever os resultados de processos judiciais na área ambiental. 
"""

In [46]:
# Import the Document class from the langchain_core.documents module
# This class is used to create and manage document objects
from langchain_core.documents import Document

# Create a Document object with specified content and metadata
# 'page_content' is the main content of the document, which is stored in the variable 'arbitrary_text'
# 'metadata' is a dictionary containing additional information about the document
# 'title' specifies the title of the document
# 'author' specifies the author of the document
# 'source' specifies the source URL of the document
doc = Document(
    page_content=arbitrary_text,
    metadata={
        "title": "IA e ciência de dados vão auxiliar o Judiciário na proteção do meio ambiente",
        "author": "Conselho Nacional de Justiça",
        "source": "https://www.cnj.jus.br/ia-e-ciencia-de-dados-vao-auxiliar-o-judiciario-na-protecao-do-meio-ambiente/",
    },
)

# Display the Document object
doc

Document(metadata={'title': 'IA e ciência de dados vão auxiliar o Judiciário na proteção do meio ambiente', 'author': 'Conselho Nacional de Justiça', 'source': 'https://www.cnj.jus.br/ia-e-ciencia-de-dados-vao-auxiliar-o-judiciario-na-protecao-do-meio-ambiente/'}, page_content='\nFortalecer a atuação do Poder Judiciário na proteção do meio ambiente com uso de Inteligência Artificial (IA): esse é um dos objetivos de projeto realizado pelo Programa das Nações Unidas para o Desenvolvimento (Pnud), o Conselho Nacional de Justiça (CNJ) e a Universidade Federal do Rio Grande do Norte (UFRN). A iniciativa utilizará IA e técnicas da ciência de dados para extrair informações úteis dos textos processuais, a fim de realizar análises e previsões em ações judiciais do assunto Direito Ambiental.  \n\nSegundo explica Rafael Leite, juiz auxiliar da Presidência do CNJ, o projeto visa usar IA, técnicas de processamento de linguagem natural e análise dos dados do processo e do conteúdo textual de suas pe

In [47]:
all_documents.append(doc)

In [48]:
len(all_documents)

135

## Step 3 - Breaking Down the Text into Manageable Segments

When working with dense, information-rich texts, it's crucial to divide them into smaller, more manageable segments for efficient storage and retrieval. This process, known as **chunking**, breaks the information into smaller pieces, making it easier to store and more meaningful. Chunking enables more relevant information retrieval in response to specific queries and reduces costs by including only a portion of a document in the LLM prompt instead of the entire document.

### Chunking Strategies

The following chunking techniques typically use two main parameters:

- **chunk_size**: Determines the size of each text segment.
- **chunk_overlap**: Controls how much text overlaps between one segment and the next.

#### Character Chunking

Character Chunking is the simplest strategy, dividing the text into segments based on a fixed number of characters. While straightforward, this method can sometimes disrupt the flow of text by breaking sentences or words unexpectedly. Despite its limitations, it serves as a good starting point for more advanced methods.

#### Recursive Character Chunking

Recursive Character Chunking builds on the basic concept of Character Chunking by dividing the text into segments until a specific condition, such as a minimum chunk size, is met. This method ensures that the chunking process aligns with the text's structure, preserving more meaning. Its adaptability makes it suitable for texts with varied structures.

#### Document Specific Chunking

Document Specific Chunking respects the document's innate structure by creating segments that align with the logical sections of the document, such as paragraphs or subsections, instead of using a fixed number of characters or a recursive process. This approach maintains the original organization of the content, making the retrieved information more relevant and useful, especially for structured documents with clearly defined sections.

#### Token-based Chunking

When dividing your text into segments, it's advisable to count the number of tokens to ensure compliance with the token limit of the language model being used. To ensure accuracy, use the same tokenizer for counting tokens as the one used in the language model.

#### Semantic Chunking

Semantic Chunking considers the relationships within the text, dividing it into meaningful, semantically complete segments. This approach ensures the integrity of the information during retrieval, leading to more accurate and contextually appropriate outcomes. The process involves:

1. Taking the embeddings of every sentence in the document.
2. Comparing the similarity of all sentences with each other.
3. Grouping sentences with the most similar embeddings together.

By focusing on the text's meaning and context, Semantic Chunking significantly enhances the quality of retrieval. It's an excellent choice when maintaining the semantic integrity of the text is crucial. However, this method requires more effort and is slower than the previous ones.

#### Agent Chunking

Agent Chunking mimics how humans might process a new document:

1. Start at the top of the document, treating the first part as a segment.
2. Continue down the document, deciding if a new sentence or piece of information belongs with the first segment or should start a new one.
3. Repeat this process until reaching the end of the document.

### Choosing the Right Chunking Strategy

Each chunking strategy has its strengths and weaknesses, and the choice of method depends on the specific requirements of the text and the intended use. Consider the following factors when selecting a chunking strategy:

- **Text structure**: For well-structured documents with clearly defined sections, Document Specific Chunking may be the most appropriate choice.
- **Semantic integrity**: If maintaining the semantic integrity of the text is crucial, Semantic Chunking is the best option, despite being more time-consuming and effort-intensive.
- **Simplicity**: Character Chunking and Recursive Character Chunking are straightforward and easy to carry out, making them suitable for quick experimentation or when the text structure is less important.
- **Language model compatibility**: Token-based Chunking ensures compliance with the token limit of the language model being used, making it a reliable choice when working with LLMs.


### Applying Token-based Chunking for Our Class

For our class, we'll use a simple, token-based chunking strategy to divide the text into manageable segments. This approach ensures that the generated responses are concise and aligned with the token limits of the language model. Because different language models have varying token limits and tokenization methods, it's essential to consider these factors when designing your chunking strategy.

### Comparing Embedding Models and Adjusting Chunk Size

Later, we'll purposely use two different strategies to generate the embeddings for the documents:

1. Using a Sentence Transformer to generate the embeddings.
2. Using OpenAI embedding models.

This will allow us to compare the results and see how different embedding models can impact the performance of the RAG system. However, there's a difference between the two strategies that we'll need to address:

> Our Sentence-Transformer has a 512 token limit, while OpenAI's model has an 8192 token limit.

This means that we'll need to adjust the chunk size for the OpenAI embeddings to make sure they fit within the token limit. This is a common challenge when working with different embedding models and language models, and it's essential to consider these limitations during the design of your RAG system. Note that, because they are two different models, the tokens generated by each model may not be directly comparable.

Let's dive into [LangChain Text Splitters](https://python.langchain.com/v0.2/docs/concepts/#text-splitters) to see how we can apply token-based chunking for our class.

In [49]:
# Import the RecursiveCharacterTextSplitter class from langchain_text_splitters
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Import the AutoTokenizer class from transformers
from transformers import AutoTokenizer

# Load a pre-trained Hugging Face tokenizer from the specified path
hf_tokenizer = AutoTokenizer.from_pretrained(
    "outputs/sentence_transformers/sentence-transformer"
)

# Create a RecursiveCharacterTextSplitter instance using OpenAI's tiktoken encoder
splitter_openai = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base",  # Specify the encoding name
    chunk_size=8191,  # Define the maximum size of each chunk
    chunk_overlap=0,  # Define the overlap between chunks
    separators=[
        "\n\n",
        "\n",
        ".",
        ",",
        " ",
        "",
    ],  # Define the separators to use for splitting
)

# Create a RecursiveCharacterTextSplitter instance using the Hugging Face tokenizer
splitter_sentence_transformers = (
    RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        tokenizer=hf_tokenizer,  # Pass the loaded Hugging Face tokenizer
        chunk_size=512,  # Define the maximum size of each chunk
        chunk_overlap=0,  # Define the overlap between chunks
        separators=[
            "\n\n",
            "\n",
            ".",
            ",",
            " ",
            "",
        ],  # Define the separators to use for splitting
    )
)

In [50]:
len(splitter_sentence_transformers.split_text(arbitrary_text)), len(
    splitter_openai.split_text(arbitrary_text)
)

(3, 1)

In [51]:
chunked_documents_openai = splitter_openai.split_documents(all_documents)
chunked_documents_sentence_transformers = (
    splitter_sentence_transformers.split_documents(all_documents)
)

Token indices sequence length is longer than the specified maximum sequence length for this model (699 > 512). Running this sequence through the model will result in indexing errors


In [52]:
len(chunked_documents_openai), len(chunked_documents_sentence_transformers)

(135, 219)

## Step 4 - Storing Chunks in ChromaDB

After splitting the text into manageable segments, the next step is to store these chunks in ChromaDB for efficient retrieval during the RAG (Retrieval-Augmented Generation) process. ChromaDB is a powerful database designed for storing and querying vector embeddings, making it an ideal choice for this task. Let's dive into the details of storing chunks in ChromaDB.

### Establishing a Connection

To interact with ChromaDB, you first need to establish a connection to your ChromaDB instance. This is typically done using the provided API or client library specific to your programming language. It's essential to ensure that you have the necessary credentials and permissions to access the database securely. Proper authentication and authorization mechanisms should be in place to protect your data and maintain the integrity of your ChromaDB instance.

### Creating a Collection

Once you have successfully connected to ChromaDB, the next step is to create a collection or database to store your text chunks. Collections serve as logical containers for organizing and grouping related data. By creating a dedicated collection for your text chunks, you can easily manage and query them later. When naming your collection, choose a meaningful and descriptive name that reflects the nature of the data it holds. This will make it easier to identify and work with the collection throughout your RAG process.

### Inserting Chunks into ChromaDB

With the collection created, you can now proceed to insert the text chunks into ChromaDB. This involves iterating over the list of chunks and using the appropriate methods provided by the ChromaDB library to store each chunk efficiently. During the insertion process, consider adding metadata or tags to each chunk. Metadata can include relevant information such as the source of the text, the date of creation, or specific keywords associated with the chunk. By attaching metadata, you enhance the organizational structure and enable more targeted queries and analysis later on.

### Verifying Data Insertion

After inserting the chunks into ChromaDB, it's crucial to verify that the data has been successfully stored. You can accomplish this by running test queries against the database and examining the results. Perform queries that retrieve specific chunks based on their content or metadata to ensure that the data is accessible and matches your expectations. This verification step helps confirm the integrity and reliability of your stored data, giving you confidence in the subsequent stages of your RAG process.

> **Important Note:**
> When working with embeddings, it's vital to [remember this](https://www.youtube.com/watch?v=ulD7IsecPbU). So, keep in mind that embeddings are only comparable within the same model. If you use different models to generate embeddings for your text chunks and queries, the embeddings will not be directly comparable. To ensure compatibility and accurate retrieval, make sure to use the same embedding model consistently throughout your RAG system. This includes embedding the user queries using the same model that was used to embed the documents stored in ChromaDB.


> We will save local files with ChromaDb, but you can also use the Chroma in [Client/Server Mode](https://docs.trychroma.com/guides#running-chroma-in-client/server-mode).

In [53]:
# Ensure you have installed the necessary packages:
# pip install langchain_chroma chromadb langchain-huggingface

# Import the Chroma class from the langchain_chroma module
# This class is used for managing and querying embeddings with Chroma
from langchain_chroma import Chroma

# Import the HuggingFaceEmbeddings class from the langchain_huggingface module
# This class is used to generate embeddings using Hugging Face models
from langchain_huggingface import HuggingFaceEmbeddings

# Import the OpenAIEmbeddings class from the langchain_openai module
# This class is used to generate embeddings using OpenAI models
from langchain_openai import OpenAIEmbeddings

# Initialize the HuggingFaceEmbeddings with a specified model
# 'model_name' specifies the path to the pre-trained Sentence Transformers model
# 'show_progress=True' enables the display of progress during embedding generation
# 'model_kwargs' is a dictionary of additional arguments for the model, here specifying to use the CPU
embedding_function_sentence_transformers = HuggingFaceEmbeddings(
    model_name="outputs/sentence_transformers/sentence-transformer",
    show_progress=True,
    model_kwargs={"device": "cpu"},
)

In [54]:
# Create a ChromaDB instance and save it to a specified directory
# 'db_sentence_transformers' will store the ChromaDB instance created from the documents and embeddings

# Initialize the ChromaDB instance using the 'from_documents' method
# 'documents' is a list of chunked documents that will be stored in the database
# 'embedding' is the embedding function used to generate embeddings for the documents
# 'persist_directory' specifies the directory where the ChromaDB instance will be saved
db_sentence_transformers = Chroma.from_documents(
    documents=chunked_documents_sentence_transformers,
    embedding=embedding_function_sentence_transformers,
    persist_directory="/tmp/chroma_db_sentence_transformer",
)

# The ChromaDB instance 'db_sentence_transformers' is now created and saved to '/tmp/chroma_db_sentence_transformer'
# This instance can be used for querying and managing document embeddings

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

In [55]:
# Initialize the OpenAIEmbeddings with a specified model
# 'model' specifies the OpenAI model to be used for generating embeddings
embedding_function_openai = OpenAIEmbeddings(model="text-embedding-3-small")

# Create a ChromaDB instance using the OpenAI embeddings and save it to a specified directory
# 'db_openai' will store the ChromaDB instance created from the documents and embeddings

# Initialize the ChromaDB instance using the 'from_documents' method
# 'documents' is a list of chunked documents that will be stored in the database
# 'embedding' is the embedding function used to generate embeddings for the documents
# 'persist_directory' specifies the directory where the ChromaDB instance will be saved
db_openai = Chroma.from_documents(
    documents=chunked_documents_openai,
    embedding=embedding_function_openai,
    persist_directory="/tmp/chroma_db_openai",
)

# The ChromaDB instance 'db_openai' is now created and saved to '/tmp/chroma_db_openai'
# This instance can be used for querying and managing document embeddings

In [56]:
# We can reload the ChromaDB instances from the saved files at any time by specifying the persist_directory

# Reload the ChromaDB instance for Sentence Transformers embeddings
# 'persist_directory' specifies the directory where the ChromaDB instance was previously saved
# 'embedding_function' specifies the embedding function used to generate embeddings for the documents
db_sentence_transformers = Chroma(
    persist_directory="/tmp/chroma_db_sentence_transformer",
    embedding_function=embedding_function_sentence_transformers,
)

# Reload the ChromaDB instance for OpenAI embeddings
# 'persist_directory' specifies the directory where the ChromaDB instance was previously saved
# 'embedding_function' specifies the embedding function used to generate embeddings for the documents
db_openai = Chroma(
    persist_directory="/tmp/chroma_db_openai",
    embedding_function=embedding_function_openai,
)

In [57]:
# Define the query string to search for relevant documents
query = "O processo avaliativo na UFRN pode ser organizado em até quantas avaliações?"

# Perform a similarity search using the ChromaDB instance with Sentence Transformers embeddings
# 'docs1' will store the documents that are most similar to the query based on Sentence Transformers embeddings
docs1 = db_sentence_transformers.similarity_search(query)

# Perform a similarity search using the ChromaDB instance with OpenAI embeddings
# 'docs2' will store the documents that are most similar to the query based on OpenAI embeddings
docs2 = db_openai.similarity_search(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [58]:
print(docs1)

[Document(id='24972478-c695-462d-8000-ec3fd5886848', metadata={'author': 'tamara', 'creationdate': '2023-10-20T11:54:29-03:00', 'creator': 'Microsoft® Office Word 2007', 'moddate': '2023-10-20T11:54:29-03:00', 'page': 24, 'page_label': '25', 'producer': 'Microsoft® Office Word 2007', 'source': 'data/regulamento-dos-cursos-de-graduacao-da-UFRN-2024.pdf', 'total_pages': 77}, page_content='das unidades, é obrigatória a realização de uma avaliação escrita, individual e presencial.  \n \n§4º  A unidade de vinculação do componen te curricular poderá, excepcionalmente, dispensar a \nobrigatoriedade estabelecida no §3º  deste artigo. \n \nArt. 101.  O rendimento acadêmico de cada unidade avaliativa é calculado a partir dos \nresultados obtidos nos instrumentos avaliativos utilizados na unidade. \n \nParágrafo único.  A quantidade de instrumentos avaliativos por unidade é definida previamente \npelo docente e divulgada no plano de ensino da turma. \n \nArt. 102.   O rendimento acadêmico parcial

In [59]:
print(docs2)

[Document(id='bbb09d51-066b-4195-9d8b-7d039e1ebbd2', metadata={'author': 'tamara', 'creationdate': '2023-10-20T11:54:29-03:00', 'creator': 'Microsoft® Office Word 2007', 'moddate': '2023-10-20T11:54:29-03:00', 'page': 23, 'page_label': '24', 'producer': 'Microsoft® Office Word 2007', 'source': 'data/regulamento-dos-cursos-de-graduacao-da-UFRN-2024.pdf', 'total_pages': 77}, page_content='datas relativas a procedimentos regulares previstos neste regulamento. \n \nArt. 94.   Os cursos de graduação se desenvolvem anualmente em dois períodos letivos \nregulares estabelecidos no Calendário Universitário. \n \n§1º  Os períodos letivos regulares têm duração de, no mínimo, 18 (dezoito) semanas de aulas. \n \n§2º  Adicionalmente, a critério da instituição, pode ser realizado período letivo especial de \nférias. \n \n§3º  O período letivo especial de férias deve ter duração mínima de 4 (quatro) semanas. \n \nArt. 95.  As aulas presenciais semanais da UFRN são ministradas: \n \nI - de segunda-feir

#### Building a Retriever from a Vector Store

A vector store can be transformed into a retriever using its built-in method. This retriever object is designed to perform efficient similarity searches by comparing the embeddings of the search query with those of stored documents. This process allows the system to fetch documents that are closely related to the query content.

#### Maximum Marginal Relevance (MMR) Retrieval

MMR retrieval is an alternative approach that goes beyond traditional similarity search. Instead of focusing solely on how similar each document is to the query, MMR also considers the similarity among the retrieved documents themselves. This approach seeks to achieve a balance between relevance and diversity.

- **Relevance vs. Diversity Trade-Off**:  
  In basic similarity search, documents are selected based solely on their similarity to the query. However, this technique might select several documents that offer very similar perspectives or overlapping information. In contrast, MMR aims to reduce redundancy by favoring documents that, while still relevant, provide complementary or different insights.  

  A common formulation of this trade-off is:
  
  $$
  \text{score}(d) = \lambda \cdot \text{sim}(q, d) - (1 - \lambda) \cdot \max_{d' \in S} \text{sim}(d, d')
  $$
  
  Here,  
  - $ \text{sim}(q, d) $ measures the similarity between the query $ q $ and the document $ d $.  
  - $ \max_{d' \in S} \text{sim}(d, d') $ represents the maximum similarity between $ d $ and any document $ d' $ already selected in the set $ S $.  
  - $ \lambda $ is a parameter (with $ 0 \leq \lambda \leq 1 $) that controls the balance between choosing documents that are highly similar to the query and those that are more diverse compared to the current set.  

- **Practical Benefits of MMR**:  
  1. **Enhanced Diversity**: By intentionally selecting documents that are not only relevant but also distinct from each other, MMR ensures that different facets of the topic are represented.  
  2. **Reduced Redundancy**: With similarity-based methods, many documents containing overlapping information may be retrieved. MMR minimizes this problem by preferring documents that contribute new information.  
  3. **Broader Information Coverage**: A diverse set of results can expose users to various aspects of the topic, which is especially useful when a single document cannot fully satisfy the user's information need.  
  4. **Improved User Experience**: Presenting a varied set of results helps users gain a more complete understanding of the subject matter without having to manually filter out duplicate or near-duplicate insights.

#### Considerations When Using MMR

- **Document Quality and Diversity**:  
  The effectiveness of MMR relies on having a vector store with a high-quality and varied set of documents. If the fundamental documents lack diversity, even MMR may not be able to produce a rich set of results.

- **Tuning the MMR Parameters**:  
  The performance of the MMR retrieval method is influenced by the choice of parameters, especially the value of $ \lambda $. Fine-tuning this parameter is essential to reach the desired balance between relevance and diversity for different use cases.

> **Note:** The choice between using traditional similarity search and MMR-based search should be based on the specific needs of the application. When redundancy is a concern and a broad overview is desired, MMR presents a beneficial alternative.

This extensive treatment of MMR retrieval methods helps clarify how this approach can be used to enhance the variety of results while still maintaining strong relevance to query terms, leading to a more effective and satisfying search experience.

In [60]:
# Convert the ChromaDB instance with OpenAI embeddings into a retriever
# 'as_retriever' method converts the ChromaDB instance into a retriever object
# 'search_type='mmr'' specifies that the retriever should use Maximal Marginal Relevance (MMR) for search
# MMR helps in diversifying the search results by balancing relevance and diversity
retriever_openai = db_openai.as_retriever(search_type="mmr")

# Convert the ChromaDB instance with Sentence Transformers embeddings into a retriever
# Similar to the above, this converts the ChromaDB instance into a retriever object using MMR
retriever_sentence_transformers = db_sentence_transformers.as_retriever(
    search_type="mmr"
)

In [61]:
# Print the results of invoking the retriever with the specified query
# 'retriever_openai' is the retriever object created from the ChromaDB instance with OpenAI embeddings
# 'invoke(query)' method performs the search using the query and returns the most relevant documents
print(retriever_openai.invoke(query))

[Document(id='bbb09d51-066b-4195-9d8b-7d039e1ebbd2', metadata={'author': 'tamara', 'creationdate': '2023-10-20T11:54:29-03:00', 'creator': 'Microsoft® Office Word 2007', 'moddate': '2023-10-20T11:54:29-03:00', 'page': 23, 'page_label': '24', 'producer': 'Microsoft® Office Word 2007', 'source': 'data/regulamento-dos-cursos-de-graduacao-da-UFRN-2024.pdf', 'total_pages': 77}, page_content='datas relativas a procedimentos regulares previstos neste regulamento. \n \nArt. 94.   Os cursos de graduação se desenvolvem anualmente em dois períodos letivos \nregulares estabelecidos no Calendário Universitário. \n \n§1º  Os períodos letivos regulares têm duração de, no mínimo, 18 (dezoito) semanas de aulas. \n \n§2º  Adicionalmente, a critério da instituição, pode ser realizado período letivo especial de \nférias. \n \n§3º  O período letivo especial de férias deve ter duração mínima de 4 (quatro) semanas. \n \nArt. 95.  As aulas presenciais semanais da UFRN são ministradas: \n \nI - de segunda-feir

In [62]:
# Print the results of invoking the retriever with the specified query
# 'retriever_sentence_transformers' is the retriever object created from the ChromaDB instance with Sentence Transformers embeddings
# 'invoke(query)' method performs the search using the query and returns the most relevant documents
print(retriever_sentence_transformers.invoke(query))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[Document(id='24972478-c695-462d-8000-ec3fd5886848', metadata={'author': 'tamara', 'creationdate': '2023-10-20T11:54:29-03:00', 'creator': 'Microsoft® Office Word 2007', 'moddate': '2023-10-20T11:54:29-03:00', 'page': 24, 'page_label': '25', 'producer': 'Microsoft® Office Word 2007', 'source': 'data/regulamento-dos-cursos-de-graduacao-da-UFRN-2024.pdf', 'total_pages': 77}, page_content='das unidades, é obrigatória a realização de uma avaliação escrita, individual e presencial.  \n \n§4º  A unidade de vinculação do componen te curricular poderá, excepcionalmente, dispensar a \nobrigatoriedade estabelecida no §3º  deste artigo. \n \nArt. 101.  O rendimento acadêmico de cada unidade avaliativa é calculado a partir dos \nresultados obtidos nos instrumentos avaliativos utilizados na unidade. \n \nParágrafo único.  A quantidade de instrumentos avaliativos por unidade é definida previamente \npelo docente e divulgada no plano de ensino da turma. \n \nArt. 102.   O rendimento acadêmico parcial

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Define the chat prompt template with a series of messages
# 'from_messages' method creates a ChatPromptTemplate from a list of message tuples
# Each tuple contains a message type and the message content
prompt = ChatPromptTemplate.from_messages(
    [
        # System message to establish the assistant's role
        # This message sets the context for the assistant, instructing it to answer questions about UFRN
        # If the assistant doesn't know the answer, it should say so
        (
            "system",
            "Você é um assistente de alunos que responde a dúvidas sobre a UFRN. Use as informações fornecidas para responder às perguntas dos alunos. Se você não souber a resposta, apenas diga que não sabe.",
        ),
        # Placeholder for chat history to maintain context
        # This placeholder will be replaced with the actual chat history during execution
        ("placeholder", "{chat_history}"),
        # Human message placeholder for user input
        # This placeholder will be replaced with the user's question and context during execution
        ("human", "\nCONTEXTO: {context} \n\nPERGUNTA: {question}"),
    ]
)

In [67]:
# Define a function to format retrieved documents
# 'docs' is a list of document objects
# The function joins the page content of each document with four newline characters in between
def format_retrieved_documents(docs):
    return "\n\n\n\n".join([doc.page_content for doc in docs])

In [69]:
# Define a base runnable for the Sentence Transformers retriever
# This runnable will handle the context and question for the chat prompt
base_runnable_sentence_transformers = (
    {
        # 'context' key will use the retriever to get relevant documents and format them
        # 'retriever_sentence_transformers' retrieves relevant documents based on the query
        # 'format_retrieved_documents' formats the retrieved documents for the chat prompt
        "context": retriever_sentence_transformers | format_retrieved_documents,
        # 'question' key will pass the question directly to the retriever without modification
        "question": RunnablePassthrough(),
    }
    # Combine the context and question with the chat prompt template
    | prompt
)

In [70]:
# Define a base runnable for the OpenAI retriever
# This runnable will handle the context and question for the chat prompt
base_runnable_openai = (
    {
        # 'context' key will use the retriever to get relevant documents and format them
        # 'retriever_openai' retrieves relevant documents based on the query
        # 'format_retrieved_documents' formats the retrieved documents for the chat prompt
        "context": retriever_openai | format_retrieved_documents,
        # 'question' key will pass the question directly to the retriever without modification
        "question": RunnablePassthrough(),
    }
    # Combine the context and question with the chat prompt template
    | prompt
)

In [71]:
# Initialize an output parser to parse the string output of the chat
# 'StrOutputParser' is used to parse the output of the chat into a string format
output_parser = StrOutputParser()

In [72]:
# Create a RAG (Retrieval-Augmented Generation) chain using Sentence Transformers embeddings and OpenAI model
# 'base_runnable_sentence_transformers' handles the context and question for the chat prompt using Sentence Transformers embeddings
# 'model_openai' is the OpenAI model used to generate responses based on the retrieved context and question
# 'output_parser' parses the output of the OpenAI model into a string format

rag_chain_sentence_transformers_openai = (
    base_runnable_sentence_transformers  # Use the base runnable with Sentence Transformers embeddings
    | model_openai  # Pass the result to the OpenAI model for response generation
    | output_parser  # Parse the model's output into a string format
)

In [73]:
# Create a RAG (Retrieval-Augmented Generation) chain using Sentence Transformers embeddings and Llama 3.1 model

rag_chain_sentence_transformers_llama = (
    base_runnable_sentence_transformers  # Use the base runnable with Sentence Transformers embeddings
    | model_llama  # Pass the result to the Llama 3.1 model for response generation
    | output_parser  # Parse the model's output into a string format
)

In [75]:
# Create a RAG (Retrieval-Augmented Generation) chain using OpenAI embeddings and OpenAI model

rag_chain_openai_openai = (
    base_runnable_openai  # Use the base runnable with OpenAI embeddings
    | model_openai  # Pass the result to the OpenAI model for response generation
    | output_parser  # Parse the model's output into a string format
)

In [76]:
# Create a RAG (Retrieval-Augmented Generation) chain using OpenAI embeddings and Llama 3.1 model
rag_chain_openai_llama = (
    base_runnable_openai  # Use the base runnable with OpenAI embeddings
    | model_llama  # Pass the result to the Llama 3.1 model for response generation
    | output_parser  # Parse the model's output into a string format
)

In [81]:
query = "O que significa uma disciplina optativa?"

In [82]:
rag_chain_sentence_transformers_openai.invoke(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'Uma disciplina optativa é um componente curricular que o estudante pode escolher cursar, ao invés de ser obrigado a fazê-lo. Essas disciplinas permitem que os alunos personalizem sua formação acadêmica, escolhendo matérias que complementem ou ampliem seus conhecimentos em áreas de interesse. Na UFRN, a carga horária das disciplinas optativas deve ser, no mínimo, 50% superior à carga horária mínima que o aluno deve cumprir nesse grupo.'

In [83]:
rag_chain_sentence_transformers_llama.invoke(query)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

'Uma disciplina optativa é um componente curricular que não é obrigatório para os alunos, mas pode ser escolhido por eles como parte de sua carga horária acadêmica. Isso significa que os alunos têm a liberdade de decidir se querem ou não incluir essa disciplina em seu plano de estudos.'

In [84]:
rag_chain_openai_openai.invoke(query)

'Uma disciplina optativa é um componente curricular que faz parte de um rol de opções disponibilizado na estrutura curricular de um curso. O estudante deve escolher cursar uma carga horária mínima de disciplinas optativas para a integralização curricular, conforme estabelecido no Projeto Pedagógico de Curso. Essas disciplinas não são obrigatórias, mas oferecem ao estudante a oportunidade de diversificar sua formação acadêmica.'

In [85]:
rag_chain_openai_llama.invoke(query)

'Uma disciplina optativa é uma opção oferecida aos estudantes para complementar seu currículo acadêmico, geralmente não sendo obrigatória para a conclusão de um curso ou programa específico. Elas permitem aos alunos explorarem áreas de interesse pessoal ou profissional que não estejam diretamente relacionadas ao seu plano de estudos principal.\n\nDisciplinas optativas podem ser oferecidas em uma variedade de formatos, desde cursos teóricos até práticas, como aulas, seminários, workshops, ou até mesmo atividades extracurriculares. Elas fornecem aos estudantes a oportunidade de desenvolver habilidades específicas, explorar novos interesses ou profundos conhecimentos em áreas que não estejam cobertas pelo seu currículo principal.\n\nA inclusão de disciplinas optativas no currículo acadêmico pode ser benéfica para os alunos por várias razões. Elas permitem aos estudantes:\n\n- **Desenvolver habilidades específicas**: Disciplinas optativas podem oferecer treinamento em áreas como programaçã

In [86]:
def get_response_all_models(query):
    # Dictionary to store responses from different RAG chains
    result = {
        "sentence_transformers-openai": rag_chain_sentence_transformers_openai.invoke(
            query
        ),
        "sentence_transformers-llama": rag_chain_sentence_transformers_llama.invoke(
            query
        ),
        "openai-openai": rag_chain_openai_openai.invoke(query),
        "openai-llama": rag_chain_openai_llama.invoke(query),
    }

    # Iterate over the results and print them in a formatted manner
    for k, v in result.items():
        print("-" * 50)  # Separator for readability
        print(f"Embedding: {k.split('-')[0]}")  # Display the embedding type
        print(f"LLM: {k.split('-')[1]}")  # Display the language model used
        print(f"Response: {v}")  # Display the response
        print("-" * 50)  # Separator for readability
        print("\n")  # Newline for better formatting

    return result  # Return the dictionary with all responses


# Example query to test the function
result = get_response_all_models(
    "Quem está fazendo o projeto de IA para o Judiciário como foco em proteção ambiental?"
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

--------------------------------------------------
Embedding: sentence_transformers
LLM: openai
Response: O projeto de IA para o Judiciário com foco em proteção ambiental está sendo realizado em parceria com a UFRN, sob a coordenação do professor Elias Jacob.
--------------------------------------------------


--------------------------------------------------
Embedding: sentence_transformers
LLM: llama
Response: O professor Elias Jacob está fazendo o projeto de IA para o Judiciário com foco na proteção ambiental, no âmbito do Programa Justiça 4.0, em parceria com a UFRN (Universidade Federal do Rio Grande do Norte).
--------------------------------------------------


--------------------------------------------------
Embedding: openai
LLM: openai
Response: O projeto de IA para o Judiciário com foco em proteção ambiental está sendo realizado pelo Programa das Nações Unidas para o Desenvolvimento (Pnud), o Conselho Nacional de Justiça (CNJ) e a Universidade Federal do Rio Grande do No

In [87]:
result = get_response_all_models(
    "Qual é a frequência mínima exigida de um aluno para ser aprovado em uma disciplina na UFRN?"
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

--------------------------------------------------
Embedding: sentence_transformers
LLM: openai
Response: Não sei.
--------------------------------------------------


--------------------------------------------------
Embedding: sentence_transformers
LLM: llama
Response: Desculpe, mas não há informações disponíveis sobre a frequência mínima exigida para aprovação em uma disciplina na UFRN. As informações fornecidas parecem se concentrar mais em regulamentos e procedimentos administrativos da instituição, como matrícula, avaliação e certificação de estudantes, mas não mencionam especificamente a frequência mínima exigida para aprovação em uma disciplina.
--------------------------------------------------


--------------------------------------------------
Embedding: openai
LLM: openai
Response: A frequência mínima exigida de um aluno para ser aprovado em uma disciplina na UFRN é de 75% (setenta e cinco por cento) da carga horária do componente curricular.
-----------------------------

## Final suggestion - [BGE-M3](https://arxiv.org/pdf/2402.03216)


The BGE-M3 model, developed by the Beijing Academy of Artificial Intelligence (BAAI), is designed for advanced retrieval-augmented generation (RAG) tasks. It is particularly effective for applications that require multilingual support and the manipulation of large volumes of text data.

### Key Concepts and Features

- **Multilingual Capabilities**:  
  The model supports 100 languages, making it suitable for systems operating in diverse linguistic contexts. This capability ensures that semantic representations are reliable regardless of the language of the input text.

- **Flexible Input Length**:  
  BGE-M3 can process texts ranging from short sentences to full-length documents with up to 8,192 tokens. The ability to handle varying input lengths is useful in scenarios where the retrieval and subsequent generation of responses depend on extensive context. 

- **High-Quality Embeddings**:  
  The quality of the embeddings generated by the model ensures that semantic information is captured accurately. These embeddings represent the key meaning of text and act as the foundation for effective retrieval and subsequent text generation. High-quality embeddings reduce the risk of retrieving irrelevant data during RAG processes.

### Practical Advantages in RAG Systems

- **Enhanced Multilingual Processing**:
  - With support for 100 languages, systems carrying out BGE-M3 can perform well in environments with multilingual data, ensuring that the retrieval and generation components work seamlessly across different languages.
  
- **Reliable Handling of Long Documents**:
  - The capability to process documents containing up to 8,192 tokens allows practitioners to work with detailed and extensive texts. This is particularly valuable in tasks where larger contexts lead to more precise and coherent responses.

- **Increased Retrieval Accuracy**:
  - By generating semantically rich embeddings, the model improves the accuracy and relevance of retrieved information. This precision is essential for generating responses that are both accurate and context-aware.


### Accessing BGE-M3 through Ollama

To use the BGE-M3 model in your RAG system, you can use the Ollama library. You'll need to pull from their hub using the following command:

```bash
(base) jacob@schrodinger ~ % ollama pull bge-m3
pulling manifest
pulling daec91ffb5dd... 100% ▕████████████████▏ 1.2 GB
pulling a406579cd136... 100% ▕████████████████▏ 1.1 KB
pulling 0c4c9c2a325f... 100% ▕████████████████▏ 337 B
verifying sha256 digest
writing manifest
removing any unused layers
success
```

This will download and set up the BGE-M3 model, making it readily available for embedding generation in your RAG system.

and, best of all... it's `free`!

In [88]:
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    model="bge-m3",  # 'model' specifies the model to use, in this case 'bge-m3'
)

In [93]:
texts = [
    "Eu gostaria de saber mais sobre o processo avaliativo na UFRN.",
    "Qual é a frequência mínima exigida de um aluno para ser aprovado em uma disciplina na UFRN?",
    "Como funcionam os trabalhos na educação superior de uma Universidade federal brasileira?",
    "Cachorros são fofos",
]

In [97]:
vectors = embeddings.embed_documents(texts)

In [98]:
# Calculate the cosine similarity between all documents
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Calculate the cosine similarity between all pairs of vectors
cosine_sim = cosine_similarity(vectors, vectors)

# Print the cosine similarity matrix
print(cosine_sim)

[[1.         0.67352188 0.58537433 0.30498869]
 [0.67352188 1.         0.55499701 0.23583424]
 [0.58537433 0.55499701 1.         0.28292264]
 [0.30498869 0.23583424 0.28292264 1.        ]]


In [99]:
# Print the similarity between all pairs of documents
for i in range(len(texts)):
    for j in range(i + 1, len(texts)):
        print(f"Similarity between document {i} and document {j}: {cosine_sim[i][j]}")

Similarity between document 0 and document 1: 0.6735218770434315
Similarity between document 0 and document 2: 0.5853743252493536
Similarity between document 0 and document 3: 0.30498868606644847
Similarity between document 1 and document 2: 0.5549970108342849
Similarity between document 1 and document 3: 0.23583424154118882
Similarity between document 2 and document 3: 0.2829226353452491


# Questions

1. What is Retrieval-Augmented Generation (RAG), and how does it enhance traditional generative models?

2. What are the main challenges in generative AI that RAG aims to address, and why are they significant?

3. Describe the key components of a RAG system and their roles in the overall architecture.

4. How do vector databases like ChromaDB contribute to the efficiency and effectiveness of RAG systems?

5. Outline the fundamental principles of RAG and explain their importance in improving the quality of generated text.

6. Compare and contrast RAG with traditional generative models, highlighting the advantages of using RAG.

7. Explain how RAG solves the problem of limited knowledge scope in generative models and its effects for generating accurate and up-to-date content.

8. What is the role of Maximum Marginal Relevance (MMR) search in RAG systems, and how does it enhance the diversity and relevance of retrieved information?

9. Discuss the adaptability of RAG systems to various domains and use cases, and provide examples of how they can be customized to meet specific requirements.

10. Outline the step-by-step process of building a RAG system using ChromaDB and LangChain, emphasizing the key considerations and best practices at each stage.

`Answers are commented inside this cell`


<!-- 1. Retrieval-Augmented Generation (RAG) is an AI framework that enhances traditional generative models by integrating information retrieval systems. It allows for the generation of text that is coherent, contextually relevant, accurate, and up-to-date by employing external knowledge sources during the generation process. RAG improves the quality and relevance of generated text by providing the model with access to a vast pool of information beyond its training data.

2. RAG addresses several key challenges in generative AI, including limited knowledge scope, factual inaccuracy, contextual irrelevance, data sparsity, training limitations, inflexibility, and the need for real-time information retrieval. These challenges hinder the ability of generative models to produce high-quality, accurate, and relevant content. RAG enables the creation of more reliable, adaptable, and context-aware generative AI systems.

3. A RAG system consists of two main components: a retriever and a generator. The retriever is responsible for searching and retrieving relevant information from an external knowledge source based on the input query. It uses techniques like semantic search and similarity matching to identify the most relevant information. The generator takes the retrieved information and the original query as input and produces the final output response by integrating the retrieved knowledge into the generation process.

4. Vector databases like ChromaDB play a crucial role in RAG systems by efficiently storing and querying embedding data. Embeddings are dense vector representations of text that capture semantic meaning. ChromaDB allows RAG systems to quickly retrieve the most relevant information based on similarity metrics, such as cosine similarity, between the query and the stored embeddings. This enables the retrieval of highly relevant and accurate information, enhancing the quality of generated responses.

5. The key principles of RAG include:
- Retrieving relevant information from external knowledge bases to expand the model's knowledge scope.
- Integrating retrieved information into the generation process to improve the accuracy and relevance of generated text.
- Enhancing context awareness and factual accuracy by using up-to-date and domain-specific knowledge.
- Improving flexibility and scalability by allowing the model to adapt to new information without requiring retraining.
- Providing real-time access to up-to-date information to generate current and relevant content.
These principles collectively contribute to the creation of more reliable, adaptable, and high-quality generative AI systems.

6. Compared to traditional generative models, RAG offers several advantages:
- Enhanced factual accuracy: RAG incorporates external knowledge sources, reducing the reliance on the model's training data and improving the accuracy of generated facts.
- Improved context awareness: By retrieving relevant information based on the input query, RAG generates text that is more contextually appropriate and coherent.
- Reduced reliance on extensive training data: RAG can capitalize on external knowledge sources, reducing the need for large-scale training data and enabling the generation of diverse content.
- Flexibility and adaptability: RAG systems can be easily adapted to different domains or updated with new information without requiring retraining of the entire model.

7. RAG solves the problem of limited knowledge scope in generative models by incorporating an information retrieval component. This allows the model to fetch relevant, up-to-date information from external knowledge sources, expanding the knowledge base beyond the training data.

8. Maximum Marginal Relevance (MMR) search is a technique used in RAG systems to enhance the diversity and relevance of retrieved information. MMR search aims to balance the relevance of retrieved documents to the query while also promoting diversity in the search results. It achieves this by considering both the similarity of documents to the query and the dissimilarity among the retrieved documents.

9. RAG systems can be adapted to various domains and use cases by customizing the external knowledge source, chunking strategies, embedding models, and retrieval methods. For example, in a medical domain, the knowledge source can be a database of medical literature, and the chunking strategy can be tailored to extract relevant sections like abstracts or conclusions. The embedding model can be fine-tuned on medical text to capture domain-specific semantics. Similarly, for a legal use case, the knowledge source can be a collection of legal documents, and the retrieval methods can be adapted to handle legal jargon and citation styles.

10. Building a RAG system with ChromaDB and LangChain involves the following steps:
1. Define the LLM models: Select the appropriate language model(s) for text generation based on the specific requirements of the task.
2. Load and prepare documents: Collect and preprocess the relevant documents that will serve as the external knowledge source for the RAG system.
3. Chunk the text: Split the documents into manageable segments or chunks that can be efficiently stored and retrieved.
4. Store chunks in ChromaDB: Create a ChromaDB instance and store the text chunks along with their embeddings for efficient retrieval.
5. Build retrievers: Create retriever objects from the ChromaDB vector stores to enable fast and accurate retrieval of relevant chunks based on similarity metrics.
6. Integrate retrievers with LLMs: Use LangChain's prompt templates and output parsers to integrate the retrievers with the selected LLM(s) and generate text based on the retrieved information.
7. Fine-tune and enhance: Restate and fine-tune the RAG system by adjusting parameters, experimenting with different retrieval strategies, and improving the prompts and output parsing logic.
8. Test and evaluate: Assess the performance of the RAG system using relevant evaluation metrics and gather feedback for further improvements. -->