**Author:** J. Žovák, `482857@mail.muni.cz`

# LVD Usage With RAG Architecture

In [None]:
!pip install -q openai
!pip install -q langchain
!pip install -q datasets==2.15.0

In [1]:
from openai import OpenAI
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
import chromadb as lvd
import pandas as pd 

## Initialize OpenAI API client

For the purpose of this demo I will use the OpenAI it does not require setting up local LLM mode. 

In [14]:
API_KEY = "your-api-key-here"

In [15]:
client = OpenAI(api_key=API_KEY)

## Load Dataset

In this demo we will use the `ai-arxiv-chunked` dataset from Hugging Face. This dataset coontains already pre-chunked arxiv papers.
Chunking is a process of splitting documents into smaller parts and is necessary step in RAG pipeline. 
Thanks to the `ai-arxiv-chunked` we can skip this step for the purpose of this demo.

In [2]:
# Use chunked version of arxiv dataset by James Calam (https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked)
from datasets import load_dataset

data = load_dataset(
    'jamescalam/ai-arxiv-chunked',
)

data

DatasetDict({
    train: Dataset({
        features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
        num_rows: 41584
    })
})

In [3]:
pd_data =  pd.DataFrame(data['train'])

We will use only the data from October 2021 and newer since the OpenAI models have training data up to September 2021.

In [4]:
pd_data['published'] = pd.to_datetime(pd_data['published'], format='%Y%m%d')

pd_data_new = pd_data[pd_data['published'] >= pd.to_datetime('2021-10-01')]

pd_data_new.reset_index(drop=True, inplace=True)

In [7]:
documents = pd_data_new['chunk'].tolist()

# LVD Setup

For this demo a all-MiniLM-L6-v2 model will be used as the embedding function.

In [5]:
from chromadb.utils import embedding_functions

embedding_model = embedding_functions.SentenceTransformerEmbeddingFunction(device='cuda')


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
binary_path: C:\Users\jakub\anaconda3\envs\win_lvd\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll
CUDA SETUP: Loading binary C:\Users\jakub\anaconda3\envs\win_lvd\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll...


Create collection with the embedding function and configuration of the LMI.

In [6]:
chroma_client = lvd.Client()
collection = chroma_client.create_collection(
  name='news', 
  embedding_function=embedding_functions.DefaultEmbeddingFunction(),
  metadata={
    "lmi:n_categories": f"[10]",
  }
)

Upload and embedd the documents.

In [9]:
collection.add(
    ids=[f"id{i}" for i in range(len(documents))],
    documents=documents
)


            LMI Build Config:
            {
                clustering_algorithms: [<function cluster at 0x0000024491B3D550>],
                epochs: [200],
                model_types: ['MLP'],
                learning_rate: [0.01],
                n_categories: [10],
            }
             


Build LMI index on the embedded documents chunks.

In [None]:
collection.build_index()

## RAG Pipeline

In this demo I use `gpt-3.5-turbo`from [OpenAI](https://platform.openai.com/docs/models/gpt-3-5-turbo) which have training data up to September 2021.
Thee `llm_pipeline` represent bare-bones call to the OpenAI API with the user prompt.

In [16]:
def llm_pipeline(prompt, context = ""):
    additional_context = f"Answer user prompt based on the following context: {context}." if context else ""
    system_prompt = f"You are generic chatbot assitant. {additional_context}"

    completion = client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
      ],
      max_tokens=100,
      temperature=0.0,
    )
    return completion.choices[0].message.content

Bellow is the definition of the `rag_pipeline` that represents the simple RAG architecture. It takes the user prompt and uses it to perform a search query in the LVD.
The result of the search query represents the context that the LLM model will receive. Thanks to this context, the LLM will be able to generate up to date answer.

In [67]:
def rag_pipeline(prompt, keywords):
    results = collection.query(
        query_texts=[prompt],
        include=["documents"],
        n_results=5,
        n_buckets=2,
        where_document={"$hybrid":{ "$hybrid_terms": keywords}}
    )
    context = results['documents'][0][0]
    
    answer = llm_pipeline(prompt, context)
    return answer

Note that in production environment RAG would be additionally integrated within a LLM application framework like [LangChain](https://www.langchain.com/) and [LlamaIndex](https://www.llamaindex.ai/).

## RAG Test

In [57]:
user_prompt = "Can you explain to me what CONDAQA can be used for in machine learning?"

llm_answer = llm_pipeline(user_prompt)
print("LLM Answer: \n", llm_answer)

[2024-05-04 15:37:48,127][INFO ][httpx] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


LLM Answer: 
 CONDAQA is not a commonly known term in the field of machine learning. It is possible that you may have mistyped or misheard the term. If you can provide more context or clarify the term you are referring to, I would be happy to help you understand its usage in machine learning.


In [68]:
rag_answer = rag_pipeline(user_prompt, ["CONDAQA", "contrastive"])
print("RAG Answer: \n", rag_answer)

[2024-05-04 15:43:50,532][INFO ][httpx] HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


RAG Answer: 
 CONDAQA, which stands for Contrastive Reading Comprehension Dataset for Reasoning about Negation, can be used in machine learning for training and evaluating models that are designed to understand and reason about negation in text. This dataset provides examples where negation plays a crucial role in answering questions, making it a valuable resource for developing natural language processing models that can accurately interpret and respond to negated statements. By using CONDAQA, researchers and practitioners can improve the performance of machine learning models on
