# Environment Setup

## Install neccessary Library
The libraries include:
- langchain framework'
- GPT4ALL, OpenAI and HuggingFace for various embedding methods and LLMs
- Document loaders
- Dependent libraries

__Note__ : 
- It requires C++ builder for building a dependant library for Chroma. Check out https://github.com/bycloudai/InstallVSBuildToolsWindows for instruction. 
- Python version: 3.12.4
- Pydantic version: 2.7.3. There is issue with pydantic version 1.10.8 

In [None]:
%pip install --upgrade -r requirements.txt

In [2]:
%pip install -qU langchain-ollama

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.1.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### Get Environment Parameters
Prepare the list of parameter in .env file for later use. 
Parameters: 
- API keys for LLMs
    - OPENAI_API_KEY 
    - HUGGINGFACEHUB_API_TOKEN 
- Directory / location for documents and vector databases
    - DOC_ARVIX = "./source/from_arvix/"
    - DOC_WIKI = "./source/from_wiki/"
    - VECTORDB_OPENAI_EM = "./vector_db/openai_embedding/"
    - VECTORDB_MINILM_EM = "./vector_db/gpt4all_miniLM/"
    - TS_RAGAS = "./evaluation/testset/by_RAGAS/"
    - TS_PROMPT = "./evaluation/testset/by_direct_prompt/"
    - EVAL_DATASET = "./evaluation/evaluation_data_set/"
    - EVAL_METRIC = "./evaluation/evaluation_metric"


In [32]:
# Get the environment parameters
import os
from dotenv import load_dotenv
load_dotenv()

True

# I. Architecture 

## A. Simple RAG Flow

<img src="diagrams/HL architecture.png" alt="HL arc" title= "HL Architecture" />

The system comprises of 5 components: 

- Internal data, documents: The system starts with a collection of internal documents and / or structured databases. Documents can be in text, PDF, photo or video formats. These documents and data are sources for the specified knowledgebase.

- Embedding processor: The documents and database entries are processed to create vector embeddings. Embeddings are numerical representations of the documents in a high-dimensional space that capture their semantic meaning. 

- Vector database: the vectorized chunk of documents and database entries are stored on vector database to be search and retrieved in a later stage. 

- Query processor: The query processor takes the user's query and performs semantic search against the vectorized database. This component ensures that the query is interpreted correctly and retrieves relevant document embeddings from the vectorized DB. It combines the user's original query with the retrieved document embeddings to form a context-rich query. This augmented query provides additional context that can help in generating a more accurate and relevant response.

- LLM: pre-trained large language model where the augmented query is passed to for generating a response based on the query and the relevant documents.

The system involves 2 main pipelines: the embedding pipeline and the retrieval pipeline. Each pipeline has specific stages and processes that contribute to the overall functionality of the system.

In this experiment, we use Langchain as a framework to build a simple RAG as a chain of tasks, which interacts with surrounding services like parsing, embedding, vector database and LLMs 

## B. MultiModal RAG Architecture
<img src="diagrams/ISM6564-Project.png" alt="HL arc" title= "MM HL Architecture" />

# II. Implementation

## A. Ingestion Pipeline

### Step 1. Data Collection

In this step, we load data from various sources. Make them ready to ingest.
We will download 5 articles from ARVIX with query "RAG for Large Language Model" and store them locally and ready for next steps of embedding

#### From ARXIV

In [19]:
import arxiv 
client = arxiv.Client()
search = arxiv.Search(
  query = "RAG for Large Language Model",     # To get more of other topics and number of papers. 
  max_results = 5,
#  sort_by = arxiv.SortCriterion.SubmittedDate
)

results = client.results(search)
all_results = list(client.results(search)) 

In [20]:
# Print out the articles' titles
for r in all_results:
    print(f"{r.title} {r.entry_id}")

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries http://arxiv.org/abs/2401.15391v1
Prompt-RAG: Pioneering Vector Embedding-Free Retrieval-Augmented Generation in Niche Domains, Exemplified by Korean Medicine http://arxiv.org/abs/2401.11246v1
Seven Failure Points When Engineering a Retrieval Augmented Generation System http://arxiv.org/abs/2401.05856v1
The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG) http://arxiv.org/abs/2402.16893v1
CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems http://arxiv.org/abs/2404.02103v1


In [32]:
# Purpose: download articles and save them in pre-defined location for later use
# Prepare: create the environment paramter DOC_ARVIX for the path to save articles. 
# Download and save articles in PDF format to the "RAG_for_LLM" folder under ARVIX_DOC path
DOC_ARVIX = os.getenv("DOC_ARVIX") 
directory_path = os.path.join(DOC_ARVIX,"RAG_for_LLM") 
if not os.path.exists(directory_path):
    os.makedirs(directory_path)
for r in all_results:
    r.download_pdf(dirpath=directory_path)

#### From Springer

#### From Lexis

### Step 2. Embeddings

This step and the previous one are usually processed together. I try to separate them to make attention that these are not always coupled.
We use available library DirectoryLoader and PyMuPDFLoader from Langchain to load and parse all .pdf files in the directory.
We can use corresponding loader for other data types such as excel, presentation, unstructured ... 

Refer to https://python.langchain.com/v0.1/docs/integrations/document_loaders/ for other available loaders. 
We also use the OCR library rapidocr to extract image as text. Certainly, the trade-off is processing time. It took 18 minutes to parse 5 pdf files with OCR compared to 0.1 second without. 

#### 1. Text Parsing

In [None]:
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import PyMuPDFLoader
import os

# Load the whole directory certain data type
def load_directory(directory_path, data_type, ocr = False):
    if data_type == "pdf":
        #Use OCR to extract image as text
        if ocr:
            loader_kwargs = {"extract_images":True}
        else:
            loader_kwargs = {"extract_images":False}
        pdf_loader = DirectoryLoader(
            path=directory_path,
            glob="*.pdf",
            loader_cls=PyMuPDFLoader,
            loader_kwargs=loader_kwargs
        )
    pdf_documents = pdf_loader.load()
    return pdf_documents

In [34]:
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import PyMuPDFLoader
directory_path = os.path.join(os.getenv("DOC_ARVIX") ,"RAG_for_LLM") 
load_directory(directory_path, "pdf")

#### 2. Text Chunking

Divide the data into smaller chunks for better handling, processing, and retrieving.
There is a limitation on number of tokens which the embedding service can process at later stage which requires documents are chunked in smaller size.
There are many of chunking methods from Langchain. In which, Recursive CharacterText and Semantic are most popular. 

Reference: https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/ 

In [54]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=30)
text_chunks = text_splitter.split_documents(pdf_documents)

#### 3. Text Vectorizing

Vectors are semantic representation of texts. 
This is an important step to make documents searchable in the later pipeline. 
Embedding is an essential step in Transformer architecture, underlined to every modern LLMs. Therefore, many LLMs provide their embedding functions as services which are ready to use, e.g. OpenAI embedding API. However, it is important to consider privacy risk when exposing internal data to those services.

IMPORTANT NOTE: 
1. the embedding method to perform similarity search in the retrieval pipeline must be the same to the one used to vectorize documents in this step. 
2. Public embedding method such as OpenAIEmbedding may cost a fraction of money and leak internal data.  

Reference: https://python.langchain.com/v0.1/docs/modules/data_connection/text_embedding/

In [5]:
from langchain_openai.embeddings import OpenAIEmbeddings #To use other embeddings e.g. Llama or Gemini
embeddings = OpenAIEmbeddings()

ModuleNotFoundError: No module named 'langchain_openai'

#### 4. Image Extraction

From each of pdf, extracts images. Expected return a list of images for each PDFs

In [None]:
## Def the function of extracting image from PDF using unstructured.io
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import PyMuPDFLoader
import os
import fitz 

def extract_images_from_pdf(pdf_path, output_folder):
    document = fitz.open(pdf_path)
   
    for page_num in range(len(document)):
        page = document[page_num]
        image_list = page.get_images(full=True)
    
    for image_index, img in enumerate(image_list, start=1):
        xref = img[0]
        base_image = document.extract_image(xref)
        image_bytes = base_image["image"]
        image_ext = base_image["ext"]  # could be 'png' or 'jpeg'
        image_filename = f"{output_folder}/page_{page_num+1}_img_{image_index}.{image_ext}"
        with open(image_filename, "wb") as img_file:
            img_file.write(image_bytes)
        print(f"Exported: {image_filename}")

#### 5. Image Summary

Using LLM e.g. Llama3.1 or Gemini to provide summary for an image

In [None]:
## Connect to LLM 
from langchain_openai.chat_models import ChatOpenAI
from langchain_huggingface import HuggingFaceEndpoint 
from langchain_ollama.chat_models import ChatOllama
import os
from dotenv import load_dotenv

llm_model = {
    "GPT_3_5_TURBO" : "gpt-3.5-turbo",
    "GPT_4" : "",
    "GPT_4_PREVIEW" : "gpt-4-1106-preview",
    "LOCAL_GPT4ALL" : "",
    "MISRALAI" : "mistralai/Mistral-7B-Instruct-v0.2",
    "LLAMA3_70B" : "meta-llama/Meta-Llama-3-70B-Instruct",
    "ZEPHYR_7B" : "HuggingFaceH4/zephyr-7b-beta",
    "OLLAMA_GEMMA2" : "gemma2",
    "OLLAMA_LLAMA3" : "llama3",
    "OLLAMA_LLAMA3.1" : "llama3.1"
}

def connectLLM(model):
    load_dotenv()

    # Connect to Open AI chat model: Online, Token-base
    if model == "GPT_3_5_TURBO" or model == "GPT_4_PREVIEW":
#       print("connect llm")
        return ChatOpenAI(openai_api_key=os.getenv("OPENAI_API_KEY"), model=llm_model[model])
    
    # Connect to HuggingFace chat model: Online, Token-base
    # Note: to use Llama3, we need to register on HuggingFace website
    if model == "LLAMA3_70B" or model == "MISRALAI" or model == "ZEPHYR_7B":
        repo_id = llm_model[model]
        return HuggingFaceEndpoint(
            repo_id=repo_id,
            max_length=128,
            temperature=0.5,
            huggingfacehub_api_token=os.getenv("HUGGINGFACEHUB_API_TOKEN")
        )
    
    # Connect to Ollama for Llama3, Llama3.1 and Gemma2 chat models
    # Need these models are working locally, they must have been downloaded. Check instruction for downloading Ollama and models
    if model == "OLLAMA_GEMMA2" or model == "OLLAMA_LLAMA3" or model == "OLLAMA_LLAMA3.1":
        return ChatOllama(model=llm_model[model])
         

In [4]:
## return string of summary for an input of image

#### 6. Image + Summary Vectorization

#### 7. Article Summary
Using LLM to summarize the paper (as text or as image (convert pdf to image ))

#### 8. Topic Modeling

#### 9. Store Article Summary + Topic Model

#### 10. Store Vector DB

There are some vector databases of choices: Chroma, FAISS, Pinecone ... 
We will create Chroma vector database with openai embedding method. 

Note: different embedding methods will result different vector dimensions and cannot be stored together. 
The same embedding method to be used in retrieval pipeline

Reference: https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/ 

In [None]:
import pandas as pd
import os
from dotenv import load_dotenv
from langchain.vectorstores import Chroma
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings



CHROMA_OPENAI_RAG_FOR_LLM = "CHROMA_OPENAI_RAG_FOR_LLM"
CHROMA_HF_RAG_FOR_LLM = "CHROMA_HF_RAG_FOR_LLM"
CHROMA_MINILM_RAG_FOR_LLM = "CHROMA_MINILM_RAG_FOR_LLM"
CHROMA_OLLAMA_RAG_FOR_LLM = "CHROMA_OLLAMA_RAG_FOR_LLM"

#IMPORTANT: THE CHROMA INSTANCE CANNOT INITIATED WITHIN A .PY. IT WILL CRASH THE KERNEL. 
class VectorBD:
    
    def __init__(self,
                 vectordb_name) -> None:
        load_dotenv()
#       OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
#       print(OPENAI_API_KEY)
        if vectordb_name == CHROMA_OPENAI_RAG_FOR_LLM:
            self.vectordb_directory = os.path.join(os.getenv("VECTORDB_OPENAI_EM"),"RAG_for_LLM")
            self.embeddings = OpenAIEmbeddings()
            self.vectordb =  Chroma(persist_directory=self.vectordb_directory, embedding_function=self.embeddings)
            self.retriever = self.vectordb.as_retriever()

        if vectordb_name == CHROMA_MINILM_RAG_FOR_LLM:
            self.vectordb_directory = os.path.join(os.getenv("VECTORDB_MINILM_EM"),"RAG_for_LLM")
            self.embeddings = GPT4AllEmbeddings(model_name="all-MiniLM-L6-v2.gguf2.f16.gguf", gpt4all_kwargs={'allow_download': 'True'})
            self.vectordb =  Chroma(persist_directory=self.vectordb_directory, embedding_function=self.embeddings)
            self.retriever = self.vectordb.as_retriever()

        if vectordb_name == CHROMA_OLLAMA_RAG_FOR_LLM:
            self.vectordb_directory = os.path.join(os.getenv("VECTORDB_OLLAMA_EM"),"RAG_for_LLM")
            self.embeddings = OllamaEmbeddings(model="llama3.1")
            self.vectordb =  Chroma(persist_directory=self.vectordb_directory, embedding_function=self.embeddings)
            self.retriever = self.vectordb.as_retriever()

        if vectordb_name == CHROMA_HF_RAG_FOR_LLM:
            self.vectordb_directory = os.path.join(os.getenv("VECTORDB_HF_EM"),"RAG_for_LLM")
            self.embeddings = HuggingFaceEmbeddings()
            self.vectordb =  Chroma(persist_directory=self.vectordb_directory, embedding_function=self.embeddings)
            self.retriever = self.vectordb.as_retriever()       

    def vectorizing(self, documents):
        self.vectordb = Chroma.from_documents(documents=documents,embedding=self.embeddings, persist_directory=self.vectordb_directory)
        self.vectordb.persist()

    def invoke(self,question):
#       print(self.retriever.invoke("What is RAG?"))
        return self.retriever.invoke(question)

def connect_km(km_name):
    load_dotenv()
#   OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
#   print(OPENAI_API_KEY)
    if km_name == CHROMA_OPENAI_RAG_FOR_LLM:
        km_dir = os.path.join(os.getenv("VECTORDB_OPENAI_EM"),"RAG_for_LLM")
        km_embeddings = OpenAIEmbeddings()
        km_db =  Chroma(persist_directory=km_dir, embedding_function=km_embeddings)
        return km_db

## B. Retrieval Pipeline

Retrieval pipeline is to retrieve relevant chunk of knowledge from pre-prepared vectorized knowledge to enrich the LLM prompt with specified context. This pipeline is run to respond to each user’s query. 

Need to load from store if there is, here is Chroma vectordb we have just persisted. 
Perform a semantic search in the vectorized database to retrieve relevant embedded documents.

NOTE: The embedding method used in this step must be same as which used to vectorize knowledges in the previous pipeline.

There is opportunity to improve efficiency and quality of similarity search, especially when the knowledgebase gets larger and more complicated (type of sources)

In [42]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [43]:
user_query = "What is retrieval augmented generation?"
#user_query = "Describe the RAG-Sequence Model?"

### Step 3. Retrieval

#### 1. Text Retrieval

In [44]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
db_directory = os.getenv("VECTORDB_OPENAI_EM")
db_directory = os.path.join(db_directory,"RAG_for_LLM")
embeddings = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=db_directory, embedding_function=embeddings)
retriever = vectordb.as_retriever()

In [45]:
retriever.invoke(user_query)

[Document(metadata={'author': '', 'creationDate': "D:20240120233737+09'00'", 'creator': '', 'file_path': 'source\\from_arvix\\RAG_for_LLM\\2401.11246v1.Prompt_RAG__Pioneering_Vector_Embedding_Free_Retrieval_Augmented_Generation_in_Niche_Domains__Exemplified_by_Korean_Medicine.pdf', 'format': 'PDF 1.7', 'keywords': '', 'modDate': "D:20240120233737+09'00'", 'page': 1, 'producer': 'Microsoft: Print To PDF', 'source': 'source\\from_arvix\\RAG_for_LLM\\2401.11246v1.Prompt_RAG__Pioneering_Vector_Embedding_Free_Retrieval_Augmented_Generation_in_Niche_Domains__Exemplified_by_Korean_Medicine.pdf', 'subject': '', 'title': 'Microsoft Word - Prompt-GPT_v1', 'total_pages': 26, 'trapped': ''}, page_content='2 \n1. Introduction \nRetrieval-Augmented Generation (RAG) models combine a generative model with an information \nretrieval function, designed to overcome the inherent constraints of generative models.(1) They \nintegrate the robustness of a large language model (LLM) with the relevance and up-t

#### 2. Image Retrieval

#### 3. Reranking and Document Selection

#### 4. Augmented Prompt

There are many ways to write the prompt. It will basically instruct the LLM to generate result based on the {question} and the {context}.

The context is inputted from the retrieved documents from p previous step. 

In [46]:
from langchain.prompts import ChatPromptTemplate

QA_RAG = "SIMPLE_QUESTION_ANSWER_RAG"

MM_QA_RAG = "MULTIMODAL_QUESTION_ANSWER_RAG"

prompt_type = {
    "QA_RAG" : "SIMPLE_QUESTION_ANSWER_RAG",
    "MM_QA_RAG" : "MULTIMODAL_QUESTION_ANSWER_RAG",
}

simple_rag_template = """
Answer the question based on the context below. 
If you can't answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""
multimodal_rag_template = """
To define the new Prompt.

Context: {context}

Question: {question}
"""

def initPrompt(type) -> ChatPromptTemplate:
    #default
    prompt = ChatPromptTemplate.from_template(simple_rag_template)
    if type == prompt_type["QA_RAG"]: 
        prompt = ChatPromptTemplate.from_template(simple_rag_template)
    if type == prompt_type["MM_QA_RAG"]: 
        prompt = ChatPromptTemplate.from_template(multimodal_rag_template)
    return prompt

In [47]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
setup = RunnableParallel(context=retriever, question=RunnablePassthrough())

### Step 4. Generation

We now send the augmented prompt to instruct a LLM generating response to user's query. The response is finally parsed for readable. 
In this experiment, we use OpenAI model GPT3.5-Turbo. 

Note: There are many options for LLMs selection, from public to private, from simple to advance. Privacy, performance and quality should be considered to trade off. 

#### 1. QA Generation 
Using LLM to generation response to augmented query

In [48]:
from langchain_openai.chat_models import ChatOpenAI
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")

In [26]:
from langchain_ollama.chat_models import ChatOllama
model = ChatOllama(model="gemma2")

In [15]:
from langchain_community.llms import GPT4All
local_path = ("C:\\Users\\derek\\Meta-Llama-3-8B-Instruct.Q4_0.gguf" )
model = GPT4All(model=local_path, verbose=False)


In [49]:
from langchain_core.output_parsers import StrOutputParser
parser = StrOutputParser()

In [50]:
# Define an chain of tasks
chain = setup | prompt | model | parser

In [51]:
response = chain.invoke(user_query)
response

'Retrieval-Augmented Generation (RAG) models combine a generative model with an information retrieval function, designed to overcome the inherent constraints of generative models.'

In [53]:
from langchain_openai.chat_models import ChatOpenAI
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")

question  = "what is the capital of Florida?"

model.invoke(question)

AIMessage(content='Tallahassee', response_metadata={'token_usage': {'completion_tokens': 4, 'prompt_tokens': 14, 'total_tokens': 18}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-2e0a5937-2cce-441a-9f95-6e7c0ec0378d-0', usage_metadata={'input_tokens': 14, 'output_tokens': 4, 'total_tokens': 18})

In [52]:
from langchain_ollama.chat_models import ChatOllama
model = ChatOllama(model="llama3.1")

question  = "what is the capital of Florida?"

model.invoke(question)

AIMessage(content='The capital of Florida is Tallahassee.', response_metadata={'model': 'llama3.1', 'created_at': '2024-08-02T23:19:21.5033819Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 2921906800, 'load_duration': 2792235500, 'prompt_eval_count': 17, 'prompt_eval_duration': 18429000, 'eval_count': 10, 'eval_duration': 109282000}, id='run-faf38f8c-70b1-453f-9a4b-307fdfae7d85-0', usage_metadata={'input_tokens': 17, 'output_tokens': 10, 'total_tokens': 27})

In [37]:
from langchain_ollama.chat_models import ChatOllama
model = ChatOllama(model="gemma2")

question  = "what is the capital of Florida?"

model.invoke(question)

AIMessage(content='The capital of Florida is **Tallahassee**. \n', response_metadata={'model': 'gemma2', 'created_at': '2024-07-29T01:00:44.0710439Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 2743309000, 'load_duration': 2525563000, 'prompt_eval_count': 16, 'prompt_eval_duration': 24912000, 'eval_count': 12, 'eval_duration': 190948000}, id='run-2f2c7c7e-37f6-403c-b0d8-82c638a242d3-0', usage_metadata={'input_tokens': 16, 'output_tokens': 12, 'total_tokens': 28})

In [None]:
import llm_connector as llm

model = llm.connectLLM("LLAMA3_70B")

question  = "what is the capital of Florida?"

model.invoke(question)

#### 2. Retrieve Topic and Relevant Articles 

#### 3. Retrieve Article Summary

#### 4. Generate the final response

In [None]:
i = 1
while True:
    user_query = input("Input your question: ")
    if user_query == "exit" or user_query == "bye" or user_query == "quit":
        print(f"\n\nUser: {user_query}")
        print("\nAI Tutor: Bye")
        break

    print(f"\n{i}\nUser: {user_query}")
    response = chain.invoke(user_query)
    print(f"\nAI Tutor: {response}")
    i=i+1

    

# III. Research Assistant Use Cases

Demonstration of Research Assistant for: 
- Answer queries
- Relevant papers: from the query and from the topic
- Summary of the recommanded papers