# Environment Setup

## Install neccessary Library
The libraries include:
- langchain framework'
- GPT4ALL, OpenAI and HuggingFace for various embedding methods and LLMs
- Document loaders
- Dependent libraries

__Note__ : 
- It requires C++ builder for building a dependant library for Chroma. Check out https://github.com/bycloudai/InstallVSBuildToolsWindows for instruction. 
- Python version: 3.12.4
- Pydantic version: 2.7.3. There is issue with pydantic version 1.10.8 

In [1]:
# !pip install --upgrade -r requirements.txt

In [2]:
# !pip install -qU langchain-ollama

In [3]:
# !pip install -U langchain-experimental

In [4]:
# !pip install "unstructured[all-docs]" pillow pydantic lxml pillow matplotlib chromadb tiktoken

In [5]:
# !pip install pdf2image

In [6]:
# !pip3 install --upgrade nltk openpyxl matplotlib textblob spacy gensim scikit-learn

In [7]:
# !pip install tqdm

In [8]:
# !pip install onnx==1.16.1

In [9]:
# import textblob 
# !python -m textblob.download_corpora
# import spacy
# !python -m spacy download en_core_web_sm
# import nltk
# nltk.download('wordnet')
# nltk.download('universal_tagset')
# nltk.download('averaged_perceptron_tagger_eng')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('punkt_tab')
# nltk.download('stopwords')

In [10]:
# nltk.__version__

### Get Environment Parameters
Prepare the list of parameter in .env file for later use. 
Parameters: 
- API keys for LLMs
    - OPENAI_API_KEY 
    - HUGGINGFACEHUB_API_TOKEN 
- Directory / location for documents and vector databases
    - DOC_ARVIX = "./source/from_arvix/"
    - DOC_WIKI = "./source/from_wiki/"
    - VECTORDB_OPENAI_EM = "./vector_db/openai_embedding/"
    - VECTORDB_MINILM_EM = "./vector_db/gpt4all_miniLM/"
    - TS_RAGAS = "./evaluation/testset/by_RAGAS/"
    - TS_PROMPT = "./evaluation/testset/by_direct_prompt/"
    - EVAL_DATASET = "./evaluation/evaluation_data_set/"
    - EVAL_METRIC = "./evaluation/evaluation_metric"


# I. Architecture 

## A. Simple RAG Flow

<img src="diagrams/HL architecture.png" alt="HL arc" title= "HL Architecture" />

The system comprises of 5 components: 

- Internal data, documents: The system starts with a collection of internal documents and / or structured databases. Documents can be in text, PDF, photo or video formats. These documents and data are sources for the specified knowledgebase.

- Embedding processor: The documents and database entries are processed to create vector embeddings. Embeddings are numerical representations of the documents in a high-dimensional space that capture their semantic meaning. 

- Vector database: the vectorized chunk of documents and database entries are stored on vector database to be search and retrieved in a later stage. 

- Query processor: The query processor takes the user's query and performs semantic search against the vectorized database. This component ensures that the query is interpreted correctly and retrieves relevant document embeddings from the vectorized DB. It combines the user's original query with the retrieved document embeddings to form a context-rich query. This augmented query provides additional context that can help in generating a more accurate and relevant response.

- LLM: pre-trained large language model where the augmented query is passed to for generating a response based on the query and the relevant documents.

The system involves 2 main pipelines: the embedding pipeline and the retrieval pipeline. Each pipeline has specific stages and processes that contribute to the overall functionality of the system.

In this experiment, we use Langchain as a framework to build a simple RAG as a chain of tasks, which interacts with surrounding services like parsing, embedding, vector database and LLMs 

## B. MultiModal RAG Architecture
<img src="diagrams/ISM6564-Project.png" alt="HL arc" title= "MM HL Architecture" />

# II. Implementation

In [11]:
# Get the environment parameters
import os
from dotenv import load_dotenv
load_dotenv()

True

## A. Ingestion Pipeline

### Step 1. Data Collection

In this step, we load data from various sources. Make them ready to ingest.
We will download 5 articles from ARVIX with query "RAG for Large Language Model" and store them locally and ready for next steps of embedding

#### From ARXIV

In [12]:
import arxiv 
client = arxiv.Client()
search = arxiv.Search(
  query = "RAG for Large Language Model",     # To get more of other topics and number of papers. 
  max_results = 5,
#  sort_by = arxiv.SortCriterion.SubmittedDate
)

results = client.results(search)
all_results = list(client.results(search)) 

In [13]:
# Print out the articles' titles
for r in all_results:
    print(f"{r.title} {r.entry_id}")

Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks http://arxiv.org/abs/2407.21059v1
RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation http://arxiv.org/abs/2408.02545v1
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries http://arxiv.org/abs/2401.15391v1
EACO-RAG: Edge-Assisted and Collaborative RAG with Adaptive Knowledge Update http://arxiv.org/abs/2410.20299v1
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models http://arxiv.org/abs/2410.07176v1


In [14]:
# Purpose: download articles and save them in pre-defined location for later use
# Prepare: create the environment paramter DOC_ARVIX for the path to save articles. 
# Download and save articles in PDF format to the "RAG_for_LLM" folder under ARVIX_DOC path
DOC_ARVIX = os.getenv("DOC_ARVIX") 
directory_path = os.path.join(DOC_ARVIX) 
if not os.path.exists(directory_path):
    os.makedirs(directory_path)
for r in all_results:
    r.download_pdf(dirpath=directory_path)

#### From Springer

#### From Lexis

### Step 2. Embeddings

This step and the previous one are usually processed together. I try to separate them to make attention that these are not always coupled.
We use available library DirectoryLoader and PyMuPDFLoader from Langchain to load and parse all .pdf files in the directory.
We can use corresponding loader for other data types such as excel, presentation, unstructured ... 

Refer to https://python.langchain.com/v0.1/docs/integrations/document_loaders/ for other available loaders. 
We also use the OCR library rapidocr to extract image as text. Certainly, the trade-off is processing time. It took 18 minutes to parse 5 pdf files with OCR compared to 0.1 second without. 

#### 1. Util functions for Embeddings

In [15]:
from langchain_text_splitters import CharacterTextSplitter
from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import NarrativeText


# Extract elements from PDF
def extract_pdf_elements(path, fname,img_path=""):
    """
    Extract images, tables, and chunk text from a PDF file.
    path: File path, which is used to dump images (.jpg)
    fname: File name
    """
    if img_path == "":
        img_path = path
    if not os.path.exists(img_path):
        os.makedirs(img_path)
    return partition_pdf(
        filename=path + fname,
        extract_images_in_pdf=True,
        infer_table_structure=True,
        chunking_strategy="by_title",
        max_characters=4000,
        new_after_n_chars=3800,
        combine_text_under_n_chars=2000,
        image_output_dir_path=img_path,
        form_extraction_skip_tables=False,
        extract_image_block_output_dir = img_path
    )

def extract_pdf_elements_v2(path, fname,img_path=""):
    """
    Extract images, tables, and chunk text from a PDF file.
    path: File path, which is used to dump images (.jpg)
    fname: File name
    """
    if img_path == "":
        img_path = path
    if not os.path.exists(img_path):
        os.makedirs(img_path)
    return partition_pdf(
        filename=path + fname,
        extract_images_in_pdf=True,
        infer_table_structure=True,
        strategy="hi_res",
        max_characters=4000,
        new_after_n_chars=3800,
        combine_text_under_n_chars=2000,
        image_output_dir_path=img_path,
        form_extraction_skip_tables=False,
        extract_image_block_output_dir = img_path
    )

# Categorize elements by type
def categorize_elements(raw_pdf_elements):
    """
    Categorize extracted elements from a PDF into tables and texts.
    raw_pdf_elements: List of unstructured.documents.elements
    """
    tables = []
    texts = []
    for element in raw_pdf_elements:
        if "unstructured.documents.elements.Table" in str(type(element)):
            tables.append(element.to_dict()["metadata"]["text_as_html"])
        elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
            texts.append(str(element))
    return texts, tables


  from .autonotebook import tqdm as notebook_tqdm


In [17]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI


# Generate summaries of text elements
def generate_text_summaries(texts, tables, summarize_texts=False):
    """
    Summarize text elements
    texts: List of str
    tables: List of str
    summarize_texts: Bool to summarize texts
    """

    # Prompt
    prompt_text = """You are an assistant tasked with summarizing tables and text for retrieval. \
    These summaries will be embedded and used to retrieve the raw text or table elements. \
    Give a concise summary of the table or text that is well optimized for retrieval. Table or text: {element} """
    prompt = ChatPromptTemplate.from_template(prompt_text)

    # Text summary chain
    model = ChatOpenAI(temperature=0, model="gpt-4")
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

    # Initialize empty summaries
    text_summaries = []
    table_summaries = []

    # Apply to text if texts are provided and summarization is requested
    if texts and summarize_texts:
        text_summaries = summarize_chain.batch(texts, {"max_concurrency": 5})
    elif texts:
        text_summaries = texts

    # Apply to tables if tables are provided
    if tables:
        table_summaries = summarize_chain.batch(tables, {"max_concurrency": 5})

    return text_summaries, table_summaries

In [18]:
import base64
import os

from langchain_core.messages import HumanMessage


def encode_image(image_path):
    """Getting the base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


def image_summarize(img_base64, prompt):
    """Make image summary"""
    chat = ChatOpenAI(model="gpt-4o", max_tokens=1024)

    msg = chat.invoke(
        [
            HumanMessage(
                content=[
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                    },
                ]
            )
        ]
    )
    return msg.content


def generate_img_summaries(path):
    """
    Generate summaries and base64 encoded strings for images
    path: Path to list of .jpg files extracted by Unstructured
    """

    # Store base64 encoded images
    img_base64_list = []

    # Store image summaries
    image_summaries = []

    # Prompt
    prompt = """You are an assistant tasked with summarizing images for retrieval. \
    These summaries will be embedded and used to retrieve the raw image. \
    Give a concise summary of the image that is well optimized for retrieval."""

    # Apply to images
    for img_file in sorted(os.listdir(path)):
        if img_file.endswith(".jpg"):
            print(img_file)
            img_path = os.path.join(path, img_file)
            base64_image = encode_image(img_path)
            img_base64_list.append(base64_image)
            image_summaries.append(image_summarize(base64_image, prompt))

    return img_base64_list, image_summaries



In [19]:
import pandas as pd
import os
from dotenv import load_dotenv
from langchain.vectorstores import Chroma
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings



CHROMA_OPENAI_RAG_FOR_LLM = "CHROMA_OPENAI_RAG_FOR_LLM"
CHROMA_HF_RAG_FOR_LLM = "CHROMA_HF_RAG_FOR_LLM"
CHROMA_MINILM_RAG_FOR_LLM = "CHROMA_MINILM_RAG_FOR_LLM"
CHROMA_OLLAMA_RAG_FOR_LLM = "CHROMA_OLLAMA_RAG_FOR_LLM"

#IMPORTANT: THE CHROMA INSTANCE CANNOT INITIATED WITHIN A .PY. IT WILL CRASH THE KERNEL. 
class VectorBD:
    
    def __init__(self,
                 vectordb_name) -> None:
        load_dotenv()
#       OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
#       print(OPENAI_API_KEY)
        if vectordb_name == CHROMA_OPENAI_RAG_FOR_LLM:
            self.vectordb_directory = os.path.join(os.getenv("VECTORDB_OPENAI_EM"),"RAG_for_LLM")
            self.embeddings = OpenAIEmbeddings()
            self.vectordb =  Chroma(persist_directory=self.vectordb_directory, embedding_function=self.embeddings)
            self.retriever = self.vectordb.as_retriever()

        if vectordb_name == CHROMA_MINILM_RAG_FOR_LLM:
            self.vectordb_directory = os.path.join(os.getenv("VECTORDB_MINILM_EM"),"RAG_for_LLM")
            self.embeddings = GPT4AllEmbeddings(model_name="all-MiniLM-L6-v2.gguf2.f16.gguf", gpt4all_kwargs={'allow_download': 'True'})
            self.vectordb =  Chroma(persist_directory=self.vectordb_directory, embedding_function=self.embeddings)
            self.retriever = self.vectordb.as_retriever()

        if vectordb_name == CHROMA_OLLAMA_RAG_FOR_LLM:
            self.vectordb_directory = os.path.join(os.getenv("VECTORDB_OLLAMA_EM"),"RAG_for_LLM")
            self.embeddings = OllamaEmbeddings(model="llama3.1")
            self.vectordb =  Chroma(persist_directory=self.vectordb_directory, embedding_function=self.embeddings)
            self.retriever = self.vectordb.as_retriever()

        if vectordb_name == CHROMA_HF_RAG_FOR_LLM:
            self.vectordb_directory = os.path.join(os.getenv("VECTORDB_HF_EM"),"RAG_for_LLM")
            self.embeddings = HuggingFaceEmbeddings()
            self.vectordb =  Chroma(persist_directory=self.vectordb_directory, embedding_function=self.embeddings)
            self.retriever = self.vectordb.as_retriever()       

    def vectorizing(self, documents):
        self.vectordb = Chroma.from_documents(documents=documents,embedding=self.embeddings, persist_directory=self.vectordb_directory)
        self.vectordb.persist()

    def invoke(self,question):
#       print(self.retriever.invoke("What is RAG?"))
        return self.retriever.invoke(question)

def connect_km(km_name):
    load_dotenv()
#   OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
#   print(OPENAI_API_KEY)
    if km_name == CHROMA_OPENAI_RAG_FOR_LLM:
        km_dir = os.path.join(os.getenv("VECTORDB_OPENAI_EM"),"RAG_for_LLM")
        km_embeddings = OpenAIEmbeddings()
        km_db =  Chroma(persist_directory=km_dir, embedding_function=km_embeddings)
        return km_db

In [20]:
## Connect to LLM 
from langchain_openai.chat_models import ChatOpenAI
from langchain_huggingface import HuggingFaceEndpoint 
from langchain_ollama.chat_models import ChatOllama
import os
from dotenv import load_dotenv

llm_model = {
    "GPT_3_5_TURBO" : "gpt-3.5-turbo",
    "GPT_4" : "gpt-4",
    "GPT_4o" : "gpt-4o",  #For vision
    "GPT_4_PREVIEW" : "gpt-4-1106-preview",
    "LOCAL_GPT4ALL" : "",
    "MISRALAI" : "mistralai/Mistral-7B-Instruct-v0.2",
    "LLAMA3_70B" : "meta-llama/Meta-Llama-3-70B-Instruct",
    "ZEPHYR_7B" : "HuggingFaceH4/zephyr-7b-beta",
    "OLLAMA_GEMMA2" : "gemma2",
    "OLLAMA_LLAMA3" : "llama3",
    "OLLAMA_LLAMA3.1" : "llama3.1"
}

def connectLLM(model, temperature = 0):
    load_dotenv()

    # Connect to Open AI chat model: Online, Token-base
    if model == "GPT_3_5_TURBO" or model == "GPT_4_PREVIEW" or model == "GPT_4" or model == "GPT_4o":
#       print("connect llm")
        return ChatOpenAI(openai_api_key=os.getenv("OPENAI_API_KEY"), model=llm_model[model], temperature=temperature)
    
    # Connect to HuggingFace chat model: Online, Token-base
    # Note: to use Llama3, we need to register on HuggingFace website
    if model == "LLAMA3_70B" or model == "MISRALAI" or model == "ZEPHYR_7B":
        repo_id = llm_model[model]
        return HuggingFaceEndpoint(
            repo_id=repo_id,
            max_length=128,
            temperature=temperature, # Should be 0.5 
            huggingfacehub_api_token=os.getenv("HUGGINGFACEHUB_API_TOKEN")
        )
    
    # Connect to Ollama for Llama3, Llama3.1 and Gemma2 chat models
    # Need these models are working locally, they must have been downloaded. Check instruction for downloading Ollama and models
    if model == "OLLAMA_GEMMA2" or model == "OLLAMA_LLAMA3" or model == "OLLAMA_LLAMA3.1":
        return ChatOllama(model=llm_model[model], temperature=temperature)
         

In [21]:
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings


def create_multi_vector_retriever(
    vectorstore, text_summaries, texts, table_summaries, tables, image_summaries, images,image_file = [],
    document_meta ={} # is the meta data to be stored together with the vectorized content
):
    """
    Create retriever that indexes summaries, but returns raw images or texts
    """

    # Initialize the storage layer
    store = InMemoryStore()
    id_key = "doc_id"

    # Create the multi-vector retriever
    retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=store,
        id_key=id_key,
    )

    # Helper function to add documents to the vectorstore and docstore
    def add_documents(retriever, doc_summaries, doc_contents,type = ""):
        doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
        summary_docs = [
            Document(page_content=s, 
                     metadata={
                         id_key: doc_ids[i],
                         'source': document_meta.get("filename",""),
                         'type': type,
                         'paper_id': document_meta.get("docid","")
                         }
                     )
            for i, s in enumerate(doc_summaries)
        ]
        content_docs = [
            Document(page_content=s, 
                     metadata={
                         id_key: doc_ids[i],
                         'source': document_meta.get("filename",""),
                         'type': type,
                         'paper_id': document_meta.get("docid","")
                         }
                     )
            for i, s in enumerate(doc_contents)
        ]
        retriever.vectorstore.add_documents(summary_docs)
        retriever.docstore.mset(list(zip(doc_ids, content_docs)))

    # Add texts, tables, and images
    # Check that text_summaries is not empty before adding
    if text_summaries:
        add_documents(retriever, text_summaries, texts,"text")
    # Check that table_summaries is not empty before adding
    if table_summaries:
        add_documents(retriever, table_summaries, tables,"table")
    # Check that image_summaries is not empty before adding
    if image_summaries:
        add_documents(retriever, image_summaries, images,"image")

    return retriever

#### 2. Document Loading

In [22]:
DOC_ARVIX = os.getenv("DOC_ARVIX") 
directory_path = os.path.join(DOC_ARVIX) 
pdffiles = [f for f in os.listdir(directory_path) if f.endswith(".pdf")]
pdffiles

['2401.15391v1.MultiHop_RAG__Benchmarking_Retrieval_Augmented_Generation_for_Multi_Hop_Queries.pdf',
 '2407.21059v1.Modular_RAG__Transforming_RAG_Systems_into_LEGO_like_Reconfigurable_Frameworks.pdf',
 '2408.02545v1.RAG_Foundry__A_Framework_for_Enhancing_LLMs_for_Retrieval_Augmented_Generation.pdf',
 '2410.07176v1.Astute_RAG__Overcoming_Imperfect_Retrieval_Augmentation_and_Knowledge_Conflicts_for_Large_Language_Models.pdf',
 '2410.20299v1.EACO_RAG__Edge_Assisted_and_Collaborative_RAG_with_Adaptive_Knowledge_Update.pdf']

In [23]:
import pickle
import pandas as pd
import uuid

# Load document catalog from picker files
if os.path.exists('document_catalog.pickle'):
    with open('document_catalog.pickle', 'rb') as pkl_file:
        df_documents = pickle.load(pkl_file) 
else:
    df_documents = pd.DataFrame(columns=["docid","filename","status","topic","summary","img_folder","imgs"])

# Load new files for processing
new_item = False
existing_files = list(df_documents["filename"])
for fn in pdffiles:
    if fn not in existing_files:
        i = len(df_documents.index)
        df_documents.loc[len(df_documents.index)]={"docid":str(uuid.uuid4()),"filename":fn,"status":"new","topic":"","summary":"","img_folder":"./figure/document_"+str(i),"imgs":[]}
        new_item = True
if new_item:
    with open('document_catalog.pickle', 'wb') as pkl_file:
        pickle.dump(df_documents,pkl_file)
df_documents

Unnamed: 0,docid,filename,status,topic,summary,img_folder,imgs
0,a5cdaa51-39b4-42fe-bc76-e19fb729c37b,2401.15391v1.MultiHop_RAG__Benchmarking_Retrie...,new,,,./figure/document_0,[]
1,012e560a-9388-4f1f-9ae3-1b4afc2a0bcd,2407.21059v1.Modular_RAG__Transforming_RAG_Sys...,new,,,./figure/document_1,[]
2,a4b74ce1-b399-4b18-93ac-0c620d1438c7,2408.02545v1.RAG_Foundry__A_Framework_for_Enha...,new,,,./figure/document_2,[]
3,decc1461-6857-422c-b0d4-b6f6420e0d6a,2410.20299v1.EACO_RAG__Edge_Assisted_and_Colla...,new,,,./figure/document_3,[]
4,30bfc2c5-9453-49d9-8ad7-9adf3f4975bb,2410.07176v1.Astute_RAG__Overcoming_Imperfect_...,new,,,./figure/document_4,[]


#### 3. Text Parsing and Image Extraction

From each of pdf, extracts images. Expected return a list of images for each PDFs

In [24]:
# This is an example for a subset of dataframe aka the first document / paper
df_subset = df_documents.loc[0]
directory_path = os.path.join(os.getenv("DOC_ARVIX"))
doc_elements = extract_pdf_elements(directory_path,df_subset["filename"],df_subset["img_folder"])

In [25]:
texts, tables = categorize_elements(doc_elements)

#### 4. Text Chunking

Divide the data into smaller chunks for better handling, processing, and retrieving.
There is a limitation on number of tokens which the embedding service can process at later stage which requires documents are chunked in smaller size.
There are many of chunking methods from Langchain. In which, Recursive CharacterText and Semantic are most popular. 

Reference: https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/ 

In [26]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=4000, chunk_overlap=0
)
joined_texts = " ".join(texts)
texts_4k_token = text_splitter.split_text(joined_texts)

#### 5. Table Process

In [27]:
from IPython.display import HTML, display
import html_to_json 
import html
for element in tables:
    display(HTML(element))

0,1,2
News source,Fortune Magazine,The Sydney Morning Herald
Evidence,"Back then, just like today, home prices had boomed for years before Fed officials were ultimately forced to hike interest rates aggressively in an attempt to fight inflation.","Postponements of such reports could complicate things for the Fed, which has insisted it will make upcoming decisions on interest rates based on what incoming data say about the economy."
Claim,Federal Reserve officials were forced to aggressively hike interest rates to combat inflation after years of home,The Federal Reserve has insisted that it will base its upcoming decisions on interest rates on the incoming economic data.
,booming prices.,
Bridge-Topic Bridge-Entity,Interest rate hikes to combat inflation Federal Reserve,Interest rate decisions based on economic data Federal Reserve
Query,"Does the article from Fortune suggest that the Federal Reserve’s interest rate hikes are a response to past conditions, such as booming home prices, while The Sydney Morning Herald article indicates that the Federal Reserve’s future interest rate decisions will be based on incoming economic data?",


0,1,2
Category,Avg. Tokens,Entry Count
technology 2262.3 172,,
entertainment 2084.3 114,,
sports 2030.6 211,,
science 1745.5 21,,
business 1723.8 81,,
health 1481.1 10,,
total,2046.5,609


0,1,2
Query Category,Entry Count,Percentage
Inference Query,816,31.92%
Comparison Query,856,33.49%
Temporal Query,583,22.81%
Null Query,301,11.78%
Total,2556,100.00 %


0,1,2
Num. of Evidence Needed,Count,Percentage
0 (Null Query),301,11.78%
2,1078,42.18%
3,719,30.48%
4,398,15.56%
Total,2556,100.00 %


0,1,2,3,4,5,6,7,8
Embedding,,Without,Reranker,,With bge-reranker-large,,,
MRR@10,MAP@10,Hits@10,Hits@4,MRR@10,MAP@10,Hits@10,Hits@4,
text-embedding-ada-002,0.4203,0.3431,0.6381,0.504,0.5477,0.4625,0.7059,0.6169
text-search-ada-query-001,0.4203,0.3431,0.6399,0.5031,0.5483,0.4625,0.7064,0.6174
Ilm-embedder,0.2558,0.1725,0.4499,0.3189,0.425,0.3059,0.5478,0.4756
bge-large-en-v1.5,0.4298,0.3423,0.6718,= 0.5221,0.563,0.4759,0.7183,0.6364
jina-embeddings-v2-base-en,0.0621,0.031,0.1479,0.0802,0.1412,0.0772,0.1909,0.1639
intfloat/e5-base-v2,0.1843,0.1161,0.3556,= 0.2334,0.3237,0.2165,0.4176,0.3716
voyage-02,0.3934,0.3143,0.6506,0.4619,0.586,0.4795,0.7467,0.6625
hkun!p/instructor-large,0.3458,0.265,0.5717,0.4229,0.5115,0.4118,0.659,0.5775


0,1,2
Models,Accuracy,
Retrieved Chunk,Ground-truth Chunk,
GPT-4,0.56,0.89
ChatGPT,0.44,0.57
Llama-2-70b-chat-hf,0.28,0.32
Mixtral-8x7B-Instruct,0.32,0.36
Claude-2.1,0.52,0.56
Google-PaLM,0.47,0.74


#### 6. Text, Table and Image Summary

Using LLM e.g. Llama3.1 or Gemini to provide summary for an image

In [28]:
# Get text, table summaries
text_summaries, table_summaries = generate_text_summaries(
    texts_4k_token, tables, summarize_texts=False # Will use the original Text for getting vectorized
)

NotFoundError: Error code: 404 - {'error': {'message': 'The model `gpt-4` does not exist or you do not have access to it.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

In [None]:
## return string of summary for an input of image
# Image summaries
img_base64_list, image_summaries = generate_img_summaries(df_subset["img_folder"])

#### 7. Text, Table and Image Vectorizing

Vectors are semantic representation of texts. 
This is an important step to make documents searchable in the later pipeline. 
Embedding is an essential step in Transformer architecture, underlined to every modern LLMs. Therefore, many LLMs provide their embedding functions as services which are ready to use, e.g. OpenAI embedding API. However, it is important to consider privacy risk when exposing internal data to those services.

IMPORTANT NOTE: 
1. the embedding method to perform similarity search in the retrieval pipeline must be the same to the one used to vectorize documents in this step. 
2. Public embedding method such as OpenAIEmbedding may cost a fraction of money and leak internal data.  

Reference: https://python.langchain.com/v0.1/docs/modules/data_connection/text_embedding/

In [None]:
from langchain_openai.embeddings import OpenAIEmbeddings #To use other embeddings e.g. Llama or Gemini
embeddings = OpenAIEmbeddings()
vectordb_directory = os.path.join(os.getenv("VECTORDB_OPENAI_EM"),"")
vectorstore =  Chroma( collection_name="research_paper",persist_directory=vectordb_directory, embedding_function=embeddings)

In [None]:
vectorstore.reset_collection()

In [None]:
# Create retriever
retriever_multi_vector_img = create_multi_vector_retriever(
    vectorstore,
    text_summaries,
    texts_4k_token,
    table_summaries,
    tables,
    image_summaries,
    img_base64_list,
    df_subset["imgs"],
    dict(df_subset)
)

#### 8. Article Summary
Using LLM to summarize the paper (as text or as image (convert pdf to image ))

In [None]:
# pip install --upgrade google-generativeai

In [29]:
import os
import re
from PyPDF2 import PdfReader
import google.generativeai as genai

# Load your Gemini API key from the .env file
genai.configure(api_key=os.getenv("GEMINI_API_KEY"))

# Define the directory path for PDF files
DOC_ARVIX = os.getenv("DOC_ARVIX")
directory_path = os.path.join(DOC_ARVIX)

# List all PDF files in the directory
pdffiles = [f for f in os.listdir(directory_path) if f.endswith(".pdf")]

# List to store summaries
summaries = []

def summarize_with_gemini(text):
    """Summarizes text using the Gemini API with a request for brevity."""
    try:
        model = genai.GenerativeModel("gemini-1.5-flash")
        response = model.generate_content(f"Provide a short summary (around 2-3 sentences) of the following text: {text}")
        return response.text.strip()  # Clean up the response text
    except Exception as e:
        print(f"An error occurred while summarizing: {e}")
        return None

# Loop through each PDF file and summarize
for pdf_file in pdffiles:
    pdf_path = os.path.join(directory_path, pdf_file)
    try:
        # Read the PDF file
        with open(pdf_path, "rb") as file:
            reader = PdfReader(file)
            text = ""
            for page in reader.pages:
                text += page.extract_text() + "\n"  # Collect text from all pages
        
        # Clean up the text
        text = re.sub(r'\s+', ' ', text).strip()  # Remove excessive whitespace

        # Summarize the text
        summary = summarize_with_gemini(text)
        summaries.append(summary)  # Store summary in the list
        print(f"Summary for {pdf_file}: {summary}")

    except Exception as e:
        print(f"An error occurred while summarizing {pdf_file}: {e}")


Summary for 2401.15391v1.MultiHop_RAG__Benchmarking_Retrieval_Augmented_Generation_for_Multi_Hop_Queries.pdf: This paper introduces MultiHop-RAG, a new benchmark dataset for evaluating Retrieval-Augmented Generation (RAG) systems on multi-hop queries. These queries require retrieving and reasoning over multiple pieces of evidence, making them more challenging and realistic than single-hop queries. The authors detail the dataset construction process, using GPT-4 to generate a diverse set of multi-hop queries, and present benchmark results showing that current RAG systems struggle to effectively handle multi-hop queries.
Summary for 2407.21059v1.Modular_RAG__Transforming_RAG_Systems_into_LEGO_like_Reconfigurable_Frameworks.pdf: This paper introduces a new modular framework for Retrieval-Augmented Generation (RAG) systems called Modular RAG. By breaking down complex RAG systems into independent modules and operators, Modular RAG facilitates highly reconfigurable frameworks. The paper then

#### 9. Topic Modeling

In [30]:
# Function to perform topic modeling using Gemini
def generate_topics_with_gemini(summary):
    """Generates concise topics based on the provided summary using the Gemini API."""
    try:
        model = genai.GenerativeModel("gemini-1.5-flash")
        response = model.generate_content(
            f"Identify concise topics (as short phrases) from the following summary:\n{summary}\n\nPlease separate multiple topics with commas:"
        )
        return response.text.strip()  # Clean up the response text
    except Exception as e:
        print(f"An error occurred while generating topics: {e}")
        return None

# Generate topics for each summary
for pdf_file, summary in zip(pdffiles, summaries):
    print(f"Generating topics for {pdf_file}:")
    if summary:  # Check if summary is not None
        topics = generate_topics_with_gemini(summary)
        if topics:
            print(f"Identified Topics for {pdf_file}:\n{topics}\n")
        else:
            print(f"No topics identified for {pdf_file}.\n")
    else:
        print(f"No summary available for {pdf_file}.\n")


Generating topics for 2401.15391v1.MultiHop_RAG__Benchmarking_Retrieval_Augmented_Generation_for_Multi_Hop_Queries.pdf:
Identified Topics for 2401.15391v1.MultiHop_RAG__Benchmarking_Retrieval_Augmented_Generation_for_Multi_Hop_Queries.pdf:
MultiHop-RAG dataset, RAG system evaluation, multi-hop query challenges, GPT-4 query generation, RAG system limitations.

Generating topics for 2407.21059v1.Modular_RAG__Transforming_RAG_Systems_into_LEGO_like_Reconfigurable_Frameworks.pdf:
Identified Topics for 2407.21059v1.Modular_RAG__Transforming_RAG_Systems_into_LEGO_like_Reconfigurable_Frameworks.pdf:
Modular RAG framework, RAG flow patterns, RAG system reconfigurability, linear RAG flow, conditional RAG flow, branching RAG flow, looping RAG flow, tuning RAG flow.

Generating topics for 2408.02545v1.RAG_Foundry__A_Framework_for_Enhancing_LLMs_for_Retrieval_Augmented_Generation.pdf:
Identified Topics for 2408.02545v1.RAG_Foundry__A_Framework_for_Enhancing_LLMs_for_Retrieval_Augmented_Generation.

#### 10. Store Article Summary + Topic Model

In [36]:
import os
import pandas as pd
from IPython.display import display, HTML
import pickle

# Load document catalog from pickle files
if os.path.exists('document_catalog.pickle'):
    with open('document_catalog.pickle', 'rb') as pkl_file:
        df_documents = pickle.load(pkl_file) 
else:
    df_documents = pd.DataFrame(columns=["filename", "status", "topic", "summary", "img_folder", "imgs"])

# Create a list of new entries for updates
updates = []

# Loop through each PDF file and its summary
for pdf_file, summary in zip(pdffiles, summaries):
    if summary:  # Ensure the summary is not None
        topics = generate_topics_with_gemini(summary)  # Generate topics for the summary
        status = "Success"
    else:
        topics = None
        status = "Failed"

    # Prepare a dictionary for the current document's update
    updates.append({
        "filename": pdf_file,
        "status": status,
        "topic": topics,
        "summary": summary,
        "img_folder": None,  # Placeholder for image folder, if applicable
        "imgs": None  # Placeholder for images, if applicable
    })

# Update the existing DataFrame without creating duplicates
for update in updates:
    filename = update['filename']
    
    if filename in df_documents['filename'].values:
        # Update the existing row based on the filename
        df_documents.loc[df_documents['filename'] == filename, ['status', 'topic', 'summary']] = update['status'], update['topic'], update['summary']
    else:
        # If the filename doesn't exist, append the new entry
        df_documents = df_documents.append(update, ignore_index=True)

# Save the updated DataFrame as a pickle file
with open('document_catalog.pickle', 'wb') as pkl_file:
    pickle.dump(df_documents, pkl_file)

# Display the updated DataFrame in a scrollable format
display(HTML(df_documents.to_html(max_rows=10, max_cols=7, justify='left')))

Unnamed: 0,docid,filename,status,topic,summary,img_folder,imgs
0,a5cdaa51-39b4-42fe-bc76-e19fb729c37b,2401.15391v1.MultiHop_RAG__Benchmarking_Retrieval_Augmented_Generation_for_Multi_Hop_Queries.pdf,Success,"MultiHop-RAG benchmark, multi-hop query evaluation, RAG system challenges, GPT-4 query generation, dataset construction.","This paper introduces MultiHop-RAG, a new benchmark dataset for evaluating Retrieval-Augmented Generation (RAG) systems on multi-hop queries. These queries require retrieving and reasoning over multiple pieces of evidence, making them more challenging and realistic than single-hop queries. The authors detail the dataset construction process, using GPT-4 to generate a diverse set of multi-hop queries, and present benchmark results showing that current RAG systems struggle to effectively handle multi-hop queries.",./figure/document_0,[]
1,012e560a-9388-4f1f-9ae3-1b4afc2a0bcd,2407.21059v1.Modular_RAG__Transforming_RAG_Systems_into_LEGO_like_Reconfigurable_Frameworks.pdf,Success,"Modular RAG framework, RAG flow patterns, Reconfigurable RAG systems, Linear RAG flow, Conditional RAG flow, Branching RAG flow, Looping RAG flow, Tuning RAG flow.","This paper introduces a new modular framework for Retrieval-Augmented Generation (RAG) systems called Modular RAG. By breaking down complex RAG systems into independent modules and operators, Modular RAG facilitates highly reconfigurable frameworks. The paper then presents six prevalent RAG flow patterns (linear, conditional, branching, looping, tuning) and analyzes their implementation nuances, paving the way for a more robust and flexible approach to building and deploying RAG technologies.",./figure/document_1,[]
2,a4b74ce1-b399-4b18-93ac-0c620d1438c7,2408.02545v1.RAG_Foundry__A_Framework_for_Enhancing_LLMs_for_Retrieval_Augmented_Generation.pdf,Success,"Open-source RAG framework, LLM enhancement for RAG, Data-augmented dataset creation, RAG LLM training, RAG performance evaluation, Rapid prototyping, RAG techniques experimentation, Customizable tool for knowledge-intensive tasks.","RAG Foundry is an open-source framework for enhancing large language models (LLMs) for retrieval-augmented generation (RAG) use cases. It streamlines the process of creating data-augmented datasets, training LLMs in RAG settings, and evaluating their performance. The framework facilitates rapid prototyping and experimentation with various RAG techniques, offering researchers and practitioners a comprehensive and customizable tool for improving LLM capabilities in knowledge-intensive tasks.",./figure/document_2,[]
3,decc1461-6857-422c-b0d4-b6f6420e0d6a,2410.20299v1.EACO_RAG__Edge_Assisted_and_Collaborative_RAG_with_Adaptive_Knowledge_Update.pdf,Success,"Edge-assisted RAG, Distributed knowledge, Dynamic database updates, Inter-node collaboration, Reduced delay and resource consumption, High accuracy, Cost-effective solution.","This paper introduces EACO-RAG, a novel edge-assisted distributed system for Retrieval-Augmented Generation (RAG). EACO-RAG overcomes the scalability challenges of traditional centralized RAG systems by distributing knowledge across edge nodes, dynamically updating local databases, and leveraging inter-node collaboration. This results in significant reductions in delay and resource consumption while maintaining high accuracy, making it a highly efficient and cost-effective solution for large-scale RAG deployments.",./figure/document_3,[]
4,30bfc2c5-9453-49d9-8ad7-9adf3f4975bb,2410.07176v1.Astute_RAG__Overcoming_Imperfect_Retrieval_Augmentation_and_Knowledge_Conflicts_for_Large_Language_Models.pdf,Success,"Imperfect retrieval in RAG systems, Astute RAG approach, LLM internal knowledge integration, Iterative information consolidation, Reliable answer generation, Performance comparison with existing RAG methods.","This paper addresses the problem of imperfect retrieval in Retrieval-Augmented Generation (RAG) systems, which can lead to inaccurate outputs due to irrelevant or misleading information from retrieved sources. The authors propose Astute RAG, a novel RAG approach that leverages internal knowledge from large language models (LLMs) to mitigate the impact of imperfect retrieval by iteratively consolidating information from both internal and external sources and ultimately producing more reliable answers. Experiments demonstrate that Astute RAG significantly outperforms existing RAG methods, especially in challenging scenarios where retrieval is unreliable.",./figure/document_4,[]


#### 11. Store Vector DB (New version of Chroma persists data automatically after vectorization)

There are some vector databases of choices: Chroma, FAISS, Pinecone ... 
We will create Chroma vector database with openai embedding method. 

Note: different embedding methods will result different vector dimensions and cannot be stored together. 
The same embedding method to be used in retrieval pipeline

Reference: https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/ 

## B. Retrieval Pipeline

Retrieval pipeline is to retrieve relevant chunk of knowledge from pre-prepared vectorized knowledge to enrich the LLM prompt with specified context. This pipeline is run to respond to each user’s query. 

Need to load from store if there is, here is Chroma vectordb we have just persisted. 
Perform a semantic search in the vectorized database to retrieve relevant embedded documents.

NOTE: The embedding method used in this step must be same as which used to vectorize knowledges in the previous pipeline.

There is opportunity to improve efficiency and quality of similarity search, especially when the knowledgebase gets larger and more complicated (type of sources)

In [32]:
import os
from dotenv import load_dotenv
load_dotenv()

True

### Step 3. Retrieval

#### 1. Util functions for retrieval and response processing

In [33]:
import io
import re

from IPython.display import HTML, display
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from PIL import Image


def plt_img_base64(img_base64):
    """Disply base64 encoded string as image"""
    # Create an HTML img tag with the base64 string as the source
    image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
    # Display the image by rendering the HTML
    display(HTML(image_html))


def looks_like_base64(sb):
    """Check if the string looks like base64"""
    return re.match("^[A-Za-z0-9+/]+[=]{0,2}$", sb) is not None


def is_image_data(b64data):
    """
    Check if the base64 data is an image by looking at the start of the data
    """
    image_signatures = {
        b"\xff\xd8\xff": "jpg",
        b"\x89\x50\x4e\x47\x0d\x0a\x1a\x0a": "png",
        b"\x47\x49\x46\x38": "gif",
        b"\x52\x49\x46\x46": "webp",
    }
    try:
        header = base64.b64decode(b64data)[:8]  # Decode and get the first 8 bytes
        for sig, format in image_signatures.items():
            if header.startswith(sig):
                return True
        return False
    except Exception:
        return False


def resize_base64_image(base64_string, size=(128, 128)):
    """
    Resize an image encoded as a Base64 string
    """
    # Decode the Base64 string
    img_data = base64.b64decode(base64_string)
    img = Image.open(io.BytesIO(img_data))

    # Resize the image
    resized_img = img.resize(size, Image.LANCZOS)

    # Save the resized image to a bytes buffer
    buffered = io.BytesIO()
    resized_img.save(buffered, format=img.format)

    # Encode the resized image to Base64
    return base64.b64encode(buffered.getvalue()).decode("utf-8")


def split_image_text_types(docs):
    """
    Split base64-encoded images and texts
    """
    b64_images = []
    texts = []
    for doc in docs:
        # Check if the document is of type Document and extract page_content if so
        if isinstance(doc, Document):
            doc = doc.page_content
        if looks_like_base64(doc) and is_image_data(doc):
            doc = resize_base64_image(doc, size=(1300, 600))
            b64_images.append(doc)
        else:
            texts.append(doc)
    return {"images": b64_images, "texts": texts}


def img_prompt_func(data_dict):
    """
    Join the context into a single string
    """
    formatted_texts = "\n".join(data_dict["context"]["texts"])
    messages = []

    # Adding image(s) to the messages if present
    if data_dict["context"]["images"]:
        for image in data_dict["context"]["images"]:
            image_message = {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image}"},
            }
            messages.append(image_message)

    # Adding the text for analysis
    text_message = {
        "type": "text",
        "text": (
            "You are financial analyst tasking with providing investment advice.\n"
            "You will be given a mixed of text, tables, and image(s) usually of charts or graphs.\n"
            "Use this information to provide investment advice related to the user question. \n"
            f"User-provided question: {data_dict['question']}\n\n"
            "Text and / or tables:\n"
            f"{formatted_texts}"
        ),
    }
    messages.append(text_message)
    return [HumanMessage(content=messages)]


def multi_modal_rag_chain(retriever):
    """
    Multi-modal RAG chain
    """

    # Multi-modal LLM
    model = ChatOpenAI(temperature=0, model="gpt-4o", max_tokens=1024)

    # RAG pipeline
    chain = (
        {
            "context": retriever | RunnableLambda(split_image_text_types),
            "question": RunnablePassthrough(),
        }
        | RunnableLambda(img_prompt_func)
        | model
        | StrOutputParser()
    )

    return chain

#### 2. Process Query

In [34]:
user_query = "What is the pperformance of GPT-4 vs Mixtral?"
#user_query = "Describe the RAG-Sequence Model?"

#### 3. Retrieve Relevant Docs - Text, Table, Image

In [35]:
relevant_docs = retriever_multi_vector_img.invoke(user_query, limit=3) # Top k relevant

NameError: name 'retriever_multi_vector_img' is not defined

In [None]:
relevant_docs

In [None]:
for d in relevant_docs:
    if d.metadata["type"] == "image":
        plt_img_base64(d.page_content)
    elif d.metadata["type"] == "table":
        display(HTML(d.page_content))
    else:
        print("Text: ",d.page_content)

In [None]:
image_summaries[2]

#### 3. Reranking and Document Selection (Leave this to the MultiModal Retriever)

#### 4. Augmented Prompt

There are many ways to write the prompt. It will basically instruct the LLM to generate result based on the {question} and the {context}.

The context is inputted from the retrieved documents from p previous step. 

In [None]:
from langchain.prompts import ChatPromptTemplate

QA_RAG = "SIMPLE_QUESTION_ANSWER_RAG"

MM_QA_RAG = "MULTIMODAL_QUESTION_ANSWER_RAG"

prompt_type = {
    "QA_RAG" : "SIMPLE_QUESTION_ANSWER_RAG",
    "MM_QA_RAG" : "MULTIMODAL_QUESTION_ANSWER_RAG",
}

simple_rag_template = """
Answer the question based on the context below. 
If you can't answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""
multimodal_rag_template = """
To define the new Prompt.

Context: {context}

Question: {question}
"""

def initPrompt(type) -> ChatPromptTemplate:
    #default
    prompt = ChatPromptTemplate.from_template(simple_rag_template)
    if type == prompt_type["QA_RAG"]: 
        prompt = ChatPromptTemplate.from_template(simple_rag_template)
    if type == prompt_type["MM_QA_RAG"]: 
        prompt = ChatPromptTemplate.from_template(multimodal_rag_template)
    return prompt

### Step 4. Generation

We now send the augmented prompt to instruct a LLM generating response to user's query. The response is finally parsed for readable. 
In this experiment, we use OpenAI model GPT3.5-Turbo. 

Note: There are many options for LLMs selection, from public to private, from simple to advance. Privacy, performance and quality should be considered to trade off. 

#### 1. QA Generation 
Using LLM to generation response to augmented query

In [None]:
# Create RAG chain
chain_multimodal_rag = multi_modal_rag_chain(retriever_multi_vector_img)

In [None]:
response = chain_multimodal_rag.invoke(user_query)

In [None]:
response

#### 2. Retrieve Topic and Relevant Articles 

In [None]:
doc_catalog = pd.DataFrame()
with open('document_catalog.pickle', 'rb') as pkl_file:
        doc_catalog = pickle.load(pkl_file)
articles = list(set([a.metadata["paper_id"] for a in relevant_docs]))

In [None]:
doc_catalog["topic"] = ["a","b","c","c"]
doc_catalog["summary"] = ["1111","222","333","3333"]

In [None]:
topics = list(set([doc_catalog["topic"].loc[i] for i, docid in enumerate(list(doc_catalog["docid"])) if docid in articles]))

In [None]:
topic_articles = [doc_catalog["filename"].loc[i][:-4] for i, docid, topic in zip(range(len(doc_catalog.index)),list(doc_catalog["docid"]),list(doc_catalog["topic"])) if docid not in articles and topic in topics]

#### 3. Retrieve Article Summary

In [None]:
articles_summary = []
topic_articles_summary = []
if articles:
    articles_summary = [doc_catalog["summary"].loc[i] for i, docid in enumerate(list(doc_catalog["docid"])) if docid in articles]
if topic_articles:
    topic_articles_summary = [doc_catalog["summary"].loc[i] for i, filename in enumerate(list(doc_catalog["filename"])) if filename[:-4] in topic_articles]

#### 4. Generate the final response

In [None]:
print("Your Query:", user_query)
print("The answer:", response)
if articles:
    print("\nYou can find the details of the answer from the following articles")
    for i in range(len(articles)):
        print("\nArticle "+str(i+1)+": "+ doc_catalog[doc_catalog["docid"]==articles[i]]["filename"].loc[0][:-4])
        print("Article Summary:\n"+articles_summary[i])
if topic_articles:
    print("\nYou seem interested in the topics:", ", ".join(topics),"\nYou may be interested in other articles in those topics below:")
    for i in range(len(topic_articles)):
        print("\nArticle "+str(i+1)+": "+topic_articles[i])
        print("Article Summary:\n"+topic_articles_summary[i])


# III. Research Assistant Use Cases

Demonstration of Research Assistant for: 
- Answer queries
- Relevant papers: from the query and from the topic
- Summary of the recommanded papers