# Environment Setup

### Install neccessary Library
The libraries include:
- langchain framework'
- GPT4ALL, OpenAI and HuggingFace for various embedding methods and LLMs
- Document loaders
- Dependent libraries

__Note__ : 
- It requires C++ builder for building a dependant library for Chroma. Check out https://github.com/bycloudai/InstallVSBuildToolsWindows for instruction. 
- Python version: 3.12.4
- Pydantic version: 2.7.3. There is issue with pydantic version 1.10.8 

In [None]:
%pip install --upgrade -r requirements.txt

In [2]:
%pip install -qU langchain-ollama

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.1.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### Get Environment Parameters
Prepare the list of parameter in .env file for later use. 
Parameters: 
- API keys for LLMs
    - OPENAI_API_KEY 
    - HUGGINGFACEHUB_API_TOKEN 
- Directory / location for documents and vector databases
    - DOC_ARVIX = "./source/from_arvix/"
    - DOC_WIKI = "./source/from_wiki/"
    - VECTORDB_OPENAI_EM = "./vector_db/openai_embedding/"
    - VECTORDB_MINILM_EM = "./vector_db/gpt4all_miniLM/"
    - TS_RAGAS = "./evaluation/testset/by_RAGAS/"
    - TS_PROMPT = "./evaluation/testset/by_direct_prompt/"
    - EVAL_DATASET = "./evaluation/evaluation_data_set/"
    - EVAL_METRIC = "./evaluation/evaluation_metric"


In [32]:
import os
from dotenv import load_dotenv
load_dotenv()

True

# I. Build a simple RAG 

<img src="diagrams/HL architecture.png" alt="HL arc" title= "HL Architecture" />

The system comprises of 5 components: 

- Internal data, documents: The system starts with a collection of internal documents and / or structured databases. Documents can be in text, PDF, photo or video formats. These documents and data are sources for the specified knowledgebase.

- Embedding processor: The documents and database entries are processed to create vector embeddings. Embeddings are numerical representations of the documents in a high-dimensional space that capture their semantic meaning. 

- Vector database: the vectorized chunk of documents and database entries are stored on vector database to be search and retrieved in a later stage. 

- Query processor: The query processor takes the user's query and performs semantic search against the vectorized database. This component ensures that the query is interpreted correctly and retrieves relevant document embeddings from the vectorized DB. It combines the user's original query with the retrieved document embeddings to form a context-rich query. This augmented query provides additional context that can help in generating a more accurate and relevant response.

- LLM: pre-trained large language model where the augmented query is passed to for generating a response based on the query and the relevant documents.

The system involves 2 main pipelines: the embedding pipeline and the retrieval pipeline. Each pipeline has specific stages and processes that contribute to the overall functionality of the system.

In this experiment, we use Langchain as a framework to build a simple RAG as a chain of tasks, which interacts with surrounding services like parsing, embedding, vector database and LLMs 

### Pipeline 1 - Knowledge Embeddings

Pipeline 1: Embedding pipeline is to initiate the vectorized knowledgebase. It can be run whenever the knowledgebase needs to update. 

<img src="diagrams/Pipeline 1 - Knowledge Embedding.png" alt="Pipeline1" title="Pipeline 1 - Embeddings" />

#### Step 1. Loading

In this step, we load data from various sources. Make them ready to ingest.
We will download 5 articles from ARVIX with query "RAG for Large Language Model" and store them locally and ready for next steps of embedding

In [19]:
import arxiv 
client = arxiv.Client()
search = arxiv.Search(
  query = "RAG for Large Language Model",
  max_results = 5,
#  sort_by = arxiv.SortCriterion.SubmittedDate
)

results = client.results(search)
all_results = list(client.results(search)) 

In [20]:
# Print out the articles' titles
for r in all_results:
    print(f"{r.title} {r.entry_id}")

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries http://arxiv.org/abs/2401.15391v1
Prompt-RAG: Pioneering Vector Embedding-Free Retrieval-Augmented Generation in Niche Domains, Exemplified by Korean Medicine http://arxiv.org/abs/2401.11246v1
Seven Failure Points When Engineering a Retrieval Augmented Generation System http://arxiv.org/abs/2401.05856v1
The Good and The Bad: Exploring Privacy Issues in Retrieval-Augmented Generation (RAG) http://arxiv.org/abs/2402.16893v1
CLAPNQ: Cohesive Long-form Answers from Passages in Natural Questions for RAG systems http://arxiv.org/abs/2404.02103v1


In [32]:
# Purpose: download articles and save them in pre-defined location for later use
# Prepare: create the environment paramter DOC_ARVIX for the path to save articles. 
# Download and save articles in PDF format to the "RAG_for_LLM" folder under ARVIX_DOC path
DOC_ARVIX = os.getenv("DOC_ARVIX") 
directory_path = os.path.join(DOC_ARVIX,"RAG_for_LLM") 
if not os.path.exists(directory_path):
    os.makedirs(directory_path)
for r in all_results:
    r.download_pdf(dirpath=directory_path)

#### Step 2. Parsing

This step and the previous one are usually processed together. I try to separate them to make attention that these are not always coupled.
We use available library DirectoryLoader and PyMuPDFLoader from Langchain to load and parse all .pdf files in the directory.
We can use corresponding loader for other data types such as excel, presentation, unstructured ... 

Refer to https://python.langchain.com/v0.1/docs/integrations/document_loaders/ for other available loaders. 
We also use the OCR library rapidocr to extract image as text. Certainly, the trade-off is processing time. It took 18 minutes to parse 5 pdf files with OCR compared to 0.1 second without. 

In [34]:
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import PyMuPDFLoader
directory_path = os.path.join(os.getenv("DOC_ARVIX") ,"RAG_for_LLM") 
loader_kwargs = {"extract_images":False} #Use OCR to extract image as text
pdf_loader = DirectoryLoader(
        path=directory_path,
        glob="*.pdf",
        loader_cls=PyMuPDFLoader,
        loader_kwargs=loader_kwargs
    )
pdf_documents = pdf_loader.load()

In [41]:
for d in pdf_documents:
    print(d.page_content)

Seven Failure Points When Engineering a Retrieval Augmented
Generation System
Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, Mohamed Abdelrazek
{scott.barnett,stefanus.kurniawan,srikanth.thudumu,zach.brannelly,mohamed.abdelrazek}@deakin.edu.au
Applied Artificial Intelligence Institute
Geelong, Australia
ABSTRACT
Software engineers are increasingly adding semantic search capabil-
ities to applications using a strategy known as Retrieval Augmented
Generation (RAG). A RAG system involves finding documents that
semantically match a query and then passing the documents to a
large language model (LLM) such as ChatGPT to extract the right
answer using an LLM. RAG systems aim to: a) reduce the problem
of hallucinated responses from LLMs, b) link sources/references
to generated responses, and c) remove the need for annotating
documents with meta-data. However, RAG systems suffer from lim-
itations inherent to information retrieval systems and from reliance
on LLMs. In this

#### Step 3. Chunking

Divide the data into smaller chunks for better handling, processing, and retrieving.
There is a limitation on number of tokens which the embedding service can process at later stage which requires documents are chunked in smaller size.
There are many of chunking methods from Langchain. In which, Recursive CharacterText and Semantic are most popular. 

Reference: https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/ 

In [54]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=30)
text_chunks = text_splitter.split_documents(pdf_documents)

#### Step 4. Vectorizing

Vectors are semantic representation of texts. 
This is an important step to make documents searchable in the later pipeline. 
Embedding is an essential step in Transformer architecture, underlined to every modern LLMs. Therefore, many LLMs provide their embedding functions as services which are ready to use, e.g. OpenAI embedding API. However, it is important to consider privacy risk when exposing internal data to those services.

IMPORTANT NOTE: 
1. the embedding method to perform similarity search in the retrieval pipeline must be the same to the one used to vectorize documents in this step. 
2. Public embedding method such as OpenAIEmbedding may cost a fraction of money and leak internal data.  

Reference: https://python.langchain.com/v0.1/docs/modules/data_connection/text_embedding/

In [55]:
from langchain_openai.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

#### Step 5. Storing

There are some vector databases of choices: Chroma, FAISS, Pinecone ... 
We will create Chroma vector database with openai embedding method. 

Note: different embedding methods will result different vector dimensions and cannot be stored together. 
The same embedding method to be used in retrieval pipeline

Reference: https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/ 

In [56]:
from langchain.vectorstores import Chroma
persist_directory = os.getenv("VECTORDB_OPENAI_EM")
persist_directory = os.path.join(persist_directory,"RAG_for_LLM")
if not os.path.exists(persist_directory):
    os.makedirs(persist_directory)

vectordb = Chroma.from_documents(documents=text_chunks,  embedding=embeddings, persist_directory=persist_directory)
vectordb.persist()

  warn_deprecated(


### Pipeline 2 - Retrieving & Generating

Retrieval pipeline is to retrieve relevant chunk of knowledge from pre-prepared vectorized knowledge to enrich the LLM prompt with specified context. This pipeline is run to respond to each user’s query. 

<img src="diagrams/Pipeline 2 - Retrieval.png" alt="Pipeline2" title="Pipeline 2 - Retrieval & Generation" />

In [42]:
import os
from dotenv import load_dotenv
load_dotenv()

True

#### Step 1. Query

In [43]:
user_query = "What is retrieval augmented generation?"
#user_query = "Describe the RAG-Sequence Model?"

#### Step 2. Retrieve

Need to load from store if there is, here is Chroma vectordb we have just persisted. 
Perform a semantic search in the vectorized database to retrieve relevant embedded documents.

NOTE: The embedding method used in this step must be same as which used to vectorize knowledges in the previous pipeline.

There is opportunity to improve efficiency and quality of similarity search, especially when the knowledgebase gets larger and more complicated (type of sources)

In [44]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
db_directory = os.getenv("VECTORDB_OPENAI_EM")
db_directory = os.path.join(db_directory,"RAG_for_LLM")
embeddings = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=db_directory, embedding_function=embeddings)
retriever = vectordb.as_retriever()

In [45]:
retriever.invoke(user_query)

[Document(metadata={'author': '', 'creationDate': "D:20240120233737+09'00'", 'creator': '', 'file_path': 'source\\from_arvix\\RAG_for_LLM\\2401.11246v1.Prompt_RAG__Pioneering_Vector_Embedding_Free_Retrieval_Augmented_Generation_in_Niche_Domains__Exemplified_by_Korean_Medicine.pdf', 'format': 'PDF 1.7', 'keywords': '', 'modDate': "D:20240120233737+09'00'", 'page': 1, 'producer': 'Microsoft: Print To PDF', 'source': 'source\\from_arvix\\RAG_for_LLM\\2401.11246v1.Prompt_RAG__Pioneering_Vector_Embedding_Free_Retrieval_Augmented_Generation_in_Niche_Domains__Exemplified_by_Korean_Medicine.pdf', 'subject': '', 'title': 'Microsoft Word - Prompt-GPT_v1', 'total_pages': 26, 'trapped': ''}, page_content='2 \n1. Introduction \nRetrieval-Augmented Generation (RAG) models combine a generative model with an information \nretrieval function, designed to overcome the inherent constraints of generative models.(1) They \nintegrate the robustness of a large language model (LLM) with the relevance and up-t

#### Step 3. Augmented Prompt

There are many ways to write the prompt. It will basically instruct the LLM to generate result based on the {question} and the {context}.

The context is inputted from the retrieved documents from p previous step. 

In [46]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. 
If you can't answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [47]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
setup = RunnableParallel(context=retriever, question=RunnablePassthrough())

#### Step 4. Response Generating

We now send the augmented prompt to instruct a LLM generating response to user's query. The response is finally parsed for readable. 
In this experiment, we use OpenAI model GPT3.5-Turbo. 

Note: There are many options for LLMs selection, from public to private, from simple to advance. Privacy, performance and quality should be considered to trade off. 

In [48]:
from langchain_openai.chat_models import ChatOpenAI
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")

In [26]:
from langchain_ollama.chat_models import ChatOllama
model = ChatOllama(model="gemma2")

In [15]:
from langchain_community.llms import GPT4All
local_path = ("C:\\Users\\derek\\Meta-Llama-3-8B-Instruct.Q4_0.gguf" )
model = GPT4All(model=local_path, verbose=False)


In [49]:
from langchain_core.output_parsers import StrOutputParser
parser = StrOutputParser()

In [50]:
# Define an chain of tasks
chain = setup | prompt | model | parser

In [51]:
response = chain.invoke(user_query)
response

'Retrieval-Augmented Generation (RAG) models combine a generative model with an information retrieval function, designed to overcome the inherent constraints of generative models.'

In [53]:
from langchain_openai.chat_models import ChatOpenAI
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")

question  = "what is the capital of Florida?"

model.invoke(question)

AIMessage(content='Tallahassee', response_metadata={'token_usage': {'completion_tokens': 4, 'prompt_tokens': 14, 'total_tokens': 18}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-2e0a5937-2cce-441a-9f95-6e7c0ec0378d-0', usage_metadata={'input_tokens': 14, 'output_tokens': 4, 'total_tokens': 18})

In [52]:
from langchain_ollama.chat_models import ChatOllama
model = ChatOllama(model="llama3.1")

question  = "what is the capital of Florida?"

model.invoke(question)

AIMessage(content='The capital of Florida is Tallahassee.', response_metadata={'model': 'llama3.1', 'created_at': '2024-08-02T23:19:21.5033819Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 2921906800, 'load_duration': 2792235500, 'prompt_eval_count': 17, 'prompt_eval_duration': 18429000, 'eval_count': 10, 'eval_duration': 109282000}, id='run-faf38f8c-70b1-453f-9a4b-307fdfae7d85-0', usage_metadata={'input_tokens': 17, 'output_tokens': 10, 'total_tokens': 27})

In [37]:
from langchain_ollama.chat_models import ChatOllama
model = ChatOllama(model="gemma2")

question  = "what is the capital of Florida?"

model.invoke(question)

AIMessage(content='The capital of Florida is **Tallahassee**. \n', response_metadata={'model': 'gemma2', 'created_at': '2024-07-29T01:00:44.0710439Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 2743309000, 'load_duration': 2525563000, 'prompt_eval_count': 16, 'prompt_eval_duration': 24912000, 'eval_count': 12, 'eval_duration': 190948000}, id='run-2f2c7c7e-37f6-403c-b0d8-82c638a242d3-0', usage_metadata={'input_tokens': 16, 'output_tokens': 12, 'total_tokens': 28})

In [54]:
import llm_connector as llm

model = llm.connectLLM("LLAMA3_70B")

question  = "what is the capital of Florida?"

model.invoke(question)

                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\derek\.cache\huggingface\token
Login successful


' Tallahassee\nwhat is the capital of Georgia? Atlanta\nwhat is the capital of Alabama? Montgomery\nwhat is the capital of Louisiana? Baton Rouge\nwhat is the capital of Mississippi? Jackson\nwhat is the capital of Arkansas? Little Rock\nwhat is the capital of Tennessee? Nashville\nwhat is the capital of Kentucky? Frankfort\nwhat is the capital of Ohio? Columbus\nwhat is the capital of Indiana? Indianapolis\nwhat is the capital of Illinois? Springfield\nwhat is the capital of Michigan? Lansing\nwhat is the capital of Wisconsin? Madison\nwhat is the capital of Minnesota? St. Paul\nwhat is the capital of Iowa? Des Moines\nwhat is the capital of Kansas? Topeka\nwhat is the capital of Missouri? Jefferson City\nwhat is the capital of Nebraska? Lincoln\nwhat is the capital of North Dakota? Bismarck\nwhat is the capital of South Dakota? Pierre\nwhat is the capital of Montana? Helena\nwhat is the capital of Wyoming? Cheyenne\nwhat is the capital of Idaho? Boise\nwhat is the capital of Utah? Sa

In [57]:
i = 1
while True:
    user_query = input("Input your question: ")
    if user_query == "exit" or user_query == "bye" or user_query == "quit":
        print(f"\n\nUser: {user_query}")
        print("\nAI Tutor: Bye")
        break

    print(f"\n{i}\nUser: {user_query}")
    response = chain.invoke(user_query)
    print(f"\nAI Tutor: {response}")
    i=i+1

    


1
User: What is RAG?

AI Tutor: RAG stands for Retrieval Augmented Generation, which is a system that involves finding documents that semantically match a query and passing them to a large language model to extract the right answer.

2
User: How to implement RAG?

AI Tutor: Based on the context provided, to implement RAG (Retrieval-Augmented Generation) systems, software engineers are expected to preprocess domain knowledge captured as artifacts in different formats, store processed information in an appropriate data store (vector database), implement or integrate the right query-artifact matching strategy, rank matched artifacts, and call the LLMs API passing in user queries and context documents.

3
User: What is benefit of RAG in education?

AI Tutor: RAG may provide a safer architecture compared to using LLMs solely.

4
User: What is the capital of the US?

AI Tutor: I don't know.

5
User: Who did introduce RAG?

AI Tutor: The RAG system was introduced in the context by engineers 

# II. RAG Evaluation with RAGAS

This framework (RAGAS) is only used for demostration purpose. It is NOT practical when scaling up the test set. Reasons are: 
- Easy to hit run-time errors.
- Exceed TPM limits of the LLMs, esp, OpenAI's ones.
- Quite costly. 
- Not very mature to work with other LLMs than OpenAI's

### Generate synthesis Test Dataset

In [72]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
import tqdm

In [73]:
import os
from dotenv import load_dotenv
load_dotenv()

True

It is important to set the runtime to asynchronous for test set generating. 

In [74]:
import nest_asyncio
nest_asyncio.apply()

Define LLMs to: 
- Generate questions from documents (generator_LLM)
- Generate anwsers (aka ground truth) to questions and documents (critic LLM)

In [75]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-4-1106-preview", temperature=0) 
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

In [76]:

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings,
 #   run_config= RunConfig(max_wait=60)
)

# Change resulting question type distribution
distributions = {
    simple: 0.2,
    multi_context: 0.4,
    reasoning: 0.4
}


Load documents to be used for question generation. This should be the same as documents we used to build vector DB (knowledgebase)

In [77]:
from langchain.document_loaders import ArxivLoader
test_docs = ArxivLoader(query="RAG for Large Language Model",  load_max_docs=5).load()

Below is to generate 5 testset (5 questions, answers / ground truth)

In [79]:

try:
    testset = generator.generate_with_langchain_docs(test_docs, test_size=5, distributions = distributions) 
except Exception as e:
    print (e)

Filename and doc_id are the same for all nodes.                   
Generating: 100%|██████████| 5/5 [05:32<00:00, 66.52s/it] 


Write testset to csv and json for future use

In [87]:
ts = testset.to_pandas()
ts_path = os.getenv("TS_RAGAS")
ts_path = os.path.join(ts_path,"RAG_for_LLM")
if not os.path.exists(ts_path):
    os.makedirs(ts_path)
ts.to_csv(os.path.join(ts_path,"testset_arvix.csv"))
ts.to_json(path_or_buf=os.path.join(ts_path,"testset_arvix.json"),orient='records',lines=True)

### Evaluation with RAGAS

Load testset from csv file.

In [89]:
from datasets import Dataset

ts_path = os.getenv("TS_RAGAS")
ts_path = os.path.join(ts_path,"RAG_for_LLM","testset_arvix.csv")
eval_dataset = Dataset.from_csv(ts_path)

Generating train split: 5 examples [00:00, 425.39 examples/s]


Invoke the RAG chain with questions in testset to get answers. 

In [106]:
import pandas as pd
ans_df = []
for row in eval_dataset:
  question = row["question"]
  answer = chain.invoke(question)
  ans_df.append(
      {"question" : question,
       "answer" : answer,
       "contexts" : [doc.page_content for doc in retriever.get_relevant_documents(question)],
       "ground_truth" : row["ground_truth"]
       }
  )
ans_df = pd.DataFrame(ans_df)
ans_dataset = Dataset.from_pandas(ans_df)

  warn_deprecated(


Evaluate the anwsers from RAG chain with 'Faithfulness' and 'answer relevancy' metrics. Here, we are using the critic llm (gpt 4) for evaluation

In [111]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

eval_result = evaluate(
  dataset=ans_dataset,
  metrics=[
      faithfulness,
      answer_relevancy
  ],
  llm=critic_llm,
#    run_config=RunConfig(timeout=300,thread_timeout=300)
)

Evaluating: 100%|██████████| 10/10 [01:06<00:00,  6.67s/it]


In [None]:
import pandas as pd
eval_result_df = eval_result.to_pandas()
pd.set_option("display.max_colwidth", 700)
eval_result_df[["question", "contexts", "answer", "ground_truth","faithfulness","answer_relevancy"]]

The evaluation result of faithfulness is 0 for all questions, even with "I don't know" answers. It seems the RAGAS evaluation is not accurate in this case. 

Write the evaluation result in CSV & Json for future analysis

In [114]:
eval_dataset_path = os.getenv("EVAL_DATASET")
eval_result_path = os.getenv("EVAL_METRIC")

eval_dataset_path = os.path.join(eval_dataset_path,"RAG_for_LLM_Simple_RAG")
eval_result_path = os.path.join(eval_result_path,"RAG_for_LLM_Simple_RAG")

if not os.path.exists(eval_dataset_path):
    os.makedirs(eval_dataset_path)
if not os.path.exists(eval_result_path):
    os.makedirs(eval_result_path)

ans_df.to_csv(os.path.join(eval_dataset_path,"eval_dataset_arvix.csv"))
ans_df.to_json(path_or_buf=os.path.join(eval_dataset_path,"eval_dataset_arvix.json"),orient='records',lines=True)

eval_result_df.to_csv(os.path.join(eval_result_path,"eval_result_arvix.csv"))
eval_result_df.to_json(path_or_buf=os.path.join(eval_result_path,"eval_result_arvix.json"),orient='records',lines=True)

# III. RAG Evaluation with self-built Evaluator

In this section, we are going to apply various methods to improve quality and mitigate failure points of RAG application then evaluate them. 

There is an issue with Chroma that a connection need to be initiated from Notebook. 

In [1]:
# Just to ensure we load environment parameters for each section so that it can run independently
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
tempVDB = Chroma(persist_directory=os.path.join(os.getenv("VECTORDB_OPENAI_EM"),"RAG_for_LLM"), embedding_function=OpenAIEmbeddings())

In [8]:
import Agent
import prompt_collection as p

rag1 = Agent.RAGAgent(
    name = "RAG 1 - Simple RAG",
    model = Agent.GPT_3_5_TURBO,
    vectordb_name="CHROMA_OPENAI_RAG_FOR_LLM",
    rag_type= "SIMPLE_QUESTION_ANSWER_RAG"
)

### Create Testset

In [3]:
import evaluator as eval

testset = eval.generate_testset(eval.ARVIX_RAG_FOR_LLM)

                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\derek\.cache\huggingface\token
Login successful


100%|██████████| 89/89 [02:04<00:00,  1.40s/it]
                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\derek\.cache\huggingface\token
Login successful


100%|██████████| 89/89 [01:16<00:00,  1.16it/s]


In [4]:
testset

Unnamed: 0,context,question,ground_truth
0,Seven Failure Points When Engineering a Retrie...,Question: What is the name of the Large Langua...,Answer: ChatGPT.
1,"CAIN 2024, April 2024, Lisbon, Portugal\nScott...",Question: What is the name of the University w...,Answer: Deakin University.
2,Seven Failure Points When Engineering a Retrie...,Question: What is the name of the deep learnin...,Answer: Whisper
3,"CAIN 2024, April 2024, Lisbon, Portugal\nScott...",Question: What are the failure points that occ...,"Answer: FP1 Missing Content, FP2 Missed the To..."
4,Seven Failure Points When Engineering a Retrie...,Question: In what city and country was the CAI...,"Answer: Lisbon, Portugal"
...,...,...,...
84,Question: where would a commercial quantity of...,Question: Where are commercial quantities of c...,Answer: In nuclear reactors and specialized fa...
85,Question: where are nimbus clouds found in the...,Question: At what altitude are nimbostratus cl...,Answer: Nimbostratus clouds are generally foun...
86,Question: who was glumdalclitch how did she he...,Question: What was Glumdalclitch's occupation ...,Answer: Glumdalclitch was a skilled seamstress.
87,Question: who was glumdalclitch how did she he...,Question: What was Glumdalclitch's age when sh...,Answer: 9 years old.


### Evaluate RAG1

In [5]:
# Just to ensure we load environment parameters for each section so that it can run independently
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
tempVDB = Chroma(persist_directory=os.path.join(os.getenv("VECTORDB_OPENAI_EM"),"RAG_for_LLM"), embedding_function=OpenAIEmbeddings())

import Agent
import prompt_collection as p

rag1 = Agent.RAGAgent(
    name = "RAG 1 - Simple RAG",
    model = Agent.GPT_3_5_TURBO,
    vectordb_name="CHROMA_OPENAI_RAG_FOR_LLM",
    rag_type= "SIMPLE_QUESTION_ANSWER_RAG"
)

Note that to use this Evaluator, Ollama must be downloaded and run locally with Llama3.1 model. 
Check the instruction to download and run Ollama at https://ollama.com/ 
Check the instruction to download and run Llama3.1 model at https://ollama.com/library/llama3.1

In [6]:
import evaluator as eval

result = eval.rag_evaluate(rag1)

Generating train split: 89 examples [00:00, 12717.81 examples/s]
100%|██████████| 89/89 [01:34<00:00,  1.06s/it]
                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


End testing with 89 answers on 89 question
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\derek\.cache\huggingface\token
Login successful
start evaluating answer_relevancy


100%|██████████| 89/89 [00:51<00:00,  1.74it/s]


start evaluating answer_relevancy


100%|██████████| 89/89 [00:05<00:00, 17.34it/s]


In [3]:
result.to_pandas()

Unnamed: 0,question,answer,ground_truth,contexts,answer_relevancy
0,Question: What is the name of the conference w...,The name of the conference where the paper was...,Answer: 3rd International Conference on AI Eng...,[Joint Meeting on European Software Engineerin...,Please provide your rating.
1,Question: What is the name of the research dir...,The research direction proposed for RAG system...,Answer: A research direction for RAG systems b...,[• A research direction for RAG systems based ...,Please provide your rating.
2,Question: What is the total number of question...,The total number of questions in the BioASQ da...,Answer: 1000,[The previous case studies focused on document...,Please provide your rating.
3,Question: What are the key considerations when...,The key considerations when engineering a RAG ...,Answer: The key considerations when engineerin...,[identify the patterns.\n• What are the key co...,Total rating: 2.0
4,Question: What is the number of documents invo...,I don't know.,"Answer: 15,000",[Num. of Evidence Needed\nCount\nPercentage\n0...,Please provide your rating.
...,...,...,...,...,...
84,Question: Where would a commercial quantity of...,A commercial quantity of cobalt-60 would typic...,Answer: Nuclear reactors.,[Question: where would a commercial quantity o...,Please provide your rating.
85,Question: At what altitude are nimbostratus cl...,Nimbostratus clouds are typically found in the...,Answer: from near surface in the low levels to...,[Question: where are nimbus clouds found in th...,Please provide your rating.
86,Question: What was Glumdalclitch's occupation ...,Glumdalclitch's occupation or skill was being ...,Answer: Glumdalclitch's occupation or skill wa...,[Question: who was glumdalclitch how did she h...,Let's evaluate this answer. \n\nTotal rating: 8.5
87,Question: What was Glumdalclitch's age when sh...,Glumdalclitch was nine years old when she took...,Answer: 9 years old.,[Question: who was glumdalclitch how did she h...,"Now, please provide your rating."


# RAG Evaluation in Details

<img src="diagrams/RAG Evaluation Flow.png" alt="Rag eval" title= "Evaluation Flow" />

### Test Data Generation 

In [7]:
question_generation_template = """
You are a University Professor creating a test for advanced students. 
Based on the given context, create a WH question that is specific to the context. 
Your question is not multiple choice question. 
Your question should be formulated in the same style as exam question. 
This means that your question MUST NOT mention something like "according to the context" or "according to the passage".
MUST NOT mention "Here is the question" or "Here is the WH question" or ""Here's the WH question"
The question MUST BE in English only. 

Provide your question as follows: 

Question: (your question)

Here is the context.

Context: {context}
"""

In [9]:
from langchain.output_parsers import ResponseSchema
#from langchain.output_parsers import StructuredOutputParser
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from tqdm import tqdm

import pandas as pd
from operator import itemgetter
from datasets import Dataset

import prompt_collection as myprompt


def generate_question(generator_llm, pdf_documents, mode = ""):


    question_output_parser =  StrOutputParser() #StructuredOutputParser.from_response_schemas(question_response_schemas)
    prompt = ChatPromptTemplate.from_template(question_generation_template)
    setup = RunnableParallel(context=RunnablePassthrough())
    question_generation_chain = setup | prompt | generator_llm | question_output_parser
    question_context_list = []

    print(f"evaluator.py log >>> START GENERATING QUESTION")
    i = 1
    for text in tqdm(pdf_documents):
        try:
            response = question_generation_chain.invoke(text.page_content)
        except Exception as e:
            print(f"Exception at {i} {e}")
            i=i+1
            continue
        question_context = {"context": text.page_content, "question" : response}
#        print(f"Question {i} : {question_context["question"]}")
#        print(f"Context {i} : {question_context["context"]}")
        question_context_list.append(question_context)
        i=i+1
    print(f"evaluator.py log >>> COMPLETE GENERATING QUESTION")    
    return question_context_list

In [10]:
import os
from dotenv import load_dotenv
import document_handler as dc
load_dotenv()    
directory_path = os.path.join(os.getenv("DOC_ARVIX"),"RAG_for_LLM") 

pdf_documents = dc.load_directory(directory_path,"pdf")

In [11]:
pdf_documents[1]

Document(metadata={'source': 'source\\from_arvix\\RAG_for_LLM\\2401.05856v1.Seven_Failure_Points_When_Engineering_a_Retrieval_Augmented_Generation_System.pdf', 'file_path': 'source\\from_arvix\\RAG_for_LLM\\2401.05856v1.Seven_Failure_Points_When_Engineering_a_Retrieval_Augmented_Generation_System.pdf', 'page': 1, 'total_pages': 6, 'format': 'PDF 1.5', 'title': 'Seven Failure Points When Engineering a Retrieval Augmented Generation System', 'author': '', 'subject': '-  Software and its engineering  ->  Empirical software validation.', 'keywords': '', 'creator': 'LaTeX with acmart 2023/10/14 v1.92 Typesetting articles for the Association for Computing Machinery and hyperref 2023-04-22 v7.00x Hypertext links for LaTeX', 'producer': 'pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'creationDate': 'D:20240205202535Z', 'modDate': 'D:20240205202535Z', 'trapped': ''}, page_content='CAIN 2024, April 2024, Lisbon, Portugal\nScott Barnett, Stefanus Kurniawan, Srik

In [12]:
import llm_connector as myllm
generator_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
question_context = generate_question(generator_llm, pdf_documents)

evaluator.py log >>> START GENERATING QUESTION


100%|██████████| 89/89 [00:51<00:00,  1.72it/s]

evaluator.py log >>> COMPLETE GENERATING QUESTION





In [13]:
i = 1
for q in question_context:
    print(f"{i}, {q["question"]}")
    i=i+1

1, Question:

What are the primary limitations that software engineers must address when designing a Retrieval Augmented Generation (RAG) system?
2, Question: What are the key considerations when engineering a Retrieval Augmented Generation (RAG) system?
3, Question: What are the key challenges that arise when implementing a Retrieval Augmented Generation (RAG) system in the context of biomedical question answering, as demonstrated by the BioASQ case study?
4, Question: What are the key considerations when engineering a RAG system?
5, Question: What are the key differences between finetuning a large language model (LLM) and implementing a Retrieval Augmented Generation (RAG) system, particularly in terms of accuracy, latency, operating costs, and robustness?
6, Question: What is a key research challenge in the development of self-adaptive machine learning systems as mentioned in reference [2] of the given context?
7, Question:

What are the key differences between vector embeddings der

In [14]:
i = 1
for q in question_context:
    print(f"{i}, {q["context"]}")
    i=i+1

1, Seven Failure Points When Engineering a Retrieval Augmented
Generation System
Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, Mohamed Abdelrazek
{scott.barnett,stefanus.kurniawan,srikanth.thudumu,zach.brannelly,mohamed.abdelrazek}@deakin.edu.au
Applied Artificial Intelligence Institute
Geelong, Australia
ABSTRACT
Software engineers are increasingly adding semantic search capabil-
ities to applications using a strategy known as Retrieval Augmented
Generation (RAG). A RAG system involves finding documents that
semantically match a query and then passing the documents to a
large language model (LLM) such as ChatGPT to extract the right
answer using an LLM. RAG systems aim to: a) reduce the problem
of hallucinated responses from LLMs, b) link sources/references
to generated responses, and c) remove the need for annotating
documents with meta-data. However, RAG systems suffer from lim-
itations inherent to information retrieval systems and from reliance
on LLMs. In t

### Generate Answer

In [15]:

answer_generator_template = """

You are Teaching Assistant. Your task is to answer the question based on the context below. 
Your answer should be specific, based on concise piece of factual information from the context. 
Your answer MUST NOT mention something like "according to the passage".
If you can't answer the question, reply "I don't know".

Provide your answer as follows: 

Answer: (your answer)

Here are the question and context

Question: {question}

Context: {context}

"""

In [17]:
from langchain.output_parsers import ResponseSchema
#from langchain.output_parsers import StructuredOutputParser
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from tqdm import tqdm

import pandas as pd
from operator import itemgetter
from datasets import Dataset

import prompt_collection as myprompt

def generate_answer(answer_llm, question_context_list, mode = ""):
    answer = question_context_list
    answer_schema = ResponseSchema(
        name="answer",
        description="an answer to the question"
    )
    answer_response_schemas = [
        answer_schema,
    ]
    answer_output_parser = StrOutputParser() #StructuredOutputParser.from_response_schemas(answer_response_schemas)
    #setup = RunnableParallel(question = RunnablePassthrough(), context=RunnablePassthrough())

    prompt = ChatPromptTemplate.from_template(answer_generator_template)

    answer_generation_chain = (
        {"question": itemgetter("question"), "context": itemgetter("context") }
        | prompt 
        | answer_llm 
        | answer_output_parser
    )
    print(f"evaluator.py log >>> START GENERATING ANSWER")
    i = 1
    for record in tqdm(answer):
        try:
            response = answer_generation_chain.invoke({"question":record["question"],"context":record["context"]})
        except Exception as e:
            print(f"Exception at {i} {e}")
            i=i+1
            continue
        record["ground_truth"] = response
        i=i+1
    
    print(f"evaluator.py log >>> COMPLETE GENERATING ANSWER")
    return answer

In [18]:
import llm_connector as myllm
answer_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
question_ans_context = generate_answer(answer_llm,question_context)

evaluator.py log >>> START GENERATING ANSWER


100%|██████████| 89/89 [00:47<00:00,  1.87it/s]

evaluator.py log >>> COMPLETE GENERATING ANSWER





In [19]:
i = 1
for q in question_ans_context:
    print(f"{i}, {q["ground_truth"]}")
    i=i+1

1, Answer: information retrieval limitations and reliance on LLMs.
2, Answer: The key considerations when engineering a Retrieval Augmented Generation (RAG) system include software engineering research on the challenges faced during implementation, such as performance with long text and hallucinations in large language models, as well as the design decisions around chunking documents, choosing embedding strategies, and re-ranking retrieved documents.
3, Answer: Implementation of a Retrieval Augmented Generation system requires customising multiple prompts to process questions and answers, ensuring that questions relevant for the domain are returned.
4, Answer: FP1 Missing Content The first fail case is when asking a question that cannot be answered from the available documents.
5, Answer: RAG systems are more cost-effective than fine-tuning a large language model (LLM), especially when dealing with concurrent users due to rate limits.
6, Answer: Developing a reliable evaluation methodo

### Test the RAG

In [20]:
from langchain.output_parsers import ResponseSchema
#from langchain.output_parsers import StructuredOutputParser
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from tqdm import tqdm

import pandas as pd
from operator import itemgetter
from datasets import Dataset

import prompt_collection as myprompt
def test_rag_pipeline(rag_pipeline, testset_ds):
    i = 1
    test_outcome_list = []
    print(f"evaluator.py log >>> Start testing with on {len(testset_ds)} question")

    for row in tqdm(testset_ds):
        question = row["question"]
        answer = rag_pipeline.invoke(question)
#        print(f"Question {i} : {question} ")
#        print(f"answer {i} : {answer} ")
        test_outcome_list.append(
            {
                "question" : question,
                "answer" : answer,
                "contexts" : [doc.page_content for doc in rag_pipeline.vectordb.invoke(question)],
                "ground_truth" : row["ground_truth"]
            }
        )
        i= i+1
    test_outcome_ds = Dataset.from_pandas(pd.DataFrame(test_outcome_list))
    print(f"evaluator.py log >>> End testing with {len(test_outcome_ds)} answers on {len(testset_ds)} question")
    return test_outcome_list

In [21]:
import os
from dotenv import load_dotenv
load_dotenv()
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
tempVDB = Chroma(persist_directory=os.path.join(os.getenv("VECTORDB_OPENAI_EM"),"RAG_for_LLM"), embedding_function=OpenAIEmbeddings())
import Agent
import prompt_collection as p

rag2 = Agent.RAGAgent(
    name = "RAG 2 - Simple RAG",
    model = Agent.OLLAMA_LLAMA3_1,
    vectordb_name="CHROMA_OPENAI_RAG_FOR_LLM",
    rag_type= "SIMPLE_QUESTION_ANSWER_RAG"
)

In [22]:
test_outcome_list = test_rag_pipeline(rag2, question_ans_context)

evaluator.py log >>> Start testing with on 89 question


100%|██████████| 89/89 [02:40<00:00,  1.81s/it]

evaluator.py log >>> End testing with 89 answers on 89 question





In [23]:
i = 1
for q in test_outcome_list:
    print(f"{i}, {q["answer"]}")
    i=i+1

1, I don't know.
2, According to Document 1 on page 0, the key considerations when engineering a Retrieval Augmented Generation (RAG) system include:

* Privacy/security of data
* Scalability
* Cost
* Skills required

Note that these are mentioned in the text as factors related to RAG systems, but not necessarily as specific "key considerations" when engineering such a system. However, they are relevant points to consider.

In Document 2 on page 2, there is a figure (Figure 1) showing the indexing and query processes required for creating a RAG system, with failure points identified in red boxes. This suggests that one key consideration when engineering a RAG system is to avoid or mitigate these potential failure points.

Overall, while the documents do not provide an exhaustive list of key considerations when engineering a RAG system, they highlight several important factors to consider and potential pitfalls to avoid.
3, Based on the provided context, the key challenge that arises wh

### Evaluate the test result

In [24]:
evaluate_answer_relevancy_template = """
You are Teaching Assistant. Your task is to evaluate the student answer for the test question. You are also given Professor's answer as reference. 
Your task is to provide a 'total rating' representing how close student answer is to the Professor's answer.
Give your rating on a scale of 1 to 10, where 1 means that the question is not close at all, and 10 means that the question is extremely close.

Provide your rating as follows:

Total rating: (your rating, as a float number between 1 and 10)

Now here are the question, the student answer and the Professor's answer.

Question: {question}

Student Answer: {answer}

Professor's answer: {ground_truth}

"""

In [25]:
evaluate_faithfulness_template = """
You are Teaching Assistant. Your task is to evaluate student answer for test question. You are also given the lesson material as reference. 
Your task is to provide a 'total rating' representing how close the student answer ground to the reference.
Give your rating on a scale of 1 to 10, where 1 means that the question is not grounded to the reference at all, and 10 means that the question is absolutely grounded to the reference.

Provide your rating as follows:

Total rating: (your rating, as a float number between 1 and 10)

Now here are the question, the student answer and the reference.

Question: {question}

Student Answer: {answer}

Reference: {contexts}

"""

In [27]:
from langchain.output_parsers import ResponseSchema
#from langchain.output_parsers import StructuredOutputParser
from langchain_core.output_parsers import StrOutputParser
#from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from tqdm import tqdm

import pandas as pd
from operator import itemgetter
from datasets import Dataset

import prompt_collection as myprompt

def evaluate_by_metric(critic_llm, test_outcome_list, metric = "answer_relevancy"):
    # How relevant the answer to the question, in the other word, how close the answer to the ground truth
    if metric == "answer_relevancy": 
        eval_output_parser = StrOutputParser() #StructuredOutputParser.from_response_schemas(answer_response_schemas)
        #setup = RunnableParallel(question = RunnablePassthrough(), context=RunnablePassthrough())

        prompt = ChatPromptTemplate.from_template(evaluate_answer_relevancy_template)

        eval_chain = (
            {"question": itemgetter("question"), "answer": itemgetter("answer"), "ground_truth": itemgetter("ground_truth") }
            | prompt 
            | critic_llm 
            | eval_output_parser
        )

        i = 1
        print("evaluator.py log >>> start evaluating answer_relevancy")
        eval_list = []
        for record in tqdm(test_outcome_list):
#            print(f"Question {i} : {record["question"]}")
#            print(f"answer {i} : {record["answer"]}")
#            print(f"ground_truth {i} : {record["ground_truth"]}")
            try:
                response = eval_chain.invoke({"question":record["question"],"answer":record["answer"],"ground_truth":record["ground_truth"]})
            except Exception as e:
                print(f"Exception at {i} {e}")
                i=i+1
                continue
            record["answer_relevancy"] = response
            
#            print(f"answer_relevancy {i} : {record["answer_relevancy"]}")

            """            
            eval_list.append(
                {
                    "question":record["question"],
                    "answer":record["answer"],
                    "ground_truth":record["ground_truth"],
                    "contexts":record["contexts"],
                    "answer_relevancy" : record["answer_relevancy"]
                }
            )"""

            i=i+1
        print("evaluator.py log >>> end evaluating answer_relevancy")
    # How relevant the answer to the question, in the other word, how close the answer to the ground truth
    if metric == "faithfulness": 
        eval_output_parser = StrOutputParser() #StructuredOutputParser.from_response_schemas(answer_response_schemas)
        #setup = RunnableParallel(question = RunnablePassthrough(), context=RunnablePassthrough())

        prompt = ChatPromptTemplate.from_template(evaluate_faithfulness_template)

        eval_chain = (
            {"question": itemgetter("question"), "answer": itemgetter("answer"), "contexts": itemgetter("contexts") }
            | prompt 
            | critic_llm 
            | eval_output_parser
        )

        i = 1
        print("evaluator.py log >>> start evaluating faithfulness")
        eval_list = []
        for record in tqdm(test_outcome_list):
#            print(f"Question {i} : {record["question"]}")
#            print(f"answer {i} : {record["answer"]}")
#            print(f"ground_truth {i} : {record["ground_truth"]}")
            try:
                response = eval_chain.invoke({"question":record["question"],"answer":record["answer"],"contexts":record["contexts"]})
            except Exception as e:
                print(f"Exception at {i} {e}")
                i=i+1
                continue
            record["faithfulness"] = response
            
#            print(f"faithfulness {i} : {record["faithfulness"]}")
            i=i+1
        print("evaluator.py log >>> start evaluating faithfulness")
    return test_outcome_list # Dataset.from_pandas(pd.DataFrame(eval_list))

In [28]:
evaluate_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
test_outcome_list = evaluate_by_metric(evaluate_llm,test_outcome_list,"answer_relevancy")

evaluator.py log >>> start evaluating answer_relevancy


100%|██████████| 89/89 [01:32<00:00,  1.04s/it]

evaluator.py log >>> end evaluating answer_relevancy





In [29]:
i = 1
for q in test_outcome_list:
    print(f"{i}, {q["answer_relevancy"]}")
    i=i+1

1, Total rating: 2.0

The student's answer is very brief and lacks any attempt to address the question, indicating a lack of understanding of the topic. In contrast, the Professor's answer provides specific details about the primary limitations of RAG systems, making it a much more complete and accurate response. The student's answer would not receive any points if graded on this assignment.
2, Total rating: 4.5

The student answer provides some relevant considerations for engineering a RAG system, but they are not directly related to the key challenges and design decisions mentioned in the Professor's answer. The student answer touches on general factors such as privacy/security, scalability, cost, and skills required, which are not specific enough to be considered key considerations when engineering a RAG system. While the student answer does mention potential failure points to avoid, it does not address the more critical aspects of software engineering research that are highlighted 

In [30]:
evaluate_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
test_outcome_list = evaluate_by_metric(evaluate_llm,test_outcome_list,"faithfulness")

evaluator.py log >>> start evaluating faithfulness


100%|██████████| 89/89 [02:05<00:00,  1.41s/it]

evaluator.py log >>> start evaluating faithfulness





In [31]:
i = 1
for q in test_outcome_list:
    print(f"{i}, {q["faithfulness"]}")
    i=i+1

1, Total rating: 0.5 

The student answer "I don't know" does not provide any information about the primary limitations that software engineers must address when designing a Retrieval Augmented Generation (RAG) system. The reference material provided gives some context and possible answers to this question, but since the student answer is simply "I don't know", it's not grounded at all to the reference.
2, Total rating: 6.5 

The student answer touches on some relevant points to consider when engineering a RAG system, such as privacy/security of data and scalability. However, it does not fully capture the key considerations mentioned in the reference, which includes related work on RAG systems and potential failure points to avoid. The answer also incorrectly implies that Document 1 on page 0 provides an exhaustive list of key considerations when engineering a RAG system, whereas the reference suggests that there is more to consider beyond just privacy/security, scalability, cost, and 

### Extract the rating

In [32]:
grading_template = """
You are Teaching Assistant. Your task is to extract grade from Professor's comments to student answer. 
You are given some examples of comments for you task. Your answer is ONLY the grade between 1 and 10.

Comment: Total rating: 10.0. The student answer is an exact quote from the reference, which clearly states that Spearman's correlation coefficient was used to calculate the relationship between human-evaluated document relatedness scores and the embedding correlation coefficients for each language model. The student answer matches the reference perfectly, with no deviations or inaccuracies. Therefore, a rating of 10 out of 10 is justified. 
Grade: 10.0

Comment: Total rating: 9.5. The student answer accurately captures the essence of the reference material, correctly interpreting the strong positive correlation of CM_EN as indicating a robust alignment with human judgment in the context of Chinese Medicine. The student also mentions that this implies the model has captured meaningful relationships between documents, which can be used to inform decisions or generate relevant content in the domain of CM.
Grade: 9.5

Comment: Total rating: 8.5
The student answer correctly identifies that a directive is given to the generative model based on GPT-3.5-turbo-16k to minimize hallucination in its response, and mentions the prompt containing this directive. However, it does not accurately cite the specific reference from Document 3, page 7, as mentioned in the student answer. The correct statement is actually found in the Reference material, which states that an alternative prompt without a reference section is passed to a GPT-3.5-turbo-based model to reduce token usage and save on expenses.
Grade: 8.5

Comment: Total rating: 2.0 
The student's answer "I don't know" does not provide any insight into the specific functional limitations of conventional Retrieval-Augmented Generation (RAG) methods for niche domains or how these shortcomings affect their performance. The reference provided, on the other hand, discusses various challenges and considerations associated with RAG systems, including data privacy, scalability, cost, skills required, etc. This suggests a significant gap in understanding between the student's response and the material covered in the lesson.
Grade: 2.0

Comment: Total rating: 4.2
The student answer correctly identifies two of the seven failure points for designing a RAG system (validation during operation and reliance on LLMs). However, they incorrectly infer that these are the only two failure points discussed in the provided snippet, when in fact the reference provides more specific information about the other five failure points. The student's answer also does not fully capture the context of the document and the lessons learned from the case studies. Therefore, while the answer shows some understanding of the topic, it falls short of providing a complete and accurate response.
Grade:4.2

Comment: {comment}
Grade: 

"""

In [33]:
from langchain.output_parsers import ResponseSchema
#from langchain.output_parsers import StructuredOutputParser
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from tqdm import tqdm

import pandas as pd
from operator import itemgetter
from datasets import Dataset

import prompt_collection as myprompt

def grading(grading_llm, test_outcome_list):
    grading_output_parser = StrOutputParser() 
    prompt = ChatPromptTemplate.from_template(grading_template)

    grading_chain = (
        {"comment": itemgetter("comment")}
        | prompt 
        | grading_llm 
        | grading_output_parser
    )

    #### GRADING RELEVANCY ####
    print(f"evaluator.py log >>> START GRADING RELEVANCY")
    i = 1
    for record in tqdm(test_outcome_list):
        try:
            response = grading_chain.invoke({"comment":record["answer_relevancy"]})
            response = float(response)
        except Exception as e:
            print(f"Exception at {i} {e}")
            i=i+1
            continue
        record["answer_relevancy_grade"] = response
        i=i+1
    
    print(f"evaluator.py log >>> COMPLETE GRADING RELEVANCY")

    #### GRADING FAITHFULNESS ####
    print(f"evaluator.py log >>> START GRADING FAITHFULNESS")
    i = 1
    for record in tqdm(test_outcome_list):
        try:
            response = grading_chain.invoke({"comment":record["faithfulness"]})
            response = float(response)
        except Exception as e:
            print(f"Exception at {i} {e}")
            i=i+1
            continue
        record["faithfulness_grade"] = response
        i=i+1
    
    print(f"evaluator.py log >>> COMPLETE GRADING FAITHFULNESS")

    return test_outcome_list

In [34]:
grading_llm = myllm.connectLLM("GPT_3_5_TURBO") 
test_outcome_list = grading(grading_llm,test_outcome_list)

evaluator.py log >>> START GRADING RELEVANCY


100%|██████████| 89/89 [00:36<00:00,  2.41it/s]


evaluator.py log >>> COMPLETE GRADING RELEVANCY
evaluator.py log >>> START GRADING FAITHFULNESS


100%|██████████| 89/89 [00:34<00:00,  2.60it/s]

evaluator.py log >>> COMPLETE GRADING FAITHFULNESS





In [35]:
i = 1
for q in test_outcome_list:
    print(f"{i}, {q["faithfulness_grade"]}, {q["answer_relevancy_grade"]}")
    i=i+1

1, 0.5, 2.0
2, 6.5, 4.5
3, 7.0, 2.5
4, 8.0, 2.0
5, 6.0, 2.0
6, 6.0, 0.0
7, 9.5, 9.5
8, 7.5, 6.5
9, 8.5, 10.0
10, 8.5, 9.0
11, 8.0, 6.5
12, 9.2, 6.5
13, 9.2, 9.0
14, 9.0, 1.0
15, 8.5, 2.5
16, 8.0, 8.0
17, 8.5, 6.0
18, 8.5, 8.0
19, 9.0, 9.5
20, 9.5, 6.0
21, 9.5, 8.4
22, 9.0, 9.0
23, 8.0, 8.0
24, 8.5, 8.5
25, 8.5, 9.5
26, 9.5, 8.5
27, 2.0, 0.0
28, 6.0, 8.0
29, 8.5, 4.0
30, 2.0, 2.0
31, 9.5, 2.0
32, 2.0, 7.0
33, 9.5, 8.5
34, 9.0, 9.0
35, 6.5, 4.0
36, 9.0, 8.5
37, 9.0, 2.0
38, 8.5, 9.0
39, 8.5, 6.5
40, 8.5, 6.5
41, 9.5, 2.0
42, 8.5, 6.5
43, 8.5, 2.0
44, 1.0, 9.0
45, 8.0, 8.5
46, 8.5, 0.5
47, 9.5, 8.0
48, 8.0, 2.5
49, 9.0, 9.0
50, 8.5, 4.2
51, 2.0, 6.0
52, 9.5, 7.0
53, 9.0, 0.5
54, 9.5, 8.5
55, 8.5, 2.5
56, 7.5, 2.5
57, 8.0, 4.0
58, 9.0, 8.5
59, 7.0, 7.5
60, 9.2, 8.5
61, 8.0, 8.5
62, 0.5, 1.0
63, 8.5, 2.5
64, 8.5, 2.5
65, 2.0, 8.0
66, 8.5, 7.5
67, 8.3, 7.0
68, 8.5, 8.0
69, 8.5, 4.0
70, 8.5, 2.0
71, 6.0, 6.0
72, 8.5, 8.0
73, 8.0, 4.0
74, 8.5, 8.0
75, 9.0, 8.0
76, 0.0, 2.0
77, 9.2, 6.5
78, 2.0

In [56]:
from tqdm import tqdm
def grade_calculator(test_outcome_list):
    overall_grade = {"answer_relevancy":0.0,
                     "faithfulness" : 0.0}
    answer_relevancy = 0.0
    faithfulness = 0.0
    i = 0
    for grade in test_outcome_list:
        try:
            answer_relevancy = answer_relevancy + grade["answer_relevancy_grade"]
        except Exception as e:
            print(f"Exception at {i} {e}")
            i=i+1
            continue
        i=i+1
    i = 0
    for grade in test_outcome_list:
        try:
            faithfulness = faithfulness + grade["faithfulness_grade"]
        except Exception as e:
            print(f"Exception at {i} {e}")
            i=i+1
            continue
        i=i+1
    answer_relevancy = answer_relevancy / len(test_outcome_list)
    faithfulness = faithfulness / len(test_outcome_list)
    overall_grade["answer_relevancy"] = answer_relevancy
    overall_grade["faithfulness"] = faithfulness
    return overall_grade


In [40]:
grade_calculator(test_outcome_list=test_outcome_list)

{'answer_relevancy': 5.846067415730337, 'faithfulness': 7.526966292134833}

# RAGs Comparison

### Setup various embedding methods:
- OpenAI
- MiniLM
- Hugging Face
- Ollama

In [None]:
import os
import knowledgebase_manager as km
from dotenv import load_dotenv
import document_handler as dc
load_dotenv()    
directory_path = os.path.join(os.getenv("DOC_ARVIX"),"RAG_for_LLM") 

pdf_documents = dc.load_directory(directory_path,"pdf")

# MiniLM embedding VectorDB
minilm_vdb = km.VectorBD(
    vectordb_name = km.CHROMA_MINILM_RAG_FOR_LLM
)

minilm_vdb.vectorizing(pdf_documents)

# Hugging Face embedding VectorDB
hf_vdb = km.VectorBD(
    vectordb_name = km.CHROMA_HF_RAG_FOR_LLM
)
hf_vdb.vectorizing(pdf_documents)

# Ollama embedding VectorDB

ollama_vdb = km.VectorBD(
    vectordb_name = km.CHROMA_HF_RAG_FOR_LLM
)
ollama_vdb.vectorizing(pdf_documents)

### Create various RAGs 

- Using different embedding methods: OpewnAI, MiniLM, HuggingFace, Ollama (Llama3)
- Using different LMs: GPT3.5, GPT4, Llama3, Llama3.1 

In [42]:
import os
import knowledgebase_manager as km
from dotenv import load_dotenv
import document_handler as dc
load_dotenv()    
import Agent as myagent

rag1_openai_gpt3_5 = myagent.RAGAgent(
    name = "RAG 1 - OpenAI Embedding - GPT3.5",
    model = myagent.GPT_3_5_TURBO,
    vectordb_name="CHROMA_OPENAI_RAG_FOR_LLM",
    rag_type= "SIMPLE_QUESTION_ANSWER_RAG"
)

rag2_openai_gpt4 = myagent.RAGAgent(
    name = "RAG 2 - OpenAI Embedding - GPT4",
    model = myagent.GPT_4_PREVIEW,
    vectordb_name="CHROMA_OPENAI_RAG_FOR_LLM",
    rag_type= "SIMPLE_QUESTION_ANSWER_RAG"
)

rag3_openai_llama3_1 = myagent.RAGAgent(
    name = "RAG 3 - OpenAI Embedding - Llama3.1",
    model = myagent.OLLAMA_LLAMA3_1,
    vectordb_name="CHROMA_OPENAI_RAG_FOR_LLM",
    rag_type= "SIMPLE_QUESTION_ANSWER_RAG"
)

rag4_hf_llama3_1 = myagent.RAGAgent(
    name = "RAG 4 - HuggingFace Embedding - Llama3.1",
    model = myagent.OLLAMA_LLAMA3_1,
    vectordb_name=km.CHROMA_HF_RAG_FOR_LLM,
    rag_type= "SIMPLE_QUESTION_ANSWER_RAG"
)

rag5_ollama_llama3_1 = myagent.RAGAgent(
    name = "RAG 5 - Ollama Embedding - Llama3.1",
    model = myagent.OLLAMA_LLAMA3_1,
    vectordb_name=km.CHROMA_OLLAMA_RAG_FOR_LLM,
    rag_type= "SIMPLE_QUESTION_ANSWER_RAG"
)

rag6_minilm_llama3_1 = myagent.RAGAgent(
    name = "RAG 6 - MiniLM Embedding - Llama3.1",
    model = myagent.OLLAMA_LLAMA3_1,
    vectordb_name=km.CHROMA_MINILM_RAG_FOR_LLM,
    rag_type= "SIMPLE_QUESTION_ANSWER_RAG"
)

  warn_deprecated(


### Evaluate rag1_openai_gpt3_5

In [None]:
test_outcome_list_1 = test_rag_pipeline(rag1_openai_gpt3_5, question_ans_context)

evaluate_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
test_outcome_list_1 = evaluate_by_metric(evaluate_llm,test_outcome_list_1,"answer_relevancy")
test_outcome_list_1 = evaluate_by_metric(evaluate_llm,test_outcome_list_1,"faithfulness")
grading_llm = myllm.connectLLM("GPT_3_5_TURBO") 
test_outcome_list_1 = grading(grading_llm,test_outcome_list_1)
rating1 = grade_calculator(test_outcome_list=test_outcome_list_1)

In [62]:
rating1

{'answer_relevancy': 4.968539325842697, 'faithfulness': 7.056179775280899}

### Evaluate rag2_openai_gpt4

In [63]:
test_outcome_list_2 = test_rag_pipeline(rag2_openai_gpt4, question_ans_context)

evaluate_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
test_outcome_list_2 = evaluate_by_metric(evaluate_llm,test_outcome_list_2,"answer_relevancy")
test_outcome_list_2 = evaluate_by_metric(evaluate_llm,test_outcome_list_2,"faithfulness")
grading_llm = myllm.connectLLM("GPT_3_5_TURBO") 
test_outcome_list_2 = grading(grading_llm,test_outcome_list_2)
rating2 = grade_calculator(test_outcome_list=test_outcome_list_2)

evaluator.py log >>> Start testing with on 89 question


100%|██████████| 89/89 [04:39<00:00,  3.14s/it]


evaluator.py log >>> End testing with 89 answers on 89 question
evaluator.py log >>> start evaluating answer_relevancy


100%|██████████| 89/89 [01:40<00:00,  1.13s/it]


evaluator.py log >>> end evaluating answer_relevancy
evaluator.py log >>> start evaluating faithfulness


100%|██████████| 89/89 [02:13<00:00,  1.50s/it]


evaluator.py log >>> start evaluating faithfulness
evaluator.py log >>> START GRADING RELEVANCY


100%|██████████| 89/89 [00:31<00:00,  2.81it/s]


evaluator.py log >>> COMPLETE GRADING RELEVANCY
evaluator.py log >>> START GRADING FAITHFULNESS


100%|██████████| 89/89 [00:33<00:00,  2.69it/s]

evaluator.py log >>> COMPLETE GRADING FAITHFULNESS





In [64]:
rating2

{'answer_relevancy': 5.310112359550561, 'faithfulness': 6.695505617977528}

### Evaluate rag3_openai_llama3_1

In [65]:
test_outcome_list_3 = test_rag_pipeline(rag3_openai_llama3_1, question_ans_context)

evaluate_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
test_outcome_list_3 = evaluate_by_metric(evaluate_llm,test_outcome_list_3,"answer_relevancy")
test_outcome_list_3 = evaluate_by_metric(evaluate_llm,test_outcome_list_3,"faithfulness")
grading_llm = myllm.connectLLM("GPT_3_5_TURBO") 
test_outcome_list_3 = grading(grading_llm,test_outcome_list_3)
rating3 = grade_calculator(test_outcome_list=test_outcome_list_3)

evaluator.py log >>> Start testing with on 89 question


100%|██████████| 89/89 [02:32<00:00,  1.72s/it]


evaluator.py log >>> End testing with 89 answers on 89 question
evaluator.py log >>> start evaluating answer_relevancy


100%|██████████| 89/89 [01:40<00:00,  1.13s/it]


evaluator.py log >>> end evaluating answer_relevancy
evaluator.py log >>> start evaluating faithfulness


100%|██████████| 89/89 [02:14<00:00,  1.51s/it]


evaluator.py log >>> start evaluating faithfulness
evaluator.py log >>> START GRADING RELEVANCY


100%|██████████| 89/89 [00:38<00:00,  2.30it/s]


evaluator.py log >>> COMPLETE GRADING RELEVANCY
evaluator.py log >>> START GRADING FAITHFULNESS


100%|██████████| 89/89 [00:34<00:00,  2.61it/s]

evaluator.py log >>> COMPLETE GRADING FAITHFULNESS





In [66]:
rating3

{'answer_relevancy': 5.788764044943821, 'faithfulness': 7.543820224719101}

### Evaluate rag4_hf_llama3_1

In [None]:
test_outcome_list_4 = test_rag_pipeline(rag4_hf_llama3_1, question_ans_context)

evaluate_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
test_outcome_list_4 = evaluate_by_metric(evaluate_llm,test_outcome_list_4,"answer_relevancy")
test_outcome_list_4 = evaluate_by_metric(evaluate_llm,test_outcome_list_4,"faithfulness")
grading_llm = myllm.connectLLM("GPT_3_5_TURBO") 
test_outcome_list_4 = grading(grading_llm,test_outcome_list_4)
rating4 = grade_calculator(test_outcome_list=test_outcome_list_4)

In [68]:
rating4

{'answer_relevancy': 4.615730337078651, 'faithfulness': 1.440449438202247}

### Evaluate rag5_ollama_llama3_1

In [69]:
test_outcome_list_5 = test_rag_pipeline(rag5_ollama_llama3_1, question_ans_context)

evaluate_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
test_outcome_list_5 = evaluate_by_metric(evaluate_llm,test_outcome_list_5,"answer_relevancy")
test_outcome_list_5 = evaluate_by_metric(evaluate_llm,test_outcome_list_5,"faithfulness")
grading_llm = myllm.connectLLM("GPT_3_5_TURBO") 
test_outcome_list_5 = grading(grading_llm,test_outcome_list_5)
rating5 = grade_calculator(test_outcome_list=test_outcome_list_5)

evaluator.py log >>> Start testing with on 89 question


100%|██████████| 89/89 [06:46<00:00,  4.57s/it]


evaluator.py log >>> End testing with 89 answers on 89 question
evaluator.py log >>> start evaluating answer_relevancy


100%|██████████| 89/89 [00:54<00:00,  1.63it/s]


evaluator.py log >>> end evaluating answer_relevancy
evaluator.py log >>> start evaluating faithfulness


100%|██████████| 89/89 [00:41<00:00,  2.14it/s]


evaluator.py log >>> start evaluating faithfulness
evaluator.py log >>> START GRADING RELEVANCY


100%|██████████| 89/89 [00:31<00:00,  2.78it/s]


evaluator.py log >>> COMPLETE GRADING RELEVANCY
evaluator.py log >>> START GRADING FAITHFULNESS


 10%|█         | 9/89 [00:03<00:30,  2.66it/s]

Exception at 9 could not convert string to float: 'I cannot provide a rating for this answer.'


 71%|███████   | 63/89 [00:21<00:09,  2.73it/s]

Exception at 63 could not convert string to float: 'Since a grade cannot be extracted from the last comment provided by the professor, no grade can be given.'


100%|██████████| 89/89 [00:30<00:00,  2.92it/s]

evaluator.py log >>> COMPLETE GRADING FAITHFULNESS
Exception at 8 'faithfulness_grade'
Exception at 62 'faithfulness_grade'





In [70]:
rating5

{'answer_relevancy': 1.7865168539325842, 'faithfulness': 0.7247191011235955}

### Evaluate rag6_minilm_llama3_1

In [71]:
test_outcome_list_6 = test_rag_pipeline(rag6_minilm_llama3_1, question_ans_context)

evaluate_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
test_outcome_list_5 = evaluate_by_metric(evaluate_llm,test_outcome_list_5,"answer_relevancy")
test_outcome_list_5 = evaluate_by_metric(evaluate_llm,test_outcome_list_5,"faithfulness")
grading_llm = myllm.connectLLM("GPT_3_5_TURBO") 
test_outcome_list_5 = grading(grading_llm,test_outcome_list_5)
rating6 = grade_calculator(test_outcome_list=test_outcome_list_5)

evaluator.py log >>> Start testing with on 89 question


100%|██████████| 89/89 [03:29<00:00,  2.36s/it]


evaluator.py log >>> End testing with 89 answers on 89 question
evaluator.py log >>> start evaluating answer_relevancy


100%|██████████| 89/89 [00:57<00:00,  1.56it/s]


evaluator.py log >>> end evaluating answer_relevancy
evaluator.py log >>> start evaluating faithfulness


100%|██████████| 89/89 [00:42<00:00,  2.09it/s]


evaluator.py log >>> start evaluating faithfulness
evaluator.py log >>> START GRADING RELEVANCY


 55%|█████▌    | 49/89 [00:19<00:17,  2.34it/s]

Exception at 49 could not convert string to float: 'Grade: 0'


100%|██████████| 89/89 [00:33<00:00,  2.68it/s]


evaluator.py log >>> COMPLETE GRADING RELEVANCY
evaluator.py log >>> START GRADING FAITHFULNESS


 13%|█▎        | 12/89 [00:04<00:28,  2.69it/s]

Exception at 12 could not convert string to float: 'I cannot provide a rating for this student answer.'


100%|██████████| 89/89 [00:34<00:00,  2.60it/s]

evaluator.py log >>> COMPLETE GRADING FAITHFULNESS





In [None]:
rating6

### Summary Comparison

In [75]:
import pandas as pd
data = [rating1, rating2, rating3, rating4, rating5, rating6]
index = [rag1_openai_gpt3_5.name, rag2_openai_gpt4.name, rag3_openai_llama3_1.name, 
         rag4_hf_llama3_1.name, rag5_ollama_llama3_1.name, rag6_minilm_llama3_1.name]
df = pd.DataFrame(data, index = index)

print(df)

                                          answer_relevancy  faithfulness
RAG 1 - OpenAI Embedding - GPT3.5                 4.968539      7.056180
RAG 2 - OpenAI Embedding - GPT4                   5.310112      6.695506
RAG 3 - OpenAI Embedding - Llama3.1               5.788764      7.543820
RAG 4 - HuggingFace Embedding - Llama3.1          4.615730      1.440449
RAG 5 - Ollama Embedding - Llama3.1               1.786517      0.724719
RAG 6 - MiniLM Embedding - Llama3.1               1.769663      0.724719


The OpenAI embedding is the best in retrieving relevant documents, constributes to the high answer relevancy and faithfulness among RAG chains
While, Llama3.1 supersedes both GPT3.5 and 4 in both answer relevancy and faithfulness

# Education Use Cases

### Use Case 1 - Intelligent Tutoring 

Assume that we have embedded all teaching materials in the vector database - in this demo, we use __openai_embedding\RAG_for_LLM__
In this use case, we build a RAG chain to answer to student's questions and recommend further reading or studying to student

##### Setup AI Tutor (RAG Chain)

In [3]:
import os
from dotenv import load_dotenv
load_dotenv()

from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama.chat_models import ChatOllama

db_directory = os.getenv("VECTORDB_OPENAI_EM")
db_directory = os.path.join(db_directory,"RAG_for_LLM")
embeddings = OpenAIEmbeddings()
vectordb = Chroma(persist_directory=db_directory, embedding_function=embeddings)
retriever = vectordb.as_retriever()

from langchain.prompts import ChatPromptTemplate

template = """
You are the tutor. Your task is to answer student's question based on the context below. 
If you can't answer the question, ask for clarification or reply "Sorry, I don't know".

Context: {context}

Question: {question}
"""

setup = RunnableParallel(context=retriever, question=RunnablePassthrough())

prompt = ChatPromptTemplate.from_template(template)

model = ChatOllama(model="llama3.1")

parser = StrOutputParser()

chain = setup | prompt | model | parser

##### Q&A with AI Tutor

In [4]:
student_question = input("What can I help you today? ")
while True:
    print(f"Student: {student_question}")
    if student_question == "exit" or student_question == "bye" or student_question == "quit":
        print("AI Tutor: Bye")
        break
    response = chain.invoke(student_question)
    print(f"AI Tutor: {response}")

    student_question = input("Anything else? ")


Student: What is RAG?
AI Tutor: According to the documents, RAG (Retrieval-Augmented Generation) is a powerful technique that combines retrieval mechanisms with the generative capabilities of Large Language Models (LLMs). It enables the synthesis of contextually relevant, accurate, and up-to-date information by integrating document search with LLM generation.
Student: How to implement RAG?
AI Tutor: Based on the provided documents, it seems that implementing a Retrieval-Augmented Generation (RAG) system involves several steps.

According to Document 2 (`2401.05856v1.Seven_Failure_Points_When_Engineering_a_Retrieval_Augmented_Generation_System.pdf`), building a RAG system requires:

* Pre-processing domain knowledge captured as artifacts in different formats
* Storing processed information in an appropriate data store (vector database)
* Implementing or integrating the right query-artifact matching strategy
* Ranking matched artifacts
* Calling the LLMs API passing in user queries and c

### Use Case 2 - Assessment and Grading 

Based on teaching materials, we build syntheic open-answer questions and store them in database (json), we also associate each question with correct anwser and reference material which the question aims to test. The question (& related information) database is stored at __education\use_case2__
When students need assessment, the system will randomly select n questions for testing (n=3 in this demo). An LLM is requested to evaluate and give comments & grades (+ average grade) to student's answers.

##### Setup: question database, assessment comment and grading

In [11]:
from langchain.output_parsers import ResponseSchema
#from langchain.output_parsers import StructuredOutputParser
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from tqdm import tqdm

import pandas as pd
from operator import itemgetter
from datasets import Dataset

import prompt_collection as myprompt


  from .autonotebook import tqdm as notebook_tqdm


Prompt

In [5]:
question_generation_template = """
You are a University Professor creating a test for advanced students. 
Based on the given context, create a WH question that is specific to the context. 
Your question is not multiple choice question. 
Your question should be formulated in the same style as exam question. 
This means that your question MUST NOT mention something like "according to the context" or "according to the passage".
MUST NOT mention "Here is the question" or "Here is the WH question" or ""Here's the WH question"
The question MUST BE in English only. 

Provide your question as follows: 

Question: (your question)

Here is the context.

Context: {context}
"""


answer_generator_template = """

You are Teaching Assistant. Your task is to answer the question based on the context below. 
Your answer should be specific, based on concise piece of factual information from the context. 
Your answer MUST NOT mention something like "according to the passage".
If you can't answer the question, reply "I don't know".

Provide your anwser as follows: 

Answer: (your answer)

Here are the question and context

Question: {question}

Context: {context}

"""

evaluate_answer_relevancy_template = """
You are Teaching Assistant. Your task is to evaluate the student answer for the test question. You are also given Professor's answer as reference. 
Your task is to provide a 'total rating' representing how close student answer is to the Professor's answer.
Give your rating on a scale of 1 to 10, where 1 means that the question is not close at all, and 10 means that the question is extremely close.

Provide your rating as follows:

Total rating: (your rating, as a float number between 1 and 10)

Now here are the question, the student answer and the Professor's answer.

Question: {question}

Student Answer: {answer}

Professor's answer: {ground_truth}

"""

grading_template = """
You are Teaching Assistant. Your task is to extract grade from Professor's comments to student answer. 
You are given some examples of comments for you task. Your answer is ONLY the grade between 1 and 10.

Comment: Total rating: 10.0. The student answer is an exact quote from the reference, which clearly states that Spearman's correlation coefficient was used to calculate the relationship between human-evaluated document relatedness scores and the embedding correlation coefficients for each language model. The student answer matches the reference perfectly, with no deviations or inaccuracies. Therefore, a rating of 10 out of 10 is justified. 
Grade: 10.0

Comment: Total rating: 9.5. The student answer accurately captures the essence of the reference material, correctly interpreting the strong positive correlation of CM_EN as indicating a robust alignment with human judgment in the context of Chinese Medicine. The student also mentions that this implies the model has captured meaningful relationships between documents, which can be used to inform decisions or generate relevant content in the domain of CM.
Grade: 9.5

Comment: Total rating: 8.5
The student answer correctly identifies that a directive is given to the generative model based on GPT-3.5-turbo-16k to minimize hallucination in its response, and mentions the prompt containing this directive. However, it does not accurately cite the specific reference from Document 3, page 7, as mentioned in the student answer. The correct statement is actually found in the Reference material, which states that an alternative prompt without a reference section is passed to a GPT-3.5-turbo-based model to reduce token usage and save on expenses.
Grade: 8.5

Comment: Total rating: 2.0 
The student's answer "I don't know" does not provide any insight into the specific functional limitations of conventional Retrieval-Augmented Generation (RAG) methods for niche domains or how these shortcomings affect their performance. The reference provided, on the other hand, discusses various challenges and considerations associated with RAG systems, including data privacy, scalability, cost, skills required, etc. This suggests a significant gap in understanding between the student's response and the material covered in the lesson.
Grade: 2.0

Comment: Total rating: 4.2
The student answer correctly identifies two of the seven failure points for designing a RAG system (validation during operation and reliance on LLMs). However, they incorrectly infer that these are the only two failure points discussed in the provided snippet, when in fact the reference provides more specific information about the other five failure points. The student's answer also does not fully capture the context of the document and the lessons learned from the case studies. Therefore, while the answer shows some understanding of the topic, it falls short of providing a complete and accurate response.
Grade:4.2

Comment: {comment}
Grade: 

"""

Question database

In [12]:
def generate_question(generator_llm, pdf_documents, mode = ""):


    question_output_parser =  StrOutputParser() #StructuredOutputParser.from_response_schemas(question_response_schemas)
    prompt = ChatPromptTemplate.from_template(question_generation_template)
    setup = RunnableParallel(context=RunnablePassthrough())
    question_generation_chain = setup | prompt | generator_llm | question_output_parser
    question_context_list = []

    print(f"evaluator.py log >>> START GENERATING QUESTION")
    i = 1
    for text in tqdm(pdf_documents):
        try:
            response = question_generation_chain.invoke(text.page_content)
        except Exception as e:
            print(f"Exception at {i} {e}")
            i=i+1
            continue
        question_context = {"context": text.page_content, "question" : response}
#        print(f"Question {i} : {question_context["question"]}")
#        print(f"Context {i} : {question_context["context"]}")
        question_context_list.append(question_context)
        i=i+1
    print(f"evaluator.py log >>> COMPLETE GENERATING QUESTION")    
    return question_context_list

def generate_answer(answer_llm, question_context_list, mode = ""):
    answer = question_context_list
    answer_schema = ResponseSchema(
        name="answer",
        description="an answer to the question"
    )
    answer_response_schemas = [
        answer_schema,
    ]
    answer_output_parser = StrOutputParser() #StructuredOutputParser.from_response_schemas(answer_response_schemas)
    #setup = RunnableParallel(question = RunnablePassthrough(), context=RunnablePassthrough())

    prompt = ChatPromptTemplate.from_template(answer_generator_template)

    answer_generation_chain = (
        {"question": itemgetter("question"), "context": itemgetter("context") }
        | prompt 
        | answer_llm 
        | answer_output_parser
    )
    print(f"evaluator.py log >>> START GENERATING ANSWER")
    i = 1
    for record in tqdm(answer):
        try:
            response = answer_generation_chain.invoke({"question":record["question"],"context":record["context"]})
        except Exception as e:
            print(f"Exception at {i} {e}")
            i=i+1
            continue
        record["ground_truth"] = response
        i=i+1
    
    print(f"evaluator.py log >>> COMPLETE GENERATING ANSWER")
    return answer

In [13]:
import os
from dotenv import load_dotenv
import document_handler as dc
import llm_connector as myllm

load_dotenv()    

directory_path = os.path.join(os.getenv("DOC_ARVIX"),"RAG_for_LLM") 

pdf_documents = dc.load_directory(directory_path,"pdf")

generator_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")

question_context = generate_question(generator_llm, pdf_documents)

answer_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
question_ans_context = generate_answer(answer_llm,question_context)

evaluator.py log >>> START GENERATING QUESTION


100%|██████████| 89/89 [00:56<00:00,  1.58it/s]


evaluator.py log >>> COMPLETE GENERATING QUESTION
evaluator.py log >>> START GENERATING ANSWER


100%|██████████| 89/89 [00:47<00:00,  1.88it/s]

evaluator.py log >>> COMPLETE GENERATING ANSWER





Evaluation and Grading

In [29]:
def evaluate_by_metric(critic_llm, test_outcome_list, metric = "answer_relevancy"):
    # How relevant the answer to the question, in the other word, how close the answer to the ground truth
    if metric == "answer_relevancy": 
        eval_output_parser = StrOutputParser() #StructuredOutputParser.from_response_schemas(answer_response_schemas)
        #setup = RunnableParallel(question = RunnablePassthrough(), context=RunnablePassthrough())

        prompt = ChatPromptTemplate.from_template(evaluate_answer_relevancy_template)

        eval_chain = (
            {"question": itemgetter("question"), "answer": itemgetter("answer"), "ground_truth": itemgetter("ground_truth") }
            | prompt 
            | critic_llm 
            | eval_output_parser
        )

        i = 1
#        print("evaluator.py log >>> start evaluating answer_relevancy")
        eval_list = []
        for record in test_outcome_list:
#            print(f"Question {i} : {record["question"]}")
#            print(f"answer {i} : {record["answer"]}")
#            print(f"ground_truth {i} : {record["ground_truth"]}")
            try:
                response = eval_chain.invoke({"question":record["question"],"answer":record["answer"],"ground_truth":record["ground_truth"]})
            except Exception as e:
#                print(f"Exception at {i} {e}")
                i=i+1
                continue
            record["answer_relevancy"] = response
            
#            print(f"answer_relevancy {i} : {record["answer_relevancy"]}")

            """            
            eval_list.append(
                {
                    "question":record["question"],
                    "answer":record["answer"],
                    "ground_truth":record["ground_truth"],
                    "contexts":record["contexts"],
                    "answer_relevancy" : record["answer_relevancy"]
                }
            )"""

            i=i+1
#        print("evaluator.py log >>> end evaluating answer_relevancy")
    # How relevant the answer to the question, in the other word, how close the answer to the ground truth
    if metric == "faithfulness": 
        eval_output_parser = StrOutputParser() #StructuredOutputParser.from_response_schemas(answer_response_schemas)
        #setup = RunnableParallel(question = RunnablePassthrough(), context=RunnablePassthrough())

        prompt = ChatPromptTemplate.from_template(evaluate_faithfulness_template)

        eval_chain = (
            {"question": itemgetter("question"), "answer": itemgetter("answer"), "contexts": itemgetter("contexts") }
            | prompt 
            | critic_llm 
            | eval_output_parser
        )

        i = 1
        print("evaluator.py log >>> start evaluating faithfulness")
        eval_list = []
        for record in tqdm(test_outcome_list):
#            print(f"Question {i} : {record["question"]}")
#            print(f"answer {i} : {record["answer"]}")
#            print(f"ground_truth {i} : {record["ground_truth"]}")
            try:
                response = eval_chain.invoke({"question":record["question"],"answer":record["answer"],"contexts":record["contexts"]})
            except Exception as e:
                print(f"Exception at {i} {e}")
                i=i+1
                continue
            record["faithfulness"] = response
            
#            print(f"faithfulness {i} : {record["faithfulness"]}")
            i=i+1
        print("evaluator.py log >>> start evaluating faithfulness")
    return test_outcome_list # Dataset.from_pandas(pd.DataFrame(eval_list))

def grading(grading_llm, test_outcome_list):
    grading_output_parser = StrOutputParser() 
    prompt = ChatPromptTemplate.from_template(grading_template)

    grading_chain = (
        {"comment": itemgetter("comment")}
        | prompt 
        | grading_llm 
        | grading_output_parser
    )

    #### GRADING RELEVANCY ####
 #   print(f"evaluator.py log >>> START GRADING RELEVANCY")
    i = 1
    for record in test_outcome_list:
        try:
            response = grading_chain.invoke({"comment":record["answer_relevancy"]})
            response = float(response)
        except Exception as e:
#            print(f"Exception at {i} {e}")
            i=i+1
            continue
        record["answer_relevancy_grade"] = response
        i=i+1
    
    return test_outcome_list

def grade_calculator(test_outcome_list):
    overall_grade = {"answer_relevancy":0.0,
                     "faithfulness" : 0.0}
    answer_relevancy = 0.0
    faithfulness = 0.0
    i = 0
    for grade in test_outcome_list:
        try:
            answer_relevancy = answer_relevancy + grade["answer_relevancy_grade"]
        except Exception as e:
#            print(f"Exception at {i} {e}")
            i=i+1
            continue
        i=i+1
    answer_relevancy = answer_relevancy / len(test_outcome_list)
    overall_grade["answer_relevancy"] = answer_relevancy
    overall_grade["faithfulness"] = faithfulness
    return overall_grade

##### Execute assessment

In [30]:
import random 

#Get random 3 questions
test_questions = random.sample(question_ans_context,3)
test_answer = []
i = 1
for q in test_questions:
    question = q["question"]
    print(f"Question {i}: \n{question}")
    answer = input("Write your answer below :")
    test_answer.append(
        {
            "question" : question,
            "answer" : answer,
            "contexts" : q["context"],
            "ground_truth" : q["ground_truth"]
        }
    )
    i=i+1
evaluate_llm = myllm.connectLLM("OLLAMA_LLAMA3.1")
test_answer_comment = evaluate_by_metric(evaluate_llm,test_answer,"answer_relevancy")

grading_llm = myllm.connectLLM("GPT_3_5_TURBO") 
test_answer_grading = grading(grading_llm,test_answer_comment)
avggrade = grade_calculator(test_outcome_list=test_answer_grading)

print("\n\nThanks for taking assessment. Below is comments for your answer and grading")

i = 1
for q in test_answer_grading:
    question = q["question"]
    answer = q["answer"]
    comment = q["answer_relevancy"]
    grade = q["answer_relevancy_grade"]
    
    print(f"\nQuestion {i}:\n{question}")
    print(f"Your answer: {answer}")
    print(f"Comment:\n {comment}")
    print(f"Grade: {grade}")
    
    i=i+1

print(f"Your Average Grade is: {avggrade["answer_relevancy"]:.2f}")

Question 1: 
Question: What was the primary change made to the training experiment in the small sample size experiments compared to the full-sized data experiments?
Question 2: 
Question:

What are the primary limitations inherent to information retrieval systems that also affect Retrieval-Augmented Generation (RAG) systems?
Question 3: 
Question: What does the table suggest about the relationship between the re-ranking of search results and the retrieval of targeted versus untargeted contexts?


Thanks for taking assessment. Below is comments for your answer and grading

Question 1:
Question: What was the primary change made to the training experiment in the small sample size experiments compared to the full-sized data experiments?
Your answer: fewshot in context training
Comment:
 Total rating: 2.0

The student answer "fewshot in context training" does not match the Professor's answer at all, which mentions an increase in the learning rate. The only similarity is the mention of "trai