# Environment Setup

### Install neccessary Library

(Optional) ARXIV for searching and loading documents from ARXIV

In [1]:
!pip install -q -U arxiv


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


RAGAS for RAG Evaluation

In [2]:
!pip install -q -U ragas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


(Optional) TQDM for progress indicator

In [None]:
!pip install -q -U tqdm

GPT4ALL for Local LLM and Embedding

In [None]:
!pip install gpt4all

In [11]:
!pip install --upgrade --quiet huggingface_hub
!pip install --upgrade --quiet langchain_huggingface


[notice] A new release of pip is available: 24.1.1 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 24.1.1 -> 24.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### Get Environment Parameters

In [2]:
import os
from dotenv import load_dotenv
load_dotenv()

True

# Pipeline 1 - Embedding

To describe to embedding flow

### Step 1. Loading

In this step, we load data from various sources. Make them ready to ingest.

#### Load data from Arxiv

In [3]:
import arxiv 
client = arxiv.Client()
search = arxiv.Search(
  query = "ReAct for Large Language Model",
  max_results = 10,
  sort_by = arxiv.SortCriterion.SubmittedDate
)

results = client.results(search)
all_results = list(client.results(search))

In [26]:
for r in all_results:
    print(f"{r.title} {r.entry_id}")

AnyTaskTune: Advanced Domain-Specific Solutions through Task-Fine-Tuning http://arxiv.org/abs/2407.07094v1
FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation http://arxiv.org/abs/2407.07093v1
V-VIPE: Variational View Invariant Pose Embedding http://arxiv.org/abs/2407.07092v1
General Relativistic effects and the NIR variability of Sgr A* II: A systematic approach to temporal asymmetry http://arxiv.org/abs/2407.07091v1
3D Gaussian Ray Tracing: Fast Tracing of Particle Scenes http://arxiv.org/abs/2407.07090v1
Fine-Tuning Linear Layers Only Is a Simple yet Effective Way for Task Arithmetic http://arxiv.org/abs/2407.07089v1
Safe and Reliable Training of Learning-Based Aerospace Controllers http://arxiv.org/abs/2407.07088v1
CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation http://arxiv.org/abs/2407.07087v1
Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language

In [15]:
print([r.title for r in all_results])

['AnyTaskTune: Advanced Domain-Specific Solutions through Task-Fine-Tuning', 'FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation', 'V-VIPE: Variational View Invariant Pose Embedding', 'General Relativistic effects and the NIR variability of Sgr A* II: A systematic approach to temporal asymmetry', '3D Gaussian Ray Tracing: Fast Tracing of Particle Scenes', 'Fine-Tuning Linear Layers Only Is a Simple yet Effective Way for Task Arithmetic', 'Safe and Reliable Training of Learning-Based Aerospace Controllers', 'CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation', 'Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models', 'On some conjectural determinants of Sun involving residues']


In [None]:
#from langchain.document_loaders import ArxivLoader
#base_docs = ArxivLoader(query="ReAct LLM", load_max_docs=5).load()

In [32]:
ARVIX_DOC = os.getenv("ARVIX_DOC") 
for r in all_results:
    r.download_pdf(dirpath=ARVIX_DOC)

### Step 2. Parsing

##### Type 1. text document

In [None]:
from langchain.document_loaders import TextLoader
DOCUMENT = os.getenv("DOCUMENT")
txt_path = DOCUMENT+"rag.txt"
txt_loader = TextLoader(txt_path)
text_documents = txt_loader.load()
#text_documents

##### Type 2. PDF document

We use PyMuPDFLoader in this experiment

In [None]:
from langchain.document_loaders import PyMuPDFLoader
pdf_path = DOCUMENT+ "*.pdf"
pdf_loader = PyMuPDFLoader(pdf_path)
pdf_documents = pdf_loader.load()

In [4]:
from langchain.document_loaders import PyMuPDFLoader
pdf_documents = []
for file in os.listdir(os.getenv("ARVIX_DOC")):
    if file.endswith('.pdf'):
        pdf_path = os.path.join(os.getenv("ARVIX_DOC"), file)
        loader = PyMuPDFLoader(pdf_path)
        pdf_documents.extend(loader.load())

##### Type 3. Batch Loading Directly from source

In [34]:
from langchain.document_loaders import ArxivLoader
batch_docs = ArxivLoader(query="ReAct for Large Language Model",  load_max_docs=10).load()

In [None]:
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders.pdf import PyMuPDFLoader
from langchain.document_loaders.xml import UnstructuredXMLLoader
from langchain.document_loaders.csv_loader import CSVLoader

# Define a dictionary to map file extensions to their respective loaders
loaders = {
    '.pdf': PyMuPDFLoader,
    '.xml': UnstructuredXMLLoader,
    '.csv': CSVLoader,
}

# Define a function to create a DirectoryLoader for a specific file type
def create_directory_loader(file_type, directory_path):
    return DirectoryLoader(
        path=directory_path,
        glob=f"**/*{file_type}",
        loader_cls=loaders[file_type],
    )

# Create DirectoryLoader instances for each file type
pdf_loader = create_directory_loader('.pdf', os.getenv("ARVIX_DOC"))

# Load the files
pdf_documents = pdf_loader.load()

### Step 3. Chunking

In [18]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
text_chunks = text_splitter.split_documents(text_documents)
#documents[:3]

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
pdf_chunks = text_splitter.split_documents(pdf_documents)

In [6]:
chunks = pdf_chunks

### Step 4. Vectorizing

Option 1: Using openAI embedding API

In [7]:
from langchain_openai.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

Option 2: Using gpt4all embedding

In [57]:
from langchain_community.embeddings import GPT4AllEmbeddings
model_name = "all-MiniLM-L6-v2.gguf2.f16.gguf"
gpt4all_kwargs = {'allow_download': 'True'}
embeddings = GPT4AllEmbeddings(
    model_name=model_name,
    gpt4all_kwargs=gpt4all_kwargs
)

Downloading: 100%|██████████| 45.9M/45.9M [00:06<00:00, 7.66MiB/s]
Verifying: 100%|██████████| 45.9M/45.9M [00:00<00:00, 855MiB/s]


### Step 5. Storing

#### In Memory vectordb

In [None]:
#from langchain_community.vectorstores import DocArrayInMemorySearch
#vectorstore = DocArrayInMemorySearch.from_documents(chunks, embeddings)

#### Persist the vectordb with Chroma

In [58]:
from langchain.vectorstores import Chroma
persist_directory = os.getenv("ARXIVSTORE_GPT4ALL")

#Create vector database with local embedding method gpt4all. 
#Note different embedding methods will result different vector dimensions and cannot be stored together
#The same embedding method to be used in retrieval pipeline
vectordb = Chroma.from_documents(documents=chunks,  embedding=embeddings, persist_directory=persist_directory)
vectordb.persist()

  warn_deprecated(


# Pipeline 2 - Retrieving & Generating

### Create a Agent

In [None]:
# Define the agent here

In [59]:
import os
from dotenv import load_dotenv
load_dotenv()

True

### Step 1. Query

In [79]:
user_query = "What is retrieval augmented generation"
#user_query = "Describe the RAG-Sequence Model?"

### Step 2. Search

Need to load from store if there is. Here the on memory vectorstore is used. 
There is opportunity to improve efficiency of search when the knowledgebase gets larger and more complicated (type of sources)

In [61]:
from langchain_community.embeddings import GPT4AllEmbeddings
model_name = "all-MiniLM-L6-v2.gguf2.f16.gguf"
gpt4all_kwargs = {'allow_download': 'True'}
embeddings = GPT4AllEmbeddings(
    model_name=model_name,
    gpt4all_kwargs=gpt4all_kwargs
)

In [62]:
#retriever = vectorstore.as_retriever()

#Load vectordb from persisted store
from langchain.vectorstores import Chroma
persist_directory = os.getenv("ARXIVSTORE_GPT4ALL")
newvectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
retriever = newvectordb.as_retriever()

In [64]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
setup = RunnableParallel(context=retriever, question=RunnablePassthrough())

In [81]:
retriever.invoke(user_query)

[Document(metadata={'author': '', 'creationDate': 'D:20240710005619Z', 'creator': 'LaTeX with hyperref', 'file_path': 'arvix_document\\2407.07087v1.CopyBench__Measuring_Literal_and_Non_Literal_Reproduction_of_Copyright_Protected_Text_in_Language_Model_Generation.pdf', 'format': 'PDF 1.5', 'keywords': '', 'modDate': 'D:20240710005619Z', 'page': 5, 'producer': 'pdfTeX-1.40.25', 'source': 'arvix_document\\2407.07087v1.CopyBench__Measuring_Literal_and_Non_Literal_Reproduction_of_Copyright_Protected_Text_in_Language_Model_Generation.pdf', 'subject': '', 'title': '', 'total_pages': 23, 'trapped': ''}, page_content='the prompt. In the fact recall task, the prompt in-\nstructs the model to generate a short answer. To\nfacilitate a fair comparison between base models\nand instruction-tuned models, we incorporate an\ninstruction and in-context learning demonstrations\ninto our prompts. Refer to Section A.2 for more\ndetails.\n3.4\nHuman Analysis of Automatic Event\nCopying Evaluation\nTo verify 

### Step 3. Augmented Prompt

In [63]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. 
If you can't answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

### Step 4. Response Generating

In [75]:
from langchain_core.output_parsers import StrOutputParser
parser = StrOutputParser()

Option 1: Using on-cloud OpenAI

In [8]:
from langchain_openai.chat_models import ChatOpenAI
#OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")

Option 2: Using Local LLM GPT4All

In [74]:
from langchain_community.llms import GPT4All
from langchain_core.callbacks import StreamingStdOutCallbackHandler

In [72]:
local_path = ("C:\\Users\\derek\\Meta-Llama-3-8B-Instruct.Q4_0.gguf")

In [76]:
# Callbacks support token-wise streaming
callbacks = [StreamingStdOutCallbackHandler()]

# Verbose is required to pass to the callback manager
model = GPT4All(model=local_path, verbose=False)
parser = StrOutputParser()
# If you want to use a custom model add the backend parameter
# Check https://docs.gpt4all.io/gpt4all_python.html for supported backends
#model = GPT4All(model=local_path, backend="gptj", callbacks=callbacks, verbose=True)

In [77]:
chain = setup | prompt | model | parser

In [80]:
response = chain.invoke(user_query)
response

'Answer: I don\'t know. \nPlease provide more context or clarify what you mean by "retrieval augmented generation". Is it a specific method or concept in natural language processing? If so, please provide more information about it. \n\nNote that the provided documents are PDFs and contain text related to various topics such as language models, multi-agent tasks, and task arithmetic. However, none of these texts explicitly discuss "retrieval augmented generation". Therefore, I am unable to answer your question based on this context alone. If you can provide more information or clarify what you mean by the term, I may be able to help further.'

# RAG Evaluation

### Generate synthesis Test Dataset

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# generator with openai models
# generator_llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0) 
# critic_llm = ChatOpenAI(model="gpt-4")
# embeddings = OpenAIEmbeddings()

In [17]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_huggingface import HuggingFaceEndpoint 
from langchain_huggingface.embeddings import HuggingFaceEmbeddings


In [13]:
from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass()

In [14]:
import os

os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

In [15]:
repo_id = "mistralai/Mistral-7B-Instruct-v0.2"

generator_llm = HuggingFaceEndpoint(
    repo_id=repo_id,
    max_length=128,
    temperature=0.5,
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
)
critic_llm = HuggingFaceEndpoint(
    repo_id=repo_id,
    max_length=128,
    temperature=0,
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
)

                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.
                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\derek\.cache\huggingface\token
Login successful
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to C:\Users\derek\.cache\huggingface\token
Login successful


In [20]:
import nest_asyncio
nest_asyncio.apply()

In [21]:
embeddings = HuggingFaceEmbeddings()
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Change resulting question type distribution
distributions = {
    simple: 0.2,
    multi_context: 0.4,
    reasoning: 0.4
}

try:
    testset = generator.generate_with_langchain_docs(chunks, test_size=10, distributions = distributions) 
except Exception as e:
    print (e)

  def _save_to_state_dict(self, destination, prefix, keep_vars):
Exception in thread Thread-23:                                      
Traceback (most recent call last):
  File "c:\Users\derek\OneDrive\1 - Technology\Workspace\rag_win\Lib\site-packages\ragas\llms\json_load.py", line 107, in _asafe_load
    _json = self._load_all_jsons(text)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\derek\OneDrive\1 - Technology\Workspace\rag_win\Lib\site-packages\ragas\llms\json_load.py", line 146, in _load_all_jsons
    _json = json.loads(text[start:end])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\derek\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\derek\AppData\Local\Programs\Python\Python312\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File 



Simpler Testset generator

In [None]:
simple_generator = TestsetGenerator.with_openai()

testset = simple_generator.generate_with_langchain_docs(chunks, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

In [22]:
testset.to_pandas()

NameError: name 'testset' is not defined

### Run evaluation on our RAG chain

In [None]:
questions = testset.to_pandas()["question"].to_list()
ground_truth = testset.to_pandas()["ground_truth"].to_list()

In [None]:
questions

In [None]:
ground_truth

In [None]:
from datasets import Dataset

data = {"question": [], "answer": [], "contexts": [], "ground_truth": ground_truth}

for query in questions:
    data["question"].append(query)
    data["answer"].append(chain.invoke(query))
    data["contexts"].append([doc.page_content for doc in retriever.get_relevant_documents(query)])

dataset = Dataset.from_dict(data)

In [None]:
retriever.get_relevant_documents(questions[1])

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

result = evaluate(
    dataset = dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

In [None]:
import pandas as pd
result_pd = result.to_pandas()
pd.set_option("display.max_colwidth", 700)
result_pd[["question", "contexts", "answer", "ground_truth","faithfulness"]]

In [37]:
while True:
        user_input = input("Enter a query: ")
        if user_input == "exit":
            break

        try:
            response = chain.invoke(user_input)  
            print(response)
        except Exception as err:
            print('Exception occurred. Please try again', str(err))

Answer: The RAG (Reinforced Augmented Generation) model uses an input sequence x to retrieve text documents z and use them as additional context when generating a target sequence y. It consists of two components: (i) a retriever pη(z|x) that returns distributions over text passages given a query x, and (ii) a generator pθ(yi|x,z,y1:i−1) parametrized by θ. The model can be used for tasks such as fact verification.
```python
import pandas as pd

# Load the data from the context into a DataFrame.

Answer: I don't know how to load this specific data, but you could use Python's `pandas` library to create a DataFrame:

```
data = [
    {"page_content": "the non-parametric memory can be replaced to update the models’ knowledge as the world changes.1\n2\...", 
     "metadata": {...}},
    ...
]

df = pd.DataFrame(data)
```  ```
Answer: I don't know how to load this specific data, but you could use Python's `pandas` library to create a DataFrame:

```
data = [
    {"page_content": "the non-para