<a href="https://colab.research.google.com/github/almutareb/rag-based-llm-app/blob/main/RAG_Arxiv_Langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Arxiv

This is a test to query and chat with Arxiv articles using an LLM
First we will use OpenAI's ChatGPT-3.5-turbo, then try other options

In [None]:
!pip install langchain arxiv openai python-dotenv pymupdf

In [9]:
from langchain.retrievers import ArxivRetriever

ArxivRetriever has these arguments:

  * optional load_max_docs: default=100. Use it to limit number of downloaded documents. It takes time to download all 100 documents, so use a small number for experiments. There is a hard limit of 300 for now.
  * optional load_all_available_meta: default=False. By default only the most important fields downloaded: Published (date when document was published/last updated), Title, Authors, Summary. If True, other fields also downloaded.

get_relevant_documents() has one argument, query: free text which used to find documents in Arxiv.org

In [20]:
retriever = ArxivRetriever(load_max_docs=25)

In [21]:
docs = retriever.get_relevant_documents(query="2310.03184")

In [16]:
docs[0].metadata

{'Published': '2023-10-04',
 'Title': 'Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference',
 'Authors': 'Zachary Levonian, Chenglu Li, Wangda Zhu, Anoushka Gade, Owen Henkel, Millie-Ellen Postle, Wanli Xing',
 'Summary': "For middle-school math students, interactive question-answering (QA) with\ntutors is an effective way to learn. The flexibility and emergent capabilities\nof generative large language models (LLMs) has led to a surge of interest in\nautomating portions of the tutoring process - including interactive QA to\nsupport conceptual discussion of mathematical concepts. However, LLM responses\nto math questions can be incorrect or mismatched to the educational context -\nsuch as being misaligned with a school's curriculum. One potential solution is\nretrieval-augmented generation (RAG), which involves incorporating a vetted\nexternal knowledge source in the LLM prompt to increase response quality. In\nthis pa

In [17]:
docs[0].page_content[:500]

'Retrieval-augmented Generation to Improve Math\nQuestion-Answering: Trade-offs Between\nGroundedness and Human Preference\nZachary Levonian1\nChenglu Li2\nWangda Zhu3\nAnoushka Gade1\nOwen Henkel4\nMillie-Ellen Postle5\nWanli Xing3\n1Digital Harbor Foundation\n2University of Utah\n3University of Florida\n4University of Oxford\n5Rising Academies\nzach@digitalharbor.org\nAbstract\nFor middle-school math students, interactive question-answering (QA) with tutors\nis an effective way to learn. The flexibility and emer'

## Q&A with ChatGPT-3.5

In [5]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

openai.api_key = os.environ['OPENAI_API_KEY']

In [13]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain

model = ChatOpenAI(model="gpt-3.5-turbo")
qs = ConversationalRetrievalChain.from_llm(model, retriever=retriever)

In [14]:
questions = [
    "What are the challenges of using LLMs to generate math answers for middle-school students?",
    "How does retrieval-augmented generation (RAG) address these challenges?",
    "What is the design of the RAG system used in this paper?",
    "How was the efficacy of the RAG system evaluated?",
    "What are the trade-offs between generating responses preferred by students and responses closely matched to specific educational resources?",
]

chat_history = []

for question in questions:
  result = qs({"question": question, "chat_history": chat_history})
  chat_history.append((question, result["answer"]))
  print(f"-> **Question**: {question} \n")
  print(f"**Answer**: {result['answer']} \n")

-> **Question**: What are the challenges of using LLMs to generate math answers for middle-school students? 

**Answer**: The given context does not specifically mention the challenges of using LLMs to generate math answers for middle-school students. Therefore, based on the provided context, it is not possible to determine the challenges related to this specific use case. 





-> **Question**: How does retrieval-augmented generation (RAG) address these challenges? 

**Answer**: La génération augmentée par recherche (RAG) est une approche qui combine les capacités de la recherche d'informations avec la génération de texte pour aborder les défis de la génération de texte. RAG utilise des modèles de langage pré-entraînés et des techniques de recherche d'informations pour fournir des réponses plus précises et informatives. Par exemple, au lieu de générer simplement une réponse à partir de zéro, RAG peut effectuer une recherche sur une base de connaissances ou sur le web pour obtenir des informations supplémentaires et les intégrer dans la réponse générée. Cela permet d'améliorer la qualité et la pertinence des réponses générées. Cependant, il est important de noter que RAG est une technologie en développement et peut avoir des limites et des lacunes dans sa capacité à aborder tous les défis de manière optimale. 

-> **Question**: What is the design of the RAG sy

## using an agent

In [22]:
from langchain.chat_models import ChatOpenAI
from langchain.agents import load_tools, initialize_agent, AgentType

llm = ChatOpenAI(temperature=0.0)
tools = load_tools(
    ["arxiv"],
)

agent_chain = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
)

In [23]:
agent_chain.run(
    "What papers cover the use of RAG to generate math answers for middle-school students?"
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI should search for papers on arxiv.org that discuss the use of RAG (Retrieve, Answer, Generate) models in generating math answers for middle-school students.
Action: arxiv
Action Input: "RAG model for generating math answers middle-school students"[0m
Observation: [36;1m[1;3mPublished: 2023-10-04
Title: Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference
Authors: Zachary Levonian, Chenglu Li, Wangda Zhu, Anoushka Gade, Owen Henkel, Millie-Ellen Postle, Wanli Xing
Summary: For middle-school math students, interactive question-answering (QA) with
tutors is an effective way to learn. The flexibility and emergent capabilities
of generative large language models (LLMs) has led to a surge of interest in
automating portions of the tutoring process - including interactive QA to
support conceptual discussion of mathematical concepts. However, LLM responses
to math

'The papers "Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference" (2023), "Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering" (2022), and "Fine-tune the Entire RAG Architecture (including DPR retriever) for Question-Answering" (2021) cover the use of RAG to generate math answers for middle-school students.'

## Bert from HuggingFace

In [None]:
!pip install transformers

In [1]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="distilgpt2",
    task="text-generation",
    model_kwargs={"temperature":0, "max_length":256},
)



In [2]:
from langchain.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step. """
prompt = PromptTemplate.from_template(template)

chain = prompt | llm

question = "What are the challenges of using LLMs to generate math answers for middle-school students?"

print(chain.invoke({"question": question}))

ValueError: ignored

In [None]:
from langchain

## new approach

In [1]:
!pip install -qU transformers accelerate einops langchain xformers bitsandbytes faiss-gpu sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m61.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m84.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.6/211.6 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ..

In [7]:
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

hf_auth = os.environ['HF_API_KEY']

In [1]:
from torch import cuda, bfloat16
import transformers

#model_id = 'distilbert-base-cased-distilled-squad'
model_id = 'Open-Orca/Mistral-7B-OpenOrca'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, you need an access token

model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
    )

model = transformers.AutoModelForCausalLM.from_pretrained(
#model = transformers.AutoModelForQuestionAnswering.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)

# enable evaluation mode to allow model inference
model.eval()

print(f"Model loaded on {device}")

NameError: ignored

In [42]:
# initialize the tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

In [36]:
# stopping criteria
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

[[198, 20490, 25], [198, 15506, 63, 198]]

In [37]:
# convert stop token ids into LongTensor objects

import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([  198, 20490,    25], device='cuda:0'),
 tensor([  198, 15506,    63,   198], device='cuda:0')]

In [38]:
# object that will check whether the stopping criteria has been satisfied

from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
  def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
    for stop_ids in stop_token_ids:
      if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
        return True
    return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

In [45]:
# initialize the HF pipeline
generate_text = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    # langchain expects the full text
    return_full_text=True,
    #task='question-answering',
    task='text-generation',
    # pass model parameters
    stopping_criteria=stopping_criteria,
    temperature=0.1,
    # max tokens in generated output
    max_new_tokens=512,
    # penalize repetition to reduce it
    repetition_penalty=1.1
)

In [47]:
res = generate_text("Explain to me the difference between ETF and managed Fonds")
print(res[0]["generated_text"])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Explain to me the difference between ETF and managed Fonds.,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
