## Building a Retrieval-Augmented Generation (RAG) System
#### Alan García Zermeño
06/13/2024

### Section 6: Expanding the RAG System to the Internet
#### This section includes:
- Code snippets for integrating internet retrieval into the RAG system.
- Examples of questions, the retrieved information from the web, and the
corresponding generated answers.
- A brief report on the methods used to evaluate and ensure the quality of the
internet-sourced information.

In [1]:
from openai import OpenAI
import textwrap
import http.client
import json
import sys
import os

# Import script modules
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../Scripts')))
from datacleaner import data_cleaner
from RetrievalSystemWeb import evaluator, generGemini, CRAG

[nltk_data] Downloading package punkt to /home/alan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The best method to improve the retrieve system is to add web search when the database dont have an specific answer to a query. First we are going call the data cleaner.

In [2]:
corpus,_,answers = data_cleaner()

49 Question/Answer pairs extracted!


To make web queries we will use a cheap version of Google search API: Serper. We will need the following:
- We will need an API key, so you can get one in: https://serper.dev/api-key.
- Then, save it in the 'serper.txt' file in the APIS directory. 
- **Please only paste the API key**.

The strategy will be to make a web query once the function that evaluates our query determines that there is not enough information in the database to accurately answer our query. Once this occurs, the process will be as follows:

- **Processing**: We will ask Gemini to analyze our query and extract up to four words that describe our question in a format suitable for web search. For example: "*The best football players of 2023*" --> "*best football players 2023*"

- **Query**: This summarized query will be used to call the server API, which will return some useful URLs and snippets.

- **Extraction**: We will use JSON to extract all the snippets and concatenate them into a corpus.

- **Summarization**: Gemini will summarize the corpus into a couple of sentences to answer the query as concretely as possible.

- **Generation**: This summarized query will serve as context for generating the final response by GPT.

First, we will call the evaluator using our corpus and our query. It will return a boolean value boo = False if there is no need for a web query, and it will return the context found in the database. It will return boo = True if the evaluator determines that there is not enough information in the database. In this case, the context will be the query itself, which will serve to initiate the web search.

In [3]:
query = "What is Keytruda?"
context,boo = evaluator(query,corpus,answers)

Does the following document have exact information to answer the following query?
    Please choose one of the two possible options: Yes, or No.
    

        Question: What is Keytruda?

        Document: Keytruda is administered as an intravenous infusion over 30 minutes.

        Evaluation: [Select one: Yes, No]:
No


In this scenario, we require a web search, so we will design the web query with the help of Gemini.

In [4]:
promptQ = "Summarize this query to a maximum of 4 words to perform an internet search: \nQuery: "
queryW = generGemini(promptQ+context)
print(f"Web Query: {queryW}")

Web Query: Keytruda definition


Next, we perform the query by calling the API with our key.

In [10]:
with open("../APIS/serper.txt", 'r') as file: apiw = file.readline().strip()
conn = http.client.HTTPSConnection("google.serper.dev")
payload = json.dumps({"q": queryW})
headers = {'X-API-KEY': apiw,'Content-Type': 'application/json'}
conn.request("POST", "/search", payload, headers)
res = conn.getresponse()
res

<http.client.HTTPResponse at 0x717b1efb75e0>

Then, we extract the snippets using json

In [12]:
data = res.read()
response_dict = json.loads(data.decode("utf-8"))

webCorpus = ""
for sni in response_dict['organic']:
    webCorpus += " "+sni['snippet']
    
print(textwrap.fill(webCorpus[:200])+" ...")

 KEYTRUDA is a prescription medicine used to treat a kind of cancer
called head and neck squamous cell cancer (HNSCC). KEYTRUDA may be
used with the chemotherapy ... KEYTRUDA, as a single agent, is in ...


This big information block is going to be summarized by Gemini.

In [8]:
promptZ = f"Summarize this corpus to a maximum of 2 sentences to try to answer ONLY this query: \nCorpus: {webCorpus} \nQuery: {query} \nResponse: "
response = generGemini(promptZ)
print(textwrap.fill(response))

Keytruda, a type of immunotherapy, is a prescription medicine used to
treat different types of cancers, including cervical cancer, head and
neck squamous cell carcinoma, and other types. It is administered
through intravenous infusion over 30 minutes.


The final step is to use this response as context for the final GPT response to our query.

In [9]:
#Define model and prompt
with open("../APIS/gpt.txt", 'r') as file: apik = file.readline().strip()
client = OpenAI(api_key = apik)
model = "gpt-3.5-turbo"


print("IMPORTANT: This information is not extracted from the official database")

prompt = f"""
    Please generate an informative and concise response to the following query.
    Use the provided context information to ensure your response is accurate and relevant.

    Context: {context}
    """

completion = client.chat.completions.create(
                model=model,
                messages=[
                {"role": "system", "content": prompt},
                {"role": "user", "content": query}
                ],
                max_tokens=120
            )
print("\n\nGPT RESPONSE:")
print(textwrap.fill(completion.choices[0].message.content))

IMPORTANT: This information is not extracted from the official database


GPT RESPONSE:
Keytruda is a brand name for the drug pembrolizumab, which is a type
of immunotherapy medication used to treat cancer. It works by helping
the immune system identify and attack cancer cells. Keytruda is
approved for various types of cancers, including melanoma, non-small
cell lung cancer, head and neck cancer, and more. It is typically
prescribed by a healthcare provider and administered through infusion
or injection.


### Adding Security

Data on the internet is not always reliable. Various strategies can be employed to maintain the reliability of responses. Among these, we could use a model with fine-tuning on real data to serve as a filter to avoid false information. However, the simplest approach is to use only trustworthy websites. In this case, we could compile a list of trusted URLs, such as "keytruda.com," to extract snippets solely from that domain. We would simply need to modify the extraction code slightly:

In [13]:
secureURLs = ['keytruda.com']                           #list of trusted URLs
webSecureCorpus = ""
for sni in response_dict['organic']:
    for urls in secureURLs:
        if urls in sni["link"]:
            webSecureCorpus += " "+sni['snippet']
    
print(textwrap.fill(webSecureCorpus))

 KEYTRUDA is a prescription medicine used to treat a kind of cancer
called head and neck squamous cell cancer (HNSCC). KEYTRUDA may be
used with the chemotherapy ...


The complete script CRAG() was programmed in the "RetrievalSystemWeb.py" file, and we can easily call it by passing it our query, our corpus, our responses and the list of websites with which we want to search for information if the evaluation model requires it.

In [3]:
query = "What is Keytruda?"
CRAG(
    query = query,
    corpus = corpus,
    answers = answers,
    safeURLs = ['keytruda.com']
    )

Does the following document have exact information to answer the following query?
    Please choose one of the two possible options: Yes, or No.
    

        Question: What is Keytruda?

        Document: Keytruda is administered as an intravenous infusion over 30 minutes.

        Evaluation: [Select one: Yes, No]:
No
IMPORTANT: This information is not extracted from the official database
Web Query: Keytruda definition
 KEYTRUDA is a type of immunotherapy that works by blocking the PD-1
pathway to help prevent cancer cells from hiding. KEYTRUDA helps the
immune system do ...


GPT RESPONSE:
Keytruda is an immunotherapy drug that works by blocking the PD-1
pathway, a mechanism that cancer cells use to evade detection by the
immune system. By blocking this pathway, Keytruda helps the immune
system recognize and attack cancer cells more effectively. It is used
in the treatment of various types of cancer, such as melanoma, lung
cancer, and head and neck cancer.


The following is an example where the evaluator model found pertinent information in the database and didnt perform a search web. 

In [4]:
query = "What are Keytruda side effects?"
CRAG(
    query = query,
    corpus = corpus,
    answers = answers,
    safeURLs = ['keytruda.com']
    )

Does the following document have exact information to answer the following query?
    Please choose one of the two possible options: Yes, or No.
    

        Question: What are Keytruda side effects?

        Document: Common side effects include fatigue, nausea, and skin rash.

        Evaluation: [Select one: Yes, No]:
Yes


GPT RESPONSE:
Keytruda is a medication used in immunotherapy to treat certain types
of cancers. Some common side effects of Keytruda include fatigue,
nausea, and skin rash. It is important to speak with your healthcare
provider about any potential side effects and how to manage them
effectively. Other possible side effects may include diarrhea,
decreased appetite, fever, and muscle pain. Severe side effects are
less common but can include immune-related reactions that affect the
skin, lungs, liver, and other organs. It is essential to report any
concerning symptoms to your healthcare provider promptly.


The following are examples where the evaluator model didnt found pertinent information in the database and did perform a search web.

In [7]:
query = "Who invented keytruda?"
CRAG(
    query = query,
    corpus = corpus,
    answers = answers,
    safeURLs = ['keytruda.com','wikipedia']
    )

Does the following document have exact information to answer the following query?
    Please choose one of the two possible options: Yes, or No.
    

        Question: Who invented keytruda?

        Document: Keytruda is administered as an intravenous infusion over 30 minutes.

        Evaluation: [Select one: Yes, No]:
No
IMPORTANT: This information is not extracted from the official database
Web Query: Inventor of Keytruda
 Pembrolizumab, sold under the brand name Keytruda, is a humanized
antibody used in cancer immunotherapy that treats melanoma, lung
cancer, head and neck cancer, ...


GPT RESPONSE:
Keytruda, also known as pembrolizumab, was developed by the
pharmaceutical company Merck & Co., Inc. It was approved by the Food
and Drug Administration (FDA) in 2014 for the treatment of advanced
melanoma.


In [4]:
query = "keytruda and chemotherapy are the same?"
CRAG(
    query = query,
    corpus = corpus,
    answers = answers,
    safeURLs = ['keytruda.com','wikipedia'],
    printp = True
    )

Does the following document have exact information to answer the following query?
    Please choose one of the two possible options: Yes, or No.
    

        Question: keytruda and chemotherapy are the same?

        Document:  Yes, Keytruda demonstrated superior efficacy in terms of overall survival and progression-free survival compared to chemotherapy in NSCLC patients.

        Evaluation: [Select one: Yes, No]:
Yes


GPT RESPONSE:
No, Keytruda (pembrolizumab) and chemotherapy are not the same.
Keytruda is an immunotherapy drug that works by targeting and blocking
the PD-1/PD-L1 pathway, thereby boosting the immune system to fight
cancer cells. On the other hand, chemotherapy involves using drugs
that directly kill rapidly dividing cancer cells but can also affect
normal cells. In studies, Keytruda has demonstrated superior efficacy
in terms of overall survival and progression-free survival compared to
chemotherapy in patients with non-small cell lung cancer (NSCLC).


This last example is very important, since the base document retriever system found an answer that responds affirmatively to our query: "Is keytruda and chemo the same?", however, our evaluator model detects that it is not an appropriate answer to our question and send to the web search. Finally, GPT clearly explains the difference between chemo and keytruda to us by taking information from secure websites.