# Advanced Uses of Generative AI 

## Retrieval Augmented Generation
 In this Project, I built a Retrieval Augmented Generation (RAG) system to answer questions about the COVID-19 pandemic in the United States, using the Wikipedia page titled "COVID-19 pandemic in the United States" as the data source..

The purpose of this exercise was to explore how to use LangChain to create a system that retrieves relevant information from a document and uses OpenAI's gpt-4 model to generate accurate and relevant responses. The steps included loading the document, processing it, creating embeddings, and then using a language model to answer queries.

#### Setting Up OpenAI API Key

To start, I imported the necessary libraries and retrieved the API key from the system's environment variables. This ensures that the API key is securely accessed and available for making the API calls.

In [1]:
import openai
import os

# The API key is stored as an environment variable for security reasons.
# It is set in the system environment and can be accessed by using the os.getenv() function.
api_key = os.getenv("OPENAI_API_KEY")

# If the key is missing, it will print an error message to indicate the issue.
if not api_key:
    print("Error: API key is not set. Please ensure the OPENAI_API_KEY environment variable is configured properly.")
    print("To set your API key, run the following command in your terminal: export OPENAI_API_KEY=your_api_key_here")
    exit(1)  # Exit if the API key is not found

# Set the API key for the OpenAI API
openai.api_key = api_key

#### Load Data from Wikipedia

Here, I used the WikipediaLoader from the langchain library to load the content of the Wikipedia page titled "COVID-19 pandemic in the United States." This page served as the knowledge base for the Q&A system.

In [2]:
from langchain.document_loaders import WikipediaLoader

# Loaded the Wikipedia page
loader = WikipediaLoader(query="COVID-19 pandemic in the United States", load_max_docs=1)
documents = loader.load()

#### Split the Text into Chunks

Since the loaded document was too large for the model to process in one go, I split the text into smaller chunks. This was done using the RecursiveCharacterTextSplitter, which ensured that each chunk contained a manageable amount of text while maintaining some overlap to keep context intact.

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split the document into smaller chunks of 100 characters, with a 20-character overlap between chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20, length_function=len, is_separator_regex=False)

chunks = text_splitter.split_documents(documents)

# Displayed the first 2 chunks to verify the splitting process
chunks[:2]


[Document(metadata={'title': 'COVID-19 pandemic in the United States', 'summary': "On December 31, 2019, China announced the discovery of a cluster of pneumonia cases in Wuhan. The first American case was reported on January 20, and Health and Human Services Secretary Alex Azar declared a public health emergency on January 31. Restrictions were placed on flights arriving from China, but the initial U.S. response to the pandemic was otherwise slow in terms of preparing the healthcare system, stopping other travel, and testing. The first known American deaths occurred in February and in late February President Donald Trump proposed allocating $2.5 billion to fight the outbreak. Instead, Congress approved $8.3 billion and Trump signed the Coronavirus Preparedness and Response Supplemental Appropriations Act, 2020 on March 6. Trump declared a national emergency on March 13. The government also purchased large quantities of medical equipment, invoking the Defense Production Act of 1950 to a

#### Create Embeddings and Store Them in Chroma

I created embeddings for the text chunks using OpenAI's model. These embeddings converted the text into vectorized representations, which are more suitable for similarity searches. I stored these embeddings in a Chroma vector store, which allowed for fast retrieval of the most relevant document chunks.

In [4]:
from langchain_openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Generated embeddings for the chunks using OpenAI's embedding model
embeddings = OpenAIEmbeddings()

# Stored the embeddings in a Chroma vector store for efficient retrieval
store = Chroma.from_documents(
    chunks,
    embeddings,
    ids=[f"{item.metadata['source']}-{index}" for index, item in enumerate(chunks)],
    collection_name="COVID19-USA-Embeddings",
    persist_directory='db',  # Directory to store the embeddings
)

#### Define the Prompt Template for the Q&A System

I defined a prompt template that guided the behavior of the language model when answering questions. The template ensured that the model answered based only on the context provided and avoided fabricating information that wasn’t in the document.

In [5]:
from langchain.prompts import PromptTemplate

# template = """You are a bot that answers questions about inflation rates in countries, using only the context provided. If you don't know the answer, simply state that you don't know.

template = """You are a bot that answers questions about the COVID-19 pandemic in the United States, using only the context provided. If you don't know the answer, simply state that you don't know.

{context}

Question: {question}"""

PROMPT = PromptTemplate(
    template=template, input_variables=["context", "question"]
)

#### Create the Retrieval Question Answering Model

Using the embeddings stored in Chroma, I created a Retrieval Question Answering (QA) model. This model combined OpenAI's language model with the Chroma vector store to retrieve relevant information and generate responses based on the context. The model used the defined prompt template for generating answers.

In [6]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain_openai import ChatOpenAI
import pprint

# Initialized the model with temperature=0 for deterministic responses
llm = ChatOpenAI(temperature=0, model_name="gpt-4")

# Created the RetrievalQA model using Chroma store and OpenAI model
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Used "stuff" chain type to concatenate documents as context
    retriever=store.as_retriever(),
    chain_type_kwargs={"prompt":PROMPT, },
    return_source_documents=True,  # Returned the source documents along with the answer
)


#### Ask Questions and Retrieve Answers

I asked different questions related to COVID-19 and used the model to retrieve answers. The model used the context from the Wikipedia page to generate accurate responses. 

##### Question 1: To start, I asked a straightforward factual question: "When did the COVID-19 vaccine rollout begin in the United States?"

In [7]:
query1 = "When did the COVID-19 vaccine rollout begin in the United States?"

In [8]:
response1 = qa.invoke({"query": query1})
print(f"Answer: {response1['result']}\n")

Answer: The COVID-19 vaccine rollout began in the United States in December 2020.



**Inference:**

This question worked well and produced an accurate and concise response from the Wikipedia page. Since the vaccine rollout was a major milestone that was widely documented in the source material, the system had no difficulty retrieving and presenting the correct answer. This demonstrates that for well-defined, fact-based queries with clear phrasing, the model and retrieval system perform effectively.

#### Question 2: Next, I asked a more open-ended question, expecting the model to respond with information like mask mandates, event cancellations, and stay-at-home orders.
"What public health measures were implemented in the United States during the COVID-19 pandemic?"

In [9]:
query2 = "What public health measures were implemented in the United States during the COVID-19 pandemic?"

In [10]:
response2 = qa.invoke({"query": query2})
print("Answer:", response2['result'])

Answer: The text does not provide specific information on the public health measures implemented in the United States during the COVID-19 pandemic.


This result was surprising because I knew the source document included mentions of public health measures like mask mandates. My thought process was that the phrase “public health measures” may have been too formal or abstract, leading the system to retrieve chunks that didn’t match the language of the question. Additionally, the chunk size used during document splitting may have influenced retrieval. If relevant information was split across chunks or not fully contained within one, it could have reduced the chance of retrieval. 

##### Refining the Question: How did the government try to keep people safe during the pandemic?
I rephrased the question to be more conversational and better align how the information was described in the source text. 

In [11]:
query3 = "How did the government try to keep people safe during the pandemic?"

In [12]:
response3 = qa.invoke({"query": query3})
print(f"Answer: {response3['result']}\n")

Answer: The government tried to keep people safe during the pandemic by implementing several measures. These included the requirement to wear a face mask in specified situations, also known as mask mandates. They also prohibited and cancelled large-scale gatherings, including festivals and sporting events, to prevent the spread of the virus.



**Inference:**

The model responded accurately, mentioning key interventions such as mask mandates and the cancellation of large events. This confirmed that even when the content exists in the source material, the phrasing of the query plays a critical role in retrieval performance. Using more natural, everyday language helped the system connect the query more effectively with the embedded chunks. 

**Conclusion**

This exercise provided valuable insights into how Retrieval-Augmented Generation (RAG) systems perform depending on query phrasing, chunk size, and model choice. Clear, fact-based questions (like the one about vaccine rollout) returned accurate results. But broader questions (such as “public health measures”) didn’t always pull in relevant content, even when it was present. Rephrasing those questions in a more natural, everyday tone helped improve the answers, showing that how we word a question can affect what the system retrieves.

I also experimented with both GPT-3.5-turbo and GPT-4. GPT-3.5-turbo often gave broader, inferred answers, even when the details weren’t explicitly present in the source. In contrast, GPT-4 was stricter, delivering responses only when the information was clearly available in the retrieved text.

This exercise enhanced my understanding of how fine-tuning queries and model selection can significantly impact the performance of a RAG system. 