## Assignment Week 6

### RAG using Wikipedia

RAG is a technique of optimizing the output of a large language model, so it references an knowledge outside of its training data sources before generating a response. RAG can be used to augment the LLM knowledge with information that was not part of the original training data. In the below activity RAG approach is used with OpenAPI LLM model (chatgp-4o-mini) which have a knwedge cut of date of `Oct-2023`. Hence, gpt-4o-mini moel will not be able to accurately answer information about events that happened after the `cut-off` date. Hoever, it may still make up some answer even it was not part of training data, but that may be false or not trustworthy. Using, RAG approach relevant and trustworthy response can be received by passing the additional data in the context.

For this exercise, question about wildfires that happened in Jnuary 2025 will be sent to LLM `without RAG approach`, and the same questions will be asked by providing information in the context about California 2025 wildfires from Wikipedia web page `January 2025 Souther California` using `RAG approach`.  Then the responses are investigated.

In [1]:
import json
import os

In [2]:
# get api key from file
with open("../../apikeys/openai-keys.json", "r") as key_file:
    api_key = json.load(key_file)["default_api_key"]
os.environ["OPENAI_API_KEY"] = api_key

##### Load Wikipedia page

In [3]:
from langchain.document_loaders import WikipediaLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [4]:
search_term = "January 2025 Southern California wildfires"
docs = WikipediaLoader(query=search_term, load_max_docs=2).load()

In [5]:
print(f"Length of docs: {len(docs)}")
print(f"{json.dumps(docs[0].metadata, indent=2)}")

Length of docs: 2
{
  "title": "January 2025 Southern California wildfires",
  "summary": "Since January 7, 2025, an ongoing series of 23 catastrophic wildfires have affected the Los Angeles metropolitan area and surrounding regions. The fires have been exacerbated by drought conditions, low humidity, and hurricane-force Santa Ana winds, which in some places have reached 100 miles per hour (160 km/h; 45 m/s). As of January 14, 2025, the wildfires have killed at least 25 people, forced over 200,000 to evacuate, and destroyed or damaged more than 12,401 structures. Most of the damage has been done by the two largest fires: the Palisades Fire in Pacific Palisades and the Eaton Fire in Altadena. They are likely the second and fourth most destructive fires in California's history, respectively.\n\n",
  "source": "https://en.wikipedia.org/wiki/January_2025_Southern_California_wildfires"
}


In RAG approach, the additional infomation from knowledge base is first chunked (splitted) and stored as vector embeddings in a vector database.

1. Perform Chunking

In [6]:

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap  = 50,
    length_function = len,
    is_separator_regex = False,
)

data = text_splitter.split_documents(docs)


[Document(metadata={'title': 'January 2025 Southern California wildfires', 'summary': "Since January 7, 2025, an ongoing series of 23 catastrophic wildfires have affected the Los Angeles metropolitan area and surrounding regions. The fires have been exacerbated by drought conditions, low humidity, and hurricane-force Santa Ana winds, which in some places have reached 100 miles per hour (160 km/h; 45 m/s). As of January 14, 2025, the wildfires have killed at least 25 people, forced over 200,000 to evacuate, and destroyed or damaged more than 12,401 structures. Most of the damage has been done by the two largest fires: the Palisades Fire in Pacific Palisades and the Eaton Fire in Altadena. They are likely the second and fourth most destructive fires in California's history, respectively.\n\n", 'source': 'https://en.wikipedia.org/wiki/January_2025_Southern_California_wildfires'}, page_content='Since January 7, 2025, an ongoing series of 23 catastrophic wildfires have affected the Los Ange

The chunk size is a parameter that can be used to adjust the number of characters in each chunks, which may need adjustment based on the use case and also can be used to control number of tokens for cost optimization. Larger chunks menas more tokens.

Generally some overlap is kept between chunks to not abruptly cut off text which may result in loss of semantic coherence. 

Here cunks are created from splits at 500 characters with an overlap of 50 characters.

It is also a good idea to get the number of tokens chunks may consume.

In [7]:
import tiktoken as tk

In [8]:
encoding = tk.get_encoding("cl100k_base")
encoded_string = encoding.encode(data[0].page_content)
num_tokens = len(encoded_string)
num_tokens

113

2. Generate embeddings and store embeddings in vector database

In [9]:
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

The chunks are converted to vector embeddings and stored in a vector database which enables retrieval of relevant information from the knowledge base matching the user query and sent as context in the prompt to LLM.

Open AI's [text-embedding-3-small][1] embedding model is used to generate the embeddings. it is important to note that this model is not Gen AI model, instead it is used to generates vector embedings. 

[1]:https://platform.openai.com/docs/guides/embeddings

In [10]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [11]:
store = Chroma.from_documents(
    collection_name="wikipedia_california_fire_embeddings",
    documents=data,
    embedding=embeddings,
    ids=[f"{doc.metadata['source']}_{index}" for index, doc in enumerate(data)],
    persist_directory="./vector-stores/chroma/"
)

#### Asking questions about 2025 January California wildfires

In [12]:
from langchain_openai import ChatOpenAI

#### Without RAG

Question 1: How many people were killed in southern california wildfire in January 2025?

In [13]:
openai_model = ChatOpenAI(model="gpt-4o-mini")

In [39]:
messages=[
        {
            "role": "system", 
            "content": f"You are a bot that answers questions about January 2025 Southern California wildfires.\n\
            If you donot know the answer, simply state that you donot know."
        },
        {
            "role": "user", 
            "content": f"Q: How many people were killed in southern california wildfire in January 2025?\nA:"
        }
    ]

In [40]:
response = openai_model.invoke(messages)

In [41]:
print(f"Response from LLM without RAG:\n {response.content}")

'I do not know.'

OpenAI LLM was not able to answer the question about California fire that broke out in January 2025 because the `cut-off` date of the model is Oct 2023 and it does not have knowledge about 2025 California wildfires. 

Note - LLM was instructed to simply say `Do not know` if it doesnot have a answer rather than making up a inaccurate answer.

#### Without RAG

In [32]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import pprint

In [34]:
template = "You are a bot that answers questions about January 2025 Southern California wildfires.\n\
            If you donot know the answer, simply state that you donot know.\n\
            {context}\n\
            Question: {question}"

PROMPT = PromptTemplate(
    template=template, input_variables=["context", "question"]
)

In [35]:
llm = ChatOpenAI(temperature=0, model="gpt-4o-mini")

In [36]:
qa_with_source = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=store.as_retriever(),
    chain_type_kwargs={"prompt": PROMPT, },
    return_source_documents=True,
)

In [38]:
pprint.pprint(
    qa_with_source("How many people were killed in southern california wildfire in January 2025?")
)

{'query': 'how many people were killed in souther calrifornia wildfire in '
          'January 2025?',
 'result': 'As of January 14, 2025, the wildfires in Southern California have '
           'killed at least 25 people.',
 'source_documents': [Document(metadata={'source': 'https://en.wikipedia.org/wiki/January_2025_Southern_California_wildfires', 'summary': "Since January 7, 2025, an ongoing series of 23 catastrophic wildfires have affected the Los Angeles metropolitan area and surrounding regions. The fires have been exacerbated by drought conditions, low humidity, and hurricane-force Santa Ana winds, which in some places have reached 100 miles per hour (160 km/h; 45 m/s). As of January 14, 2025, the wildfires have killed at least 25 people, forced over 200,000 to evacuate, and destroyed or damaged more than 12,401 structures. Most of the damage has been done by the two largest fires: the Palisades Fire in Pacific Palisades and the Eaton Fire in Altadena. They are likely the second 

RAG approach is working as expected. LLM was able to accurately answer the question that southern California wildfires in january 2025 "killed at least 25 people". As we see above that the prompt includes the information about the southern california wildfires from the Wikipedia in the `context`. The `retriever` first retrived the context information from the vector database that closely matches the user question and then included it as part of the prompt for the LLM call.

#### Without RAG

Question 2: What is largest wildfire among california wildfires in 2025?

In [43]:
messages=[
        {
            "role": "system", 
            "content": f"You are a bot that answers questions about January 2025 Southern California wildfires.\n\
            If you donot know the answer, simply state that you donot know."
        },
        {
            "role": "user", 
            "content": f"Q: What is largest wildfire among california wildfires in 2025?\nA:"
        }
    ]

In [44]:
response = openai_model.invoke(messages)

In [45]:
print(f"Response from LLM without RAG:\n {response.content}")

Response from LLM without RAG:
 I do not know.


Same observation as question 1, OpenAI LLM was not able to answer the question. 

#### With RAG

In [46]:
pprint.pprint(
    qa_with_source("What is largest wildfire among california wildfires in 2025?")
)

{'query': 'What is largest wildfire among california wildfires in 2025?',
 'result': 'The largest wildfire among the California wildfires in 2025 is the '
           'Palisades Fire in Pacific Palisades.',
 'source_documents': [Document(metadata={'source': 'https://en.wikipedia.org/wiki/January_2025_Southern_California_wildfires', 'summary': "Since January 7, 2025, an ongoing series of 23 catastrophic wildfires have affected the Los Angeles metropolitan area and surrounding regions. The fires have been exacerbated by drought conditions, low humidity, and hurricane-force Santa Ana winds, which in some places have reached 100 miles per hour (160 km/h; 45 m/s). As of January 14, 2025, the wildfires have killed at least 25 people, forced over 200,000 to evacuate, and destroyed or damaged more than 12,401 structures. Most of the damage has been done by the two largest fires: the Palisades Fire in Pacific Palisades and the Eaton Fire in Altadena. They are likely the second and fourth most de

Received the corcect response from LLM about te largest wildfire among all the southern California wildfires in 2025 because the information about the lagest wildfire in 2025 is passed in the context of the prompt sent to LLM.

Apart from RAG's capability to provide reponses by using information from a knowledge base, the above responses also includes the source document from which the response were based. This is another important factor in RAG approach which allows the user to trust the answer and make sure the response is grounded in the knowledge base.