## RAG notebook
#### This notebook will go over basic RAG examples

#### What is RAG? 

LLMs excel at producing answers based on "general" knowledge (this is the data they were originally trained on). When posing questions to an LLM based on niche data, or user-specific data, they will usually fail to produce correct or relevant answers. 

We want to somehow add background knowledge to our question so the LLM knows what to do with it. Here is the example

query = 'Summarize the history of Virginia' 

^ this query might be okay - it is possible the model has seen enough information about Virginia to answer, but maybe not! 

better_query = 'Summarize the history of Virginia + [state creation, state constitution, key figures etc.] 

the content in the brackets is fetched from somewhere. 

1) How do we effectively store data? 
2) How do we know what is "relevant" data?
3) How much context is "enough"

^ these are some of the key questions that engineers and researchers ask themselves when studying and building RAG systems

#### Context Window Trade-off

Why can't we just paste in the relevant data directly into our query without dealing with storage and retrieval? This is because LLMs have limited "context windows". 

"Context windows" are how many tokens the LLM can process at once. This is like the working memory of an LLM. A helpful analogy is person A gives someone digits of pi and then asks person B to recall the digits in forward and backwards order. The number of digits they can do this without error is essentially a context window. 

If LLMs had infinite context windows, we wouldn't need RAG. There are pros and cons. 

1) Long Context Windows - consistently outperform RAG, but are more expensive (when using free models, this means it takes more time to analyze). 

2) RAG is efficient - you only supply the most relevant information and is faster. 

As with many concepts in software, there are performance vs. efficiency tradeoffs. Ideally, it helps to leverage both *together* when possible

## Example 1 - Basic Query vs. Augmented Query

#### The example will show the differences about when we ask a query about a topic that an LLM may or may not have enough background information on w/ and w/o context. 



#### Early History of VA (general knowledge)

In [1]:
import os
import ollama 
general_query = 'give me a three paragraph summary of early Virgina history'

response = ollama.generate(model = 'gemma2:2b',
                           prompt = general_query)['response']
#let's write the response to a text file for easy reading 

f = open('llm_va_history_response.txt','w')
f.write(response)
f.close() #make sure to close the file

#check your project directory for the generated text file 

### Early History of VA (with context)

#### paragraphs taken from: https://www.britannica.com/place/Virginia-state/Independence-and-statehood

In [2]:
early_va_history = open('early_va_history.txt').read() 

query_with_context = f"{general_query}. Use this text as context: {early_va_history}"

response_with_context = ollama.generate(model = 'gemma2:2b',
                                        prompt = query_with_context)['response']

f = open('llm_response_with_context.txt','w')
f.write(response_with_context)
f.close() 

#again, check your project directory 
#compare answers to the generated files 



#### Note here that Gemma2-2b's context window was long enough such that we could just add the contents of the text file to the query. This is an ideal scenario.

#### With RAG, you want to break up the text into smaller chunks and store it. This process is called "indexing". 

#### Recall, that LLMs, and all other ML models, convert text into vectors (technically "tensors", but not necessary to understand how those work here). These vectors represent the text in a sensible format that is learned during training. (How LLMs are trained will be in a different notebook)

#### First, we compute embeddings for our documents. Documents don't strictly have to be text. "Documents" in RAG mean some source data that we want to index. They can be images and audio. 

#### Embeddings are learned in a way that similar words, sentences etc. produce "similar" vectors. This means that their "distance" from each other is close. Indexing documents is a way to pair embeddings with the documents they came from so when we do find similar vectors, we know what documents they are associated with. How we pair is a design decision based on application! 

#### When we pose a query to an LLM and we want to fetch context, we want to fetch documents which are relevant to the query. This means we need to embed the query and compare to documents we have indexed. 



In [3]:
#function to create embeddings from a file 
def make_embeddings(file,
                    model = 'all-minilm:33m'):
    
    
    lines = open(file,'r').read().split('.')
    embedding_dict = {} #dictionary for embeddings
    for i,line in enumerate(lines):
        '''
        For each line, we are going to compute the embedding
        each line is a context chunk that we will compare
        a query to
        '''
        embedding = ollama.embeddings(model = model,
                                      prompt = line)
        
        embedding_dict[line] = embedding['embedding']
   
    return embedding_dict 

file = 'early_va_history.txt'
model = 'all-minilm:33m'
#ollama.pull(model)
embedding_dict = make_embeddings(file = file,model = model)

#### Now let's compute the embedding distances between a query and the chunks. We will rank the chunks from most similar (i.e. smallest distance) to least similar (greatest distance). To do this efficiently, we will use a data structure called a "heap" to track the top-k most relevant documents

#### We don't *have* to use heaps since our text file is small, but as the number of documents increase - we need to store the most relevant ones efficiently. 

In [4]:

import numpy as np 
import heapq 
query = "who were the earliest people in modern day Virginia?"
top_k = 3 #top 3 chunks, the document has 12 sentences so pick a number 1 <= top_k <= 11 

def compute_distances(query,embedding_dict):
    query_embedding = np.array(ollama.embeddings(model = model,
                                        prompt = query)['embedding'])
    distance_dict = {}
    for chunk,_embedding in embedding_dict.items():
        embedding = np.array(_embedding)
        distance = np.linalg.norm(embedding - query_embedding)
        distance_dict[chunk] = distance.item()
    
    top_k_chunks = heapq.nlargest(top_k,distance_dict)
    return {'chunks': top_k_chunks,
            'distances': distance_dict}

distance_dict = compute_distances(query,embedding_dict)
#let's now see the top chunks (not necessarily sorted)
for i,chunk in enumerate(distance_dict['chunks']):
    chunk_query_dist = distance_dict['distances'][chunk]
    print(f"chunk {i+1} (distance: {chunk_query_dist:.4f}): {chunk}")

    




        

chunk 1 (distance: 4.6651): The original inhabitants of Virginia arrived some 10,000 to 12,000 years ago
chunk 2 (distance: 5.6623):  These were people of Paleo-Indian culture, who, like their successors, the Archaic-culture people, lived mainly by hunting and fishing
chunk 3 (distance: 4.6479):  The coastal areas of eastern Virginia supported a significant population of indigenous peoples who fished in the rivers and bays and hunted wild fowl


#### GraphRAG example coming soon 