# RAG – Retrieval Augmented Generation

## CIML Summer Institute

#### UC San Diego



In [2]:
import os
import numpy as np
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.memory import ConversationSummaryMemory
from langchain.prompts import PromptTemplate
from langchain_community.llms import Ollama
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from chromadb.utils import embedding_functions
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'langchain'

## Setup

In this RAG tutorial, we'll be working with [LangChain](https://python.langchain.com/v0.2/docs/introduction/), which is a powerful framework for building applications with language models. LangChain provides utilities for working with various language model providers, integrating embeddings, and creating chains for more complex applications. Below are the necessary imports for this notebook:

We are also using [Ollama](https://ollama.com/), which is a platform for running LLMs on your local machine.  The following steps are needed to set up ollama for this RAG tutorial:
1. In a terminal window in JupyterLab, type in the following command to start up the ollama service:
**ollama serve**
2. In another terminal window in JupyterLab, type in the following to download the model:  **ollama pull mistral**

## Part 1: Retrieval

- In this section, we'll focus on the retrieval aspect of RAG. We'll start by understanding vectorization, followed by storing and retrieving vectors efficiently.

#### Vectorizing

- **Vectorization** is the process of converting text into vectors in an embedding space. These vectors capture the semantic meaning of the text, enabling us to perform various operations like similarity calculations. We'll use HuggingFaceEmbeddings for this task, which is a class in LangChain that allows for an **embedding model** from HuggingFace to be integrated into LangChain applications. You can see the documentation for this LangChain class [here](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.huggingface.HuggingFaceEmbeddings.html).


In [2]:
vectorizer = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

This vectorizer converts text into vectors in embedding space. Lets try seeing how we can use this.

In [3]:
vectorizer.embed_query("dog")[0:10]

[-0.05314699932932854,
 0.014194376766681671,
 0.0071457442827522755,
 0.06860868632793427,
 -0.0784803256392479,
 0.010167454369366169,
 0.10228314995765686,
 -0.012064827606081963,
 0.09521343559026718,
 -0.030350156128406525]

- As you can see from above, this converts text into a series of numbers. 

### Task 1

Your job is to **write a function that takes in two strings, vectorize them, and return their cosine similarity.** Implement the following function.

#### `similarity_two_queries`

In [4]:
def similarity_two_queries(word1, word2):
    # HINT:
    # Use vectorizer.embed_query(<text>) to embed text.
    # Use np.dot to find the cosine similarity/dot product of 2 vectors
    # TODO

    return None

In [5]:
# SOLUTION
def similarity_two_queries(word1, word2):
    # TODO
    word1_vec = vectorizer.embed_query(word1)
    word2_vec = vectorizer.embed_query(word2)
    return np.dot(word1_vec,word2_vec)

- Observe the similarity scores of both **'cat'** and **'dog'** to the word **'kitten'**

In [6]:
print("Similarity of 'kitten' and 'cat': ",similarity_two_queries("kitten","cat"))
print("Similarity of 'kitten' and 'dog': ",similarity_two_queries("kitten","dog"))

Similarity of 'kitten' and 'cat':  0.7882107945392884
Similarity of 'kitten' and 'dog':  0.520505095530218


- By using the previously defined function,  we can take pairs of texts and quantify how **similar** they are.

### Task 2

**Which of the following words in the list `words` are most related to the word 'color'?** The function `similarity_list` takes a list of words, and outputs the word and similarity score from highest to lowest. 

In [7]:
def similarity_list(word,list):
    similarity_list = [(i,similarity_two_queries("color",i)) for i in words]
    sorted_similarity_list = sorted(similarity_list,key=lambda x:x[1],reverse=True)
    return sorted_similarity_list

In [8]:
words = ["rainbow","car","black","red","cat","tree"]

In [9]:
# TODO: Which words are most similar to color?

In [10]:
# SOLUTION
similarity_list("color",words)

[('black', 0.7855720868733791),
 ('red', 0.7491880062040888),
 ('rainbow', 0.5601088091565491),
 ('car', 0.4040626690053575),
 ('cat', 0.3583904470705671),
 ('tree', 0.357355130693161)]

### Task 3

Each query below has an appropriate text that allows you to answer the question. The function `match_queries_with_texts` matches a query with its most related text. **Come up with 3 more questions and 3 suitable answers and add them to the list below.**

In [11]:
def match_queries_with_texts(queries, texts):
    # Calculate similarities between each query and text
    similarities = np.zeros((len(queries), len(texts)))
    
    for i, query in enumerate(queries):
        for j, text in enumerate(texts):
            similarities[i, j] = similarity_two_queries(query, text)
    
    # Match each query to the text with the highest similarity
    matches = {}
    for i, query in enumerate(queries):
        best_match_idx = np.argmax(similarities[i])
        matches[query] = texts[best_match_idx]
    
    return matches

In [12]:
# TODO: Fill in the list to make suitable question-text pairs.

queries = ["What are the 7 colors of the rainbow?", 
           "What does Elsie do for work?", 
           "Which country has the largest population?",
           "-- INSERT QUERY 1 HERE--",
           "-- INSERT QUERY 2 HERE--",
           "-- INSERT QUERY 3 HERE--"]
texts = ["China has 1.4 billion people.",
         "Elsie works the register at Arby's.", 
         "The colors of the rainbow are ROYGBIV.",
         "-- INSERT TEXT 1 HERE--",
         "-- INSERT TEXT 2 HERE--",
         "-- INSERT TEXT 3 HERE--"]

In [13]:
#SOLUTION
queries = ["What are the 7 colors of the rainbow?", 
           "What does Elsie do for work?", 
           "Which country has the largest population?",
           "What time is it?",
           "What is the largest continent?",
           "Who is the greatest Football player?"]
texts = ["China has 1.4 billion people.",
         "Elsie works the register at Arby's.", 
         "The colors of the rainbow are ROYGBIV.",
         "The time is 3:14.",
         "The largest continent is Asia.",
         "Christiano Ronaldo"]

- Now we shuffle the queries and texts. Let's see if we can match them!

In [14]:
import random
random.shuffle(queries)
random.shuffle(texts)

match_queries_with_texts(queries, texts)

{'Who is the greatest Football player?': 'Christiano Ronaldo',
 'What are the 7 colors of the rainbow?': 'The colors of the rainbow are ROYGBIV.',
 'What time is it?': 'The time is 3:14.',
 'Which country has the largest population?': 'China has 1.4 billion people.',
 'What does Elsie do for work?': "Elsie works the register at Arby's.",
 'What is the largest continent?': 'The largest continent is Asia.'}

#### Database

Now lets look at how we can store these for efficient retrieval of the vectors. There are many options for storage but in this exercise, we use [ChromaDB](https://python.langchain.com/v0.1/docs/integrations/vectorstores/chroma/)
 which is an open-source vector DB.

Through langchain, we can set the database to be a LangChain ***retriever*** object, which essentially allows us to perform queries similarly to what we have done before.

**Taking the `texts` and `queries` that you defined before, we can load it into ChromaDB and similarly perform the same operations.**

In [15]:
ids = list(range(len(texts)))
db = Chroma.from_texts(texts, vectorizer, metadatas=[{"id": id} for id in ids])
retriever = db.as_retriever(search_kwargs={"k": 1})

In [16]:
texts

['The time is 3:14.',
 'The colors of the rainbow are ROYGBIV.',
 "Elsie works the register at Arby's.",
 'The largest continent is Asia.',
 'China has 1.4 billion people.',
 'Christiano Ronaldo']

In [17]:
retriever.invoke("Which country has the largest population?")

[Document(page_content='China has 1.4 billion people.', metadata={'id': 4})]

#### Task 4
Now let’s apply the same retrieval process to a file we read in. The file `workplaces.txt` contains names and workplaces of several people. 


In [18]:
with open("workplaces.txt", 'r') as file:
    lines = file.readlines()
lines = [line.strip() for line in lines]
print(lines[0:4])

["Aaron works at McDonald's", 'Beth works at Starbucks', 'Charlie works at Walmart', 'Daisy works at Amazon']


`workplace_retriever` is a function that takes in the workplace.txt file and returns a database as retriever that you can use to find out the workplaces of people in the file. You can specify the top-k results in the argument of the function.

In [19]:
def workplace_retriever(k=3):
    with open("workplaces.txt", 'r') as file:
        lines = file.readlines()
    lines = [line.strip() for line in lines]
    db = Chroma.from_texts(lines,vectorizer, metadatas=[{"id": id} for id in list(range(len(lines)))])
    retriever = db.as_retriever(search_kwargs={"k": k})
    return retriever

Using `workplace_retriever`, **find out who works at Starbucks and McDonald's**.

In [20]:
# TODO: Find out who works at Starbucks and who works at McDonalds. Use the retriever(k=3).invoke(<query>) method to do this
# You can experiment with the value of k to make sure you find all people that work in one place.

In [21]:
# SOLUTION
workplace_retriever(3).invoke("Who works at Starbucks")

[Document(page_content='Brian works at Starbucks', metadata={'id': 27}),
 Document(page_content='Beth works at Starbucks', metadata={'id': 1}),
 Document(page_content="Aaron works at McDonald's", metadata={'id': 0})]

In [22]:
# SOLUTION
workplace_retriever(3).invoke("Who works at McDonald's")

[Document(page_content="Aaron works at McDonald's", metadata={'id': 0}),
 Document(page_content="Alice works at McDonald's", metadata={'id': 26}),
 Document(page_content='Wendy works at Reddit', metadata={'id': 22})]

### Task 5

#### Chunking

The `workplaces.txt` data we just looked at was conveniently split into lines, with each line representing a distinct and meaningful chunk of information. This straightforward structure makes it easier to process and analyze the text data.

However, it is usually not so straightforward:
- When dealing with text data, especially from large or complex documents, it's essential to handle the formatting and structure efficiently.
- If we get a not-so-simply formatted file, we can break it down into manageable chunks using LangChain's `TextLoader` and `RecursiveCharacterTextSplitter`.
- This allows us to preprocess and chunk the data effectively for further use in our RAG pipeline.

Lets take a look at some of the *Expanse* documentation [here](https://www.sdsc.edu/support/user_guides/expanse.html). We have downloaded the contents of this webpage into two text files named `expanse_doc_1.txt` and `expanse_doc_2.txt`.

In [23]:
with open("expanse_doc_1.txt", 'r') as file:
    lines = file.readlines()
lines = [line.strip() for line in lines]
print(lines[20:35])

['Job Charging', 'Compiling', 'Running Jobs', 'GPU Nodes', 'Data Movement', 'Storage', 'Composable Systems', 'Software Packages', 'Publications', 'Expanse User Guide', 'Technical Summary', '', '', 'Expanse is a dedicated Advanced Cyberinfrastructure Coordination Ecosystem: Services and Support (ACCESS) cluster designed by Dell and SDSC delivering 5.16 peak petaflops, and will offer Composable Systems and Cloud Bursting.', '']


- We see that the data and text is not split into meaningful chunks of information by default, so we need to try out best to format it in such a way it can be useful. This is why we use chunks, which capture local and neighboring texts, grouping them together.

A function that chunks `expanse_doc_1.txt` has been provided below.  When using the RecursiveCharacterTextSplitter, the chunk size determines the maximum size of each text chunk. This is particularly useful when dealing with large documents that need to be split into smaller, manageable pieces for better retrieval and analysis.

**Experiment with different chunk sizes** and pick a size that captures enough information to answer the question: ***"How do you run jobs on expanse?"*** Try sizes **10, 100 and 1000** and observe what info is being given.

In [24]:
def expanse_retriever(chunk_size):
    loader = TextLoader('expanse_doc_1.txt')
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=10, separators=[" ", ",", "\n"])
    texts = text_splitter.split_documents(documents)
    db = Chroma(embedding_function=vectorizer)
    db.add_documents(texts)
    retriever = db.as_retriever(search_kwargs={"k": 3})
    return retriever

In [25]:
# TODO: Think about how many characters would be needed to contain useful information for such a complex task


In [26]:
# SOLUTION
expanse_retriever(1000).invoke("How do you run jobs on expanse?")

[Document(page_content='up to 30M core-hours.\nJob Scheduling Policies\nThe maximum allowable job size on Expanse is 4,096 cores – a limit that helps shorten wait times since there are fewer nodes in idle state waiting for large number of nodes to become free.\nExpanse supports long-running jobs - run times can be extended to one week. Users requests will be evaluated based on number of jobs and job size. \nExpanse supports shared-node jobs (more than one job on a single node). Many applications are serial or can only scale to a few cores. Allowing shared nodes improves job throughput, provides higher overall system utilization, and allows more users to run on Expanse.\nTechnical Details\nSystem Component\tConfiguration\nCompute Nodes\nCPU Type\tAMD EPYC 7742\nNodes\t728\nSockets\t2\nCores/socket\t64\nClock speed\t2.25 GHz\nFlop speed\t4608 GFlop/s\nMemory capacity\t\n* 256 GB DDR4 DRAM\n\nLocal Storage\t\n1TB Intel P4510 NVMe PCIe SSD\n\nMax CPU Memory bandwidth\t409.5 GB/s\nGPU Nodes

### Task 6
#### Multiple Document Chunking

When we have more than one document we want to use in our database, we can simply iteratively chunk them. Metadata for the text source is added by default, but we can add our own metadata as well in the form of IDs.


`expanse_all_retriever` is a function that chunks both `expanse_doc_1.txt` and `expanse_doc_2.txt` has been provided below, using a chunk size of 1000 characters, **Find which document information for "Compiling Codes" is most likely to be located.** *Hint: Look at the metadata*

In [27]:
def expanse_all_retriever(chunk_size):
    import glob
    db = Chroma(embedding_function=vectorizer)
    pattern = 'expanse_doc_*.txt'
    file_list = glob.glob(pattern)
    for file_name in file_list:
        loader = TextLoader(file_name)
        documents = loader.load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=10, separators=[" ", ",", "\n"])
        texts = text_splitter.split_documents(documents)
        for id,text in enumerate(texts):
            text.metadata["chunk_number"] = id
        db.add_documents(texts)

    
    retriever = db.as_retriever(search_kwargs={"k": 3})
    return retriever

In [28]:
# TODO: Find the relevant source for the query "Compiling Codes"

In [29]:
# SOLUTION
chunks = expanse_all_retriever(1000).invoke("Compiling Codes")
for chunk in chunks:
    print(chunk.metadata)

# The answer is expanse_doc_2.txt

{'source': 'expanse_doc_2.txt', 'chunk_number': 0}
{'source': 'expanse_doc_2.txt', 'chunk_number': 1}
{'source': 'expanse_doc_2.txt', 'chunk_number': 5}


## Part 2: Basic RAG


Ollama is an open-source LLM platform that allows us to use a plethora of different LLMs. In this notebook, Mistral is our LLM of choice. Feel free to play around with it.

In [30]:
ollama = Ollama(model="mistral")
ollama.invoke("How are you doing?")

" I'm just a computer program, so I don't have feelings or experiences like humans do. But I'm here and ready to help you with any question you might have! Let's chat! What can I assist you with today?"

### Task 7

**Write a function that uses the `workplace_retriever` function to parse your question, retrieves relevant responses from `workplace_retriever`, and then sends this context to Ollama for it to answer your question in natural language.** Fill in `workplace_question` which accomplishes this task.

In [31]:
# TODO
def workplace_question(question):
    retriever = #TODO: assign the retriever
    context = #TODO: invoke the retriever here
    ollama = Ollama(model="mistral")
    prompt = f"Based on the following context: {context}, answer the question: "
    response = #TODO: invoke ollama with the prompt and question
    return response

SyntaxError: invalid syntax (1314514365.py, line 3)

In [32]:
#SOLUTION
def workplace_question(question):
    retriever = workplace_retriever()
    context = retriever.invoke(question)
    ollama = Ollama(model="mistral")
    prompt = f"Based on the following context: {context}, answer the question: "
    response = ollama.invoke(prompt + question)
    return response

In [33]:
workplace_question("Who works at Starbucks?")

' The individuals who work at Starbucks are Brian (from document with id 27) and Beth (from document with id 1).'

## Part 3: LangChain RAG

The above is a very simple example of a RAG. Now, using langchain, we can put everything together in a cleaner and all inclusive way in one go. Let's combine everything we've learned into the function `generate_rag`.

- The below implementation has a custom class that allows us to view what chunks are being used based on our queries.

In [34]:
def generate_rag(verbose=False, chunk_info=False):
    import glob
    vectorizer = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    db = Chroma(embedding_function=vectorizer)
    pattern = 'expanse_doc_*.txt'
    file_list = glob.glob(pattern)
    for file_name in file_list:
        loader = TextLoader(file_name)
        documents = loader.load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=10, separators=[" ", ",", "\n"])
        texts = text_splitter.split_documents(documents)
        for id,text in enumerate(texts):
            text.metadata["chunk_number"] = id
        db.add_documents(texts)
    
    template = """<s>[INST] Given the context - {context} </s>[INST] [INST] Answer the following question - {question}[/INST]"""
    pt = PromptTemplate(
                template=template, input_variables=["context", "question"]
            )
    # Let's retrieve the top 3 chunks for our results
    retriever = db.as_retriever(search_kwargs={"k": 3})
    class CustomRetrievalQA(RetrievalQA):
        def invoke(self, *args, **kwargs):
            result = super().invoke(*args, **kwargs)
            if chunk_info:
                # Print out the chunks that were retrieved
                print("Chunks being looked at:")
                chunks = retriever.invoke(*args, **kwargs)
                for chunk in chunks:
                    print(f"Source: {chunk.metadata['source']}, Chunk number: {chunk.metadata['chunk_number']}")
                    print(f"Text snippet: {chunk.page_content[:200]}...\n")  # Print the first 200 characters
            return result
    rag = CustomRetrievalQA.from_chain_type(
        llm=Ollama(model="mistral"),
        retriever=retriever,
        memory=ConversationSummaryMemory(llm=Ollama(model="mistral")),
        chain_type_kwargs={"prompt": pt, "verbose": verbose},
    )

    return rag

### Task 8
**Compare how mistral performs without context, and with context, i.e. without RAG and with RAG.**

In [35]:
print(ollama.invoke("How do you check available resources on Expanse Supercomputer"))

 To check the available resources on an Expanse Supercomputer, you would typically use command-line tools or graphical interfaces provided by the supercomputing system. Here's a general guide for using some common commands:

1. **SSH (Secure Shell)**: You can connect to your allocated node(s) using SSH and then run commands to check resource availability. For example, you might use `uname -a` to get detailed system information, or `free -h` to see memory usage, among other commands.

2. **Resource Management System (RMS)**: Most supercomputers have a Resource Management System like Slurm, LSF, or Torque. These systems allow you to submit jobs for processing and monitor their status. To check resource availability in such a system, you can use commands like `squeue` to see currently running jobs, pending jobs, and resource usage.

3. **Monitoring Tools**: Some supercomputers have monitoring tools that provide real-time or historical data about the system's performance and resource utili

In [36]:
expanse_rag = generate_rag()
result = expanse_rag.invoke("How do you check available resources on Expanse Supercomputer")
print(result["result"])

 To check the available resources on the Expanse Supercomputer, you can use the command `expanse-client user -r expanse`. This command will display a table showing the list of projects and their usage details on the designated resource (Expanse by default if no resource is specified). If you want to see the full list of available resources, you can use the 'resource' command without any parameters.

Here's the command again for your convenience:

* To check available projects on Expanse: `[user@login01 ~]$ expanse-client user -r expanse`
* To see full list of available resources: `[user@login02 ~]$ resource`


**We can see what is exactly being passed into the LLM highlighted in green when we set `verbose` to True.**

In [37]:
expanse_rag = generate_rag(verbose=True)
result = expanse_rag.invoke("How do you check available resources on Expanse Supercomputer")
print(result["result"])



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m<s>[INST] Given the context - script provides additional details regarding project availability and usage.  The script is located at:

/cm/shared/apps/sdsc/current/bin/expanse-client

The script uses the 'sdsc' module, which is loaded by default. 

[user@login01 ~]$ module load sdsc
 
To review your available projects on Expanse resource use the 'user' parameter and '-r' to desginate a resource.  If no resouce is designated expanse data will be shown by default.

user@login01 ~]$ expanse-client user -r expanse

Resource expanse

╭───┬─────────────┬─────────┬────────────┬──────┬───────────┬─────────────────╮
│   │ NAME        │ PROJECT │ TG PROJECT │ USED │ AVAILABLE │ USED BY PROJECT │
├───┼─────────────┼─────────┼────────────┼──────┼───────────┼─────────────────┤
│ 1 │ user        │ ddp386  │            │ 0    │ 110000    │ 8318            │
╰───┴─────

**For more concise information, the function defined allows us to see individual chunk details as well as their source.**

In [38]:
expanse_rag = generate_rag(chunk_info=True)
result = expanse_rag.invoke("How do you check available resources on Expanse Supercomputer")

Chunks being looked at:
Source: expanse_doc_1.txt, Chunk number: 12
Text snippet: script provides additional details regarding project availability and usage.  The script is located at:

/cm/shared/apps/sdsc/current/bin/expanse-client

The script uses the 'sdsc' module, which is lo...

Source: expanse_doc_1.txt, Chunk number: 1
Text snippet: Compute Units (SSCUs), comprising 728 standard nodes, 54 GPU nodes and 4 large-memory nodes. Every Expanse node has access to a 12 PB Lustre parallel file system (provided by Aeon Computing) and a 7 P...

Source: expanse_doc_1.txt, Chunk number: 2
Text snippet: up to 30M core-hours.
Job Scheduling Policies
The maximum allowable job size on Expanse is 4,096 cores – a limit that helps shorten wait times since there are fewer nodes in idle state waiting for lar...



In [39]:
print(result["result"])

 To check the available resources on the Expanse Supercomputer, you can use the command `expanse-client user -r expanse`. This will display a table of available projects on the Expanse resource.

Here's how to execute it:

```bash
[user@login01 ~]$ module load sdsc
[user@login01 ~]$ expanse-client user -r expanse
```

If you want to see a full list of available resources, use the `resource` command without any parameters:

```bash
[user@login02 ~]$ resource
```

For more detailed information about Expanse resources such as CPU and GPU types, memory capacity, and storage configurations, you can refer to the technical details provided in the context.


#### Great work! We've officially made a chatbot that can help us out with all things *Expanse*, at least according to the 2 .txt files we have access to!