# RAG – Retrieval Augmented Generation


In [2]:
!pip install langchain -q
!pip install langchain_community -q
!pip install chromadb -q

In [3]:
import os
import numpy as np
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.memory import ConversationSummaryMemory
from langchain.prompts import PromptTemplate
from langchain_community.llms import Ollama
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from chromadb.utils import embedding_functions
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

import warnings
warnings.filterwarnings('ignore')

## Setup

In this RAG tutorial, we'll be working with [LangChain](https://python.langchain.com/v0.2/docs/introduction/), which is a powerful framework for building applications with language models. LangChain provides utilities for working with various language model providers, integrating embeddings, and creating chains for more complex applications. Below are the necessary imports for this notebook:

We are also using [Ollama](https://ollama.com/), which is a platform for running LLMs on your local machine.  The following steps are needed to set up ollama for this RAG tutorial:
1. In a terminal window in JupyterLab, type in the following command to start up the ollama service:
**ollama serve**
2. In another terminal window in JupyterLab, type in the following to download the model:  **ollama pull mistral**

## Part 1: Retrieval

- In this section, we'll focus on the retrieval aspect of RAG. We'll start by understanding vectorization, followed by storing and retrieving vectors efficiently.

#### Vectorizing

- **Vectorization** is the process of converting text into vectors in an embedding space. These vectors capture the semantic meaning of the text, enabling us to perform various operations like similarity calculations. We'll use HuggingFaceEmbeddings for this task, which is a class in LangChain that allows for an **embedding model** from HuggingFace to be integrated into LangChain applications. You can see the documentation for this LangChain class [here](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.huggingface.HuggingFaceEmbeddings.html).


In [None]:
!pip install tf-keras --upgrade -q
!pip install --upgrade transformers sentence-transformers langchain_community -q


In [10]:
from langchain_community.embeddings import HuggingFaceEmbeddings

# Initialize the vectorizer
vectorizer = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


This vectorizer converts text into vectors in embedding space. Lets try seeing how we can use this.

In [11]:
vectorizer.embed_query("dog")[0:10]

[-0.05314696952700615,
 0.014194391667842865,
 0.007145767565816641,
 0.06860870122909546,
 -0.0784803181886673,
 0.010167491622269154,
 0.10228311270475388,
 -0.012064853683114052,
 0.09521345049142838,
 -0.030350126326084137]

- As you can see from above, this converts text into a series of numbers. 

### Task 1

Your job is to **write a function that takes in two strings, vectorize them, and return their cosine similarity.** Implement the following function.

#### `similarity_two_queries`

In [12]:
def similarity_two_queries(word1, word2):
    # TODO
    word1_vec = vectorizer.embed_query(word1)
    word2_vec = vectorizer.embed_query(word2)
    return np.dot(word1_vec,word2_vec)

- Observe the similarity scores of both **'cat'** and **'dog'** to the word **'kitten'**

In [13]:
print("Similarity of 'kitten' and 'cat': ",similarity_two_queries("kitten","cat"))
print("Similarity of 'kitten' and 'dog': ",similarity_two_queries("kitten","dog"))

Similarity of 'kitten' and 'cat':  0.7882108523104192
Similarity of 'kitten' and 'dog':  0.5205050432579855


- By using the previously defined function,  we can take pairs of texts and quantify how **similar** they are.

### Task 2

**Which of the following words in the list `words` are most related to the word 'color'?** The function `similarity_list` takes a list of words, and outputs the word and similarity score from highest to lowest. 

In [14]:
def similarity_list(word,list):
    similarity_list = [(i,similarity_two_queries("color",i)) for i in words]
    sorted_similarity_list = sorted(similarity_list,key=lambda x:x[1],reverse=True)
    return sorted_similarity_list

In [15]:
words = ["rainbow","car","black","red","cat","tree"]

In [16]:
similarity_list("color",words)

[('black', 0.7855719945035862),
 ('red', 0.7491879989606778),
 ('rainbow', 0.5601087130572295),
 ('car', 0.4040626729451856),
 ('cat', 0.35839052763321166),
 ('tree', 0.35735521995380115)]

### Task 3

Each query below has an appropriate text that allows you to answer the question. The function `match_queries_with_texts` matches a query with its most related text. **Come up with 3 more questions and 3 suitable answers and add them to the list below.**

In [17]:
def match_queries_with_texts(queries, texts):
    # Calculate similarities between each query and text
    similarities = np.zeros((len(queries), len(texts)))
    
    for i, query in enumerate(queries):
        for j, text in enumerate(texts):
            similarities[i, j] = similarity_two_queries(query, text)
    
    # Match each query to the text with the highest similarity
    matches = {}
    for i, query in enumerate(queries):
        best_match_idx = np.argmax(similarities[i])
        matches[query] = texts[best_match_idx]
    
    return matches

In [18]:
queries = ["What are the 7 colors of the rainbow?", 
           "What does Elsie do for work?", 
           "Which country has the largest population?",
           "What time is it?",
           "What is the largest continent?",
           "Who is the greatest Football player?"]
texts = ["China has 1.4 billion people.",
         "Elsie works the register at Arby's.", 
         "The colors of the rainbow are ROYGBIV.",
         "The time is 3:14.",
         "The largest continent is Asia.",
         "Christiano Ronaldo"]

- Now we shuffle the queries and texts. Let's see if we can match them!

In [19]:
import random
random.shuffle(queries)
random.shuffle(texts)

match_queries_with_texts(queries, texts)

{'Who is the greatest Football player?': 'Christiano Ronaldo',
 'What is the largest continent?': 'The largest continent is Asia.',
 'What does Elsie do for work?': "Elsie works the register at Arby's.",
 'What time is it?': 'The time is 3:14.',
 'Which country has the largest population?': 'China has 1.4 billion people.',
 'What are the 7 colors of the rainbow?': 'The colors of the rainbow are ROYGBIV.'}

#### Database

Now lets look at how we can store these for efficient retrieval of the vectors. There are many options for storage but in this exercise, we use [ChromaDB](https://python.langchain.com/v0.1/docs/integrations/vectorstores/chroma/)
 which is an open-source vector DB.

Through langchain, we can set the database to be a LangChain ***retriever*** object, which essentially allows us to perform queries similarly to what we have done before.

**Taking the `texts` and `queries` that you defined before, we can load it into ChromaDB and similarly perform the same operations.**

In [20]:
ids = list(range(len(texts)))
db = Chroma.from_texts(texts, vectorizer, metadatas=[{"id": id} for id in ids])
retriever = db.as_retriever(search_kwargs={"k": 1})

In [21]:
texts

['The colors of the rainbow are ROYGBIV.',
 "Elsie works the register at Arby's.",
 'China has 1.4 billion people.',
 'The time is 3:14.',
 'The largest continent is Asia.',
 'Christiano Ronaldo']

In [22]:
retriever.invoke("Which country has the largest population?")

[Document(metadata={'id': 2}, page_content='China has 1.4 billion people.')]

#### Task 4
Now let’s apply the same retrieval process to a file we read in. The file `workplaces.txt` contains names and workplaces of several people. 


In [23]:
with open("workplaces.txt", 'r') as file:
    lines = file.readlines()
lines = [line.strip() for line in lines]
print(lines[0:4])

["Aaron works at McDonald's", 'Beth works at Starbucks', 'Charlie works at Walmart', 'Daisy works at Amazon']


`workplace_retriever` is a function that takes in the workplace.txt file and returns a database as retriever that you can use to find out the workplaces of people in the file. You can specify the top-k results in the argument of the function.

In [24]:
def workplace_retriever(k=3):
    with open("workplaces.txt", 'r') as file:
        lines = file.readlines()
    lines = [line.strip() for line in lines]
    db = Chroma.from_texts(lines,vectorizer, metadatas=[{"id": id} for id in list(range(len(lines)))])
    retriever = db.as_retriever(search_kwargs={"k": k})
    return retriever

Using `workplace_retriever`, **find out who works at Starbucks and McDonald's**.

In [25]:
# TODO: Find out who works at Starbucks and who works at McDonalds. Use the retriever(k=3).invoke(<query>) method to do this
# You can experiment with the value of k to make sure you find all people that work in one place.

In [28]:
# Query for employees at Starbucks
results_starbucks = retriever.invoke("Who works at Starbucks?")
print("Employees at Starbucks:")
for result in results_starbucks:
    print(result)

Employees at Starbucks:
page_content='Brian works at Starbucks' metadata={'id': 27}


In [29]:
# Query for employees at McDonald's
results_mcdonalds = retriever.invoke("Who works at McDonald's?")
print("Employees at McDonald's:")
for result in results_mcdonalds:
    print(result)

Employees at McDonald's:
page_content='Alice works at McDonald's' metadata={'id': 26}


### Task 5

#### Chunking

The `workplaces.txt` data we just looked at was conveniently split into lines, with each line representing a distinct and meaningful chunk of information. This straightforward structure makes it easier to process and analyze the text data.

However, it is usually not so straightforward:
- When dealing with text data, especially from large or complex documents, it's essential to handle the formatting and structure efficiently.
- If we get a not-so-simply formatted file, we can break it down into manageable chunks using LangChain's `TextLoader` and `RecursiveCharacterTextSplitter`.
- This allows us to preprocess and chunk the data effectively for further use in our RAG pipeline.

Lets take a look at some of the *TIDE* documentation [here](https://tide.sdsu.edu/l). We have downloaded the contents of this webpage into two text files named `tide_doc_1.txt` and `tide_doc_2.txt`.

In [30]:
with open("tide_doc_1.txt", 'r') as file:
    lines = file.readlines()
lines = [line.strip() for line in lines]
print(lines[20:35])

['Launch Server: Use your campus credentials to access the TIDE JupyterHub and launch your computational environment.', 'Jobs', 'For tasks that require extended computation time or more complex configurations, TIDE allows the execution of containers as jobs within namespaces. Here’s how you can manage these jobs:', '', 'Purpose:', '', 'Long-Running Tasks: Execute long-running or resource-intensive jobs using containers.', 'Namespaces:', '', 'Organization: Namespaces help in organizing users, jobs, and other resources within the TIDE environment.', 'Assistance:', '', 'Support: If you need help with creating or managing namespaces, submit a TIDE Support Request, and the team will assist you in setting up the necessary configurations.', 'Quick Links', 'TIDE Support Request: Submit a support request for any issues or queries.']


- We see that the data and text is not split into meaningful chunks of information by default, so we need to try out best to format it in such a way it can be useful. This is why we use chunks, which capture local and neighboring texts, grouping them together.

A function that chunks `tide.txt` has been provided below.  When using the RecursiveCharacterTextSplitter, the chunk size determines the maximum size of each text chunk. This is particularly useful when dealing with large documents that need to be split into smaller, manageable pieces for better retrieval and analysis.

**Experiment with different chunk sizes** and pick a size that captures enough information to answer the question: 

In [31]:
def tide_retriever(chunk_size):
    loader = TextLoader('tide_doc_1.txt')
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=10, separators=[" ", ",", "\n"])
    texts = text_splitter.split_documents(documents)
    db = Chroma(embedding_function=vectorizer)
    db.add_documents(texts)
    retriever = db.as_retriever(search_kwargs={"k": 3})
    return retriever

In [32]:
# TODO: Think about how many characters would be needed to contain useful information for such a complex task


In [33]:
# SOLUTION
tide_retriever(1000).invoke("WHat is TIDE")

[Document(metadata={'source': 'expanse_doc_1.txt'}, page_content='the TIDE environment.\nAssistance:\n\nSupport: If you need help with creating or managing namespaces, submit a TIDE Support Request, and the team will assist you in setting up the necessary configurations.\nQuick Links\nTIDE Support Request: Submit a support request for any issues or queries.\nTIDE JupyterHub: Access the JupyterHub interface for managing computational resources.\nTIDE on GitHub: Explore the TIDE documentation and additional resources on GitHub.\nTIDE YouTube: Watch tutorials and informational videos about TIDE.\nNSF Award #2346'),
 Document(metadata={'source': 'expanse_doc_1.txt'}, page_content='TIDE Wave\nTechnology Infrastructure for Data Exploration (TIDE)\nTIDE is an advanced infrastructure platform integrated into the National Research Platform Nautilus hyper-cluster. It provides a robust environment for data exploration and computational tasks, leveraging cutting-edge technology to support artifici

### Task 6
#### Multiple Document Chunking

When we have more than one document we want to use in our database, we can simply iteratively chunk them. Metadata for the text source is added by default, but we can add our own metadata as well in the form of IDs.


`tide_all_retriever` is a function that chunks both `tide_doc_1.txt` and `tide_doc_2.txt` has been provided below, using a chunk size of 1000 characters, **Find which document information for "Compiling Codes" is most likely to be located.**

In [34]:
def tide_all_retriever(chunk_size):
    import glob
    db = Chroma(embedding_function=vectorizer)
    pattern = 'tide_doc_*.txt'
    file_list = glob.glob(pattern)
    for file_name in file_list:
        loader = TextLoader(file_name)
        documents = loader.load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=10, separators=[" ", ",", "\n"])
        texts = text_splitter.split_documents(documents)
        for id,text in enumerate(texts):
            text.metadata["chunk_number"] = id
        db.add_documents(texts)

    
    retriever = db.as_retriever(search_kwargs={"k": 3})
    return retriever

In [35]:
# TODO: Find the relevant source for the query "Compiling Codes"

In [36]:
# SOLUTION
chunks = tide_all_retriever(1000).invoke("Compiling Codes")
for chunk in chunks:
    print(chunk.metadata)

# The answer is tide_doc_2.txt

{'chunk_number': 0, 'source': 'expanse_doc_2.txt'}
{'chunk_number': 1, 'source': 'expanse_doc_2.txt'}
{'chunk_number': 5, 'source': 'expanse_doc_2.txt'}


## Part 2: Basic RAG


Ollama is an open-source LLM platform that allows us to use a plethora of different LLMs. In this notebook, Mistral is our LLM of choice. Feel free to play around with it.

In [37]:
ollama = Ollama(model="mistral")
ollama.invoke("How are you doing?")

" I'm just a computer program, so I don't have feelings or emotions. I'm here to help you with any questions or problems you might have! How can I assist you today?"

### Task 7

**Write a function that uses the `workplace_retriever` function to parse your question, retrieves relevant responses from `workplace_retriever`, and then sends this context to Ollama for it to answer your question in natural language.** Fill in `workplace_question` which accomplishes this task.

In [39]:

def workplace_question(question):
    retriever = workplace_retriever()
    context = retriever.invoke(question)
    ollama = Ollama(model="mistral")
    prompt = f"Based on the following context: {context}, answer the question: "
    response = ollama.invoke(prompt + question)
    return response

In [40]:
workplace_question("Who works at Starbucks?")

' Brian works at Starbucks.'

## Part 3: LangChain RAG

The above is a very simple example of a RAG. Now, using langchain, we can put everything together in a cleaner and all inclusive way in one go. Let's combine everything we've learned into the function `generate_rag`.

- The below implementation has a custom class that allows us to view what chunks are being used based on our queries.

In [41]:
def generate_rag(verbose=False, chunk_info=False):
    import glob
    vectorizer = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    db = Chroma(embedding_function=vectorizer)
    pattern = 'tide_doc_*.txt'
    file_list = glob.glob(pattern)
    for file_name in file_list:
        loader = TextLoader(file_name)
        documents = loader.load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=10, separators=[" ", ",", "\n"])
        texts = text_splitter.split_documents(documents)
        for id,text in enumerate(texts):
            text.metadata["chunk_number"] = id
        db.add_documents(texts)
    
    template = """<s>[INST] Given the context - {context} </s>[INST] [INST] Answer the following question - {question}[/INST]"""
    pt = PromptTemplate(
                template=template, input_variables=["context", "question"]
            )
    # Let's retrieve the top 3 chunks for our results
    retriever = db.as_retriever(search_kwargs={"k": 3})
    class CustomRetrievalQA(RetrievalQA):
        def invoke(self, *args, **kwargs):
            result = super().invoke(*args, **kwargs)
            if chunk_info:
                # Print out the chunks that were retrieved
                print("Chunks being looked at:")
                chunks = retriever.invoke(*args, **kwargs)
                for chunk in chunks:
                    print(f"Source: {chunk.metadata['source']}, Chunk number: {chunk.metadata['chunk_number']}")
                    print(f"Text snippet: {chunk.page_content[:200]}...\n")  # Print the first 200 characters
            return result
    rag = CustomRetrievalQA.from_chain_type(
        llm=Ollama(model="mistral"),
        retriever=retriever,
        memory=ConversationSummaryMemory(llm=Ollama(model="mistral")),
        chain_type_kwargs={"prompt": pt, "verbose": verbose},
    )

    return rag

### Task 8
**Compare how mistral performs without context, and with context, i.e. without RAG and with RAG.**

In [42]:
print(ollama.invoke("What is TIDE, and what are its primary focuses?"))

 To check the available resources on an Expanse Supercomputer, you would typically use command-line tools provided by the system. Here's a general guide:

1. **SSH into the supercomputer**: First, you need to establish a secure connection to the supercomputer using SSH (Secure Shell). You'll need the hostname or IP address of the supercomputer and your credentials. The command would look something like this:

   ```
   ssh username@supercomputer_hostname
   ```

2. **Check CPU usage**: To check the CPU usage, you can use the `top` command. This command will display a dynamically updated list of processes on your system sorted by their resource utilization.

   ```
   top
   ```

3. **Check memory usage**: To check the memory (RAM) usage, you can use the `free -h` command. This command will display information about the total amount of free and used memory in a human-readable format.

   ```
   free -h
   ```

4. **Check disk usage**: To check the disk usage, you can use the `df -h` com

In [43]:
tide_rag = generate_rag()
result = tide_rag.invoke("How do you check available resources on tide Supercomputer")
print(result["result"])

 To check available resources on the Expanse Supercomputer, you can use the following command in your terminal after loading the required module (sdsc) and navigating to the correct directory:

```bash
[user@login01 ~]$ module load sdsc
[user@login01 ~]$ expanse-client resource
```

This will display a table showing available resources, including the resource name, project details, usage, and availability. If you want to check the resources for a specific project or resource, you can use the 'user' parameter followed by '-r' to specify the desired resource:

```bash
[user@login01 ~]$ expanse-client user -r <resource>
```

Replace '<resource>' with the name of your desired resource or leave it blank to view data for the default resource.


**We can see what is exactly being passed into the LLM highlighted in green when we set `verbose` to True.**

In [None]:
print(result["result"])

#### Great work! We've officially made a chatbot that can help us out with all things *TIDE*, at least according to the 2 .txt files we have access to!