# RAG Demo
</br>
The code in this notebook was adapted from langchain's simple <a href='https://python.langchain.com/docs/tutorials/rag/' target='_blank'>RAG application walkthrough</a> and <a href='https://huggingface.co/spaces/cboettig/streamlit-demo/blob/main/pages/rag.py' target='_blank'>Professor Boettiger's streamlit RAG demo</a>.  
</br>

Before running this notebook, make sure to open the terminal and run `pip install -r requirements.txt` to load the necessary packages.

<hr style="border: 5px solid #0D335F;" />
<hr style="border: 2px solid #5FAE5B;" />

# Setting up RAG

This portion of the notebook will walk through the code used to set up our RAG system for the demos.  
    </br>
To run all the code in this section and skip to the demo, click the table of contents icon on the left menu bar. Then right click the title of this section, and choose 'Select and Run Cell(s) for this Heading'. Then click the Demos heading to skip to that portion of the notebook.

<hr style="border: 1px solid #5FAE5B;" />
    
## Initial Setup

First we'll set up the chatbot, embedding model, and embedding storage system.

**General Questions:**  
What llm, embedding model, and vector store should I use? - we'll just pick one arbitrarily

In [2]:
#%pip install --quiet --upgrade langchain-text-splitters langchain-community langgraph

### Ask for your OpenAI API key if you haven't already set one

In [3]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  api_key = getpass.getpass("Enter API key for OpenAI: ")
    #should I also do: os.environ["OPENAI_API_KEY"] = api_key
    #or do I not want to mess with their os environment
else:
    api_key = os.environ["OPENAI_API_KEY"]

Enter API key for OpenAI:  ········


**TODO:** Replace with some call to a secrets folder for easier use while developing

### Set up the language model

In [14]:
#%pip install -qU "langchain-openai" 

In [5]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model = "llama3", api_key = api_key, base_url = "https://llm.nrp-nautilus.io",  temperature=0)

### Set up the embedding model
[insert one setence explanation for embedding, and link to further explanation]

In [6]:
#Set up the embedding model
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model = "embed-mistral", 
    api_key = api_key, 
    base_url = "https://llm.nrp-nautilus.io")

**Perhaps change this ^ to embed mistral**

### Set up the embedding storage system

In [7]:
#%pip install -qU langchain-core

In [8]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

### Initial Setup Complete!

<hr style="border: 1px solid #5FAE5B;" />

## Data Processing Pipeline (Indexing)
Here's where we start processing the textual data in the document(s) we want our chatbot to use when answering our questions. In our case, this will involve 3 steps: 

1. Load the document(s)    
2. Split the document(s) into smaller pieces  
3. Produce vectors representing these smaller pieces, and use those vectors to organize our pieces in a database

If we want to change the document(s) our chatbot is using, we'll have to add the new documents and run through this part of the process again (hence the name 'pipeline').

### Load the document(s)
This code allows us to load the textual data from PDFs into a format that we can work with. You can also load html files directly from the web by following the steps described in 
<a href='https://python.langchain.com/docs/tutorials/rag/#loading-documents' target='_blank'>the 'loading documents' portion of the RAG application walkthrough</a>.

In [15]:
#%pip install -qU langchain-community pypdf 

In [10]:
from langchain_community.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

def pdf_loader(url):
    """
    Loads the PDF at the given url.

    Arguments:
        url: the url to the PDF you want to load

    Returns: A document containing the text data (and metadata) of the specified PDF.
    """
    loader = PyPDFLoader(url)
    return loader.load()

docs = pdf_loader('https://canature.maps.arcgis.com/sharing/rest/content/items/8da9faef231c4e31b651ae6dff95254e/data')

To load multiple PDFs: put all the PDFs in a folder, uncomment the last line of the cell below, paste in the path to your folder, and then run the cell.

In [11]:
def multiple_pdf_loader(folder_path):
    """
    Loads all PDFs in the specified folder.

    Arguments:
        folder_path: the path to the folder containing the all the PDFs you want to load.

    Returns: A list of documents, each document representing one PDF
    """
    loader = PyPDFDirectoryLoader(folder_path)
    return loader.load()

#Uncomment the line below and paste in the path to your pdf folder to load multiple PDFs. An example folder file path would look like: 'C:/Users/evanl/Downloads/PDF Folder Name'
#docs = multiple_pdf_loader('paste the path to your folder here')

**TODO:** Test the multiple_pdf_loader function  
Also should I remove the docstrings? Do they make the code look scarier than it is just because it makes the cell look much bigger than it should be?

### Split the document(s) into bite-sized pieces
This code will take our document(s) and split their text into smaller sub-sections, sometimes referred to as 'chunks'. There are two important parameters to note in the cell below: `chunk_size` and `chunk_overlap`. 

The `chunk_size` parameter determines (approximately) how many characters will be in each chunk. The `chunk_overlap` parameter determines how many characters will be shared by any given chunk and the chunk that directly follows it in the text. The importance of `chunk_overlap` is discussed in the article (see breaking mode 1), and will be demonstrated later in this notebook.

You can read more about langchain's text splitting methods [here](https://python.langchain.com/docs/how_to/recursive_text_splitter/).

In [12]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # chunk size (characters)
    chunk_overlap=200,  # chunk overlap (characters)
    add_start_index=True,  # track index in original document
)
all_splits = text_splitter.split_documents(docs)

print(f"Split pdf into {len(all_splits)} sub-documents.")

Split pdf into 188 sub-documents.


### Add the pieces to the vector store
Under the hood, this code is actually doing two things. When we set up the vector store earlier, we told it which embedding model to use. Now, when we the add the chunks of our documents to the vector store, it first will call the embedding model to create vector representations of those chunks. Then it will use those vector representations to organize the chunks within the database. This will allow us to  quickly search for relevant pieces of our document(s) later.

**TODO:** fact check my description of the under-the-hood activities (I think it's true, but that's just because I don't see how else it could work)

In [13]:
document_ids = vector_store.add_documents(documents=all_splits)

print(document_ids[:3])

['28556928-7cdd-43bd-af31-93e74241c729', '7edc9640-d0d7-42d1-9da5-127cede7823b', '091afac0-7b78-49fb-ae25-7c13a3270670']


### Indexing Complete!

At this point we've completed the 'indexing' portion of our set up process. This has involved 3 steps:  

1. **Loading our document(s)**: We used PyPDFLoader to load our pdf(s) into a format we could process using code.
2. **Text Splitting**: We used a text splitter to break our document(s) into smaller pieces that our chatbot will be able to more easily digest.  
3. **Add chunks to our vector storage system**: We used an embedding model to represeent the pieces of our document(s) as vectors. Utilizing the vector embeddings we just made, we organized the pieces of our document(s) in a database.
                                                                                                                                                                                                        
Next we will set up a 'retriever' which will use this organized database to retrieve relevant pieces of our document based on the user's question.

<hr style="border: 1px solid #5FAE5B;" />

## Retrieval and Generation

</br>
<hr style="border: 5px solid #0D335F;" />
<hr style="border: 2px solid #5FAE5B;" />

# Demos

<hr style="border: 1px solid #5FAE5B;" />

## Breaking mode 1

<hr style="border: 1px solid #5FAE5B;" />

## Breaking mode 2

</br>
<hr style='border: 3px solid #0D335F;' />
<hr style='border: 1px solid #5FAE5B;' />

# Sources
This is a collection of all the links I inserted throughout the doc

https://python.langchain.com/docs/tutorials/rag/  
https://huggingface.co/spaces/cboettig/streamlit-demo/blob/main/pages/rag.py  
https://python.langchain.com/docs/how_to/recursive_text_splitter/

### Dump:

**Breaking mode 1:**  
Higher chunk overlap increases the chance that, if one chunk is deemed relevant to the prompt, the chunks surrounding it will also be seen as relevant. In effect, this encourages the RAG model to read more of the context surrounding the chunk where it believes an answer is located. The downside of high chunk overlap is increased computational intensity, since higher overlap means there will be more chunks.

### Questions:

How should we set up the notebook so users can conveniently enter their OpenAI API key?

What do we think about the blue and green horizontal lines? Are there tweaks we could make that would be better?

I assume the actual notebook shouldn't have the %pip install cells right?  
And how do I add the -U in requirements.txt (or I guess just the U since I assume the q just means don't fill the screen with text)?  
