# Tech Bytes 3: Let's talk about RAG

Today, we're going to build a chatbot, but not any chatbot, we're going to build a chatbot that knows about stuff that you have teach it!

LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model's cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).

A typical RAG application has two main components:

* **Indexing**: a pipeline for ingesting data from a source and indexing it. This usually happens offline.

* **Retrieval and generation**: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

In [None]:
!pip install langchain-openai==0.1.7 langchain-core==0.2.1 langchain-community==0.2.1 langchain==0.2.1 sentence-transformers==3.0.0 faiss-cpu==1.8.0

# Start with connecting to openAI
This time we will use a library called `langchain` which makes easier the interactions between the LLM and python. So no need of requests.

The main concept that you have to know is that in order to execute a call you have to do call the object with the method `.invoke()`. You will see an example below.

In [None]:
from langchain_openai import AzureChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain_core.runnables import RunnablePassthrough
import getpass

In [None]:
api_key = getpass.getpass("Enter your OpenAI API key: ")

In [None]:
gpt4o_mini = AzureChatOpenAI(
    azure_deployment="gpt-4o-mini",
    api_key=api_key,
    openai_api_version="2024-02-01",
    azure_endpoint="https://oai-tech-bytes.openai.azure.com/"
    )

gpt4o_mini.invoke("What's the capital of Paris?")

## Exercise
Ask gpt-4o what is AI?

In [None]:
# your code here

Looks a bit weird right? It's because langchain has some weird methods an objects attached to the response. We could parse the answer and get the actual text by using the `StrOutputParser()`

We can use also the `|` or `pipe` operator and it essentially takes the output of of function and puts it as the input of the next function. So instead of doing:
```python
intermediate_output = gpt4o_mini.invoke("What's the capital of Paris?")
output = StrOutputParser().invoke(intermediate_output)
```

We do:

In [None]:
llm = gpt4o_mini | StrOutputParser()
llm.invoke("What's the capital of Paris?")

# Exercise 
Ask again what is AI but suing the llm chain

In [None]:
# your code here

## Prompt Templates

You can also create prompt templates. These templates help to determined some placeholders and being able to reuse them without having to tap everytime the prompt.
Let's generalize the prompt to be able to ask for the capital of any country

In [None]:
template = """What is the capital of {country}?"""
prompt = PromptTemplate.from_template(template)

In [None]:
# Let's create a new chain for this

capital_bot = prompt | gpt4o_mini | StrOutputParser()

capital_bot.invoke("France")

In [None]:
# You can also invoke the chain with a dictionary, this is useful when you have multiple placeholders
capital_bot.invoke({"country": "France"})

## Exercise
Create a translator bot, that takes as an input a language and a sentence and translates it

In [None]:
# Your code goes here:

# Adding information to the prompt
One way to add new information would be to add it to the prompt. For example:

In [None]:
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.

{context}

Question: {question}

Helpful Answer:"""
prompt = PromptTemplate.from_template(template)

qa_bot = prompt | gpt4o_mini | StrOutputParser()

In [None]:
qa_bot.invoke({"context": "The capital of Erickland is Santillan.", 
               "question": "What is the capital of Erickland?"})

That was easy because it was small. But this could potentially help by passing a whole document to the context.

The code below reads NNF annual report and stores in the variables `context`. Can you do a NNF report bot?

In [None]:
with open("docs/strategy.txt", "r") as f:
    context = f.read()

# Printing the first 1000 characters of the context
print(context[:1000])

In [None]:
# Your code goes here


Lets create a bot for the annual report!

In [None]:
with open("docs/2023-annual-report.txt", "r") as f:
    annual_report = f.read()

# Printing the first 1000 characters of the annual report
print(annual_report[:1000])

In [None]:
# Your code goes here
template = """Use the following pieces of context to answer the question at the end.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.

{context}

Question: {question}

Helpful Answer:"""
prompt = PromptTemplate.from_template(template)

nnf_2023_bot = prompt | gpt4o_mini | StrOutputParser()

nnf_2023_bot.invoke({"context": annual_report,
                        "question": "How many DKK in grants have been awarded by NNF in 2019-2023?"})

## Exercise
How many words are there in the annual_report? (you can use the `len` function to count the words)

In [None]:
# Your code goes here

# RAG
Our loaded documents seems to be too long, which means they can't fit fit in the context window of many models. Even for those models that could fit the full post in their context window, models can struggle to find information in very long inputs.

To handle this we’ll split the Document into chunks and just put in the context relevant information.

In this case we’ll split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.

## How to find relevant information?
This is THE question and is a bit outside the scope of this workshop, but you can search for the key words: Information retrieval and embeddings if you are really curious.

I'll provide a big part of the code that you can just re-use. Don't worry if you don't get everything :)


In [None]:
# Loading the documents
loader = DirectoryLoader("./docs/", loader_cls=TextLoader, glob="*.txt")
docs = loader.load()

#Splitting the documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

# Creating embedding for the documents and putting them in a vector store
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vectorstore = FAISS.from_documents(documents=all_splits, embedding=embedding_model)


We create a retriever out of our vector store. The `6` you see is the amount of similar "documents" that will be retrieved.

In [None]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [None]:
query = "How many DKK in grants have been awarded by NNF in 2019-2023?"
retrieved_docs = retriever.invoke(query)

In [None]:
print(retrieved_docs[0].page_content)

## What's happening under the hood? (only if you're interested)
 It may seem really obscur what one line of code is doing but it's really simple. It's a 4 step process:
 1. The `query` is passed through our embedding model and gets transformed into a vector, let's called it `query_vector`
 2. The `query_vector` is then compared to all the vectors in the vectorstore. Remember that those vectors in the vectorstore are just a mathematical representation of parts of the documents
 3. We then take the vectors that are the most "similar" to our `query_vector`
 4. We return a list with the documents that had the nearest distance to the `query`
 5. We then print the first document

# Putting everything together

We will create a chain called rag_chain that will have only one input: the user's question.

The question be forked and passed through two different pipelines:

1. The retrieval pipeline, where the question will be compared to the documents inside the vectorstore using the retriever and its output will be appended usint the format_docs function. The output of this chain will be a string and be passed to prompt on the context property.
2. The question will be other property passed to the prompt.
Once the prompt is filled with context and the question, we will send it to the llm, and we will print out the outcome.

In [None]:
# You can safely ignore this function, it's just for formatting the output
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [None]:
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
rag_chain.invoke("How many DKK in grants have been awarded by NNF in 2019-2023?")

# Your turn!
Now it's up to you, here we propose some exercises for you to play with, feel free to mess around with it :)

# Exercise 1: Validation
Right now our agent can answer questions about PNNFlanday... but also about coding in python, or about the weather in Mexico. I think you can see how this can be abused... How can you put some guard rails to avoid it?

The following prompt shouldn't be possible:

In [None]:
print(rag_chain.invoke("Write a python function that somes all fibonacci numbers between 1-18"))

# Exercise 2: Cite your sources!
We know LLMs are prompt to hallucinate... how can you make it return the sources of where the knowledge came from?

Pssst: maybe you want to look into modifyin the format_docs function, although there are several ways of doing it

In [None]:
# Your code goes here

# Exercise 3: Follow-up questions *really hard*
Right now, our agent can answer questions about NNF. But if you ask a follow up question, it has no idea about what you were talking about as an LLM has no memory. The only way to provide it with memory is by somehow adding the past requests manually to the request. How could you do it...?

In [None]:
# your code goes here