#### This app demos a Retrieval Augmented Generation (RAG) pattern:
RAG is well known pattern to allow a LLM (such as Llama 3) to answer \
questions on private data (which most likely not part of the training set). 

RAG is one way to prevent hallucination as it's grounded by the retrieved \
data (and much cheaper that fine-tuning)

The demo includes how-tos for the following:
- download Llama 3.1 from HF
- use LangChain to ask Llama general questions and follow up questions using memory
- use LangChain to load content (a recent web page - Hugging Face's blog post on Llama 3.1) and chat about it.
- use an embeddings model, `sentence-transformers/all-mpnet-base-v2` for generating embeddings

For now, this demo needs a L40S (and above in compute), although it should 'fit' in a 4090 (but it doesnt)

Uses the following container, `image:nvcr.io/nvidia/pytorch:24.07-py3` from PyTorch NGC

Roughly based on https://github.com/meta-llama/llama-recipes/blob/main/recipes/quickstart/RAG/hello_llama_cloud.ipynb

#### Install packages
Let's start by installing the necessary packages:

In [None]:
!pip install langchain
!pip install faiss-cpu
!pip install bs4
!pip install langchain_community

#### Login to HF and download 'meta-llama/Meta-Llama-3-8B-Instruct'

In [None]:
!pip install huggingface-hub 

In [None]:
# go here for token: https://huggingface.co/settings/tokens
from huggingface_hub import notebook_login
notebook_login()

In [None]:
!pip install langchain_huggingface

In [None]:
import torch
print (f'allocated (GB): {torch.cuda.memory_allocated() / (1024 **3)}')
# no memory should be allocated at this point

In [None]:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

from langchain_huggingface import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="meta-llama/Meta-Llama-3-8B-Instruct",
    device=0,
    task="text-generation",
    pipeline_kwargs={
        "max_new_tokens": 100,
        "top_k": 5,
        "temperature": 0.1,
    },
)

In [None]:
#check GPU memory allocation again
print (f'allocated (GB): {torch.cuda.memory_allocated() / (1024 **3)}')  # ~29.9 GB

In [None]:
# test the model
print(f'Output:\n{llm.invoke("Hugging Face is")}')

#### Q&A w/model (on model memory, *not* on your data)
With the model set up, you are now ready to ask some questions. \
Here is an example of the simplest way to ask the model some general questions.

In [None]:
question = "who wrote the book Innovator's dilemma?"
answer = llm.invoke(question)
print(answer)

##### Does multi-turn conversations work?
We'll now ask a follow-up question, more information on the book.\
Since the chat history is not passed on, Llama doesn't have the context and doesn't know this is more about the book. \
Thus it treats this as new query.

In [None]:
# ask one more question
# chat history hasn't been passed, so Llama doesn't have the context and doesn't know this is more about the book
followup = "tell me more"
followup_answer = llm.invoke(followup)
print(followup_answer)

#### Q&A w/model (with conversation history, still *not* your data)
To get around this we will need to provide the model with history of the chat.\
To do this, we will use `ConversationBufferMemory` to pass the chat history to the model \
and give it the capability to handle follow up questions.

In [None]:
# using ConversationBufferMemory to pass memory (chat history) for follow up questions
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory()
conversation = ConversationChain(
    llm=llm, 
    memory = memory,
    verbose=False
)

##### Repeating with converational memory
Once this is set up, let us repeat the steps from before and ask the model a simple question.\
Then we pass the question plus the answer back into the model for context along with the follow up question.

In [None]:
# restart from the original question
answer = conversation.predict(input=question)
print(answer)

In [None]:
# pass context (previous question and answer) along with the follow up "tell me more" to Llama who now knows more of what
memory.save_context({"input": question},
                    {"output": answer})
followup_answer = conversation.predict(input=followup)
print(followup_answer)

#### Finally, a small RAG pattern
Next, let's explore using Llama 3.1 to answer questions using documents for context. \
This allows us to not rely on Llama 3.1's knowledge but provide better context without needing to finetune.

##### Import the HF and LangChain components
We will import the `HuggingFaceEmbeddings` and `RecursiveCharacterTextSplitter` to assist in storing the documents.\
The embedding model is, `sentence-transformers/all-mpnet-base-v2`

##### The Vector Store
We need to store our document in a vector store. \
There are more than <B>30 vector stores (DBs)</B> supported by LangChain. \
For this example we will use FAISS, a popular open source vector store by Facebook.

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(["https://huggingface.co/blog/llama3"])
docs = loader.load()

#### Chunk the docs, build the Vector DB
To store the documents, we will need to split them into chunks using `RecursiveCharacterTextSplitter` and create vector representations \
of these chunks using `HuggingFaceEmbeddings` on them before storing them into our vector database. \
In general, you should use larger chuck sizes for highly structured text such as code and smaller size for less structured text. \
You may need to experiment with different chunk sizes and overlap values to find out the best numbers.

In [None]:
# Split the document into chunks with a specified chunk size
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
all_splits = text_splitter.split_documents(docs)

# Store the document into a vector store with a specific embedding model
vectorstore = FAISS.from_documents(all_splits, HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

#### Retrieve using Semantic Search

We then use `RetrievalQA` to retrieve the documents from the vector database and give the model more context on Llama 3.1, thereby reducing its hallucination. \
LLama 3.1 also really shines with the new 128k context!

For each question, LangChain performs a semantic similarity search of it in the vector db, then passes the search results as the context to Llama to answer the question.

In [None]:
# use LangChain's RetrievalQA, to associate Llama 3 with the loaded documents stored in the vector db
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever()
)

question = "What's new with Llama 3?"
result = qa_chain({"query": question})
print(f"Response:\n{result['result']}")

#### Followup without memory
Now, lets bring it all together by incorporating follow up questions.\
First we ask a follow up questions without giving the model context of the previous conversation. \
Without this context, the answer we get does not relate to our original question.

In [None]:
# no context passed so Llama 3 doesn't have enough context to answer so it lets its imagination run wild
result = qa_chain({"query": "Based on what architecture?"})
print(result['result'])

As we did before, let us use the `ConversationalRetrievalChain` package to give the model context of our \
previous question so we can add follow up questions.

In [None]:
# use ConversationalRetrievalChain to pass chat history for follow up questions
from langchain.chains import ConversationalRetrievalChain
chat_chain = ConversationalRetrievalChain.from_llm(llm, vectorstore.as_retriever(), return_source_documents=True)

#### Followup with memory

In [None]:
# let's ask the original question What's new with Llama 3?" again
result = chat_chain({"question": question, "chat_history": []})
print(result['answer'])

In [None]:
# this time we pass chat history along with the follow up so good things should happen
chat_history = [(question, result["answer"])]
followup = "Based on what architecture?"
followup_answer = chat_chain({"question": followup, "chat_history": chat_history})
print(followup_answer['answer'])

In [None]:
# further follow ups can be made possible by updating chat_history like this:
chat_history.append((followup, followup_answer["answer"]))
more_followup = "What changes in vocabulary size?"
more_followup_answer = chat_chain({"question": more_followup, "chat_history": chat_history})
print(more_followup_answer['answer'])

<B>Additional Note</B>: If results get cut off, you can set "max_new_tokens" in the Replicate call above to a larger number (like shown below) to avoid the cut off.

model_kwargs={"temperature": 0.01, "top_p": 1, "max_new_tokens": 1000}