#### RAG Pipeline Integrated with LLM Hosted on Furiosa RNGD
This notebook is a demo RAG pipeline.
It is inspired from [Mastering LangChain RAG: Integrating Chat History (Part 2)](https://medium.com/@eric_vaillancourt/mastering-langchain-rag-integrating-chat-history-part-2-4c80eae11b43) \
and LangChain [QA Chat with History]( https://python.langchain.com/docs/tutorials/qa_chat_history/). \
Before you run notebook, ensure that the [OpenAI compatible RNGD server](https://developer.furiosa.ai/latest/en/furiosa_llm/furiosa-llm-serve.html) is running. \
Failing that, you;ll get a connction error

In [1]:
import bs4
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_chroma import Chroma
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
import textwrap

USER_AGENT environment variable not set, consider setting it to identify your requests.


### Setup 
We need to setup three components from LangChain's suite of integrations.
1. An LLM model
2. An Embeddings model
3. A Vector Store/DB

Also populate the Vector DB by chunking the document

In [2]:


llm = ChatOpenAI(model="meta-llama/Llama-3.1-70B-Instruct", 
                 base_url="http://localhost:8000/v1")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")


# Prepare the document in the vectorsore
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)



### Setup the conversational RAG chain

In [None]:
# Construct retriever
retriever = vectorstore.as_retriever()

# Contextualize question
contextualize_q_system_prompt = """Given a chat history and the latest user question \
which might reference context in the chat history, formulate a standalone question \
which can be understood without the chat history. Do NOT answer the question, \
just reformulate it if needed and otherwise return it as is."""

contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)
history_aware_retriever = create_history_aware_retriever(
    llm, retriever, contextualize_q_prompt
)

# Answer question 
qa_system_prompt = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know. \
Use three sentences maximum and keep the answer concise.\

{context}"""
qa_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", qa_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

# Statefully manage chat history ###
store = {}


def get_session_history(session_id: str) -> BaseChatMessageHistory:
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]


conversational_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)


In [3]:
answer = conversational_rag_chain.invoke(
    {"input": "What is Task Decomposition?"},
    config={"configurable": {"session_id": "abc123"}
    }, 
)["answer"]
print(textwrap.fill(answer))

Task decomposition is the process of breaking down a complex task into
smaller, manageable subgoals or steps. This allows an agent or model
to plan ahead and make the task more achievable by transforming it
into multiple, simpler tasks.


### Maintaining History
Note below in the answer that history is maintained; the word 'it' is correctly associated with 'task decomposition'

In [4]:
answer = conversational_rag_chain.invoke(
    {"input": "What are common ways of doing it?"},
    config={"configurable": {"session_id": "abc123"}},
)["answer"]
print(textwrap.fill(answer))

Task decomposition can be done by LLM with simple prompting, using
task-specific instructions, and through human inputs. Another
approach, LLM+P, involves an external classical planner to do long-
horizon planning, utilizing the Planning Domain Definition Language
(PDDL) as an intermediate interface.
