# RAG Phases: LangChain Quickstart
* We will build a simple RAG app over a text data source.
* We will see how LangSmith can help us trace and understand the RAG app.

## Architecture
Steps of the Indexing phase:
* Load text with a [Document Loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/).
* Split text into small chunks with a [Splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/).
* Convert chunks into [embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/) and store them in a [vector database, also called vector store](https://python.langchain.com/docs/modules/data_connection/vectorstores/).

Steps of the Retrieval and Generation phase:
* Given a question, the most relevant chunks are retrieved from the vector database using a [Retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/).
* The question and the chunks are sent to the [LLM](https://python.langchain.com/docs/modules/model_io/llms/) in a prompt. The LLM then produces the answer. You can use a [ChatModel](https://python.langchain.com/docs/modules/model_io/chat) instead of an LLM.

## Resources available from LangChain
* [RAG Quickstart Guide](https://python.langchain.com/docs/use_cases/question_answering/quickstart).
* [How to get the source documents used to produce a RAG answer](https://python.langchain.com/docs/use_cases/question_answering/sources).
* [How to stream RAG answers](https://python.langchain.com/docs/use_cases/question_answering/streaming).
* [How to add chat history to a RAG app](https://python.langchain.com/docs/use_cases/question_answering/chat_history).
* [How to do RAG when each user has their own private data](https://python.langchain.com/docs/use_cases/question_answering/per_user).
* [How to use Agents for RAG](https://python.langchain.com/docs/use_cases/question_answering/conversational_retrieval_agents).
* [How to use RAG with local models](https://python.langchain.com/docs/use_cases/question_answering/local_retrieval_qa).

## Setup

#### Recommended: create new virtualenv
* pyenv virtualenv 3.11.4 your_venv_name
* pyenv activate your_venv_name
* pip install jupyterlab
* jupyter lab

#### Dependencies
* OpenAI Chat Model.
* OpenAI Embeddings.
* Chroma Vector Database.

#### Necessary Modules:
* langchain
* langchain-community
* langchainhub
* langchain-openai
* chromadb
* bs4: for web scraping

In [1]:
#%pip install --upgrade --quiet  langchain langchain-community langchainhub langchain-openai chromadb bs4

NOTE: In a Jupyter Notebook, both `%pip` and `!pip` are used to manage Python packages, but they operate slightly differently due to the way Jupyter interprets these commands.

1. **`%pip` Command**: This is a Jupyter-specific magic command. When you use `%pip`, it ensures that the package installation or any other pip operation you are performing happens in the current kernel's environment. This is important in Jupyter because notebooks can be connected to different Python environments (kernels). The `%pip` magic command helps to avoid confusion about where packages are being installed, especially in cases where you might have multiple environments or kernels. It was introduced to address the common issue where users would install a package using `!pip` and then discover it was not available in the notebook because it was installed in a different Python environment.

2. **`!pip` Command**: This command invokes the system shell and runs `pip`, Python's package installer, as if you were running it in your terminal or command prompt. While this does effectively install or upgrade packages, it may not always install them into the kernel that the Jupyter Notebook is currently using. This is because `!pip` executes pip in the environment from which Jupyter was started, which might be different from the current notebook's kernel environment. Before the introduction of `%pip`, this could lead to confusion among users about why installed packages were not available to their notebooks.

In summary, while both commands can be used to install Python packages in Jupyter Notebooks, `%pip` is more Jupyter-aware and ensures that packages are installed in the environment associated with the current notebook's kernel, thus avoiding common environment mismatch issues. It is generally recommended to use `%pip` when working in Jupyter Notebooks to ensure consistency and avoid potential confusion.

#### .env File
Remember to include:
OPENAI_API_KEY=your_openai_api_key

LANGCHAIN_TRACING_V2=true
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=your_langchain_api_key
LANGCHAIN_PROJECT=your_project_name

* our LangSmith project name: RAGquickstart

#### Recommended: activate LangSmith from the beginning
* LangSmith credentials in the .env file:
    * LANGCHAIN_TRACING_V2=true
    * LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
    * LANGCHAIN_API_KEY=your_key
    * LANGCHAIN_PROJECT=your_project_name

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

## Open LangSmith to track the following operations
* smith.langchain.com

## Goal of our RAG App
* RAG App over the blog post [LLM Powered Autonomous Agents](https://lilianweng.github.io/posts/2023-06-23-agent/) by Lilian Weng.

#### Required modules

In [4]:
import bs4
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

For demo purposes, we will import these modules again in the next cells.

# Phase 1: Indexing

## Step #1 of the Indexing phase: Load data
* WebBaseLoader loads HTML from web URLs.
* BS4 selects what HTML classes and parses them into text.

In [5]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Only keep post title, headers, and content from the full HTML.
bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()

In [6]:
len(docs[0].page_content)

42823

In [7]:
print(docs[0].page_content[:500])



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In


#### Additional info on Loading Data
* [Documentation on Document Loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/). Interesting points:
    * PDF. Check the Multimodal LLM Apps section.
    * CSV.
    * JSON.
    * HTML.
* [Integrations with 3rd Party Document Loaders](https://python.langchain.com/docs/integrations/document_loaders/). Interesting points:
    * Email.
    * Github.
    * Google Drive.
    * HugggingFace dataset.
    * Images.
    * Jupyter Notebook.
    * MS Excel.
    * MS Powerpoint.
    * MS Word.
    * MongoDB.
    * Pandas DataFrame.
    * Twitter.
    * URL.
    * Unstructured file.
    * WebBaseLoader.
    * YouTube audio.
    * YouTube transcription.

## Step #2 of the Indexing Phase: Split data into small chunks
* We’ll split our documents into chunks of 1000 characters with 200 characters of overlap between chunks.
* We use the RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size. This is the recommended text splitter for generic text use cases.
* We set add_start_index=True so that the character index at which each split Document starts within the initial Document is preserved as metadata attribute “start_index”.

In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

In [9]:
len(all_splits)

66

In [10]:
len(all_splits[0].page_content)

969

In [11]:
all_splits[10].metadata

{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
 'start_index': 7056}

#### Additional info on Splitting chunks of text
* [Documentation on Text Splitters](https://python.langchain.com/docs/modules/data_connection/document_transformers/). Interesting points:
    * Recursively split by character.
    * Semantic chunking.
    * Recursively split JSON.

## Step #3 of the Indexing Phase: Transform chunks into embeddings and store them in the vector database
* Now we will index our 66 text chunks so that we can search over them.
* When we want to search over the chunks, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored chunks with the most similar embeddings to our query embedding.

In [12]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

#### Additional info on Embeddings
* [Documentation on Embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding).
* [Integration with 3rd Party Embedding Models](https://python.langchain.com/docs/integrations/text_embedding/). Interesting points:
    *  OpenAI.
    *  Cohere.
    *  Fireworks.
    *  Google Vertex AI.
    *  Hugging Face.
    *  Llama-ccp.
    *  MistralAI.
    *  Ollama.

#### Additional info on Vector Databases (also called Vector Stores)
* [Documentation on Vector Databases](https://python.langchain.com/docs/modules/data_connection/vectorstores/).
* [Integration with 3rd Party Vector Databases](https://python.langchain.com/docs/integrations/vectorstores/). Interesting points:
    * Chroma.
    * Faiss.
    * Postgres.
    * PGVector.
    * Databricks.
    * Pinecone.
    * Redis.
    * Supabase.
    * LanceDB.
    * MongoDB.
    * Neo4j.
    * Activeloop DeepLake. 

# Phase 2: Retrieval
* We are going to create a RAG app that
    * takes a user question,
    * searches for documents relevant to that question,
    * passes the retrieved documents and initial question to a model,
    * and returns an answer.
* First we need to define our logic for searching over documents. LangChain defines a Retriever interface which wraps an index that can return relevant Documents given a string query.
* The most common type of Retriever is the VectorStoreRetriever, which uses the similarity search capabilities of a vector store to facilitate retrieval. Any VectorStore can easily be turned into a Retriever with VectorStore.as_retriever().

#### Set the retriever
A "retriever" refers to a component or object that is used to retrieve (or search for) specific information from a database or a collection of data. The retriever is designed to work with a vector database that stores data in the form of embeddings.

Embeddings are numerical representations of data (in this case, text data split into smaller chunks) that capture the semantic meaning of the original content. These embeddings are created using models like the ones provided by OpenAI (for example, through `OpenAIEmbeddings`), which convert textual content into high-dimensional vectors that numerically represent the semantic content of the text.

The vector database, in this context represented by `Chroma`, is a specialized database that stores these embeddings. The primary function of this database is to enable efficient similarity searches. That is, given a query in the form of an embedding, the database can quickly find and return the most similar embeddings (and by extension, the corresponding chunks of text or documents) stored within it.

When the code sets the `retriever` as `vectorstore.as_retriever()`, it essentially initializes a retriever object with the capability to perform these similarity-based searches within the `Chroma` vector database. This means that the retriever can be used to find the most relevant pieces of information (text chunks, in this scenario) based on a query. This is particularly useful in applications that needs to retrieve information based on semantic similarity rather than exact keyword matches.

In [13]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [14]:
retrieved_docs = retriever.invoke("What are the approaches to Task Decomposition?")

* When we execute the previous cell, we start seeing our project on LangSmith.

In [15]:
len(retrieved_docs)

6

In [16]:
print(retrieved_docs[0].page_content)

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.


#### Additional information on Retrievers.
* [Documentation on Retrievers](https://python.langchain.com/docs/modules/data_connection/retrievers/). Interesting points:
    * Vectorstore.
    * Time-Weighted Vectorstore: most recent first.
    * Advanced Retrievers. 
* [Integrations with 3rd Party Retrieval Services](https://python.langchain.com/docs/integrations/retrievers/).

# Phase 3: Generation
* We will put it all together into a chain that takes a question, retrieves relevant documents, constructs a prompt, passes that to a model, and parses the output.
* We’ll use the gpt-3.5-turbo OpenAI chat model, but any LangChain LLM or ChatModel could be substituted in.

#### Determine which Foundation LLM we will use

In [17]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo-0125", temperature=0)

#### Set the prompt
* In this case, we will use a [pre-defined prompt from the LangSmith Prompt Hub](https://smith.langchain.com/hub/rlm/rag-prompt?organizationId=ec6a7494-139b-5170-8044-143bc78622a9).

In [18]:
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

In [None]:
# To print the prompt we have selected from the LangSmith Prompt Hub
prompt

* Alternative with a custom prompt:

In [2]:
# from langchain.prompts import ChatPromptTemplate

# template = """Answer the question based only on the following context:
# {context}

# Question: {question}
# """

# prompt = ChatPromptTemplate.from_template(template)

In [19]:
example_messages = prompt.invoke(
    {"context": "filler context", 
     "question": "filler question"}
).to_messages()

example_messages

[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: filler question \nContext: filler context \nAnswer:")]

* You can see the previous operation reflected on LangSmith.

In [22]:
print(example_messages[0].content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question 
Context: filler context 
Answer:


#### Define a pre-processing function to better format the document
* This step is optional

In [None]:
# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

#### Define the RAG chain
Pay attention to this details:
* See how we use the retriever as context
* See how we use the pre-processing function
* See how we use the StrOutputParser to format the answer as a string

In [23]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

* Using the LCEL Runnable protocol to define the chain allows us to:
    * pipe together components and functions in a transparent way.
    * automatically trace our chain in LangSmith.
    * get streaming, async, and batched calling out of the box.

### Runnables
A "Runnable" is like a tool or a small machine in a factory line that has a specific job. You can think of it as a mini-program that knows how to do one particular task, like read data, change it, or make decisions based on that data.

A "Runnable" in the context of LangChain is essentially a component or a piece of code that can be executed or "run". 

It acts like a building block that can perform a specific operation or process data in a particular way.

The chaining or composition of Runnables allows you to build a sequence where data flows from one Runnable to another, each performing its function, ultimately leading to a final result.

In essence, a Runnable makes it easy to define and reuse modular operations that can be combined in various ways to build complex data processing and interaction pipelines efficiently.

#### RunnableParallel
RunnableParallel is like a manager that can handle multiple tasks at once in a software system. Instead of doing one job at a time, it can do several jobs simultaneously, making the process faster and more efficient.

RunnableParallel is used to manage different components that do tasks like retrieving information, passing questions, or formatting data, and then brings together all their outputs neatly. This makes it very handy for building complex operations where multiple, independent processes need to run at the same time.

#### RunnablePassThrough
RunnablePassthrough is a simple tool that lets data pass through it without making any changes. You can think of it like a clear tube: whatever you send into one end comes out the other end unchanged.

When working with data transformations and pipelines, RunnablePassthrough is often used alongside other tools that might modify the data. For example, when multiple operations are run in parallel, you might want to let some data flow through unaltered while other data gets modified or processed differently.

RunnablePassthrough is often used in combination with RunnableParallel:

In [None]:
# from langchain_core.runnables import RunnableParallel, RunnablePassthrough

# runnable = RunnableParallel(
#     passed=RunnablePassthrough(),
#     modified=lambda x: x["num"] + 1,
# )

# runnable.invoke({"num": 1})

##### Output:
{'passed': {'num': 1}, 'modified': 2}

**In the previous example:** 
   - `RunnableParallel` is set up with two keys, `passed` and `modified`.
   - Under the `passed` key, `RunnablePassthrough` is used. This means that the data under this key remains unchanged. In the example, it passes the input `{'num': 1}` directly to the output.
   - Under the `modified` key, a function modifies the data by adding 1 to the value of `num`, so the output becomes 2.

**In our example:**
   - Here, `RunnablePassthrough` helps in a retrieval scenario where it is used to directly pass the user's question to a model that needs it to generate a response.
   - While `RunnablePassthrough` takes care of the question, another component fetches relevant context data to be used with the question for generating the response.

In summary, RunnablePassthrough is useful when you want to ensure some parts of the input data are transmitted through a system or process without any changes, while other parts might be actively modified or enriched.


#### We can start asking questions to our RAG application

* We will use streaming, so the answer will flow like what we see using chatGPT instead printing the whole answer in one single step.

In [24]:
for chunk in rag_chain.stream("What is Task Decomposition?"):
    print(chunk, end="", flush=True)

Task decomposition is a technique that breaks down complex tasks into smaller, more manageable steps. It involves transforming big tasks into multiple simpler tasks to aid in the interpretation of the model's thinking process. Task decomposition can be done using prompting techniques, task-specific instructions, or human inputs.

* Now you can see the previous operation reflected on LangSmith.

#### Alternative approach with a customized prompt and without streaming.

In [25]:
from langchain_core.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.

{context}

Question: {question}

Helpful Answer:"""
custom_rag_prompt = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is Task Decomposition?")

'Task decomposition is the process of breaking down complex tasks into smaller and simpler steps to make them more manageable for autonomous agents or AI systems. This approach allows for better planning and execution of tasks by focusing on individual components rather than tackling the entire task at once. Thanks for asking!'

* You can see the previous operation reflected on LangSmith.

#### Additional Information on Chat Models and LLM Models
* [Documentation on Chat Models](https://python.langchain.com/docs/modules/model_io/chat/). Interesting points:
    * How-To Guides.
    * Function calling.
    * Streaming. 
* [Integration with 3rd Party Chat Models](https://python.langchain.com/docs/integrations/chat/). Interesting points:
    * OpenAI.
    * Anthropic.
    * Azure OpenAI.
    * Cohere.
    * Fireworks.
    * Google AI.
    * Hugging Face.
    * Llama 2 Chat.
    * MistralAI.
    * Ollama.
* [Documentation on LLM Models](https://python.langchain.com/docs/modules/model_io/llms). Interesting points:
    * How-To Guides.
    * Streaming.
    * Tracking token usage. 
* [Integrations with 3rd Party LLM Models](https://python.langchain.com/docs/integrations/llms). Interesting points:
    * OpenAI.
    * Anthropic.
    * Azure OpenAI.
    * Cohere.
    * Fireworks.
    * Google AI.
    * GPT4All.
    * Hugging Face.
    * Llama.cpp
    * Ollama.
    * Replicate.