# Lab Statement: Introduction to Creating RAGs (Retrieval-Augmented Generators) with OpenAI

## Name: David Santiago Castro

### Overview
One of the most powerful applications enabled by LLMs is sophisticated question-answering  chatbots. These are applications that can answer questions about specific source information. These applications use a technique known as Retrieval Augmented Generation, or RAG.

### Concepts
We will cover the following concepts:
Indexing: a pipeline for ingesting data from a source and indexing it. This usually happens in a separate process.
Retrieval and generation: the actual RAG process, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.
Once we’ve indexed our data, we will use an agent as our orchestration framework to implement the retrieval and generation steps

### Installation

We require these langchain dependencies:



In [130]:
%pip install langchain langchain-text-splitters langchain-community bs4 

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip



### LangSmith

LangSmith is a tool that helps us see what is happening inside our LangChain applications. Since these applications often involve many steps and multiple LLM calls, it can be difficult to understand how everything works. LangSmith allows we to log and inspect the process, so we can track each step and debug more easily. To use it, we must first register and then configure our environment variables so that all activity in our chain or agent is logged for review


In [131]:
import os
from dotenv import load_dotenv
from langchain.chat_models import init_chat_model

load_dotenv()  

langchain_tracing_v2 = os.getenv("LANGCHAIN_TRACING_V2")
langchain_api_key = os.getenv("LANGCHAIN_API_KEY")



### Components
We will need to select three components from LangChain’s suite of integrations. We select OpenAI.

In [132]:
%pip install -U "langchain[openai]" python-dotenv langchain-openai

model = init_chat_model(
    "gpt-4o",
    model_provider="openai",
    openai_api_key=os.getenv("GITHUB_TOKEN"),
    openai_api_base="https://models.inference.ai.azure.com"
)


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Select an embeddings model:


In [133]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-large",
    openai_api_key=os.getenv("GITHUB_TOKEN"),
    openai_api_base="https://models.inference.ai.azure.com"
)

We selected Pinecone as the vector database as requested:


In [134]:
%pip install -qU langchain-pinecone

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [135]:
from pinecone import Pinecone

pinecone_api_key = os.getenv("PINECONE_API_KEY")
pc = Pinecone(api_key=pinecone_api_key)

Before initializing our vector store, let’s connect to a Pinecone index. If one named index_name doesn’t exist, it will be created

In [136]:
from pinecone import ServerlessSpec
from langchain_pinecone import PineconeVectorStore

index_name = "langchain-test-index-v2" 

if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=3072,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

index = pc.Index(index_name)
vector_store = PineconeVectorStore(embedding=embeddings, index=index)

document_ids = vector_store.add_documents(documents=all_splits)


### Indexing


Indexing commonly works as follows:
- Load: First we need to load our data. This is done with Document Loaders
- Split: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and passing it into a model, as large chunks are harder to search over and won’t fit in a model’s finite context window
- Store: We need somewhere to store and index our splits, so that they can be searched over later. This is often done using a VectorStore and Embeddings model

### Loading documents

We first need to load the blog post content, and for that we use DocumentLoaders, which are tools that bring in data from a source and turn it into a list of Document objects we can work with. In our case, we’ll use the WebBaseLoader, which relies on urllib to fetch HTML from web URLs and then uses BeautifulSoup to convert that HTML into text. We can adjust how BeautifulSoup does the parsing by passing parameters bs_kwargs. Since we only care about HTML tags with classes like post content, post title, or post header, we’ll filter out everything else so that we only keep the relevant text


In [137]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

bs4_strainer = bs4.SoupStrainer(class_=("post-title", "post-header", "post-content"))
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs={"parse_only": bs4_strainer},
)
docs = loader.load()

assert len(docs) == 1
print(f"Total characters: {len(docs[0].page_content)}")

Total characters: 43047


In [138]:
print(docs[0].page_content[:500])



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In


We start by using a DocumentLoader, which is an object that takes data from a source and turns it into a list of Documents we can work with. There are more than 160 integrations available, and the BaseLoader serves as the API reference for the core interface.
Since our blog post has over 43,000 characters, it’s too long to fit into the context window of most models. Even if a model could handle the full text, searching through such a large input would be inefficient. To solve this, we split the document into smaller chunks that can be embedded and stored in a vector database. This way, at runtime we only retrieve the most relevant fragments instead of the entire post.
For splitting, we use a RecursiveCharacterTextSplitter, which breaks the text down step by step using common separators like new lines. It keeps dividing until each chunk is the right size. This method is recommended for general text use cases because it balances efficiency with preserving meaningful structure


In [139]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  
    chunk_overlap=200,  
    add_start_index=True,  
)
all_splits = text_splitter.split_documents(docs)

print(f"Split blog post into {len(all_splits)} sub-documents.")

Split blog post into 63 sub-documents.


We use a TextSplitter to break our list of Documents into smaller chunks, making them easier to store and retrieve later. Once we have these fragments, we need to index them so they can be searched at runtime. In our case, we ended up with 63 text chunks. Following the semantic search tutorial, our approach is to embed the content of each chunk and insert those embeddings into a vector store. This allows us to perform vector searches: when we get a query, we can quickly find and return only the most relevant pieces of text. The nice part is that we can embed and store all our document splits in one step, using the vector store and embedding model we selected at the beginning of the tutorial


In [140]:
document_ids = vector_store.add_documents(documents=all_splits)

print(document_ids[:3])

['7beb196c-7f56-4e4a-8d45-38a8e89632d4', '380dd8f6-3b0f-4a9f-b84a-f7a5ef5726e9', 'abcb1eb9-7ddb-4d0e-af7b-3f0f0d7a4dcc']


We use Embeddings to turn text into numerical representations that capture meaning, and then store those embeddings in a VectorStore, which is basically a searchable database for vectors. With this setup, we finish the indexing part of our pipeline: now we have a vector store filled with the blog’s fragmented content. When a user asks a question, we can retrieve the most relevant chunks from the store and then generate an answer by passing both the question and those retrieved pieces to the model

We want to build a simple application that takes a user’s question, finds the most relevant documents, and then passes both the question and those documents to a model to generate an answer. To demonstrate this, we’ll show two approaches: first, a RAG agent that performs searches using a basic tool, which works well as a general-purpose solution; and second, a two-step RAG chain that uses only one LLM call per query, making it faster and very effective for straightforward questions


### RAG Agents

A RAG agent is a simple way to build a RAG application by giving it a tool that can retrieve information. In practice, we can create a minimal agent by implementing a tool that connects directly to our vector store, so the agent can search for relevant text chunks and use them to answer user questions


In [141]:
from langchain.tools import tool

@tool(response_format="content_and_artifact")
def retrieve_context(query: str):
    """Retrieve information to help answer a query."""
    retrieved_docs = vector_store.similarity_search(query, k=2)
    serialized = "\n\n".join(
        (f"Source: {doc.metadata}\nContent: {doc.page_content}")
        for doc in retrieved_docs
    )
    return serialized, retrieved_docs

We build the agent:


In [142]:
from langchain.agents import create_agent


tools = [retrieve_context]
prompt = (
    "You have access to a tool that retrieves context from a blog post. "
    "Use the tool to help answer user queries."
)
agent = create_agent(model, tools, system_prompt=prompt)

We construct a question that would normally require an iterative sequence of retrieval steps to answer:

In [None]:
query = (
    "What is the standard method for Task Decomposition?\n\n"
    "Once you get the answer, look up common extensions of that method."
)

for event in agent.stream(
    {"messages": [{"role": "user", "content": query}]},
    stream_mode="values",
):
    event["messages"][-1].pretty_print()

Note that the agent:
- Generates a query to search for a standard method for task decomposition;
- Receiving the answer, generates a second query to search for common extensions of it;
- Having received all necessary context, answers the question.
We can see the full sequence of steps, along with latency and other metadata, in the LangSmith trace

Where we see that RAG agents are like AI assistants that first search for relevant information in reliable documents or databases and then use that context to generate a better response. Instead of relying solely on what the model already knows, they combine retrieval and generation, which makes responses more accurate, up-to-date, and based on real sources