# Rag From Scratch: Overview

These notebooks walk through the process of building RAG app(s) from scratch.

They will build towards a broader understanding of the RAG langscape, as shown here:

![Screenshot 2024-03-25 at 8.30.33 PM.png](attachment:c566957c-a8ef-41a9-9b78-e089d35cf0b7.png)

## Enviornment

`(1) Packages`

In [1]:
! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain

Collecting langchain_community
  Downloading langchain_community-0.3.14-py3-none-any.whl.metadata (2.9 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting langchain-openai
  Downloading langchain_openai-0.2.14-py3-none-any.whl.metadata (2.7 kB)
Collecting langchainhub
  Downloading langchainhub-0.1.21-py3-none-any.whl.metadata (659 bytes)
Collecting chromadb
  Downloading chromadb-0.6.2-py3-none-any.whl.metadata (6.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting types-requests<3.0.0.0,>=2.31.0.2 (from langchainhub)
  Downloading

`(2) LangSmith`

https://docs.smith.langchain.com/

In [None]:
import os
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = <your-api-key>

`(3) API Keys`

In [None]:
os.environ['OPENAI_API_KEY'] = <your-api-key>

## Part 1: Overview

[RAG quickstart](https://python.langchain.com/docs/use_cases/question_answering/quickstart)

In [None]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

#### INDEXING ####

# Load Documents
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)

# trying to use WebBaseLoader with BeautifulSoup to extract specific sections from a webpage.
# You're specifying bs_kwargs to pass arguments to BeautifulSoup, and using bs4.SoupStrainer to filter certain HTML elements
# with specific classes like "post-content", "post-title", and "post-header".
# This way, the loader only extracts content from those sections.



docs = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# The RecursiveCharacterTextSplitter is a part of the langchain package used to split text into smaller chunks
# while preserving logical structure (e.g., paragraphs, sentences).
# This is useful when working with large texts that need to be broken into smaller pieces for tasks like text summarization,
# question-answering, or search indexing.

# What sets RecursiveCharacterTextSplitter apart from simple text splitting is
# that it tries to maintain the natural hierarchy of the text (like paragraphs or sentences)
# and only falls back to splitting by characters when necessary. This helps retain meaningful chunks of information.

# Key Features:
# Recursive Splitting: It first tries to split the text by higher-level units (paragraphs, sentences)
#                      and only falls back to characters if those units are still too large.
# Flexible Control: You can set chunk sizes and control how the text should be split based on different delimiters.
# Text Preprocessing: Useful when preparing documents for tasks that work better with smaller chunks (e.g., embeddings generation, large-scale document analysis).

# Embed
vectorstore = Chroma.from_documents(documents=splits,
                                    embedding=OpenAIEmbeddings())


retriever = vectorstore.as_retriever()

# In the context of Langchain and vector stores (like Chroma), the method vectorstore.as_retriever()
# is used to convert a vector store into a retriever. This allows the vector store to be used as a retrieval mechanism
# for tasks such as question answering, document search, or any other task
# where you need to retrieve similar chunks of text based on embeddings.

# How It Works:
# Vector Store: The vector store contains documents (or text chunks) that have been embedded into
# vector representations using an embedding model (such as OpenAI embeddings).
# Retriever: A retriever is an abstraction that allows you to query the vector store by providing an input query,
# which is then embedded and compared against the stored document embeddings.
# The retriever returns the most relevant documents or text chunks based on similarity
#  (usually via cosine similarity or other distance metrics).
# Using as_retriever(), you can easily retrieve relevant chunks of text that are similar in meaning to a query.

#### RETRIEVAL and GENERATION ####

# Prompt
prompt = hub.pull("rlm/rag-prompt")

#extracting prompt template from hugging face

# LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# The | (pipe) symbol is being used as an operator to chain together components in a sequence of transformations


# Question
rag_chain.invoke("What is Task Decomposition?")