- What is it?
    - RAG, or Retrieval Augmented Generation, is a technique that enhances Large Language Models (LLMs) by incorporating additional data beyond what the model was trained on.
- RAG relies on the following core components:
    - Documents: RAG processes documents by splitting them into smaller chunks for better indexing and retrieval.
    - Document Loaders: These load data for indexing in RAG applications.
    - Text Splitters: They break down large documents into manageable chunks for indexing and model input.
    - Embedding Models: Used to embed text chunks for indexing and retrieval in RAG.
        - An embedding is a numerical representation of a word's meaning.
    - Vector Stores: Store and index the embedded text chunks for efficient retrieval.
    - Retrievers: Retrieve relevant data from the vector store based on user queries for the model to generate answers.
- Why is it important?
    - RAG is crucial as it allows LLMs to incorporate new data beyond their training set, enabling them to reason about private or post-training data.
- What problem does it solve for us?
    -  It solves the problem of limited knowledge in LLMs by dynamically adding specific information needed for accurate responses.
- Additional use cases with examples.
    - Q&A on large documents.
    - Q&A on knowledge bases.
    - Q&A on code repositories.


- Document Loaders:
    - TextLoader
    - CSVLoader
    - WebBaseLoader
    - YoutubeLoader
    - NotionDB

- Text Splitters:
    - CharacterTextSplitter
    - RecursiveCharacterTextSplitter
    - MarkdownTextSplitter

In [None]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter

# Load the Document: Read the text content from your file
loader = TextLoader("state_of_the_union.txt")  
# Replace "state_of_the_union.txt" with the path to your text file
documents = loader.load()  

# Split the Document into Chunks: Make the text easier to manage
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
# Split into chunks of 1000 characters with no overlap between chunks
docs = text_splitter.split_documents(documents)

# Display information about the split documents
print(f"Number of document chunks: {len(docs)}")
print(f"Sample chunk:\n{docs[0].page_content}")  # Show the content of the first chunk


OpenAI Embedding Pricing:
- https://openai.com/api/pricing/
- $0.02 per 1M tokens

Chroma:
- Chroma is an open-source embedding database specifically designed for AI applications. It allows you to store, index, and retrieve embedded representations of your text data (or other vectorized data) efficiently. Think of it as a specialized database for embeddings, optimized for the kinds of similarity searches you need in natural language processing tasks.
- Alternatives:
    - FAISS, Pinecone, Weaviate


In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

# Create Embeddings: Turn text into numbers so the computer can understand meaning
embeddings = OpenAIEmbeddings()  # Uses OpenAI's embedding model (requires your API key)

# Create Vector Store: A database to store the embeddings for efficient searching
db = Chroma.from_documents(docs, embeddings)  # Embed each chunk and store in the Chroma database

In [None]:
# User's Question: What the user wants to know
query = "What did the president say about the economy?"

# Retrieve Relevant Documents: Find the most relevant parts of the text based on the query
retriever = db.as_retriever(search_type="similarity") 
# Use the vector store to find similar documents based on meaning
relevant_docs = retriever.get_relevant_documents(query, k=3) # Get the top 3 most relevant documents

# TODO: Show similarity scores for each document
# # Search the DB.
# results = db.similarity_search_with_relevance_scores(query_text, k=3)
# if len(results) == 0 or results[0][1] < 0.7:
#     print(f"Unable to find matching results.")
#     return

# Display the Relevant Results
print(f"Relevant Documents:\n")
for doc in relevant_docs:
    print(doc.page_content)  # Print the content of each relevant document chunk


# TODO: Show how to pass chunks into prompt template and use it to generate the response.
# https://github.com/pixegami/langchain-rag-tutorial/blob/main/query_data.py
# context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
# prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
# prompt = prompt_template.format(context=context_text, question=query_text)
# print(prompt)
# model = ChatOpenAI()
# response_text = model.predict(prompt)


TODO: 
- Show how to add citations.
- Show how to add metadata while embedding.

DuckDB + Parquet:

In the chromadb.config.Settings, we used chroma_db_impl="duckdb+parquet". Here's why this combination is often used:

DuckDB: As an in-memory database, DuckDB offers excellent performance for smaller to medium-sized datasets. It's very fast for data loading, indexing, and querying.
Parquet: Parquet is a columnar storage format that is highly efficient for storing and reading large datasets. It also supports compression, which can help reduce storage costs.
By combining DuckDB with Parquet, you get the best of both worlds:

Fast in-memory processing for interactive queries and small datasets.
Efficient storage and retrieval for larger datasets thanks to Parquet's columnar format.
You can choose other combinations depending on your needs:

duckdb: Use only DuckDB (good for in-memory, smaller datasets).
parquet: Use only Parquet files (suitable for larger datasets but might be slower for interactive queries).


ConversationBufferMemory:

Purpose: Stores the history of messages exchanged in a conversation. This includes both the user's queries and the AI's responses.
Alternatives: LangChain offers other memory options like:
ConversationSummaryMemory: Summarizes the conversation periodically.
ConversationEntityMemory: Tracks specific entities mentioned in the conversation.
ConversationKGMemory: Stores information as a knowledge graph.


Alternatives to chain_type="stuff":

In the RetrievalQA chain, the chain_type parameter determines how the retrieved documents are incorporated into the prompt for the language model. Here are the common options:

- stuff: Inserts all retrieved documents directly into the prompt, one after the other.
- map_reduce: Summarizes each document individually and then combines the summaries.
- refine: Uses the first retrieved document to generate an initial response, then iteratively refines the response using subsequent documents.
- map_rerank: Generates a score for each retrieved document based on its relevance to the query and then selects the top-scoring documents.

How Developers Typically Use Collections:

There's no one-size-fits-all answer, but here are some common approaches:

- By Source: Divide collections based on the source of the data (e.g., "books," "articles," "code").
- By Topic: Group documents related to similar topics into collections (e.g., "machine learning," "finance," "health").
- By Time: If your data has a temporal dimension, you can create collections for different time periods (e.g., "2023 news," "2022 reports").
- By Granularity: Sometimes, you might have documents that naturally fit into different levels of granularity. For example, you could have a collection for entire books and another collection for individual chapters.
- By Author: If you are working with text documents from a variety of authors, it can be helpful to split things up by author.

In [None]:
# Continual Chat

import chromadb
from chromadb.config import Settings
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.memory import ConversationBufferMemory
from langchain.prompts import ChatPromptTemplate
from langchain import hub
from langchain.chains import RetrievalQA, ConversationalRetrievalChain

# Your OpenAI API Key
OPENAI_API_KEY = 'YOUR_OPENAI_API_KEY'

# 1. Load and Prepare Your Data 
loader = TextLoader("state_of_the_union.txt") 
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)  
docs = text_splitter.split_documents(documents)

# 2. Set Up Your Chroma Database 
persist_directory = "db"  
client_settings = Settings(chroma_db_impl="duckdb+parquet", persist_directory=persist_directory)
client = chromadb.Client(client_settings)
collection = client.get_or_create_collection(name="my_documents") 

if collection.count() == 0:
    collection.add(
        documents=[doc.page_content for doc in docs],  
        metadatas=[doc.metadata for doc in docs], 
        ids=[f"doc_{i}" for i in range(len(docs))], 
    )

# 3. Set Up the Question Answering System with History-Aware Retrieval 
retriever = collection.as_retriever()
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
llm = ChatOpenAI(openai_api_key=OPENAI_API_KEY, temperature=0)

# Use a pre-configured prompt for retrieval QA (or define your own)
chat_retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chain")
qa_chain = ConversationalRetrievalChain.from_llm(llm,retriever=retriever, memory=memory, chain_type='stuff')

# 4. Have a Conversation!
while True:  
    query = input("You: ")
    if query.lower() == "exit":
        break
    result = qa_chain({"question": query}) 
    print(f"AI: {result['answer']}") 



# TODO: Use langchain.chains import create_history_aware_retriever
# https://github.com/alejandro-ao/chat-with-websites/blob/master/src/app.py