# Lessons 4: Q&A over Documents
This notebook documents my learning journey on **LangChain for LLM Application Development** course from Deeplearning.ai \
[Lesson 4: QnA](https://learn.deeplearning.ai/courses/langchain/lesson/mv7m1/question-and-answer)

\
What I Learned
- How to load and prepare document data for RAG (`CSVLoader`)
- How to embed documents using HuggingFace models
- How to create a vector store for semantic search
- How to manually construct a RAG pipeline using retrievers and LLMs
- How to simplify the full pipeline using `VectorstoreIndexCreator`

## Setting up the Environment

In [None]:
!pip install -qU python-dotenv
!pip install -qU langchain-groq
!pip install -qU langchain-community
!pip install -qU langchain-huggingface
!pip install -qU docarray

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
import os
from dotenv import load_dotenv

_ = load_dotenv() # read local .env file

In [None]:
from langchain.chat_models import init_chat_model

llm = init_chat_model(
    model = "llama-3.3-70b-versatile",
    model_provider = "groq",
    temperature = 0.9
)

## I. Experimenting with a Simple RAG


1. Load product data from a CSV file


In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader

file_path = "OutdoorClothingCatalog_1000.csv"
loader = CSVLoader(file_path=file_path)

2. Define the embedding model to convert text into vectors

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# more information: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

3. Create a vector index from the loaded documents using the embedding model

In [None]:
from langchain.indexes.vectorstore import VectorstoreIndexCreator
from langchain_community.vectorstores import DocArrayInMemorySearch

index_creator = VectorstoreIndexCreator(
    embedding=embeddings,
    vectorstore_cls=DocArrayInMemorySearch
)

index = index_creator.from_loaders([loader])

4. Query and display the result

In [None]:
query ="Please list all your shirts with sun protection in a table in markdown and summarize each one."

In [None]:
response = index.query(query, llm = llm)

In [None]:
from IPython.display import display, Markdown

display(Markdown(response))

### Shirts with Sun Protection
| Shirt Name | Description | Fabric | UPF Rating | Care |
| --- | --- | --- | --- | --- |
| Women's Tropical Tee, Sleeveless | Sleeveless button-up with SunSmart™ protection, wrinkle resistant, and low-profile pockets | 71% nylon, 29% polyester | UPF 50+ | Machine wash and dry |
| Sun Shield Shirt | High-performance sun shirt with moisture-wicking and abrasion-resistant fabric | 78% nylon, 22% Lycra Xtra Life fiber | UPF 50+ | Hand wash, line dry |
| Tropical Breeze Shirt | Lightweight, breathable long-sleeve men's UPF shirt with SunSmart™ protection and moisture-wicking fabric | 71% nylon, 29% polyester | UPF 50+ | Machine wash and dry |
| Men's Plaid Tropic Shirt, Short-Sleeve | Ultracomfortable sun protection shirt with SunSmart technology and wrinkle-free fabric | 52% polyester, 48% nylon | UPF 50+ | Machine wash and dry |

Each shirt provides UPF 50+ sun protection, blocking 98% of the sun's harmful UV rays. The Women's Tropical Tee and Tropical Breeze Shirt feature SunSmart™ technology, while the Sun Shield Shirt has a high-performance fabric recommended by The Skin Cancer Foundation. The Men's Plaid Tropic Shirt offers a comfortable and breathable design with wrinkle-free fabric. All shirts are designed to provide superior sun protection and comfort for outdoor activities.

## II. Step-by-Step Exploration of RAG with LangChain

Define a loader

In [None]:
file_path = "OutdoorClothingCatalog_1000.csv"
loader = CSVLoader(file_path=file_path)

Load data

In [None]:
docs = loader.load()

In [None]:
doc = docs[0]
print(type(doc)) # just see what type it is

<class 'langchain_core.documents.base.Document'>


A `Document` object has two main parts: `page_content` and `metadata`


In [None]:
print(doc)

page_content=': 0
name: Women's Campside Oxfords
description: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. 

Size & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. 

Specs: Approx. weight: 1 lb.1 oz. per pair. 

Construction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. 

Questions? Please contact us for any inquiries.' metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}


In [None]:
doc.page_content

": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \r\n\r\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \r\n\r\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \r\n\r\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \r\n\r\nQuestions? Please contact us for any inquiries."

In [None]:
doc.metadata

{'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}

Define an embedding model

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [None]:
query = "Please list all your shirts with sun protection in a table in markdown and summarize each one."
embed = embeddings.embed_query(query)

In [None]:
print(len(embed)) # The embedding dimension

384


Create the vector database (vector store) in memory

In [None]:
db = DocArrayInMemorySearch.from_documents(
    docs,
    embeddings
)

Do semantic similarity search - Retrieve the most relevant documents for a query

In [None]:
docs = db.similarity_search(query)

In [None]:
len(docs) # by default, it return 4 results

4

In [None]:
len(db.similarity_search(query, k=10))  # you can use the 'k' parameter to specify the number of results

10

Wrap the vector store as a retriever so it can be used in QA chains (like `RetrievalQA`)

In [None]:
retriever = db.as_retriever()

Combine the text content of all retrieved documents into a single string to use as input for the language model.

In [None]:
qdocs = "".join([docs[i].page_content for i in range(len(docs))])

In [None]:
response = llm.invoke(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.")

In [None]:
display(Markdown(response.content))

### Sun Protection Shirts
| Shirt Name | Description | Fabric | UPF Rating | Care |
| --- | --- | --- | --- | --- |
| Women's Tropical Tee, Sleeveless | Sleeveless button-up with SunSmart™ protection, slightly fitted, and wrinkle resistant | 71% nylon, 29% polyester | UPF 50+ | Machine wash and dry |
| Sun Shield Shirt | High-performance sun shirt with SPF 50+ sun protection, slightly fitted, and quick-drying | 78% nylon, 22% Lycra Xtra Life fiber | UPF 50+ | Hand wash, line dry |
| Sunrise Tee | Lightweight, UV-protective button-down shirt with moisture-wicking fabric, slightly fitted, and wrinkle-free | 71% nylon, 29% polyester | UPF 50+ | Machine wash and dry |
| Tropical Breeze Shirt | Lightweight, breathable long-sleeve shirt with SunSmart™ protection, traditional fit, and moisture-wicking fabric | 71% nylon, 29% polyester | UPF 50+ | Machine wash and dry |

**Summary:**
These four shirts offer sun protection with UPF 50+ ratings, blocking 98% of the sun's harmful UV rays. The Women's Tropical Tee and Sunrise Tee are designed for women, while the Tropical Breeze Shirt is for men. The Sun Shield Shirt is a high-performance option with quick-drying comfort. All shirts have moisture-wicking fabrics and are designed for outdoor activities, travel, or everyday wear.

Finally: build a full retrieval chain (without manually receiving the context or invoking the LLM directly)

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages([
    ("system", "Use the given context to answer the question, "
     "\n{context}\n"
     "Respond in markdown table and summarize each one."),
    ("human", "{input}")
])
# IMPORTANT:
# LangChain's create_retrieval_chain automatically provides {context} and {input} as variables to the prompt.
# Make sure your prompt includes exactly these two placeholders: {context} and {input}.


# Create a "stuff" chain: combines all retrieved documents into a single long string for the LLM
question_answer_chain = create_stuff_documents_chain(llm, prompt)

# Create the full Retrieval QA chain by connecting the retriever and the question-answer chain
chain = create_retrieval_chain(retriever, question_answer_chain)


In [None]:
response = chain.invoke({"input": "Please list all your shirts with sun protection"})

In [None]:
display(Markdown(response["answer"]))

### Sun Protection Shirts
Here is a list of shirts with sun protection:

| ID | Name | Description | Sun Protection |
| --- | --- | --- | --- |
| 255 | Sun Shield Shirt | High-performance sun shirt with UPF 50+ rating | Blocks 98% of UV rays |
| 679 | Women's Tropical Tee | Sleeveless shirt with built-in SunSmart UPF 50+ rating | Blocks 98% of UV rays |
| 535 | Men's TropicVibe Shirt | Short-sleeve shirt with UPF 50+ rating | Blocks 98% of UV rays |
| 709 | Sunrise Tee | Women's UV-protective button down shirt with UPF 50+ rating | Blocks 98% of UV rays |

All shirts have a **UPF 50+ rating**, which is the highest rated sun protection possible, blocking **98% of the sun's harmful rays**.

👉 All of the steps above can be simplified using `VectorstoreIndexCreator`:




In [None]:
# llm = ...  # define the LLM
# embeddings = ...  # define the embedding model
# loader = ...  # define the document loader

index_creator = VectorstoreIndexCreator(
    embedding=embeddings,
    vectorstore_cls=DocArrayInMemorySearch
)

index = index_creator.from_loaders([loader])


query = "Please list all your shirts with sun protection in a table in markdown and summarize each one."
response = index.query(query, llm=llm)


---
# Summary of Your LangChain Learning Journey

In this notebook, you've delved into the world of Question & Answering over documents, a core application of LangChain. You've learned how to build a RAG (Retrieval-Augmented Generation) system from the ground up, and also how to use LangChain's abstractions to simplify the process. Here's a breakdown of the key concepts:

**1. Setting Up Your Environment for RAG:**

* You began by installing the necessary libraries for this lesson, including:
    - `langchain-groq` for interacting with the Groq API.
    - `langchain-community` for community-contributed components like loaders.
    - `langchain-huggingface` to use Hugging Face's embedding models.
    - `docarray` for in-memory vector storage.
* You continued the best practice of using `python-dotenv` to manage your API keys securely.

**2. Building a RAG Pipeline - The Easy Way:**

* You started with the `VectorstoreIndexCreator`, a high-level abstraction that simplifies the creation of a RAG pipeline.
* You learned how to:
    - Load documents from a CSV file using `CSVLoader`.
    - Define an embedding model using `HuggingFaceEmbeddings`.
    - Create a vector store and query it with a single command, demonstrating the power of LangChain's abstractions.

**3. Deep Dive into the Components of RAG:**

* You then deconstructed the RAG pipeline to understand each component in detail:
    - **Loading Documents:** You loaded documents and inspected their structure, understanding the `page_content` and `metadata`.
    - **Embeddings:** You used an embedding model to convert your query into a vector representation.
    - **Vector Stores:** You created an in-memory vector store using `DocArrayInMemorySearch` and performed similarity searches to retrieve relevant documents.
    - **Retrievers:** You wrapped your vector store in a retriever to make it compatible with LangChain's QA chains.
    - **Chains:** You learned how to manually construct a QA chain by retrieving documents, combining their content, and then passing that context to the LLM.

**4. Building a More Advanced RAG Chain:**

* You leveled up your skills by using `create_retrieval_chain` to build a more sophisticated and flexible RAG pipeline.
* You learned how to use `ChatPromptTemplate` to create a prompt that effectively utilizes the retrieved context to answer the user's question.

**In essence, you've learned how to:**

* **Load** and prepare data from external documents.
* **Embed** text into a vector representation for semantic search.
* **Store** and retrieve information from a vector store.
* **Construct** a complete RAG pipeline, both manually and with the help of LangChain's abstractions.
* **Chain** all these components together to create a powerful Q&A application.

These are the fundamental skills for building applications that can reason about and answer questions based on your own data. You're well on your way to building even more impressive LangChain applications!

---
# Key Commands and Imports to Remember

### Python Libraries:
- **`import os`**: Interacts with the operating system, mainly for accessing environment variables.
- **`from dotenv import load_dotenv`**: Loads environment variables from a `.env` file to securely manage API credentials.

### LangChain Libraries:
- **`from langchain.chat_models import init_chat_model`**: Easily initializes a chat model instance from a provider.
- **`from langchain_community.document_loaders.csv_loader import CSVLoader`**: A loader for reading data from CSV files.
- **`from langchain_huggingface import HuggingFaceEmbeddings`**:  A class for using embedding models from Hugging Face.
- **`from langchain.indexes.vectorstore import VectorstoreIndexCreator`**: A high-level tool for quickly creating a vector store index from documents.
- **`from langchain_community.vectorstores import DocArrayInMemorySearch`**: An in-memory vector store for fast prototyping and development.
- **`from langchain.chains import create_retrieval_chain`**: A function to create a chain that retrieves documents and then answers a question based on them.
- **`from langchain.chains.combine_documents import create_stuff_documents_chain`**: A chain that "stuffs" all retrieved documents into the prompt.
- **`from langchain_core.prompts import ChatPromptTemplate`**: For creating and formatting prompt templates that can be used in a chain.