# Retrieval Augmented Generation (RAG) with LangChain

This notebook builds a Retrieval Augmented Generation (RAG) system using LangChain and HuggingFace embeddings. The main functionalities include:

1. Loading documents from the knowledge-base directory
2. Splitting documents into smaller chunks
3. Creating vector embeddings for each chunk
4. Storing vectors in Chroma vector database
5. Building a chat application using RAG with Gemini API

## Installation and Library Imports

In [1]:
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.memory import ConversationBufferWindowMemory  
from langchain.chains import ConversationalRetrievalChain
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma

from sklearn.manifold import TSNE
import plotly.graph_objects as go

import numpy as np
import os
import glob
import time

from dotenv import load_dotenv

DB_NAME = "vector_db"

## 1. Loading Documents from Knowledge Base

Load all markdown files from the `knowledge-base` directory and its subdirectories. Documents are assigned metadata to categorize them based on folder names.

In [2]:
# Get list of directories in knowledge-base
folders = glob.glob("../knowledge-base/*")
text_loader_kwargs = {"autodetect_encoding": True}

# Initialize list to hold documents
documents = []

# Loop through each folder to load documents
for folder in folders:
    # Use folder name as document type
    doc_type = os.path.basename(folder)
    
    # Create loader for all .md files in the folder
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    
    # Load documents from the folder
    folder_docs = loader.load()
    
    # Add metadata and append to main list
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

print("Total documents loaded:", len(documents))

Total documents loaded: 17


### 1.1. Examining Loaded Documents

In [3]:
documents[0]

Document(metadata={'source': '../knowledge-base/company/about.md', 'doc_type': 'company'}, page_content='# Về Công Ty - Korea Study Consultant Center\n\n## Lịch Sử Thành Lập\n\nKorea Study Consultant Center được thành lập vào năm 2018 với sứ mệnh kết nối các bạn trẻ Việt Nam với hệ thống giáo dục chất lượng cao của Hàn Quốc. Được sáng lập bởi các chuyên gia giáo dục có nhiều năm kinh nghiệm tại Hàn Quốc, chúng tôi đã hỗ trợ hơn 2,000 học sinh Việt Nam thực hiện ước mơ du học tại xứ sở kim chi.\n\n## Tầm Nhìn & Sứ Mệnh\n\n### Tầm Nhìn\nTrở thành trung tâm tư vấn du học Hàn Quốc hàng đầu tại Việt Nam, mang đến cơ hội giáo dục tốt nhất cho thế hệ trẻ Việt Nam.\n\n### Sứ Mệnh\n- Cung cấp dịch vụ tư vấn du học chuyên nghiệp và uy tín\n- Hỗ trợ toàn diện từ khâu chuẩn bị hồ sơ đến khi định cư tại Hàn Quốc\n- Xây dựng cầu nối văn hóa và giáo dục giữa Việt Nam và Hàn Quốc\n- Đảm bảo tỷ lệ thành công cao nhất cho học sinh\n\n## Giá Trị Cốt Lõi\n\n### 1. Chuyên Nghiệp\n- Đội ngũ tư vấn viên có b

## 2. Splitting Documents into Chunks

Using RecursiveCharacterTextSplitter to divide documents into smaller chunks of appropriate length, with some overlap to maintain context.

In [4]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=256,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

Created 135 chunks


### 2.1. Examining Created Chunks

In [5]:
chunks[5]

Document(metadata={'source': '../knowledge-base/company/overview.md', 'doc_type': 'company'}, page_content='### Văn Phòng Đại Diện Seoul\n📍 **Địa chỉ**: #1203, Gangnam Finance Center, Gangnam-gu, Seoul, South Korea  \n📞 **Điện thoại**: +82-2-558-9876  \n📧 **Email**: seoul@koreastudyvn.com  \n\n## Thông Tin Liên Lạc Nhanh\n\n🌐 **Website**: www.koreastudyvn.com  \n📱 **Hotline**: 1900-6789  \n💬 **Zalo**: 0901-234-567  \n📘 **Facebook**: Korea Study Consultant Center Vietnam  \n📸 **Instagram**: @koreastudyvn  \n🎬 **YouTube**: Korea Study VN  \n💼 **LinkedIn**: Korea Study Consultant Center  \n\n## Giấy Phép & Chứng Nhận\n\n### Giấy Phép Hoạt Động\n- **Giấy phép kinh doanh**: Số 0123456789-001, cấp ngày 15/03/2018\n- **Giấy phép hoạt động tư vấn du học**: Số EDU-2018-VN-001\n- **Chứng nhận ISO 9001:2015**: Quản lý chất lượng dịch vụ\n\n### Thành Viên Của\n- **Hiệp hội Tư vấn Du học Việt Nam (VIECA)**\n- **Liên minh Giáo dục Việt Nam - Hàn Quốc**\n- **Mạng lưới Đối tác Giáo dục Quốc tế (IEPN)*

In [6]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: schools, company, visas, employees


## 3. Creating Vector Embeddings and Storing in Chroma Database

Using HuggingFace model to create vector embeddings for text chunks, then storing them in a Chroma database.

In [7]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

if os.path.exists(DB_NAME):
    Chroma(persist_directory=DB_NAME, embedding_function=embeddings).delete_collection()

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm


In [8]:
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=DB_NAME
)

print(f"Vectorstore created with {vectorstore._collection.count()} documents")

  return forward_call(*args, **kwargs)


Vectorstore created with 135 documents


In [9]:
collection = vectorstore._collection

sample_embedding = collection.get(limit=1, include=["embeddings"])["embeddings"][0]

dimensions = len(sample_embedding)
print(f"The vectors have {dimensions:,} dimensions")

The vectors have 384 dimensions


In [10]:
sample_embedding

array([-4.62564491e-02,  1.97377615e-02,  5.19227646e-02, -6.66938350e-02,
       -2.26987321e-02,  1.27717359e-02,  7.94795156e-02,  1.46346195e-02,
        5.64175658e-02,  1.37510812e-02,  1.15849324e-01, -6.65821955e-02,
        3.20206187e-03, -3.47903855e-02, -3.53573225e-02, -4.56932969e-02,
       -4.71743979e-02,  1.45339761e-02, -4.97469343e-02, -9.72204804e-02,
       -1.14383707e-02,  8.46363837e-04, -1.46946963e-03, -3.99394003e-06,
       -7.87993334e-03,  2.86254417e-02, -3.68046276e-02,  4.76785488e-02,
        2.19761226e-02, -8.49490520e-04, -3.24455053e-02,  1.46460757e-01,
       -1.51343085e-02, -1.45951333e-02,  4.59653884e-02,  2.21654009e-02,
       -2.31053289e-02,  1.39798019e-02,  3.91129553e-02,  7.46086799e-03,
       -6.55276626e-02, -6.74750954e-02,  4.91987988e-02, -7.31474012e-02,
        5.80212772e-02,  8.07655044e-03, -7.41121247e-02, -2.56642159e-02,
        6.78543147e-05, -4.59108222e-03, -2.01404691e-02,  6.04366995e-02,
        6.14164257e-03,  

## 4. Visualizing Vector Embeddings with TSNE

Using TSNE to reduce vector embeddings from high dimensions to 2D for visualization of relationships between vectors. Each point represents a text chunk, with colors indicating document type.

In [11]:
result = collection.get(include=['embeddings', 'documents', 'metadatas'])

vectors = np.array(result['embeddings'])

documents = result['documents']

doc_types = [metadata['doc_type'] for metadata in result['metadatas']]

colors = [['blue', 'green', 'red', 'orange'][['company', 'employees', 'visas', 'schools'].index(t)] for t in doc_types]

In [12]:
tsne = TSNE(n_components=2, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

fig = go.Figure(data=[go.Scatter(
    x=reduced_vectors[:, 0],
    y=reduced_vectors[:, 1],
    mode='markers',
    marker=dict(size=5, color=colors, opacity=0.8),
    text=[f"Type: {t}<br>Text: {d[:50]}..." for t, d in zip(doc_types, documents)],
    hoverinfo='text'
)])

fig.update_layout(
    title='2D Visualization of Chroma Vector Store',
    scene=dict(xaxis_title='x', yaxis_title='y'),
    width=800,
    height=600,
    margin=dict(r=20, b=10, l=10, t=40)
)

fig.show()

## 5. Building the RAG Query Chain

Creating a RAG query chain that combines the Gemini language model with the vector store to answer questions based on embedded documents.

In [13]:
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0.7
)

memory = ConversationBufferWindowMemory(memory_key='chat_history', return_messages=True)

retriever = vectorstore.as_retriever(search_kwargs={"k": 30})

conversation_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory
)


Please see the migration guide at: https://python.langchain.com/docs/versions/migrating_memory/



### 5.1. Initializing LLM Model and Query Chain

In [14]:
def test_query_performance():
    """Test query with timing"""
    query = "Can you briefly describe the Korea Study Center?"
    start_time = time.time()
    result = conversation_chain.invoke({"question": query})
    end_time = time.time()
    
    print(f"Query processed in {end_time - start_time:.2f} seconds")
    print("Answer:", result["answer"])
    if "source_documents" in result:
        print(f"Used {len(result['source_documents'])} source documents")

In [15]:
test_query_performance()


`encoder_attention_mask` is deprecated and will be removed in version 4.55.0 for `BertSdpaSelfAttention.forward`.



Query processed in 2.81 seconds
Answer: Based on the information provided, here's a summary of the Korea Study Consultant Center Vietnam:

*   **Services:** They provide educational consulting services for Vietnamese students who want to study in Korea.
*   **Expertise:** They specialize in helping students get into top Korean universities, especially for Master's and PhD programs. They have a high success rate with KGSP scholarships and placements in STEM fields.
*   **Network:** They have connections with rectors/presidents of 15+ top Korean universities, international affairs officers, faculty members, and government officials.
*   **Achievements:** They've supported 450+ Vietnamese students, have a high scholarship success rate, developed specialized training programs, and published research on international education.
*   **Contact:** Hotline, email, Zalo, and Facebook are provided for contact.


### 5.2. Initializing Chat Interface with Gradio

In [16]:
memory = ConversationBufferWindowMemory(memory_key='chat_history', return_messages=True)
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

In [17]:
def chat(message, history):
    result = conversation_chain.invoke({"question": message})
    return result["answer"]

In [18]:
import gradio as gr
view = gr.ChatInterface(chat, type="messages", theme="soft").launch(inbrowser=True)

* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.



`encoder_attention_mask` is deprecated and will be removed in version 4.55.0 for `BertSdpaSelfAttention.forward`.


`encoder_attention_mask` is deprecated and will be removed in version 4.55.0 for `BertSdpaSelfAttention.forward`.


`encoder_attention_mask` is deprecated and will be removed in version 4.55.0 for `BertSdpaSelfAttention.forward`.


`encoder_attention_mask` is deprecated and will be removed in version 4.55.0 for `BertSdpaSelfAttention.forward`.



## 6. Conclusion

This notebook has demonstrated how to build a complete RAG system, from loading and processing data, creating vector embeddings, to implementing an interactive query interface. This system can be expanded with different types of documents and can be fine-tuned to improve the accuracy of answers.