<a href="https://colab.research.google.com/github/estellacoding/ll-rag-chroma/blob/main/langchain_llamaindex_rag_chroma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain RAG

## 安裝套件

In [None]:
!pip install pypdf
!pip install langchain_community
!pip install langchain-openai
!pip install langchain-chroma
!pip install chromadb
!pip install tiktoken

## 下載資料

In [2]:
!mkdir "data"
!wget "https://arxiv.org/pdf/2308.00352" -O data/metagpt.pdf
!ls

--2025-01-15 05:01:36--  https://arxiv.org/pdf/2308.00352
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.67.42, 151.101.195.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16753634 (16M) [application/pdf]
Saving to: ‘data/metagpt.pdf’


2025-01-15 05:01:37 (35.9 MB/s) - ‘data/metagpt.pdf’ saved [16753634/16753634]

data  sample_data


## 載入資料

In [3]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("./data/metagpt.pdf")
text_splitter = RecursiveCharacterTextSplitter()
pages = loader.load_and_split(text_splitter=text_splitter)

print(pages[-1].page_content)

Published as a conference paper at ICLR 2024
Table 9: Additional results of pure MetaGPT w/o feedback on SoftwareDev. Averages (Avg.) of 70 tasks are calculated and 10 randomly selected tasks are
included. ‘#’ denotes ‘The number of’, while ‘ID’ is ‘Task ID’.
ID Code statistics Doc statistics Cost statistics Cost of revision Code executability
#code files #lines of code #lines per code file #doc files #lines of doc #lines per doc file #prompt tokens #completion tokens time costs money costs
0 5.00 196.00 39.20 3.00 210.00 70.00 24087.00 6157.00 582.04 $ 1.09 1. TypeError 4
1 6.00 191.00 31.83 3.00 230.00 76.67 32517.00 6238.00 566.30 $ 1.35 1. TypeError 4
2 3.00 198.00 66.00 3.00 235.00 78.33 21934.00 6316.00 553.11 $ 1.04 1. lack
@app.route(’/’)
3
3 5.00 164 32.80 3.00 202.00 67.33 22951.00 5312.00 481.34 $ 1.01 1. PNG file miss-
ing 2. Compile bug
fixes
2
4 6.00 203.00 33.83 3.00 210.00 70.00 30087.00 6567.00 599.58 $ 1.30 1. PNG file
missing 2. Com-
pile bug fixes 3.
pygame.surface 

## 向量化

In [4]:
# 若資料夾存在則刪除
import shutil
import os

if os.path.exists("./vector"):
    shutil.rmtree("./vector")

In [5]:
# 文本 -> 分割/索引 -> 向量化 -> 建立資料庫 -> 儲存
from langchain.indexes import VectorstoreIndexCreator
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

# 創建向量索引
index_creator = VectorstoreIndexCreator(
    embedding=OpenAIEmbeddings(api_key=OPENAI_API_KEY),
    vectorstore_cls=Chroma, # 設定向量資料庫
    vectorstore_kwargs={"persist_directory": "./vector"}
)

# 從分割文檔(pages)進行文本分割、向量化及索引並儲存
docsearch = index_creator.from_documents(pages)

## 向量資料庫

In [6]:
from langchain_chroma import Chroma

db = Chroma(embedding_function=OpenAIEmbeddings(api_key=OPENAI_API_KEY), persist_directory='./vector')

In [7]:
similarity_context = db.similarity_search("Describe the five roles in MetaGPT framework", k=10)

# 打印檢索到的相似內容
print("\n相似性上下文:")
for i, doc in enumerate(similarity_context, start=1):
    print(f"\n--- Context {i} ---")
    print(doc.page_content)


相似性上下文:

--- Context 1 ---
software company: Product Manager, Architect, Project Manager, Engineer, and QA Engineer, as
shown in Figure 1. In MetaGPT, we specify the agent’s profile, which includes their name, profile,
goal, and constraints for each role. We also initialize the specific context and skills for each role.
For instance, a Product Manager can use web search tools, while an Engineer can execute code, as
shown in Figure 2. All agents adhere to the React-style behavior as described in Yao et al. (2022).
Every agent monitors the environment ( i.e., the message pool in MetaGPT) to spot important ob-
servations (e.g.,, messages from other agents). These messages can either directly trigger actions or
assist in finishing the job.
Workflow across Agents By defining the agents’ roles and operational skills, we can establish
basic workflows. In our work, we follow SOP in software development, which enables all agents to
work in a sequential manner.
4

--- Context 2 ---
Published as

## 開啟查詢

In [8]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

llm = ChatOpenAI(model="gpt-4o-mini", openai_api_key=OPENAI_API_KEY)

# 設置檢索器:檢索最相關的k段內容
retriever = db.as_retriever(search_kwargs={"k": 5})

system_prompt = (
    "你是一個專業的助理，請從給定內容中提取準確答案。"
    "Context: {context}"
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

# 創建處理文檔的鏈條
question_answer_chain = create_stuff_documents_chain(llm, prompt)

# 創建檢索鏈條
chain = create_retrieval_chain(retriever, question_answer_chain)

In [9]:
result = chain.invoke({"input": "What are the five roles in the MetaGPT framework?"})
result['answer']

'The five roles in the MetaGPT framework are Product Manager, Architect, Project Manager, Engineer, and QA Engineer.'

In [10]:
result = chain.invoke({"input": "Describe the five roles in MetaGPT framework"})
result['answer']

'The five roles in the MetaGPT framework, as specified, are:\n\n1. **Product Manager**: Focuses on overseeing the product lifecycle and ensuring that the product meets the market needs. They can utilize web search tools to gather relevant information.\n\n2. **Architect**: Responsible for designing the overall system architecture and ensuring that the software aligns with technical standards and requirements.\n\n3. **Project Manager**: Manages project timelines, resources, and coordination among team members to ensure successful project delivery.\n\n4. **Engineer**: Engages in the technical development of the software. They have the capability to execute code, which enables them to carry out programming tasks effectively.\n\n5. **QA Engineer**: Focuses on quality assurance processes, testing the software to identify bugs or issues, and ensuring that the final product meets quality standards. \n\nEach role has specific responsibilities and skills, contributing to the collaborative workfl

In [11]:
result = chain.invoke({"input": "說明 MetaGPT 框架中的五個角色"})
result['answer']

'MetaGPT 框架中的五個角色包括：\n\n1. **產品經理 (Product Manager)** - 負責定義產品的需求和方向，並確保產品的成功推出。\n2. **架構師 (Architect)** - 專注於設計系統的整體架構與技術選型，以確保系統的可擴展性和可靠性。\n3. **項目經理 (Project Manager)** - 負責管理項目的進度和資源，協調團隊成員之間的合作，確保項目按時完成。\n4. **工程師 (Engineer)** - 實際編寫代碼並進行系統開發，解決技術問題。\n5. **質量保證工程師 (QA Engineer)** - 負責測試產品以確保其質量，並檢查系統是否符合需求。'

# Llamaindex RAG

## 安裝套件

In [None]:
!pip install llama_index
!pip install llama-index-embeddings-openai
!pip install llama-index-vector-stores-chroma

## 下載資料

In [2]:
!mkdir "data"
!curl -L "https://arxiv.org/pdf/2308.00352" -e "https://arxiv.org/pdf/2308.00352" -o "data/metagpt.pdf"
!ls

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15.9M  100 15.9M    0     0  25.7M      0 --:--:-- --:--:-- --:--:-- 25.7M
data  sample_data


## 載入資料

In [3]:
from llama_index.core import SimpleDirectoryReader

# 讀取指定資料夾內的所有檔案
documents = SimpleDirectoryReader("./data").load_data(show_progress=True)
print("載入的文件列表:", documents[-1])

Loading files: 100%|██████████| 1/1 [00:01<00:00,  1.20s/file]

載入的文件列表: Doc ID: dfd9ad1a-813a-46c9-8e29-1bf670af3b45
Text: Published as a conference paper at ICLR 2024 Table 9: Additional
results of pure MetaGPT w/o feedback on SoftwareDev. Averages (Avg.)
of 70 tasks are calculated and 10 randomly selected tasks are
included. ‘#’ denotes ‘The number of’, while ‘ID’ is ‘Task ID’. ID
Code statistics Doc statistics Cost statistics Cost of revision Code
executability #...





## 向量化

In [4]:
# 若資料夾存在則刪除
import shutil
import os

if os.path.exists("./vector"):
    shutil.rmtree("./vector")

In [5]:
from llama_index.embeddings.openai import OpenAIEmbedding
import openai
from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings, PromptTemplate
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
import chromadb
from google.colab import userdata
import os

openai.api_key = userdata.get("OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

llm = OpenAI(model="gpt-4o-mini")
system_prompt = (
    "你是一個專業的助理，請從給定內容中提取準確答案。"
    "{context}"
)
prompt_template = PromptTemplate(system_prompt)

# 設定Chroma向量資料庫
db = chromadb.PersistentClient(path="./vector")
chroma_collection = db.get_or_create_collection("llamaindex_chroma")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# 建立向量索引，從文本文件(documents)生成的向量
print("正在構建向量索引...")
index = VectorStoreIndex.from_documents(
    documents, # 自動默認RecursiveTextSplitter切割文本
    storage_context=storage_context,
    embed_model=OpenAIEmbedding(embed_batch_size=10),
    show_progress=True,
)
print("向量索引構建完成。")

# 將索引轉換為查詢引擎
query_engine = index.as_query_engine(llm=llm,prompt_template=prompt_template)
print("查詢引擎建立完成:", query_engine)

正在構建向量索引...


Parsing nodes:   0%|          | 0/29 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/34 [00:00<?, ?it/s]

向量索引構建完成。
查詢引擎建立完成: <llama_index.core.query_engine.retriever_query_engine.RetrieverQueryEngine object at 0x7eb768dcf0a0>


## 向量資料庫

In [6]:
# 從儲存上下文中獲取檢索器
retriever = index.as_retriever(search_kwargs={"k": 5})

# 執行相似性檢索
query = "Describe the five roles in MetaGPT framework"
print(f"正在執行相似性檢索: {query}")
similarity_context = retriever.retrieve(query)

# 打印檢索到的相似內容
print("\n相似性上下文:")
for i, doc in enumerate(similarity_context, start=1):
    print(f"\n--- Context {i} ---")
    print(doc.get_content())

正在執行相似性檢索: Describe the five roles in MetaGPT framework

相似性上下文:

--- Context 1 ---
Published as a conference paper at ICLR 2024
Figure 2: An example of the communication protocol (left) and iterative programming with exe-
cutable feedback (right). Left: Agents use a shared message pool to publish structured messages.
They can also subscribe to relevant messages based on their profiles. Right: After generating the
initial code, the Engineer agent runs and checks for errors. If errors occur, the agent checks past
messages stored in memory and compares them with the PRD, system design, and code files.
3 M ETAGPT: A M ETA-PROGRAMMING FRAMEWORK
MetaGPT is a meta-programming framework for LLM-based multi-agent systems. Sec. 3.1 pro-
vides an explanation of role specialization, workflow and structured communication in this frame-
work, and illustrates how to organize a multi-agent system within the context of SOPs. Sec. 3.2
presents a communication protocol that enhances role communication e

## 開啟查詢

In [7]:
response = query_engine.query("What are the five roles in the MetaGPT framework?")
print(response)

The five roles in the MetaGPT framework are Product Manager, Architect, Project Manager, Engineer, and QA Engineer.


In [8]:
response = query_engine.query("說明 MetaGPT 框架中的五個角色")
print(response)

MetaGPT 框架中定義了五個角色，分別是產品經理、架構師、項目經理、工程師和質量保證工程師。每個角色都有其特定的職責和技能，這樣的角色專業化使得複雜的工作可以分解為更小、更具體的任務。產品經理負責業務分析和洞察，工程師則專注於編程。這些角色的明確定義有助於在軟體開發過程中建立基本的工作流程，促進各角色之間的協作。
