<a href="https://colab.research.google.com/github/estellacoding/ll-rag-chroma/blob/main/langchain_llamaindex_rag_chroma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain RAG

## 安裝套件

In [None]:
!pip install pypdf
!pip install langchain_community
!pip install langchain-openai
!pip install langchain-chroma
!pip install chromadb
!pip install tiktoken

## 下載資料

In [2]:
!mkdir "data"
!wget "https://openreview.net/pdf?id=VtmBAGCN7o" -O data/metagpt.pdf
!ls

--2025-01-15 02:09:08--  https://openreview.net/pdf?id=VtmBAGCN7o
Resolving openreview.net (openreview.net)... 35.184.86.251
Connecting to openreview.net (openreview.net)|35.184.86.251|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16911937 (16M) [application/pdf]
Saving to: ‘data/metagpt.pdf’


2025-01-15 02:09:11 (7.73 MB/s) - ‘data/metagpt.pdf’ saved [16911937/16911937]

data  sample_data


## 載入資料

In [3]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("./data/metagpt.pdf")
# 設定每片段長度為1000字元，重疊200字元
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=200)
pages = loader.load_and_split(text_splitter=text_splitter)

print(pages[-1].page_content)

8 5.00 215.00 43.00 3.00 301.00 100.33 29372.00 6499.00 621.73 $ 1.27 1. tensorflow ver-
sion error 2. model
training method not
implement
2
9 5.00 215.00 43.00 3.00 270.00 90.00 24799.00 5734.00 550.88 $ 1.27 1. dependency er-
ror 2. URL 403 er-
ror
3
10 3.00 93.00 31.00 3.00 254.00 84.67 24109.00 5363.00 438.50 $ 0.92 1. dependency er-
ror 2. missing main
func.
4
Avg. 4.71 191.57 42.98 3.00 240.00 80.00 26626.86 6218.00 516.71 $1.12 0.51 (only consider
item scored 2, 3 or
4)
3.36
29


## 向量化

In [4]:
# 若資料夾存在則刪除
import shutil
import os

if os.path.exists("./vector"):
    shutil.rmtree("./vector")

In [5]:
# 文本 -> 分割/索引 -> 向量化 -> 建立資料庫 -> 儲存
from langchain.indexes import VectorstoreIndexCreator
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

# 使用OpenAI的嵌入模型將文本轉換為嵌入向量
embeddings = OpenAIEmbeddings(api_key=OPENAI_API_KEY)

# 創建向量索引
index_creator = VectorstoreIndexCreator(
    embedding=embeddings,
    vectorstore_cls=Chroma, # 設定向量資料庫
    vectorstore_kwargs={"persist_directory": "./vector"}
)

# 從分割文檔(pages)進行文本分割、向量化及索引並儲存
docsearch = index_creator.from_documents(pages)

## 向量資料庫

In [6]:
from langchain_chroma import Chroma

db = Chroma(embedding_function=embeddings, persist_directory='./vector')

In [7]:
similarity_context = db.similarity_search("Describe the five roles in MetaGPT framework", k=10)
for doc in similarity_context:
    print("-"*66)
    print(doc.page_content)

------------------------------------------------------------------
MetaGPT is a meta-programming framework for LLM-based multi-agent systems. Sec. 3.1 pro-
vides an explanation of role specialization, workflow and structured communication in this frame-
work, and illustrates how to organize a multi-agent system within the context of SOPs. Sec. 3.2
presents a communication protocol that enhances role communication efficiency. We also imple-
ment structured communication interfaces and an effective publish-subscribe mechanism. These
------------------------------------------------------------------
agents within the MetaGPT framework. This platform provides users with an operational interface,
allowing users to easily manage a variety of agents with different emotions, personalities, and capa-
bilities for specific tasks.
16
------------------------------------------------------------------
Preprint
• We introduce MetaGPT, a meta-programming framework for multi-agent collaboration based 

## 開啟查詢

In [8]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

llm = ChatOpenAI(model="gpt-4o-mini", openai_api_key=OPENAI_API_KEY)

# 設置檢索器:檢索最相關的k段內容
retriever = db.as_retriever(search_kwargs={"k": 100})

system_prompt = (
    "你是一個專業的助理，請從給定內容中提取準確答案。"
    "Context: {context}"
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

# 創建處理文檔的鏈條
question_answer_chain = create_stuff_documents_chain(llm, prompt)

# 創建檢索鏈條
chain = create_retrieval_chain(retriever, question_answer_chain)

In [9]:
result = chain.invoke({"input": "What are the five roles in the MetaGPT framework?"})
result['answer']

'The five roles in the MetaGPT framework are: Product Manager, Architect, Project Manager, Engineer, and QA Engineer.'

In [10]:
result = chain.invoke({"input": "Describe the five roles in MetaGPT framework"})
result['answer']

'In the MetaGPT framework, the five roles defined are:\n\n1. **Product Manager**: Responsible for analyzing competition and user needs to create Product Requirements Documents (PRDs) that guide the developmental process.\n\n2. **Architect**: Focuses on system design, generating system interface designs and flow diagrams based on the PRDs provided by the Product Manager.\n\n3. **Project Manager**: Manages the overall project workflow and ensures that tasks are distributed effectively among team members.\n\n4. **Engineer**: Executes code based on the system design and PRDs, and is responsible for coding tasks and implementing functionalities.\n\n5. **QA Engineer**: Formulates test cases to ensure code quality and performs quality assurance checks on the produced software to validate its functionality and reliability.'

In [11]:
result = chain.invoke({"input": "說明 MetaGPT 框架中的五個角色"})
result['answer']

'MetaGPT 框架中的五個角色包括：\n\n1. **產品經理 (Product Manager)**：負責生成產品需求文檔 (PRD)，分析市場競爭和用戶需求，以指導開發過程。\n\n2. **架構師 (Architect)**：負責系統界面的設計，生成系統模塊設計和交互序列的文檔，確保系統的整體架構符合需求。\n\n3. **項目經理 (Project Manager)**：負責任務分配，協調各角色之間的合作，確保項目的順利進行。\n\n4. **工程師 (Engineer)**：負責根據設計文檔執行代碼，編寫實際的軟件解決方案，並進行測試和調試。\n\n5. **質量保證工程師 (QA Engineer)**：負責制定測試用例，檢查代碼的質量，確保最終產品符合規範和需求。 \n\n這些角色協同工作，通過標準化操作程序 (SOPs) 提升軟件開發的效率和質量。'

# Llamaindex RAG

## 安裝套件

In [None]:
!pip install llama_index
!pip install llama-index-embeddings-openai
!pip install llama-index-vector-stores-chroma

## 下載資料

In [2]:
!mkdir "data"
!curl -L "https://openreview.net/pdf?id=VtmBAGCN7o" -e "https://openreview.net/pdf?id=VtmBAGCN7o" -o "data/metagpt.pdf"
!ls

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 16.1M  100 16.1M    0     0  20.1M      0 --:--:-- --:--:-- --:--:-- 20.1M
data  sample_data


## 載入資料

In [14]:
from llama_index.core import SimpleDirectoryReader

# 讀取指定資料夾內的所有檔案
documents = SimpleDirectoryReader("./data").load_data(show_progress=True)
print("載入的文件列表:", documents[-1])

Loading files: 100%|██████████| 1/1 [00:04<00:00,  5.00s/file]

載入的文件列表: Doc ID: e26de12a-1195-4742-8380-d3ab9dcb0407
Text: Preprint Table 9: Additional results of pure MetaGPT w/o
feedback on SoftwareDev. Averages (Avg.) of 70 tasks are calculated
and 10 randomly selected tasks are included. ‘#’ denotes ‘The number
of’, while ‘ID’ is ‘Task ID’. ID Code statistics Doc statistics Cost
statistics Cost of revision Code executability #code files #lines of
code #lines per...





## 向量化

In [4]:
# 若資料夾存在則刪除
import shutil
import os

if os.path.exists("./vector"):
    shutil.rmtree("./vector")

In [5]:
from llama_index.embeddings.openai import OpenAIEmbedding
import openai
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings, PromptTemplate
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
import chromadb
from google.colab import userdata
import os

openai.api_key = userdata.get("OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

# 使用OpenAI的嵌入模型將文本轉換為嵌入向量
embed_model = OpenAIEmbedding(embed_batch_size=10)

system_prompt = (
    "你是一個專業的助理，請從給定內容中提取準確答案。"
    "{context}"
)
prompt_template = PromptTemplate(system_prompt)

# 設定Chroma向量資料庫
db = chromadb.PersistentClient(path="./vector")
chroma_collection = db.get_or_create_collection("llamaindex_chroma")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# 建立向量索引，從文本文件(documents)生成的向量
print("正在構建向量索引...")
index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    embed_model=embed_model,
    show_progress=True,
)
print("向量索引構建完成。")

# 將索引轉換為查詢引擎
query_engine = index.as_query_engine(prompt_template=prompt_template)
print("查詢引擎建立完成:", query_engine)

正在構建向量索引...


Parsing nodes:   0%|          | 0/29 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/34 [00:00<?, ?it/s]

向量索引構建完成。
查詢引擎建立完成: <llama_index.core.query_engine.retriever_query_engine.RetrieverQueryEngine object at 0x7eebf67d45b0>


## 向量資料庫

In [7]:
# 從存儲上下文中獲取檢索器
retriever = index.as_retriever(search_kwargs={"k": 5})

# 執行相似性檢索
query = "Describe the five roles in MetaGPT framework"
print(f"正在執行相似性檢索: {query}")
similarity_context = retriever.retrieve(query)

# 打印檢索到的相似內容
print("\n相似性上下文:")
for i, doc in enumerate(similarity_context, start=1):
    print(f"\n--- Context {i} ---")
    print(doc.get_content())

正在執行相似性檢索: Describe the five roles in MetaGPT framework

相似性上下文:

--- Context 1 ---
Preprint
Table 2: Comparison of capabilities for MetaGPT and other approaches. ‘!’ indicates the
presence of a specific feature in the corresponding framework, ‘%’ its absence.
Framework Capabiliy AutoGPT LangChain AgentVerse ChatDev MetaGPT
PRD generation % % % % !
Tenical design genenration % % % % !
API interface generation % % % % !
Code generation ! ! ! ! !
Precompilation execution % % % % !
Role-based task management % % % ! !
Code review % % ! ! !
Table 3: Ablation study on roles. ‘#’ denotes ‘The number of’, ‘Product’ denotes ‘Product man-
ager’, and ‘Project’ denotes ‘Project manager’. ‘ !’ indicates the addition of a specific role. ‘Revi-
sions’ refers to ‘Human Revision Cost’.
Engineer Product Architect Project #Agents #Lines Expense Revisions Executability
! % % % 1 83.0 $ 0.915 10 1.0
! ! % % 2 112.0 $ 1.059 6.5 2.0
! ! ! % 3 143.0 $ 1.204 4.0 2.5
! ! % ! 3 205.0 $ 1.251 3.5 2.0
! ! ! ! 4 1

## 開啟查詢

In [8]:
response = query_engine.query("What are the five roles in the MetaGPT framework?")
print(response)

Product Manager, Architect, Project Manager, Engineer, QA Engineer


In [9]:
response = query_engine.query("說明 MetaGPT 框架中的五個角色")
print(response)

MetaGPT框架中的五個角色是產品經理（Product Manager）、架構師（Architect）、專案經理（Project Manager）、工程師（Engineer）和品質保證工程師（QA Engineer）。每個角色在框架中有特定的設定，包括他們的名稱、設定、目標和約束，並為每個角色初始化特定的上下文和技能。例如，產品經理可以使用網絡搜索工具，而工程師可以執行代碼。在MetaGPT中，所有代理都遵循React風格的行為。每個代理監控環境（即MetaGPT中的消息池）以發現重要的觀察結果，這些消息可以直接觸發操作或幫助完成工作。
