
## LangChain 0.3+ 向量資料庫比較
比較不同向量資料庫的性能與特性

需求套件:
- langchain>=0.3.0
- langchain-community>=0.0.1
- chromadb>=0.4.0
- faiss-cpu>=1.7.4
- pymilvus>=2.3.3
- pinecone-client>=3.0.0
- pandas>=2.0.0
- numpy>=1.24.0
- python-dotenv>=0.19.0
"""


LangChain 0.3+ 向量資料庫比較
比較不同向量資料庫的性能與特性


# 向量資料庫性能比較

## 方法特性分析表

| 特性         | FAISS | Annoy | ScaNN | Milvus | Weaviate | Pinecone | Qdrant | ChromaDB |
|-------------|:-----:|:-----:|:-----:|:------:|:--------:|:--------:|:------:|:--------:|
| 查詢效率     |   ○   |   △   |   ○   |   ○    |    ○     |    ○     |   ○    |    △     |
| 插入/更新    |   △   |   ×   |   ○   |   ○    |    ○     |    ○     |   ○    |    ○     |
| 記憶體使用   |   △   |   ○   |   △   |   ○    |    ○     |    ○     |   ○    |    △     |
| 索引構建時間 |   ○   |   △   |   ○   |   ○    |    △     |    △     |   ○    |    △     |
| 可擴展性     |   ○   |   △   |   ○   |   ○    |    ○     |    ○     |   ○    |    △     |
| 召回率       |   ○   |   ×   |   ○   |   ○    |    ○     |    ○     |   ○    |    ○     |
| 多模態支持   |   ×   |   ×   |   ×   |   ○    |    ○     |    ○     |   ○    |    ○     |
| 併發能力     |   ○   |   △   |   ○   |   ○    |    ○     |    ○     |   ○    |    △     |



## 問題特性分析表

| 特性          | FAISS | Annoy | ScaNN | Milvus | Weaviate | Pinecone | Qdrant | ChromaDB |
|--------------|:-----:|:-----:|:-----:|:------:|:--------:|:--------:|:------:|:--------:|
| 高速檢索      |   ○   |   △   |   ○   |   ○    |    ○     |    ○     |   ○    |    △     |
| 多模態支持    |   ×   |   ×   |   ×   |   ○    |    ○     |    ○     |   ○    |    ○     |
| 記憶體限制應用 |   △   |   ○   |   △   |   ○    |    ○     |    ○     |   ○    |    △     |
| 併發能力高    |   ○   |   △   |   ○   |   ○    |    ○     |    ○     |   ○    |    △     |
| 雲端解決方案  |   ×   |   ×   |   ×   |   ○    |    ○     |    ○     |   ○    |    △     |



## 方法特性 vs. 問題特性 矩陣比較表

| 方法特性 / 問題特性 | 高速檢索 | 多模態支持 | 記憶體限制應用 | 併發能力高 | 雲端解決方案 |
|--------------------|:--------:|:----------:|:--------------:|:----------:|:------------:|
| 查詢效率           |    ○     |      ×      |       △        |     ○      |      ×       |
| 插入/更新          |    △     |      ×      |       ○        |     △      |      ×       |
| 記憶體使用         |    △     |      ×      |       ○        |     ○      |      ○       |
| 索引構建時間       |    ○     |      △      |       △        |     ○      |      ○       |
| 可擴展性           |    ○     |      ○      |       ○        |     ○      |      ○       |
| 召回率             |    ○     |      ○      |       ○        |     ○      |      ○       |
| 多模態支持         |    ×     |      ○      |       ○        |     ○      |      ○       |
| 併發能力           |    ○     |      ○      |       ○        |     ○      |      ○       |



## 符號意義

- **○**：表現優異 or 在意
- **△**：表現普通 or 不介意
- **×**：表現較差 or 介意



In [7]:
import os
import time
import numpy as np
import pandas as pd
import psutil
import wikipediaapi
import logging
from typing import List
from langchain_community.vectorstores import FAISS, Milvus, Qdrant, Chroma
from langchain_openai import OpenAIEmbeddings
from qdrant_client import QdrantClient
from langchain_core.documents import Document

In [21]:
from dotenv import load_dotenv

# 設定日誌
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

# 載入環境變數
load_dotenv()

True

In [19]:
class RealDataVectorBenchmark:

    """基於真實數據的向量資料庫測試"""
    def __init__(self, num_articles=100, embedding_dim=384):
            """初始化測試數據"""
            self.num_articles = num_articles
            self.embedding_dim = embedding_dim
            self.embeddings = OpenAIEmbeddings()
            self.wikipedia = wikipediaapi.Wikipedia(user_agent='CoolBot/0.0 (https://example.org/coolbot/; coolbot@example.org) generic-library/0.0', language='en')

            # 取得 Wikipedia 文章
            self.test_texts = self.fetch_wikipedia_articles()
            self.test_vectors = self.texts_to_embeddings(self.test_texts)
            self.query_vector = self.texts_to_embeddings(["Artificial Intelligence"])[0]  # 測試用查詢


            print(self.test_texts)

    def fetch_wikipedia_articles(self):
            """從 Wikipedia API 取得真實文本"""
            articles = [
                "Artificial intelligence", "Machine learning", "Deep learning",
                "Natural language processing", "Neural networks",
                "Quantum computing", "Blockchain", "Cloud computing",
                "Cybersecurity", "Data science"
            ]
            fetched_texts = []
            for title in articles:
                page = self.wikipedia.page(title)
                if page.exists():
                    fetched_texts.append(page.summary[:1000])  # 限制每篇文章最多 1000 字元
                else:
                    print(f"文章 {title} 不存在")
            return fetched_texts

    def texts_to_embeddings(self, texts):
        """將文本轉換為向量嵌入"""
        return self.embeddings.embed_documents(texts)

    def memory_usage(self):
        """取得記憶體使用量 (MB)"""
        return psutil.Process().memory_info().rss / 1024 / 1024

    def calculate_recall(self, results):
        """計算問答 Recall"""
        retrieved_texts = [doc.page_content for doc in results]

        # 設定正確答案
        correct_answers = {"Artificial intelligence", "Machine learning", "Deep learning"}

        # **確保每個正確答案最多計算一次**
        matched = sum(1 for ans in correct_answers if any(ans in text for text in retrieved_texts))

        # **確保 Recall 不超過 100%**
        recall = min((matched / len(correct_answers)) * 100, 100)
        
        return recall

    def evaluate_faiss(self):
        """測試 FAISS"""
        logger.info("測試 FAISS ...")
        start_time = time.time()
        vectorstore = FAISS.from_texts(self.test_texts, embedding=self.embeddings)
        insert_time = time.time() - start_time

        start_time = time.time()
        results = vectorstore.similarity_search_by_vector(self.query_vector, k=3)
        query_time = time.time() - start_time

        # **提取查詢結果文本**
        retrieved_texts = [doc.page_content for doc in results]

        print(f"\n=== FAISS 查詢結果 ===")
        print(f"🔍 查詢: \"Artificial Intelligence\"")  # **顯示原始查詢文本**
        print(f"📌 Top-3 相關結果：")
        for i, text in enumerate(retrieved_texts, 1):
            print(f"   {i}. {text[:200]}...")  # **顯示前 200 字，避免過長**
        print("--------------------------------\n")

        recall = self.calculate_recall(results)

        return {
            "Database": "FAISS",
            "Insert Time (s)": insert_time,
            "Query Time (s)": query_time,
            "Memory Usage (MB)": self.memory_usage(),
            "Recall (%)": recall
        }

    def evaluate_qdrant(self):
        """測試 Qdrant"""
        logger.info("測試 Qdrant ...")
        client = QdrantClient(location=":memory:")

        start_time = time.time()
        vectorstore = Qdrant.from_texts(
            self.test_texts, embedding=self.embeddings, collection_name="qdrant_realdata", location=":memory:"
        )
        insert_time = time.time() - start_time

        start_time = time.time()
        results = vectorstore.similarity_search_by_vector(self.query_vector, k=3)
        query_time = time.time() - start_time

        # **提取查詢結果文本**
        retrieved_texts = [doc.page_content for doc in results]

        print(f"\n=== Qdrant 查詢結果 ===")
        print(f"🔍 查詢: \"Artificial Intelligence\"")
        print(f"📌 Top-3 相關結果：")
        for i, text in enumerate(retrieved_texts, 1):
            print(f"   {i}. {text[:200]}...")
        print("--------------------------------\n")

        recall = self.calculate_recall(results)

        return {
            "Database": "Qdrant",
            "Insert Time (s)": insert_time,
            "Query Time (s)": query_time,
            "Memory Usage (MB)": self.memory_usage(),
            "Recall (%)": recall
        }

    def evaluate_milvus(self):
        """測試 Milvus"""
        logger.info("測試 Milvus ...")
        start_time = time.time()
        vectorstore = Milvus.from_texts(
            self.test_texts,
            embedding=self.embeddings,
            collection_name="milvus_realdata",
            connection_args={"host": "localhost", "port": "19530"}
        )
        insert_time = time.time() - start_time

        start_time = time.time()
        results = vectorstore.similarity_search_by_vector(self.query_vector, k=3)
        query_time = time.time() - start_time
        print(f"Milvus search results@top3 w/ query: {self.query_vector} : {results}")
        print("--------------------------------")
        recall = self.calculate_recall(results)

        return {
            "Database": "Milvus",
            "Insert Time (s)": insert_time,
            "Query Time (s)": query_time,
            "Memory Usage (MB)": self.memory_usage(),
            "Recall (%)": recall
        }

    def evaluate_chroma(self):
        """測試 Chroma"""
        logger.info("測試 Chroma ...")
        persist_directory = os.path.join("vectorstore", "chroma_realdata")
        os.makedirs(persist_directory, exist_ok=True)

        start_time = time.time()
        vectorstore = Chroma.from_texts(
            texts=self.test_texts, embedding=self.embeddings, collection_name="chroma_realdata", persist_directory=persist_directory
        )
        insert_time = time.time() - start_time

        start_time = time.time()
        results = vectorstore.similarity_search_by_vector(self.query_vector, k=3)
        query_time = time.time() - start_time

        # **提取查詢結果文本**
        retrieved_texts = [doc.page_content for doc in results]

        print(f"\n=== Chroma 查詢結果 ===")
        print(f"🔍 查詢: \"Artificial Intelligence\"")
        print(f"📌 Top-3 相關結果：")
        for i, text in enumerate(retrieved_texts, 1):
            print(f"   {i}. {text[:200]}...")
        print("--------------------------------\n")

        recall = self.calculate_recall(results)

        return {
            "Database": "ChromaDB",
            "Insert Time (s)": insert_time,
            "Query Time (s)": query_time,
            "Memory Usage (MB)": self.memory_usage(),
            "Recall (%)": recall
        }



    def run_benchmark(self):
        """執行所有測試"""
        results = [
            self.evaluate_faiss(),
            self.evaluate_qdrant(),
            # self.evaluate_milvus(),
            self.evaluate_chroma()
        ]
        
        df = pd.DataFrame(results)
        print(df)
        return df

def main():
    """主程式"""
    print("\n=== 真實數據 向量資料庫 Benchmark 測試 ===\n")
    benchmark = RealDataVectorBenchmark(num_articles=100, embedding_dim=384)
    results = benchmark.run_benchmark()
    return results

if __name__ == "__main__":
    main()



=== 真實數據 向量資料庫 Benchmark 測試 ===



2025-02-13 15:02:02,551 - INFO - Wikipedia: language=en, user_agent: CoolBot/0.0 (https://example.org/coolbot/; coolbot@example.org) generic-library/0.0 (Wikipedia-API/0.8.1; https://github.com/martin-majlis/Wikipedia-API/), extract_format=1
2025-02-13 15:02:02,552 - INFO - Request URL: https://en.wikipedia.org/w/api.php?format=json&redirects=1&action=query&prop=info&titles=Artificial intelligence&inprop=protection|talkid|watched|watchers|visitingwatchers|notificationtimestamp|subjectid|url|readable|preload|displaytitle|varianttitles
2025-02-13 15:02:03,048 - INFO - Request URL: https://en.wikipedia.org/w/api.php?format=json&redirects=1&action=query&prop=extracts&titles=Artificial intelligence&explaintext=1&exsectionformat=wiki
2025-02-13 15:02:03,812 - INFO - Request URL: https://en.wikipedia.org/w/api.php?format=json&redirects=1&action=query&prop=info&titles=Machine learning&inprop=protection|talkid|watched|watchers|visitingwatchers|notificationtimestamp|subjectid|url|readable|preloa

['Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals. Such machines may be called AIs.\nHigh-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon, and Netflix); virtual assistants (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo); generative and creative tools (e.g., ChatGPT and AI art); and superhuman play and analysis in strategy games (e.g., chess and Go). However, many AI applications are not perceived as AI: "A lot of cutting edge AI has filtered into general applications, often without being called AI because once something becomes useful enough and common en

2025-02-13 15:02:14,035 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-02-13 15:02:14,289 - INFO - 測試 Qdrant ...



=== FAISS 查詢結果 ===
🔍 查詢: "Artificial Intelligence"
📌 Top-3 相關結果：
   1. Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies metho...
   2. Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus...
   3. Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded ...
--------------------------------



2025-02-13 15:02:14,601 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-02-13 15:02:14,955 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-02-13 15:02:15,357 - INFO - 測試 Chroma ...



=== Qdrant 查詢結果 ===
🔍 查詢: "Artificial Intelligence"
📌 Top-3 相關結果：
   1. Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies metho...
   2. Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus...
   3. Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded ...
--------------------------------



2025-02-13 15:02:15,743 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"



=== Chroma 查詢結果 ===
🔍 查詢: "Artificial Intelligence"
📌 Top-3 相關結果：
   1. Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies metho...
   2. Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies metho...
   3. Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies metho...
--------------------------------

   Database  Insert Time (s)  Query Time (s)  Memory Usage (MB)  Recall (%)
0     FAISS         0.513410        0.000000         359.613281   66.666667
1    Qdrant         1.059199        0.000000         359.832031   66.666667
2  ChromaDB         0.679660        0.005