# 68. HNSW性能比較: bge-small vs E5-base vs MiniLM

## 目的
- 3つのモデルのHNSW検索性能を比較
- DuckDB VSSを使用してHNSWインデックスを構築
- 検索精度（Recall@K）、検索速度、インデックス構築時間を測定

## 比較対象モデル
| モデル | 次元数 | ライブラリ | 特徴 |
|--------|--------|------------|------|
| BAAI/bge-small-en-v1.5 | 384 | FastEmbed | 軽量、英語専用 |
| intfloat/multilingual-e5-base | 768 | sentence-transformers | 多言語対応 |
| all-MiniLM-L6-v2 | 384 | FastEmbed | 最速、英語専用 |

## 0. セットアップ

In [1]:
import numpy as np
import time
import duckdb
from pathlib import Path
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

DATA_DIR = Path("../data")
np.random.seed(42)

# 設定
N_DOCUMENTS = 10000
N_TEST_QUERIES = 100
TOP_K_VALUES = [10, 50, 100]

print(f"Configuration:")
print(f"  Documents: {N_DOCUMENTS}")
print(f"  Test queries: {N_TEST_QUERIES}")
print(f"  Top-K values: {TOP_K_VALUES}")

Configuration:
  Documents: 10000
  Test queries: 100
  Top-K values: [10, 50, 100]


## 1. データ準備（英語Wikipedia）

In [2]:
from datasets import load_dataset

print("Loading Wikipedia English dataset...")
wiki_en = load_dataset(
    "wikimedia/wikipedia",
    "20231101.en",
    split="train",
    streaming=True
)

documents = []
for i, item in enumerate(wiki_en):
    if len(documents) >= N_DOCUMENTS:
        break
    text = item['text'][:500].strip()
    if len(text) >= 50:
        documents.append(text)

print(f"Collected {len(documents)} documents")
print(f"Sample: {documents[0][:100]}...")

Loading Wikipedia English dataset...


Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

Collected 10000 documents
Sample: Anarchism is a political philosophy and movement that is skeptical of all justifications for authori...


## 2. 埋め込み生成

In [3]:
# 結果を格納する辞書
results = {
    'bge_small': {'name': 'bge-small-en-v1.5', 'dim': 384},
    'e5_base': {'name': 'multilingual-e5-base', 'dim': 768},
    'minilm': {'name': 'all-MiniLM-L6-v2', 'dim': 384},
}

In [4]:
# bge-small-en-v1.5 (FastEmbed)
from fastembed import TextEmbedding

print("="*60)
print("Loading bge-small-en-v1.5 (FastEmbed)...")
start = time.time()
bge_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
results['bge_small']['load_time'] = time.time() - start
print(f"  Load time: {results['bge_small']['load_time']:.1f}s")

print("Generating embeddings...")
start = time.time()
bge_embeddings = np.array(list(bge_model.embed(documents))).astype(np.float32)
results['bge_small']['embed_time'] = time.time() - start
results['bge_small']['embeddings'] = bge_embeddings
print(f"  Shape: {bge_embeddings.shape}")
print(f"  Time: {results['bge_small']['embed_time']:.1f}s ({results['bge_small']['embed_time']/len(documents)*1000:.1f} ms/doc)")

Loading bge-small-en-v1.5 (FastEmbed)...


[0;93m2026-02-05 12:32:39.766789679 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card0/device/vendor"[m


  Load time: 0.2s
Generating embeddings...


  Shape: (10000, 384)
  Time: 290.4s (29.0 ms/doc)


In [5]:
# all-MiniLM-L6-v2 (FastEmbed)
print("="*60)
print("Loading all-MiniLM-L6-v2 (FastEmbed)...")
start = time.time()
minilm_model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
results['minilm']['load_time'] = time.time() - start
print(f"  Load time: {results['minilm']['load_time']:.1f}s")

print("Generating embeddings...")
start = time.time()
minilm_embeddings = np.array(list(minilm_model.embed(documents))).astype(np.float32)
results['minilm']['embed_time'] = time.time() - start
results['minilm']['embeddings'] = minilm_embeddings
print(f"  Shape: {minilm_embeddings.shape}")
print(f"  Time: {results['minilm']['embed_time']:.1f}s ({results['minilm']['embed_time']/len(documents)*1000:.1f} ms/doc)")

Loading all-MiniLM-L6-v2 (FastEmbed)...
  Load time: 0.1s
Generating embeddings...


  Shape: (10000, 384)
  Time: 97.9s (9.8 ms/doc)


In [6]:
# multilingual-e5-base (sentence-transformers)
from sentence_transformers import SentenceTransformer

print("="*60)
print("Loading multilingual-e5-base (sentence-transformers)...")
start = time.time()
e5_model = SentenceTransformer("intfloat/multilingual-e5-base", device="cpu")
results['e5_base']['load_time'] = time.time() - start
print(f"  Load time: {results['e5_base']['load_time']:.1f}s")

print("Generating embeddings...")
docs_with_prefix = [f"passage: {d}" for d in documents]
start = time.time()
e5_embeddings = e5_model.encode(docs_with_prefix, show_progress_bar=True, convert_to_numpy=True).astype(np.float32)
results['e5_base']['embed_time'] = time.time() - start
results['e5_base']['embeddings'] = e5_embeddings
print(f"  Shape: {e5_embeddings.shape}")
print(f"  Time: {results['e5_base']['embed_time']:.1f}s ({results['e5_base']['embed_time']/len(documents)*1000:.1f} ms/doc)")

Loading multilingual-e5-base (sentence-transformers)...


  Load time: 4.3s
Generating embeddings...


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

  Shape: (10000, 768)
  Time: 619.6s (62.0 ms/doc)


In [7]:
# 埋め込み生成時間の比較
print("\n" + "="*60)
print("Embedding Generation Time Summary")
print("="*60)
print(f"{'Model':<25} {'Dim':>6} {'Time (s)':>10} {'ms/doc':>10}")
print("-"*55)
for key in ['minilm', 'bge_small', 'e5_base']:
    r = results[key]
    ms_per_doc = r['embed_time'] / len(documents) * 1000
    print(f"{r['name']:<25} {r['dim']:>6} {r['embed_time']:>10.1f} {ms_per_doc:>10.1f}")


Embedding Generation Time Summary
Model                        Dim   Time (s)     ms/doc
-------------------------------------------------------
all-MiniLM-L6-v2             384       97.9        9.8
bge-small-en-v1.5            384      290.4       29.0
multilingual-e5-base         768      619.6       62.0


## 3. DuckDB HNSWインデックス構築

In [8]:
def create_hnsw_index(embeddings, model_key, dim):
    """
    DuckDBでHNSWインデックスを構築（インメモリ）
    
    Note: DuckDB HNSWインデックスはインメモリDBでのみ作成可能
    """
    # インメモリDBを使用（HNSWインデックスの制約）
    conn = duckdb.connect(":memory:")
    
    # VSS拡張をロード
    conn.execute("INSTALL vss")
    conn.execute("LOAD vss")
    
    # テーブル作成
    conn.execute(f"""
        CREATE TABLE documents (
            id INTEGER PRIMARY KEY,
            embedding FLOAT[{dim}]
        )
    """)
    
    # データ挿入
    print(f"  Inserting {len(embeddings)} documents...")
    for i, emb in enumerate(embeddings):
        conn.execute(
            "INSERT INTO documents VALUES (?, ?)",
            [i, emb.tolist()]
        )
    
    # HNSWインデックス構築
    print(f"  Building HNSW index...")
    start = time.time()
    conn.execute(f"""
        CREATE INDEX hnsw_idx ON documents 
        USING HNSW (embedding)
        WITH (metric = 'cosine')
    """)
    index_time = time.time() - start
    
    return conn, index_time

In [9]:
# 各モデルのHNSWインデックスを構築
connections = {}

for key in ['bge_small', 'minilm', 'e5_base']:
    r = results[key]
    print(f"\n{'='*60}")
    print(f"Building HNSW index for {r['name']}...")
    
    conn, index_time = create_hnsw_index(
        r['embeddings'], key, r['dim']
    )
    
    connections[key] = conn
    results[key]['index_time'] = index_time
    
    print(f"  Index build time: {index_time:.2f}s")


Building HNSW index for bge-small-en-v1.5...
  Inserting 10000 documents...


  Building HNSW index...


  Index build time: 1.23s

Building HNSW index for all-MiniLM-L6-v2...
  Inserting 10000 documents...


  Building HNSW index...


  Index build time: 1.42s

Building HNSW index for multilingual-e5-base...
  Inserting 10000 documents...


  Building HNSW index...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

  Index build time: 2.71s


In [10]:
# インデックス構築時間の比較
print("\n" + "="*60)
print("HNSW Index Build Time Summary")
print("="*60)
print(f"{'Model':<25} {'Dim':>6} {'Index Time (s)':>15}")
print("-"*50)
for key in ['minilm', 'bge_small', 'e5_base']:
    r = results[key]
    print(f"{r['name']:<25} {r['dim']:>6} {r['index_time']:>15.2f}")


HNSW Index Build Time Summary
Model                        Dim  Index Time (s)
--------------------------------------------------
all-MiniLM-L6-v2             384            1.42
bge-small-en-v1.5            384            1.23
multilingual-e5-base         768            2.71


## 4. 検索精度評価（Recall@K）

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

def get_ground_truth(query_emb, all_embeddings, k, exclude_idx=None):
    """
    ブルートフォースで正解Top-Kを取得
    """
    sims = cosine_similarity(query_emb.reshape(1, -1), all_embeddings)[0]
    if exclude_idx is not None:
        sims[exclude_idx] = -1
    return set(np.argsort(sims)[-k:])

def hnsw_search(conn, query_emb, k, dim):
    """
    DuckDB HNSWで検索
    """
    query_list = query_emb.tolist()
    result = conn.execute(f"""
        SELECT id, array_cosine_distance(embedding, ?::FLOAT[{dim}]) as distance
        FROM documents
        ORDER BY distance
        LIMIT {k}
    """, [query_list]).fetchall()
    return set([r[0] for r in result])

def evaluate_recall(conn, embeddings, dim, n_queries, k_values):
    """
    Recall@Kを評価
    """
    np.random.seed(42)
    query_indices = np.random.choice(len(embeddings), n_queries, replace=False)
    
    recall_results = {k: [] for k in k_values}
    search_times = []
    
    for idx in query_indices:
        query_emb = embeddings[idx]
        
        for k in k_values:
            # Ground truth
            gt = get_ground_truth(query_emb, embeddings, k, exclude_idx=idx)
            
            # HNSW search
            start = time.time()
            hnsw_result = hnsw_search(conn, query_emb, k+1, dim)  # +1 for self
            search_times.append(time.time() - start)
            
            # Remove self if present
            hnsw_result.discard(idx)
            hnsw_result = set(list(hnsw_result)[:k])
            
            # Recall
            recall = len(hnsw_result & gt) / k
            recall_results[k].append(recall)
    
    return recall_results, np.mean(search_times) * 1000  # ms

In [12]:
# 各モデルのRecall@Kを評価
print("Evaluating Recall@K...")

for key in ['bge_small', 'minilm', 'e5_base']:
    r = results[key]
    print(f"\n  {r['name']}...")
    
    recall_results, avg_search_time = evaluate_recall(
        connections[key], r['embeddings'], r['dim'],
        N_TEST_QUERIES, TOP_K_VALUES
    )
    
    results[key]['recall'] = {k: np.mean(v) for k, v in recall_results.items()}
    results[key]['search_time_ms'] = avg_search_time
    
    for k in TOP_K_VALUES:
        print(f"    Recall@{k}: {results[key]['recall'][k]*100:.1f}%")
    print(f"    Avg search time: {avg_search_time:.2f} ms")

Evaluating Recall@K...

  bge-small-en-v1.5...


    Recall@10: 99.8%
    Recall@50: 97.6%
    Recall@100: 97.3%
    Avg search time: 4.50 ms

  all-MiniLM-L6-v2...


    Recall@10: 99.7%
    Recall@50: 97.4%
    Recall@100: 96.2%
    Avg search time: 4.41 ms

  multilingual-e5-base...


    Recall@10: 99.7%
    Recall@50: 96.6%
    Recall@100: 95.5%
    Avg search time: 6.59 ms


## 5. クロスモデル検索結果比較

In [13]:
def cross_model_overlap(emb1, emb2, n_queries=100, k=10):
    """
    2つのモデルでTop-K検索結果がどれだけ一致するか
    """
    np.random.seed(42)
    query_indices = np.random.choice(len(emb1), n_queries, replace=False)
    
    overlaps = []
    for idx in query_indices:
        gt1 = get_ground_truth(emb1[idx], emb1, k, exclude_idx=idx)
        gt2 = get_ground_truth(emb2[idx], emb2, k, exclude_idx=idx)
        overlap = len(gt1 & gt2) / k
        overlaps.append(overlap)
    
    return np.mean(overlaps), np.std(overlaps)

# クロスモデル比較
print("\n" + "="*60)
print("Cross-Model Top-10 Search Result Overlap")
print("="*60)

model_pairs = [
    ('bge_small', 'e5_base'),
    ('minilm', 'e5_base'),
    ('bge_small', 'minilm'),
]

print(f"\n{'Model Pair':<35} {'Overlap':>12} {'Std':>10}")
print("-"*60)

for key1, key2 in model_pairs:
    overlap, std = cross_model_overlap(
        results[key1]['embeddings'],
        results[key2]['embeddings'],
        n_queries=N_TEST_QUERIES,
        k=10
    )
    name1 = results[key1]['name']
    name2 = results[key2]['name']
    print(f"{name1} vs {name2:<15} {overlap*100:>11.1f}% {std*100:>9.1f}%")


Cross-Model Top-10 Search Result Overlap

Model Pair                               Overlap        Std
------------------------------------------------------------


bge-small-en-v1.5 vs multilingual-e5-base        40.0%      21.7%


all-MiniLM-L6-v2 vs multilingual-e5-base        35.6%      22.0%


bge-small-en-v1.5 vs all-MiniLM-L6-v2        36.2%      19.6%


## 6. 単一クエリのレイテンシ測定

In [14]:
def measure_single_query_latency(model_key, query_text, n_runs=10):
    """
    単一クエリの埋め込み生成 + HNSW検索のレイテンシを測定
    """
    r = results[model_key]
    conn = connections[model_key]
    dim = r['dim']
    
    embed_times = []
    search_times = []
    
    for _ in range(n_runs):
        # 埋め込み生成
        start = time.time()
        if model_key == 'e5_base':
            query_emb = e5_model.encode([f"query: {query_text}"], convert_to_numpy=True)[0].astype(np.float32)
        elif model_key == 'bge_small':
            query_emb = np.array(list(bge_model.embed([query_text])))[0].astype(np.float32)
        else:  # minilm
            query_emb = np.array(list(minilm_model.embed([query_text])))[0].astype(np.float32)
        embed_times.append(time.time() - start)
        
        # HNSW検索
        start = time.time()
        _ = hnsw_search(conn, query_emb, 10, dim)
        search_times.append(time.time() - start)
    
    return {
        'embed_ms': np.mean(embed_times) * 1000,
        'search_ms': np.mean(search_times) * 1000,
        'total_ms': (np.mean(embed_times) + np.mean(search_times)) * 1000,
    }

# テストクエリ
test_query = "machine learning algorithms for natural language processing"

print("\n" + "="*60)
print("Single Query Latency (embed + search)")
print(f"Query: '{test_query}'")
print("="*60)

print(f"\n{'Model':<25} {'Embed (ms)':>12} {'Search (ms)':>12} {'Total (ms)':>12}")
print("-"*65)

for key in ['minilm', 'bge_small', 'e5_base']:
    latency = measure_single_query_latency(key, test_query)
    results[key]['latency'] = latency
    print(f"{results[key]['name']:<25} {latency['embed_ms']:>12.1f} {latency['search_ms']:>12.1f} {latency['total_ms']:>12.1f}")


Single Query Latency (embed + search)
Query: 'machine learning algorithms for natural language processing'

Model                       Embed (ms)  Search (ms)   Total (ms)
-----------------------------------------------------------------
all-MiniLM-L6-v2                   3.5          6.7         10.3
bge-small-en-v1.5                  3.6          4.2          7.8


multilingual-e5-base              31.8          4.9         36.7


## 7. 総合比較サマリー

In [15]:
print("="*80)
print("HNSW Performance Comparison Summary")
print("="*80)

print("\n【Model Specifications】")
print(f"{'Model':<25} {'Dim':>6} {'Library':>25}")
print("-"*60)
for key in ['minilm', 'bge_small', 'e5_base']:
    r = results[key]
    lib = 'FastEmbed (ONNX)' if key != 'e5_base' else 'sentence-transformers'
    print(f"{r['name']:<25} {r['dim']:>6} {lib:>25}")

print("\n【Embedding Performance (10,000 docs)】")
print(f"{'Model':<25} {'Time (s)':>10} {'ms/doc':>10}")
print("-"*50)
for key in ['minilm', 'bge_small', 'e5_base']:
    r = results[key]
    ms = r['embed_time'] / len(documents) * 1000
    print(f"{r['name']:<25} {r['embed_time']:>10.1f} {ms:>10.1f}")

print("\n【HNSW Index Build Time】")
print(f"{'Model':<25} {'Time (s)':>10}")
print("-"*40)
for key in ['minilm', 'bge_small', 'e5_base']:
    r = results[key]
    print(f"{r['name']:<25} {r['index_time']:>10.2f}")

print("\n【HNSW Search Recall@K】")
print(f"{'Model':<25}", end='')
for k in TOP_K_VALUES:
    print(f" {'R@'+str(k):>8}", end='')
print()
print("-"*55)
for key in ['minilm', 'bge_small', 'e5_base']:
    r = results[key]
    print(f"{r['name']:<25}", end='')
    for k in TOP_K_VALUES:
        print(f" {r['recall'][k]*100:>7.1f}%", end='')
    print()

print("\n【Single Query Latency】")
print(f"{'Model':<25} {'Embed':>10} {'Search':>10} {'Total':>10}")
print("-"*60)
for key in ['minilm', 'bge_small', 'e5_base']:
    r = results[key]
    lat = r['latency']
    print(f"{r['name']:<25} {lat['embed_ms']:>9.1f}ms {lat['search_ms']:>9.1f}ms {lat['total_ms']:>9.1f}ms")

HNSW Performance Comparison Summary

【Model Specifications】
Model                        Dim                   Library
------------------------------------------------------------
all-MiniLM-L6-v2             384          FastEmbed (ONNX)
bge-small-en-v1.5            384          FastEmbed (ONNX)
multilingual-e5-base         768     sentence-transformers

【Embedding Performance (10,000 docs)】
Model                       Time (s)     ms/doc
--------------------------------------------------
all-MiniLM-L6-v2                97.9        9.8
bge-small-en-v1.5              290.4       29.0
multilingual-e5-base           619.6       62.0

【HNSW Index Build Time】
Model                       Time (s)
----------------------------------------
all-MiniLM-L6-v2                1.42
bge-small-en-v1.5               1.23
multilingual-e5-base            2.71

【HNSW Search Recall@K】
Model                         R@10     R@50    R@100
-------------------------------------------------------
all-MiniLM-L6-

In [16]:
# クリーンアップ
for conn in connections.values():
    conn.close()
print("\nDatabase connections closed.")


Database connections closed.


## 8. 結果サマリー

### 主要な発見

#### 1. 埋め込み生成速度（CPU）
| モデル | ms/doc | 速度比 |
|--------|--------|--------|
| all-MiniLM-L6-v2 | 9.8 | 6.3x (最速) |
| bge-small-en-v1.5 | 29.0 | 2.1x |
| multilingual-e5-base | 62.0 | 1.0x (基準) |

- **MiniLM**が最速（E5-baseの約6.3倍）
- **bge-small**はMiniLMの約3倍の時間がかかる

#### 2. HNSW検索精度（Recall@K）
| モデル | R@10 | R@50 | R@100 |
|--------|------|------|-------|
| bge-small-en-v1.5 | **99.8%** | **97.6%** | **97.3%** |
| all-MiniLM-L6-v2 | 99.7% | 97.4% | 96.2% |
| multilingual-e5-base | 99.7% | 96.6% | 95.5% |

- 全モデルでR@10は99%以上と優秀
- **bge-small**がわずかに高いRecallを達成

#### 3. 単一クエリレイテンシ（埋め込み + HNSW検索）
| モデル | Embed | Search | Total |
|--------|-------|--------|-------|
| bge-small-en-v1.5 | 3.6ms | 4.2ms | **7.8ms** |
| all-MiniLM-L6-v2 | 3.5ms | 6.7ms | 10.3ms |
| multilingual-e5-base | 31.8ms | 4.9ms | 36.7ms |

- **bge-small**が最も低レイテンシ（7.8ms）
- MiniLMは埋め込みは速いが、HNSW検索がやや遅い

#### 4. クロスモデル検索結果オーバーラップ
| ペア | オーバーラップ |
|------|---------------|
| bge-small vs E5-base | 40.0% |
| MiniLM vs E5-base | 35.6% |
| bge-small vs MiniLM | 36.2% |

- 異なるモデル間のTop-10結果は約35-40%しか一致しない
- モデル選択により検索結果が大きく変わる

### 結論

| 観点 | 推奨モデル |
|------|-----------|
| **総合バランス（英語）** | **bge-small-en-v1.5** |
| 最速埋め込み | all-MiniLM-L6-v2 |
| 最低レイテンシ | bge-small-en-v1.5 |
| 多言語対応 | multilingual-e5-base |
| ITQ LSH相性 | bge-small-en-v1.5（実験67より相関-0.61） |

**bge-small-en-v1.5**は:
- 単一クエリで最低レイテンシ（7.8ms）
- 高いHNSW Recall（99.8%）
- 良好なITQ LSH相関（-0.61、実験67）
- FastEmbed対応（PyTorch不要）

英語専用・スピード重視の用途では**bge-small-en-v1.5**が最適な選択。