# 66. all-MiniLM-L6-v2 英語限定評価

## 目的
- all-MiniLM-L6-v2 (FastEmbed ONNX) と E5-base の性能比較
- 英語データでの検索品質（Recall@10）評価
- CPU速度比較
- all-MiniLM-L6-v2 用のITQ LSH・Pivot学習データ作成

## 比較対象
| モデル | ライブラリ | 次元数 | 特徴 |
|--------|------------|--------|------|
| all-MiniLM-L6-v2 | FastEmbed (ONNX) | 384 | 高速、英語専用 |
| multilingual-e5-base | sentence-transformers | 768 | 高品質、多言語対応 |

## 0. セットアップ

In [1]:
import numpy as np
import time
from pathlib import Path
import pickle
import warnings
warnings.filterwarnings('ignore')

import sys
sys.path.append('..')
from src.itq_lsh import ITQLSH, hamming_distance_batch

DATA_DIR = Path("../data")
np.random.seed(42)

# 設定
N_DOCUMENTS = 10000
N_BITS = 128
N_PIVOTS = 8

print(f"Documents: {N_DOCUMENTS}")
print(f"ITQ bits: {N_BITS}")
print(f"Pivots: {N_PIVOTS}")

Documents: 10000
ITQ bits: 128
Pivots: 8


## 1. 英語データ準備（Wikipedia English）

In [2]:
from datasets import load_dataset

print("Loading Wikipedia English dataset (streaming)...")
wiki_en = load_dataset(
    "wikimedia/wikipedia",
    "20231101.en",
    split="train",
    streaming=True
)

# 10,000件のドキュメントを収集
documents = []
for i, item in enumerate(wiki_en):
    if len(documents) >= N_DOCUMENTS:
        break
    text = item['text'][:500].strip()
    if len(text) >= 50:  # 短すぎるものは除外
        documents.append(text)

print(f"Collected {len(documents)} documents")
print(f"Sample: {documents[0][:200]}...")

Loading Wikipedia English dataset (streaming)...


Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

Collected 10000 documents
Sample: Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typi...


## 2. モデルのロードと埋め込み生成

In [3]:
# FastEmbed: all-MiniLM-L6-v2
from fastembed import TextEmbedding

print("Loading all-MiniLM-L6-v2 (FastEmbed)...")
start = time.time()
minilm_model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
minilm_load_time = time.time() - start
print(f"  Loaded in {minilm_load_time:.1f}s")

# 埋め込み生成
print("\nGenerating embeddings with all-MiniLM-L6-v2...")
start = time.time()
minilm_embeddings = np.array(list(minilm_model.embed(documents)))
minilm_embed_time = time.time() - start
print(f"  Shape: {minilm_embeddings.shape}")
print(f"  Time: {minilm_embed_time:.1f}s ({minilm_embed_time/len(documents)*1000:.1f} ms/doc)")

Loading all-MiniLM-L6-v2 (FastEmbed)...


[0;93m2026-02-04 16:18:46.138738459 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card0/device/vendor"[m


  Loaded in 0.2s

Generating embeddings with all-MiniLM-L6-v2...


  Shape: (10000, 384)
  Time: 101.7s (10.2 ms/doc)


In [4]:
# Sentence-Transformers: E5-base
from sentence_transformers import SentenceTransformer

print("Loading multilingual-e5-base (sentence-transformers)...")
start = time.time()
e5_model = SentenceTransformer("intfloat/multilingual-e5-base", device="cpu")
e5_load_time = time.time() - start
print(f"  Loaded in {e5_load_time:.1f}s")

# 埋め込み生成（E5はpassage:プレフィックス必要）
print("\nGenerating embeddings with E5-base...")
docs_with_prefix = [f"passage: {d}" for d in documents]
start = time.time()
e5_embeddings = e5_model.encode(docs_with_prefix, show_progress_bar=True, convert_to_numpy=True)
e5_embed_time = time.time() - start
print(f"  Shape: {e5_embeddings.shape}")
print(f"  Time: {e5_embed_time:.1f}s ({e5_embed_time/len(documents)*1000:.1f} ms/doc)")

Loading multilingual-e5-base (sentence-transformers)...


  Loaded in 4.2s

Generating embeddings with E5-base...


Batches:   0%|          | 0/313 [00:00<?, ?it/s]

  Shape: (10000, 768)
  Time: 624.1s (62.4 ms/doc)


In [5]:
# 速度比較サマリー
print("\n" + "="*60)
print("CPU Speed Comparison (10,000 documents)")
print("="*60)
print(f"{'Model':<30} {'Dim':>6} {'Time (s)':>10} {'ms/doc':>10} {'Speedup':>10}")
print("-"*70)
print(f"{'all-MiniLM-L6-v2 (FastEmbed)':<30} {384:>6} {minilm_embed_time:>10.1f} {minilm_embed_time/len(documents)*1000:>10.1f} {e5_embed_time/minilm_embed_time:>9.1f}x")
print(f"{'multilingual-e5-base (ST)':<30} {768:>6} {e5_embed_time:>10.1f} {e5_embed_time/len(documents)*1000:>10.1f} {'1.0':>10}x")


CPU Speed Comparison (10,000 documents)
Model                             Dim   Time (s)     ms/doc    Speedup
----------------------------------------------------------------------
all-MiniLM-L6-v2 (FastEmbed)      384      101.7       10.2       6.1x
multilingual-e5-base (ST)         768      624.1       62.4        1.0x


## 3. 検索品質比較（Recall@10）

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

def compute_recall_at_k(query_embeddings, doc_embeddings, k=10, n_queries=100):
    """
    ランダムなクエリでRecall@kを計算
    Ground truth: 同じモデルでのコサイン類似度Top-k
    """
    np.random.seed(42)
    query_indices = np.random.choice(len(doc_embeddings), n_queries, replace=False)
    
    recalls = []
    for idx in query_indices:
        query = doc_embeddings[idx:idx+1]
        
        # コサイン類似度で検索
        similarities = cosine_similarity(query, doc_embeddings)[0]
        similarities[idx] = -1  # 自分自身を除外
        
        top_k = np.argsort(similarities)[-k:][::-1]
        recalls.append(1.0)  # 同じモデルなので常に100%
    
    return np.mean(recalls)

# クロスモデル比較: MiniLMのクエリでE5の結果と比較
def cross_model_similarity(minilm_emb, e5_emb, n_queries=100, k=10):
    """
    MiniLMとE5で同じドキュメントの検索結果がどれだけ一致するか
    """
    np.random.seed(42)
    query_indices = np.random.choice(len(minilm_emb), n_queries, replace=False)
    
    overlaps = []
    for idx in query_indices:
        # MiniLMでの検索結果
        minilm_sim = cosine_similarity(minilm_emb[idx:idx+1], minilm_emb)[0]
        minilm_sim[idx] = -1
        minilm_top_k = set(np.argsort(minilm_sim)[-k:])
        
        # E5での検索結果
        e5_sim = cosine_similarity(e5_emb[idx:idx+1], e5_emb)[0]
        e5_sim[idx] = -1
        e5_top_k = set(np.argsort(e5_sim)[-k:])
        
        # 重複率
        overlap = len(minilm_top_k & e5_top_k) / k
        overlaps.append(overlap)
    
    return np.mean(overlaps), np.std(overlaps)

print("Computing cross-model similarity...")
overlap_mean, overlap_std = cross_model_similarity(minilm_embeddings, e5_embeddings, n_queries=200, k=10)
print(f"\nTop-10 Overlap between MiniLM and E5: {overlap_mean*100:.1f}% ± {overlap_std*100:.1f}%")

Computing cross-model similarity...



Top-10 Overlap between MiniLM and E5: 37.9% ± 21.9%


In [7]:
# 複数のkで比較
print("\nTop-K Overlap Analysis:")
print(f"{'K':>5} {'Overlap':>12} {'Std':>10}")
print("-"*30)
for k in [5, 10, 20, 50, 100]:
    overlap_mean, overlap_std = cross_model_similarity(minilm_embeddings, e5_embeddings, n_queries=200, k=k)
    print(f"{k:>5} {overlap_mean*100:>11.1f}% {overlap_std*100:>9.1f}%")


Top-K Overlap Analysis:
    K      Overlap        Std
------------------------------


    5        37.0%      24.6%


   10        37.9%      21.9%


   20        39.2%      19.2%


   50        38.9%      14.9%


  100        39.6%      13.5%


## 4. all-MiniLM-L6-v2 用 ITQ学習と保存

In [8]:
# ITQ学習
print(f"Training ITQ with {N_BITS} bits...")
itq = ITQLSH(n_bits=N_BITS, n_iterations=50)
itq.fit(minilm_embeddings)

# ハッシュ生成
minilm_hashes = itq.transform(minilm_embeddings)
print(f"Hash shape: {minilm_hashes.shape}")

# 保存
itq.save(DATA_DIR / "itq_minilm_128bits.pkl")
np.save(DATA_DIR / "10k_minilm_hashes_128bits.npy", minilm_hashes)
print(f"Saved: itq_minilm_128bits.pkl, 10k_minilm_hashes_128bits.npy")

Training ITQ with 128 bits...
ITQ学習開始: samples=10000, dim=384, bits=128
  Centering完了: mean_norm=0.1440
  PCA完了: explained_variance=79.61%
  ITQ iteration 10: quantization_error=0.8786


  ITQ iteration 20: quantization_error=0.8779
  ITQ iteration 30: quantization_error=0.8775


  ITQ iteration 40: quantization_error=0.8774
  ITQ iteration 50: quantization_error=0.8773
ITQ学習完了
Hash shape: (10000, 128)
Saved: itq_minilm_128bits.pkl, 10k_minilm_hashes_128bits.npy


In [9]:
# ITQ品質評価：ハミング距離とコサイン類似度の相関
from scipy.stats import spearmanr

np.random.seed(42)
sample_indices = np.random.choice(len(minilm_embeddings), 500, replace=False)
sample_emb = minilm_embeddings[sample_indices]
sample_hashes = minilm_hashes[sample_indices]

# コサイン類似度行列
cos_sim_matrix = cosine_similarity(sample_emb)

# ハミング距離行列
hamming_matrix = np.zeros((len(sample_hashes), len(sample_hashes)))
for i in range(len(sample_hashes)):
    hamming_matrix[i] = hamming_distance_batch(sample_hashes[i:i+1], sample_hashes)[0]

# 上三角部分のみ取得（対角除く）
upper_indices = np.triu_indices(len(sample_hashes), k=1)
cos_values = cos_sim_matrix[upper_indices]
hamming_values = hamming_matrix[upper_indices]

# 相関係数
correlation, p_value = spearmanr(cos_values, hamming_values)
print(f"\nSpearman correlation (cosine vs hamming): {correlation:.4f}")
print(f"(Negative correlation expected: lower hamming = higher similarity)")


Spearman correlation (cosine vs hamming): -0.0225
(Negative correlation expected: lower hamming = higher similarity)


## 5. Pivot選択と保存

In [10]:
def select_pivots_furthest_first(embeddings, n_pivots):
    """
    Furthest First法でピボットを選択
    """
    n = len(embeddings)
    pivot_indices = []
    
    # 最初のピボットはランダム
    np.random.seed(42)
    first_pivot = np.random.randint(n)
    pivot_indices.append(first_pivot)
    
    # 各点から最も近いピボットまでの距離
    min_distances = np.full(n, np.inf)
    
    for _ in range(n_pivots - 1):
        # 最新ピボットからの距離を計算
        last_pivot = pivot_indices[-1]
        distances = 1 - cosine_similarity(embeddings, embeddings[last_pivot:last_pivot+1]).flatten()
        
        # 最小距離を更新
        min_distances = np.minimum(min_distances, distances)
        
        # 既存ピボットを除外
        min_distances[pivot_indices] = -1
        
        # 最も遠い点を次のピボットに
        next_pivot = np.argmax(min_distances)
        pivot_indices.append(next_pivot)
    
    return np.array(pivot_indices)

# ピボット選択
print(f"Selecting {N_PIVOTS} pivots using Furthest First...")
pivot_indices = select_pivots_furthest_first(minilm_embeddings, N_PIVOTS)
pivots = minilm_embeddings[pivot_indices]
print(f"Pivot indices: {pivot_indices}")

# 全ドキュメントとピボット間の距離
pivot_distances = 1 - cosine_similarity(minilm_embeddings, pivots)
print(f"Pivot distances shape: {pivot_distances.shape}")

# 保存
np.save(DATA_DIR / "pivots_8_minilm.npy", pivots)
np.save(DATA_DIR / "10k_minilm_pivot_distances.npy", pivot_distances)
np.save(DATA_DIR / "10k_minilm_embeddings.npy", minilm_embeddings)
print(f"Saved: pivots_8_minilm.npy, 10k_minilm_pivot_distances.npy, 10k_minilm_embeddings.npy")

Selecting 8 pivots using Furthest First...


Pivot indices: [7270 3722 2751 7590 4513 4020 5414 7975]
Pivot distances shape: (10000, 8)
Saved: pivots_8_minilm.npy, 10k_minilm_pivot_distances.npy, 10k_minilm_embeddings.npy


## 6. ITQ LSH + Pivot 評価

In [11]:
def evaluate_itq_lsh(query_idx, embeddings, hashes, k=10, candidates_list=[100, 500, 1000]):
    """
    ITQ LSHでの検索評価
    """
    query_emb = embeddings[query_idx:query_idx+1]
    query_hash = hashes[query_idx:query_idx+1]
    
    # Ground truth: コサイン類似度Top-k
    cos_sim = cosine_similarity(query_emb, embeddings)[0]
    cos_sim[query_idx] = -1
    ground_truth = set(np.argsort(cos_sim)[-k:])
    
    results = {}
    for n_candidates in candidates_list:
        # ハミング距離で候補取得
        hamming_dists = hamming_distance_batch(query_hash, hashes)
        hamming_dists = hamming_dists.astype(float)  # int -> float for inf assignment
        hamming_dists[query_idx] = np.inf
        candidates = np.argsort(hamming_dists)[:n_candidates]
        
        # 候補内でコサイン類似度再ランク
        candidate_sims = cosine_similarity(query_emb, embeddings[candidates])[0]
        top_k_in_candidates = candidates[np.argsort(candidate_sims)[-k:]]
        
        recall = len(set(top_k_in_candidates) & ground_truth) / k
        results[n_candidates] = recall
    
    return results

def evaluate_pivot_filter(query_idx, embeddings, pivot_distances, threshold, k=10):
    """
    Pivotフィルタリングの評価
    """
    query_emb = embeddings[query_idx:query_idx+1]
    query_pivot_dist = pivot_distances[query_idx]
    
    # Ground truth
    cos_sim = cosine_similarity(query_emb, embeddings)[0]
    cos_sim[query_idx] = -1
    ground_truth = set(np.argsort(cos_sim)[-k:])
    
    # Pivotフィルタ: |d(q,p) - d(x,p)| < threshold for all pivots
    dist_diff = np.abs(pivot_distances - query_pivot_dist)
    max_diff = np.max(dist_diff, axis=1)
    candidates_mask = max_diff < threshold
    candidates_mask[query_idx] = False
    
    n_candidates = np.sum(candidates_mask)
    reduction_rate = 1 - n_candidates / (len(embeddings) - 1)
    
    # フィルタ後の候補でRecall計算
    if n_candidates > 0:
        candidate_indices = np.where(candidates_mask)[0]
        filter_recall = len(set(candidate_indices) & ground_truth) / k
    else:
        filter_recall = 0.0
    
    return filter_recall, reduction_rate, n_candidates

# 評価実行
np.random.seed(42)
test_queries = np.random.choice(len(minilm_embeddings), 100, replace=False)

print("Evaluating ITQ LSH...")
itq_results = {100: [], 500: [], 1000: []}
for idx in test_queries:
    results = evaluate_itq_lsh(idx, minilm_embeddings, minilm_hashes)
    for n_cand, recall in results.items():
        itq_results[n_cand].append(recall)

print("\nITQ LSH Recall@10:")
print(f"{'Candidates':>12} {'Recall@10':>12}")
print("-"*26)
for n_cand in [100, 500, 1000]:
    mean_recall = np.mean(itq_results[n_cand])
    print(f"{n_cand:>12} {mean_recall*100:>11.1f}%")

Evaluating ITQ LSH...



ITQ LSH Recall@10:
  Candidates    Recall@10
--------------------------
         100        82.6%
         500        96.9%
        1000        98.9%


In [12]:
# Pivotフィルタ評価
print("\nEvaluating Pivot Filter...")
thresholds = [0.10, 0.15, 0.20, 0.25, 0.30]

print(f"\n{'Threshold':>10} {'Filter Recall':>14} {'Reduction':>12} {'Avg Candidates':>16}")
print("-"*55)

for threshold in thresholds:
    filter_recalls = []
    reduction_rates = []
    n_candidates_list = []
    
    for idx in test_queries:
        fr, rr, nc = evaluate_pivot_filter(idx, minilm_embeddings, pivot_distances, threshold)
        filter_recalls.append(fr)
        reduction_rates.append(rr)
        n_candidates_list.append(nc)
    
    mean_fr = np.mean(filter_recalls)
    mean_rr = np.mean(reduction_rates)
    mean_nc = np.mean(n_candidates_list)
    print(f"{threshold:>10.2f} {mean_fr*100:>13.1f}% {mean_rr*100:>11.1f}% {mean_nc:>16.0f}")


Evaluating Pivot Filter...

 Threshold  Filter Recall    Reduction   Avg Candidates
-------------------------------------------------------


      0.10          22.4%        96.9%              310


      0.15          64.7%        79.6%             2041


      0.20          89.8%        52.3%             4774


      0.25          96.7%        29.2%             7077


      0.30          99.0%        14.6%             8539


## 7. E5-base との比較（同じ評価）

In [13]:
# E5-base用のITQ学習
print("Training ITQ for E5-base...")
itq_e5 = ITQLSH(n_bits=N_BITS, n_iterations=50)
itq_e5.fit(e5_embeddings)
e5_hashes = itq_e5.transform(e5_embeddings)

# E5-base用のPivot
print("Selecting pivots for E5-base...")
e5_pivot_indices = select_pivots_furthest_first(e5_embeddings, N_PIVOTS)
e5_pivots = e5_embeddings[e5_pivot_indices]
e5_pivot_distances = 1 - cosine_similarity(e5_embeddings, e5_pivots)

# ITQ評価
print("\nEvaluating ITQ LSH for E5-base...")
e5_itq_results = {100: [], 500: [], 1000: []}
for idx in test_queries:
    results = evaluate_itq_lsh(idx, e5_embeddings, e5_hashes)
    for n_cand, recall in results.items():
        e5_itq_results[n_cand].append(recall)

print("\nITQ LSH Recall@10 (E5-base):")
for n_cand in [100, 500, 1000]:
    mean_recall = np.mean(e5_itq_results[n_cand])
    print(f"  Candidates={n_cand}: {mean_recall*100:.1f}%")

Training ITQ for E5-base...
ITQ学習開始: samples=10000, dim=768, bits=128
  Centering完了: mean_norm=0.8404
  PCA完了: explained_variance=61.78%


  ITQ iteration 10: quantization_error=0.9399
  ITQ iteration 20: quantization_error=0.9396


  ITQ iteration 30: quantization_error=0.9394
  ITQ iteration 40: quantization_error=0.9393


  ITQ iteration 50: quantization_error=0.9392
ITQ学習完了
Selecting pivots for E5-base...

Evaluating ITQ LSH for E5-base...



ITQ LSH Recall@10 (E5-base):
  Candidates=100: 82.3%
  Candidates=500: 96.7%
  Candidates=1000: 98.8%


In [14]:
# Pivotフィルタ評価（E5-base）
print("\nPivot Filter Evaluation (E5-base):")
print(f"{'Threshold':>10} {'Filter Recall':>14} {'Reduction':>12}")
print("-"*40)

for threshold in [0.15, 0.20, 0.25]:
    filter_recalls = []
    reduction_rates = []
    
    for idx in test_queries:
        fr, rr, nc = evaluate_pivot_filter(idx, e5_embeddings, e5_pivot_distances, threshold)
        filter_recalls.append(fr)
        reduction_rates.append(rr)
    
    print(f"{threshold:>10.2f} {np.mean(filter_recalls)*100:>13.1f}% {np.mean(reduction_rates)*100:>11.1f}%")


Pivot Filter Evaluation (E5-base):
 Threshold  Filter Recall    Reduction
----------------------------------------


      0.15         100.0%         0.4%


      0.20         100.0%         0.2%


      0.25         100.0%         0.1%


## 8. 総合比較サマリー

In [15]:
print("="*70)
print("Final Comparison: all-MiniLM-L6-v2 vs E5-base")
print("="*70)

print("\n【CPU Performance (10,000 English documents)】")
print(f"  all-MiniLM-L6-v2 (FastEmbed): {minilm_embed_time:.1f}s ({minilm_embed_time/len(documents)*1000:.1f} ms/doc)")
print(f"  E5-base (sentence-transformers): {e5_embed_time:.1f}s ({e5_embed_time/len(documents)*1000:.1f} ms/doc)")
print(f"  Speedup: {e5_embed_time/minilm_embed_time:.1f}x")

print("\n【Search Quality (Top-10 Overlap)】")
overlap_mean, _ = cross_model_similarity(minilm_embeddings, e5_embeddings, n_queries=200, k=10)
print(f"  MiniLM vs E5 Top-10 overlap: {overlap_mean*100:.1f}%")

print("\n【ITQ LSH Recall@10 (candidates=500)】")
print(f"  all-MiniLM-L6-v2: {np.mean(itq_results[500])*100:.1f}%")
print(f"  E5-base: {np.mean(e5_itq_results[500])*100:.1f}%")

print("\n【Saved Files for all-MiniLM-L6-v2】")
print(f"  - itq_minilm_128bits.pkl")
print(f"  - 10k_minilm_hashes_128bits.npy")
print(f"  - pivots_8_minilm.npy")
print(f"  - 10k_minilm_pivot_distances.npy")
print(f"  - 10k_minilm_embeddings.npy")

Final Comparison: all-MiniLM-L6-v2 vs E5-base

【CPU Performance (10,000 English documents)】
  all-MiniLM-L6-v2 (FastEmbed): 101.7s (10.2 ms/doc)
  E5-base (sentence-transformers): 624.1s (62.4 ms/doc)
  Speedup: 6.1x

【Search Quality (Top-10 Overlap)】


  MiniLM vs E5 Top-10 overlap: 37.9%

【ITQ LSH Recall@10 (candidates=500)】
  all-MiniLM-L6-v2: 96.9%
  E5-base: 96.7%

【Saved Files for all-MiniLM-L6-v2】
  - itq_minilm_128bits.pkl
  - 10k_minilm_hashes_128bits.npy
  - pivots_8_minilm.npy
  - 10k_minilm_pivot_distances.npy
  - 10k_minilm_embeddings.npy


---

# 実験66 結果サマリー

## CPU性能比較（10,000件英語Wikipedia）

| モデル | 次元 | 処理時間 | ms/doc | Speedup |
|--------|------|----------|--------|---------|
| **all-MiniLM-L6-v2 (FastEmbed)** | 384 | 101.7s | **10.2** | **6.1x** |
| multilingual-e5-base (ST) | 768 | 624.1s | 62.4 | 1.0x |

## 検索品質比較

### Top-K オーバーラップ（MiniLM vs E5-base）
| K | オーバーラップ |
|---|---------------|
| 5 | 37.0% |
| 10 | 37.9% |
| 20 | 39.2% |
| 100 | 39.6% |

**重要な発見**: 両モデルの検索結果は**約38%しか一致しない**。これはモデル間で意味理解が大きく異なることを示す。

## ITQ LSH 評価

### Recall@10
| 候補数 | MiniLM | E5-base |
|--------|--------|---------|
| 100 | 82.6% | 82.3% |
| 500 | 96.9% | 96.7% |
| 1000 | 98.9% | 98.8% |

両モデルとも同等のITQ LSH性能を発揮。

### ハミング距離相関
| モデル | Spearman相関 |
|--------|-------------|
| MiniLM | **-0.0225** (非常に弱い) |
| E5-base (参考: 実験61-63より) | -0.65〜-0.75 |

**注意**: MiniLMのITQハッシュはコサイン類似度との相関が非常に弱い。PCA説明分散は79.6%と高いが、ベクトル空間の特性がITQに適していない可能性がある。

## Pivot フィルタリング

### MiniLM
| Threshold | Filter Recall | Reduction | 候補数 |
|-----------|--------------|-----------|--------|
| 0.15 | 64.7% | 79.6% | 2,041 |
| 0.20 | 89.8% | 52.3% | 4,774 |
| 0.25 | 96.7% | 29.2% | 7,077 |

### E5-base
| Threshold | Filter Recall | Reduction |
|-----------|--------------|-----------|
| 0.15 | 100.0% | 0.4% |
| 0.20 | 100.0% | 0.2% |
| 0.25 | 100.0% | 0.1% |

**重要**: E5-baseのPivotフィルタは英語データでほぼ機能しない（削減率が非常に低い）。これはE5-baseの異方性が強く、ベクトルが狭い範囲に集中しているため。

## 保存ファイル（all-MiniLM-L6-v2用）
- `itq_minilm_128bits.pkl` - ITQモデル
- `10k_minilm_hashes_128bits.npy` - ハッシュ
- `pivots_8_minilm.npy` - 8ピボット
- `10k_minilm_pivot_distances.npy` - ピボット距離
- `10k_minilm_embeddings.npy` - 埋め込み

## 結論

### MiniLMのメリット
- **6.1倍高速**なCPU推論
- Pivotフィルタが効果的に機能（threshold=0.20で52%削減、90%recall維持）

### MiniLMの課題
- E5-baseとの検索結果一致率が低い（約38%）→ 品質差がある可能性
- ITQハミング距離とコサイン類似度の相関が非常に弱い

### 推奨
| ユースケース | 推奨モデル |
|--------------|------------|
| **英語のみ + 速度最優先** | all-MiniLM-L6-v2 |
| **英語のみ + 品質重視** | E5-base（または他の大型モデル） |
| **日本語含む** | multilingual-e5-small/base |