# 67. bge-small-en-v1.5 (FastEmbed) ITQ・Pivot評価

## 目的
- **FastEmbedのみ（PyTorchなし）** での動作確認
- bge-small-en-v1.5での埋め込み生成とITQ/Pivot学習
- 10,000件英語Wikipediaデータでの評価
- all-MiniLM-L6-v2との比較（ハミング相関の確認）

## モデル情報
| 項目 | 値 |
|------|-----|
| モデル | BAAI/bge-small-en-v1.5 |
| ライブラリ | FastEmbed (ONNX) |
| 次元数 | 384 |
| モデルサイズ | 66.5MB |
| 言語 | 英語のみ |

## 0. セットアップ（PyTorchなし）

In [1]:
# PyTorchを使用しないことを確認
import sys
print("Checking that PyTorch is NOT required...")
print(f"'torch' in sys.modules: {'torch' in sys.modules}")

import numpy as np
import time
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# ITQLSHはnumpyベースなのでPyTorch不要
sys.path.append('..')
from src.itq_lsh import ITQLSH, hamming_distance_batch

DATA_DIR = Path("../data")
EXPORT_DIR = DATA_DIR / "export"
EXPORT_DIR.mkdir(exist_ok=True)
np.random.seed(42)

# 設定
N_DOCUMENTS = 10000
N_BITS = 128
N_PIVOTS = 8

print(f"\nConfiguration:")
print(f"  Documents: {N_DOCUMENTS}")
print(f"  ITQ bits: {N_BITS}")
print(f"  Pivots: {N_PIVOTS}")
print(f"\n'torch' in sys.modules after imports: {'torch' in sys.modules}")

Checking that PyTorch is NOT required...
'torch' in sys.modules: False

Configuration:
  Documents: 10000
  ITQ bits: 128
  Pivots: 8

'torch' in sys.modules after imports: False


## 1. データ準備（Wikipedia English）

In [2]:
from datasets import load_dataset

print("Loading Wikipedia English dataset (streaming)...")
wiki_en = load_dataset(
    "wikimedia/wikipedia",
    "20231101.en",
    split="train",
    streaming=True
)

# 10,000件のドキュメントを収集
documents = []
for i, item in enumerate(wiki_en):
    if len(documents) >= N_DOCUMENTS:
        break
    text = item['text'][:500].strip()
    if len(text) >= 50:  # 短すぎるものは除外
        documents.append(text)

print(f"Collected {len(documents)} documents")
print(f"Sample: {documents[0][:150]}...")

Loading Wikipedia English dataset (streaming)...


Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

Collected 10000 documents
Sample: Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims...


## 2. 埋め込み生成（FastEmbed）

In [3]:
from fastembed import TextEmbedding

MODEL_NAME = "BAAI/bge-small-en-v1.5"

print(f"Loading {MODEL_NAME} (FastEmbed ONNX)...")
start = time.time()
model = TextEmbedding(model_name=MODEL_NAME)
load_time = time.time() - start
print(f"  Model loaded in {load_time:.1f}s")

# 埋め込み生成
print(f"\nGenerating embeddings for {len(documents)} documents...")
start = time.time()
embeddings = np.array(list(model.embed(documents)))
embed_time = time.time() - start

print(f"  Shape: {embeddings.shape}")
print(f"  Dtype: {embeddings.dtype}")
print(f"  Time: {embed_time:.1f}s ({embed_time/len(documents)*1000:.1f} ms/doc)")

# PyTorchが読み込まれていないことを再確認
print(f"\n'torch' in sys.modules: {'torch' in sys.modules}")

Loading BAAI/bge-small-en-v1.5 (FastEmbed ONNX)...


[0;93m2026-02-05 11:07:47.618738480 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card0/device/vendor"[m


  Model loaded in 0.2s

Generating embeddings for 10000 documents...


  Shape: (10000, 384)
  Dtype: float32
  Time: 302.2s (30.2 ms/doc)

'torch' in sys.modules: True


In [4]:
# 埋め込みを保存
np.save(DATA_DIR / "10k_bge_small_embeddings.npy", embeddings)
print(f"Saved: 10k_bge_small_embeddings.npy")

Saved: 10k_bge_small_embeddings.npy


## 3. ITQ学習と保存

In [5]:
# ITQ学習
print(f"Training ITQ with {N_BITS} bits...")
itq = ITQLSH(n_bits=N_BITS, n_iterations=50)
itq.fit(embeddings)

# ハッシュ生成
hashes = itq.transform(embeddings)
print(f"Hash shape: {hashes.shape}")

# 保存
itq.save(DATA_DIR / "itq_bge_small_128bits.pkl")
np.save(DATA_DIR / "10k_bge_small_hashes_128bits.npy", hashes)
print(f"Saved: itq_bge_small_128bits.pkl, 10k_bge_small_hashes_128bits.npy")

Training ITQ with 128 bits...
ITQ学習開始: samples=10000, dim=384, bits=128
  Centering完了: mean_norm=0.6669
  PCA完了: explained_variance=72.43%
  ITQ iteration 10: quantization_error=0.9114


  ITQ iteration 20: quantization_error=0.9109
  ITQ iteration 30: quantization_error=0.9106


  ITQ iteration 40: quantization_error=0.9105
  ITQ iteration 50: quantization_error=0.9104
ITQ学習完了
Hash shape: (10000, 128)
Saved: itq_bge_small_128bits.pkl, 10k_bge_small_hashes_128bits.npy


In [6]:
# ハミング距離とコサイン類似度の相関を確認
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import spearmanr

print("Computing Hamming-Cosine correlation...")
np.random.seed(42)
sample_indices = np.random.choice(len(embeddings), 500, replace=False)
sample_emb = embeddings[sample_indices]
sample_hashes = hashes[sample_indices]

# コサイン類似度行列
cos_sim_matrix = cosine_similarity(sample_emb)

# ハミング距離行列
hamming_matrix = np.zeros((len(sample_hashes), len(sample_hashes)))
for i in range(len(sample_hashes)):
    hamming_matrix[i] = hamming_distance_batch(sample_hashes[i:i+1], sample_hashes)

# 上三角部分のみ取得（対角除く）
upper_indices = np.triu_indices(len(sample_hashes), k=1)
cos_values = cos_sim_matrix[upper_indices]
hamming_values = hamming_matrix[upper_indices]

# 相関係数
correlation, p_value = spearmanr(cos_values, hamming_values)
print(f"\nSpearman correlation (cosine vs hamming): {correlation:.4f}")
print(f"(Negative correlation expected: lower hamming = higher similarity)")
print(f"\nComparison with other models:")
print(f"  bge-small-en-v1.5: {correlation:.4f}")
print(f"  all-MiniLM-L6-v2:  -0.0225 (weak)")
print(f"  E5-base:           ~-0.65 (good)")

Computing Hamming-Cosine correlation...

Spearman correlation (cosine vs hamming): -0.6111
(Negative correlation expected: lower hamming = higher similarity)

Comparison with other models:
  bge-small-en-v1.5: -0.6111
  all-MiniLM-L6-v2:  -0.0225 (weak)
  E5-base:           ~-0.65 (good)


## 4. Pivot選択と保存

In [7]:
def select_pivots_furthest_first(embeddings, n_pivots, seed=42):
    """
    Furthest First法でピボットを選択
    """
    n = len(embeddings)
    pivot_indices = []
    
    np.random.seed(seed)
    first_pivot = np.random.randint(n)
    pivot_indices.append(first_pivot)
    
    min_distances = np.full(n, np.inf)
    
    for _ in range(n_pivots - 1):
        last_pivot = pivot_indices[-1]
        distances = 1 - cosine_similarity(embeddings, embeddings[last_pivot:last_pivot+1]).flatten()
        min_distances = np.minimum(min_distances, distances)
        min_distances[pivot_indices] = -1
        next_pivot = np.argmax(min_distances)
        pivot_indices.append(next_pivot)
    
    return np.array(pivot_indices)

# ピボット選択
print(f"Selecting {N_PIVOTS} pivots using Furthest First...")
pivot_indices = select_pivots_furthest_first(embeddings, N_PIVOTS)
pivots = embeddings[pivot_indices]
print(f"Pivot indices: {pivot_indices}")
print(f"Pivots shape: {pivots.shape}")

# 全ドキュメントとピボット間の距離
pivot_distances = 1 - cosine_similarity(embeddings, pivots)
print(f"Pivot distances shape: {pivot_distances.shape}")

# 保存
np.save(DATA_DIR / "pivots_8_bge_small.npy", pivots)
np.save(DATA_DIR / "10k_bge_small_pivot_distances.npy", pivot_distances)
print(f"Saved: pivots_8_bge_small.npy, 10k_bge_small_pivot_distances.npy")

Selecting 8 pivots using Furthest First...
Pivot indices: [7270  986 8445 7455 1405 2673 9083 2003]
Pivots shape: (8, 384)
Pivot distances shape: (10000, 8)
Saved: pivots_8_bge_small.npy, 10k_bge_small_pivot_distances.npy


## 5. 評価（Recall@10, Filter Recall）

In [8]:
def evaluate_itq_lsh(query_idx, embeddings, hashes, k=10, candidates_list=[100, 500, 1000]):
    """
    ITQ LSHでの検索評価
    """
    query_emb = embeddings[query_idx:query_idx+1]
    query_hash = hashes[query_idx:query_idx+1]
    
    # Ground truth: コサイン類似度Top-k
    cos_sim = cosine_similarity(query_emb, embeddings)[0]
    cos_sim[query_idx] = -1
    ground_truth = set(np.argsort(cos_sim)[-k:])
    
    results = {}
    for n_candidates in candidates_list:
        hamming_dists = hamming_distance_batch(query_hash, hashes)
        hamming_dists = hamming_dists.astype(float)
        hamming_dists[query_idx] = np.inf
        candidates = np.argsort(hamming_dists)[:n_candidates]
        
        candidate_sims = cosine_similarity(query_emb, embeddings[candidates])[0]
        top_k_in_candidates = candidates[np.argsort(candidate_sims)[-k:]]
        
        recall = len(set(top_k_in_candidates) & ground_truth) / k
        results[n_candidates] = recall
    
    return results

def evaluate_pivot_filter(query_idx, embeddings, pivot_distances, threshold, k=10):
    """
    Pivotフィルタリングの評価
    """
    query_emb = embeddings[query_idx:query_idx+1]
    query_pivot_dist = pivot_distances[query_idx]
    
    cos_sim = cosine_similarity(query_emb, embeddings)[0]
    cos_sim[query_idx] = -1
    ground_truth = set(np.argsort(cos_sim)[-k:])
    
    dist_diff = np.abs(pivot_distances - query_pivot_dist)
    max_diff = np.max(dist_diff, axis=1)
    candidates_mask = max_diff < threshold
    candidates_mask[query_idx] = False
    
    n_candidates = np.sum(candidates_mask)
    reduction_rate = 1 - n_candidates / (len(embeddings) - 1)
    
    if n_candidates > 0:
        candidate_indices = np.where(candidates_mask)[0]
        filter_recall = len(set(candidate_indices) & ground_truth) / k
    else:
        filter_recall = 0.0
    
    return filter_recall, reduction_rate, n_candidates

# 評価実行
np.random.seed(42)
test_queries = np.random.choice(len(embeddings), 100, replace=False)

print("Evaluating ITQ LSH...")
itq_results = {100: [], 500: [], 1000: []}
for idx in test_queries:
    results = evaluate_itq_lsh(idx, embeddings, hashes)
    for n_cand, recall in results.items():
        itq_results[n_cand].append(recall)

print("\nITQ LSH Recall@10:")
print(f"{'Candidates':>12} {'Recall@10':>12}")
print("-"*26)
for n_cand in [100, 500, 1000]:
    mean_recall = np.mean(itq_results[n_cand])
    print(f"{n_cand:>12} {mean_recall*100:>11.1f}%")

Evaluating ITQ LSH...



ITQ LSH Recall@10:
  Candidates    Recall@10
--------------------------
         100        82.9%
         500        96.8%
        1000        99.0%


In [9]:
# Pivotフィルタ評価
print("Evaluating Pivot Filter...")
thresholds = [0.10, 0.15, 0.20, 0.25, 0.30]

print(f"\n{'Threshold':>10} {'Filter Recall':>14} {'Reduction':>12} {'Avg Candidates':>16}")
print("-"*55)

pivot_results = {}
for threshold in thresholds:
    filter_recalls = []
    reduction_rates = []
    n_candidates_list = []
    
    for idx in test_queries:
        fr, rr, nc = evaluate_pivot_filter(idx, embeddings, pivot_distances, threshold)
        filter_recalls.append(fr)
        reduction_rates.append(rr)
        n_candidates_list.append(nc)
    
    mean_fr = np.mean(filter_recalls)
    mean_rr = np.mean(reduction_rates)
    mean_nc = np.mean(n_candidates_list)
    pivot_results[threshold] = {'recall': mean_fr, 'reduction': mean_rr, 'candidates': mean_nc}
    print(f"{threshold:>10.2f} {mean_fr*100:>13.1f}% {mean_rr*100:>11.1f}% {mean_nc:>16.0f}")

Evaluating Pivot Filter...

 Threshold  Filter Recall    Reduction   Avg Candidates
-------------------------------------------------------


      0.10          75.3%        81.9%             1814


      0.15          96.0%        39.7%             6025


      0.20          99.7%        11.8%             8818


      0.25          99.9%         2.6%             9735


      0.30         100.0%         0.6%             9938


## 6. all-MiniLM-L6-v2との比較

In [10]:
# MiniLMの結果（実験66より）を参照して比較
print("="*70)
print("Comparison: bge-small-en-v1.5 vs all-MiniLM-L6-v2")
print("="*70)

print("\n【Model Info】")
print(f"{'':>25} {'bge-small':>15} {'MiniLM':>15}")
print("-"*60)
print(f"{'Dimension':>25} {384:>15} {384:>15}")
print(f"{'Model Size':>25} {'66.5MB':>15} {'90.4MB':>15}")

print("\n【Hamming-Cosine Correlation】")
print(f"  bge-small-en-v1.5: {correlation:.4f}")
print(f"  all-MiniLM-L6-v2:  -0.0225")
if correlation < -0.3:
    print(f"  -> bge-small has MUCH better correlation!")
elif correlation < -0.1:
    print(f"  -> bge-small has better correlation")
else:
    print(f"  -> Both have weak correlation")

print("\n【ITQ LSH Recall@10 (candidates=500)】")
print(f"  bge-small-en-v1.5: {np.mean(itq_results[500])*100:.1f}%")
print(f"  all-MiniLM-L6-v2:  96.9% (from exp 66)")

print("\n【Pivot Filter (threshold=0.20)】")
print(f"  bge-small-en-v1.5: Recall={pivot_results[0.20]['recall']*100:.1f}%, Reduction={pivot_results[0.20]['reduction']*100:.1f}%")
print(f"  all-MiniLM-L6-v2:  Recall=89.8%, Reduction=52.3% (from exp 66)")

Comparison: bge-small-en-v1.5 vs all-MiniLM-L6-v2

【Model Info】
                                bge-small          MiniLM
------------------------------------------------------------
                Dimension             384             384
               Model Size          66.5MB          90.4MB

【Hamming-Cosine Correlation】
  bge-small-en-v1.5: -0.6111
  all-MiniLM-L6-v2:  -0.0225
  -> bge-small has MUCH better correlation!

【ITQ LSH Recall@10 (candidates=500)】
  bge-small-en-v1.5: 96.8%
  all-MiniLM-L6-v2:  96.9% (from exp 66)

【Pivot Filter (threshold=0.20)】
  bge-small-en-v1.5: Recall=99.7%, Reduction=11.8%
  all-MiniLM-L6-v2:  Recall=89.8%, Reduction=52.3% (from exp 66)


## 7. エクスポート（再利用可能な形式）

In [11]:
import pickle

# ITQモデルからパラメータを抽出
print("Exporting ITQ parameters to npy format...")
with open(DATA_DIR / "itq_bge_small_128bits.pkl", "rb") as f:
    itq_params = pickle.load(f)

# 保存
np.save(EXPORT_DIR / "bge_small_itq_mean_vector.npy", itq_params['mean_vector'])
np.save(EXPORT_DIR / "bge_small_itq_pca_matrix.npy", itq_params['pca_matrix'])
np.save(EXPORT_DIR / "bge_small_itq_rotation_matrix.npy", itq_params['rotation_matrix'])
np.save(EXPORT_DIR / "bge_small_pivots_8.npy", pivots)

# メタデータ
metadata = {
    'n_bits': itq_params['n_bits'],
    'n_iterations': itq_params['n_iterations'],
    'seed': itq_params['seed'],
    'model_name': MODEL_NAME,
    'embedding_dim': 384
}
np.save(EXPORT_DIR / "bge_small_itq_metadata.npy", metadata)

print("\nExported files:")
for f in sorted(EXPORT_DIR.glob("bge_small_*.npy")):
    print(f"  - {f.name}")

Exporting ITQ parameters to npy format...

Exported files:
  - bge_small_itq_mean_vector.npy
  - bge_small_itq_metadata.npy
  - bge_small_itq_pca_matrix.npy
  - bge_small_itq_rotation_matrix.npy
  - bge_small_pivots_8.npy


## 8. サマリー

In [12]:
print("="*70)
print("Experiment 67: bge-small-en-v1.5 (FastEmbed) - Final Summary")
print("="*70)

print("\n【PyTorch Dependency Check】")
print(f"  'torch' in sys.modules: {'torch' in sys.modules}")
print(f"  -> {'OK: No PyTorch dependency!' if 'torch' not in sys.modules else 'WARNING: PyTorch was loaded'}")

print("\n【Embedding Performance】")
print(f"  Model: {MODEL_NAME}")
print(f"  Dimension: {embeddings.shape[1]}")
print(f"  Documents: {len(documents)}")
print(f"  Time: {embed_time:.1f}s ({embed_time/len(documents)*1000:.1f} ms/doc)")

print("\n【ITQ LSH Quality】")
print(f"  Hamming-Cosine correlation: {correlation:.4f}")
print(f"  PCA explained variance: {itq_params.get('explained_variance', 'N/A')}")
print(f"  Recall@10 (candidates=500): {np.mean(itq_results[500])*100:.1f}%")

print("\n【Pivot Filter Performance】")
for t in [0.15, 0.20, 0.25]:
    r = pivot_results[t]
    print(f"  threshold={t:.2f}: Recall={r['recall']*100:.1f}%, Reduction={r['reduction']*100:.1f}%")

print("\n【Saved Files】")
print("  Data:")
print(f"    - 10k_bge_small_embeddings.npy")
print(f"    - itq_bge_small_128bits.pkl")
print(f"    - 10k_bge_small_hashes_128bits.npy")
print(f"    - pivots_8_bge_small.npy")
print(f"    - 10k_bge_small_pivot_distances.npy")
print("  Export:")
print(f"    - bge_small_itq_mean_vector.npy")
print(f"    - bge_small_itq_pca_matrix.npy")
print(f"    - bge_small_itq_rotation_matrix.npy")
print(f"    - bge_small_pivots_8.npy")

Experiment 67: bge-small-en-v1.5 (FastEmbed) - Final Summary

【PyTorch Dependency Check】
  'torch' in sys.modules: True

【Embedding Performance】
  Model: BAAI/bge-small-en-v1.5
  Dimension: 384
  Documents: 10000
  Time: 302.2s (30.2 ms/doc)

【ITQ LSH Quality】
  Hamming-Cosine correlation: -0.6111
  PCA explained variance: N/A
  Recall@10 (candidates=500): 96.8%

【Pivot Filter Performance】
  threshold=0.15: Recall=96.0%, Reduction=39.7%
  threshold=0.20: Recall=99.7%, Reduction=11.8%
  threshold=0.25: Recall=99.9%, Reduction=2.6%

【Saved Files】
  Data:
    - 10k_bge_small_embeddings.npy
    - itq_bge_small_128bits.pkl
    - 10k_bge_small_hashes_128bits.npy
    - pivots_8_bge_small.npy
    - 10k_bge_small_pivot_distances.npy
  Export:
    - bge_small_itq_mean_vector.npy
    - bge_small_itq_pca_matrix.npy
    - bge_small_itq_rotation_matrix.npy
    - bge_small_pivots_8.npy


---

# 実験67 結果サマリー

## PyTorch依存性

| チェック項目 | 結果 |
|--------------|------|
| FastEmbed自体 | PyTorch不要（ONNX） |
| datasets ライブラリ | PyTorchを内部でロード |

**注意**: `datasets`ライブラリがPyTorchを間接的にロードする。純粋にPyTorchフリーにするには、データロード部分を別の方法に変更が必要。

## 性能比較: bge-small vs MiniLM

| 指標 | bge-small-en-v1.5 | all-MiniLM-L6-v2 |
|------|-------------------|------------------|
| モデルサイズ | **66.5MB** | 90.4MB |
| 埋め込み速度 | 30.2 ms/doc | **10.2 ms/doc** |
| ハミング相関 | **-0.6111** | -0.0225 |
| ITQ Recall@10 (500) | 96.8% | 96.9% |
| PCA説明分散 | 72.4% | 79.6% |

### 重要な発見

1. **ハミング相関が大幅に改善**
   - bge-small: **-0.6111** (良好)
   - MiniLM: -0.0225 (非常に弱い)
   - → bge-smallはITQ LSHに適したベクトル空間を持つ

2. **速度はMiniLMが優位**
   - MiniLM: 10.2 ms/doc (3倍高速)
   - bge-small: 30.2 ms/doc

3. **Pivotフィルタの特性が異なる**
   - bge-smallはベクトル分布が広いため、threshold=0.15で39.7%削減
   - MiniLMはthreshold=0.20で52.3%削減

## Pivot フィルタ評価

| Threshold | Filter Recall | Reduction | 候補数 |
|-----------|--------------|-----------|--------|
| 0.10 | 75.3% | 81.9% | 1,814 |
| **0.15** | **96.0%** | **39.7%** | 6,025 |
| 0.20 | 99.7% | 11.8% | 8,818 |

## 結論

| ユースケース | 推奨モデル | 理由 |
|--------------|------------|------|
| **ITQ LSH重視** | **bge-small-en-v1.5** | ハミング相関が良好（-0.61） |
| **速度最優先** | all-MiniLM-L6-v2 | 3倍高速（10ms/doc） |
| **モデルサイズ最小** | bge-small-en-v1.5 | 66.5MB（MiniLMより24MB小さい） |

## 保存ファイル

### data/
- `10k_bge_small_embeddings.npy` - 埋め込み (10000, 384)
- `itq_bge_small_128bits.pkl` - ITQモデル
- `10k_bge_small_hashes_128bits.npy` - ハッシュ (10000, 128)
- `pivots_8_bge_small.npy` - 8ピボット (8, 384)
- `10k_bge_small_pivot_distances.npy` - ピボット距離

### data/export/
- `bge_small_itq_mean_vector.npy` - 平均ベクトル (384,)
- `bge_small_itq_pca_matrix.npy` - PCA行列 (384, 128)
- `bge_small_itq_rotation_matrix.npy` - 回転行列 (128, 128)
- `bge_small_pivots_8.npy` - ピボット (8, 384)