# 63. BGE-M3 ITQ・Pivot評価

## 目的
- `BAAI/bge-m3` (1024次元) での埋め込み生成とITQ/Pivot学習
- 10,000件Wikipediaデータでの評価
- 学習済みデータの保存（ITQモデル、Pivot重心点）

## 出力ファイル
- `data/itq_bge_m3_128bits.pkl` - ITQモデル
- `data/10k_bge_m3_hashes_128bits.npy` - ハッシュ
- `data/pivots_8_bge_m3.npy` - 8ピボット
- `data/10k_bge_m3_pivot_distances.npy` - ピボット距離
- `data/10k_bge_m3_embeddings.npy` - 埋め込み

## 注意
- BGE-M3は1024次元の大きなモデル
- ITQ 128bitsでは約12.5%の次元を保持

## 0. セットアップ

In [1]:
import numpy as np
import time
from pathlib import Path
from tqdm import tqdm
import sys
sys.path.insert(0, '../src')
from itq_lsh import ITQLSH, hamming_distance_batch

# GPU確認
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

DATA_DIR = Path("../data")
np.random.seed(42)

# モデル設定
MODEL_NAME = "BAAI/bge-m3"
MODEL_SHORT = "bge_m3"
EMBEDDING_DIM = 1024
N_SAMPLES = 10000

PyTorch version: 2.10.0+cu128
CUDA available: True
GPU: NVIDIA GeForce RTX 4090


## 1. データ準備（10,000件サンプリング）

In [2]:
# 既存のWikipediaデータからテキストを取得
from datasets import load_dataset

print("Loading Wikipedia Japanese dataset (streaming)...")
start_time = time.time()

wiki_ja = load_dataset(
    "wikimedia/wikipedia",
    "20231101.ja",
    split="train",
    streaming=True
)

print(f"Dataset loaded in {time.time() - start_time:.1f}s")

Loading Wikipedia Japanese dataset (streaming)...


Dataset loaded in 3.1s


In [3]:
# 10,000件を収集
print(f"Collecting {N_SAMPLES:,} documents...")
start_time = time.time()

documents = []
titles = []

for i, item in enumerate(tqdm(wiki_ja, total=N_SAMPLES, desc="Collecting")):
    if i >= N_SAMPLES:
        break
    
    # テキストの前処理（最初の500文字程度を使用）
    text = item['text'][:500].strip()
    if len(text) < 50:  # 短すぎるものはスキップ
        continue
    
    documents.append(text)
    titles.append(item['title'])

print(f"\nCollected {len(documents):,} documents in {time.time() - start_time:.1f}s")
print(f"Sample title: {titles[0]}")
print(f"Sample text (first 100 chars): {documents[0][:100]}...")

Collecting 10,000 documents...


Collecting:   0%|          | 0/10000 [00:00<?, ?it/s]

Collecting:   0%|          | 1/10000 [00:03<9:29:54,  3.42s/it]

Collecting:  10%|▉         | 971/10000 [00:03<00:23, 389.22it/s]

Collecting:  15%|█▌        | 1542/10000 [00:06<00:32, 263.70it/s]

Collecting:  25%|██▍       | 2477/10000 [00:06<00:14, 529.43it/s]

Collecting:  34%|███▍      | 3437/10000 [00:06<00:07, 893.17it/s]

Collecting:  41%|████▏     | 4143/10000 [00:09<00:10, 549.91it/s]

Collecting:  52%|█████▏    | 5170/10000 [00:09<00:05, 875.71it/s]

Collecting:  62%|██████▏   | 6210/10000 [00:09<00:02, 1310.91it/s]

Collecting:  70%|███████   | 7001/10000 [00:11<00:04, 720.05it/s] 

Collecting:  80%|████████  | 8011/10000 [00:11<00:01, 1052.50it/s]

Collecting:  91%|█████████ | 9055/10000 [00:11<00:00, 1504.89it/s]

Collecting:  99%|█████████▉| 9906/10000 [00:11<00:00, 1956.04it/s]

Collecting: 100%|██████████| 10000/10000 [00:14<00:00, 677.77it/s]


Collected 9,990 documents in 14.8s
Sample title: アンパサンド
Sample text (first 100 chars): アンパサンド（&, ）は、並立助詞「…と…」を意味する記号である。ラテン語で「…と…」を表す接続詞 "et" の合字を起源とする。現代のフォントでも、Trebuchet MS など一部のフォントでは、...





## 2. 埋め込み生成

In [4]:
from sentence_transformers import SentenceTransformer

# モデルロード
print(f"Loading model: {MODEL_NAME}")
start_time = time.time()

model = SentenceTransformer(MODEL_NAME)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"Model loaded in {time.time() - start_time:.1f}s")
print(f"Device: {device}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

Loading model: BAAI/bge-m3


Model loaded in 5.8s
Device: cuda
Embedding dimension: 1024


In [5]:
# 埋め込み生成
print(f"\nGenerating embeddings for {len(documents):,} documents...")
start_time = time.time()

embeddings = model.encode(
    documents,
    batch_size=32,  # BGE-M3は大きいのでバッチサイズを小さく
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
)

elapsed = time.time() - start_time
print(f"\nEmbedding generation completed!")
print(f"Time: {elapsed:.1f}s")
print(f"Speed: {len(documents)/elapsed:.1f} docs/sec")
print(f"Shape: {embeddings.shape}")


Generating embeddings for 9,990 documents...


Batches:   0%|          | 0/313 [00:00<?, ?it/s]


Embedding generation completed!
Time: 69.5s
Speed: 143.8 docs/sec
Shape: (9990, 1024)


In [6]:
# 埋め込みを保存
EMB_PATH = DATA_DIR / f"10k_{MODEL_SHORT}_embeddings.npy"
np.save(EMB_PATH, embeddings)
print(f"Saved embeddings: {EMB_PATH} ({EMB_PATH.stat().st_size / 1024**2:.1f} MB)")

Saved embeddings: ../data/10k_bge_m3_embeddings.npy (39.0 MB)


## 3. ITQ学習と保存

In [7]:
# ITQ学習 (128 bits)
N_BITS = 128

print(f"Training ITQ with {N_BITS} bits...")
start_time = time.time()

itq = ITQLSH(n_bits=N_BITS, n_iterations=50, seed=42)
itq.fit(embeddings)

print(f"\nTraining time: {time.time() - start_time:.1f}s")

Training ITQ with 128 bits...
ITQ学習開始: samples=9990, dim=1024, bits=128
  Centering完了: mean_norm=0.5552


  PCA完了: explained_variance=68.88%


  ITQ iteration 10: quantization_error=0.9026


  ITQ iteration 20: quantization_error=0.9020


  ITQ iteration 30: quantization_error=0.9017


  ITQ iteration 40: quantization_error=0.9016


  ITQ iteration 50: quantization_error=0.9015
ITQ学習完了

Training time: 0.9s


In [8]:
# ITQモデルを保存
ITQ_PATH = DATA_DIR / f"itq_{MODEL_SHORT}_{N_BITS}bits.pkl"
itq.save(str(ITQ_PATH))
print(f"Saved ITQ model: {ITQ_PATH}")

Saved ITQ model: ../data/itq_bge_m3_128bits.pkl


In [9]:
# ハッシュを生成して保存
print("Generating hashes...")
start_time = time.time()

hashes = itq.transform(embeddings)
print(f"Hashes shape: {hashes.shape}, time: {time.time() - start_time:.2f}s")

HASH_PATH = DATA_DIR / f"10k_{MODEL_SHORT}_hashes_{N_BITS}bits.npy"
np.save(HASH_PATH, hashes)
print(f"Saved hashes: {HASH_PATH}")

Generating hashes...
Hashes shape: (9990, 128), time: 0.01s
Saved hashes: ../data/10k_bge_m3_hashes_128bits.npy


## 4. Pivot選択と保存

In [10]:
# ヘルパー関数
def hamming_distance(h1: np.ndarray, h2: np.ndarray) -> int:
    """2つのハッシュ間のハミング距離"""
    return np.sum(h1 != h2)

def hamming_distance_to_all(query_hash: np.ndarray, all_hashes: np.ndarray) -> np.ndarray:
    """クエリと全ドキュメントのハミング距離を計算"""
    return np.sum(query_hash != all_hashes, axis=1)

def select_pivots_furthest_first(hashes: np.ndarray, n_pivots: int, seed: int = 42) -> np.ndarray:
    """
    Furthest First法でピボットを選択
    """
    rng = np.random.default_rng(seed)
    n_samples = len(hashes)
    
    sample_size = min(10000, n_samples)
    sample_indices = rng.choice(n_samples, sample_size, replace=False)
    sample_hashes = hashes[sample_indices]
    
    pivot_indices = [rng.integers(sample_size)]
    pivots = [sample_hashes[pivot_indices[0]]]
    
    for _ in range(n_pivots - 1):
        min_dists = np.full(sample_size, np.inf)
        for pivot in pivots:
            dists = hamming_distance_to_all(pivot, sample_hashes)
            min_dists = np.minimum(min_dists, dists)
        
        min_dists[pivot_indices] = -1
        new_idx = np.argmax(min_dists)
        pivot_indices.append(new_idx)
        pivots.append(sample_hashes[new_idx])
    
    return np.array(pivots)

def compute_pivot_distances(hashes: np.ndarray, pivots: np.ndarray) -> np.ndarray:
    n_samples = len(hashes)
    n_pivots = len(pivots)
    
    distances = np.zeros((n_samples, n_pivots), dtype=np.uint8)
    for i, pivot in enumerate(tqdm(pivots, desc="Computing pivot distances")):
        distances[:, i] = hamming_distance_to_all(pivot, hashes)
    
    return distances

In [11]:
# 8ピボットを選択
N_PIVOTS = 8

print(f"Selecting {N_PIVOTS} pivots using Furthest First method...")
pivots = select_pivots_furthest_first(hashes, N_PIVOTS)
print(f"Pivots shape: {pivots.shape}")

# ピボット間の距離を確認
pivot_dists = []
for i in range(N_PIVOTS):
    for j in range(i+1, N_PIVOTS):
        pivot_dists.append(hamming_distance(pivots[i], pivots[j]))
print(f"Pivot-to-pivot distances: min={min(pivot_dists)}, max={max(pivot_dists)}, mean={np.mean(pivot_dists):.1f}")

Selecting 8 pivots using Furthest First method...


Pivots shape: (8, 128)
Pivot-to-pivot distances: min=64, max=87, mean=69.8


In [12]:
# ピボットを保存
PIVOT_PATH = DATA_DIR / f"pivots_8_{MODEL_SHORT}.npy"
np.save(PIVOT_PATH, pivots)
print(f"Saved pivots: {PIVOT_PATH}")

Saved pivots: ../data/pivots_8_bge_m3.npy


In [13]:
# 全文書のピボット距離を計算して保存
print("Computing pivot distances for all documents...")
pivot_distances = compute_pivot_distances(hashes, pivots)
print(f"Pivot distances shape: {pivot_distances.shape}")

PIVOT_DIST_PATH = DATA_DIR / f"10k_{MODEL_SHORT}_pivot_distances.npy"
np.save(PIVOT_DIST_PATH, pivot_distances)
print(f"Saved pivot distances: {PIVOT_DIST_PATH}")

Computing pivot distances for all documents...


Computing pivot distances:   0%|          | 0/8 [00:00<?, ?it/s]

Computing pivot distances: 100%|██████████| 8/8 [00:00<00:00, 1326.26it/s]

Pivot distances shape: (9990, 8)
Saved pivot distances: ../data/10k_bge_m3_pivot_distances.npy





## 5. 評価（Recall@10, Filter Recall）

In [14]:
def pivot_filter(query_hash: np.ndarray, pivots: np.ndarray, 
                 all_pivot_distances: np.ndarray, threshold: int) -> np.ndarray:
    n_docs, n_pivots = all_pivot_distances.shape
    query_pivot_dists = np.array([hamming_distance(query_hash, p) for p in pivots])
    
    mask = np.ones(n_docs, dtype=bool)
    for i in range(n_pivots):
        lower = query_pivot_dists[i] - threshold
        upper = query_pivot_dists[i] + threshold
        mask &= (all_pivot_distances[:, i] >= lower) & (all_pivot_distances[:, i] <= upper)
    
    return np.where(mask)[0]

def evaluate_model(
    embeddings: np.ndarray,
    hashes: np.ndarray,
    pivots: np.ndarray,
    pivot_distances: np.ndarray,
    thresholds: list = [15, 20],
    n_queries: int = 100,
    top_k: int = 10,
    candidate_limits: list = [100, 500, 1000]
):
    n_docs = len(embeddings)
    query_indices = np.random.choice(n_docs, n_queries, replace=False)
    
    print(f"Computing ground truth for {n_queries} queries...")
    ground_truth = []
    for q_idx in tqdm(query_indices, desc="Ground truth"):
        sims = embeddings @ embeddings[q_idx]
        sims[q_idx] = -1
        top_indices = np.argsort(sims)[-top_k:][::-1]
        ground_truth.append(set(top_indices))
    
    results = []
    
    # ベースライン
    print("\nEvaluating baseline (no filter)...")
    baseline_recalls = {limit: [] for limit in candidate_limits}
    
    for i, q_idx in enumerate(tqdm(query_indices, desc="Baseline")):
        query_hash = hashes[q_idx]
        distances = hamming_distance_batch(query_hash, hashes)
        distances[q_idx] = 999
        sorted_indices = np.argsort(distances)
        
        for limit in candidate_limits:
            top_candidates = set(sorted_indices[:limit])
            recall = len(top_candidates & ground_truth[i]) / top_k
            baseline_recalls[limit].append(recall)
    
    baseline_result = {
        'method': 'Baseline (no filter)',
        'threshold': '-',
        'reduction_rate': 0.0,
        'filter_recall': 1.0,
    }
    for limit in candidate_limits:
        baseline_result[f'recall@{top_k}_limit{limit}'] = np.mean(baseline_recalls[limit])
    results.append(baseline_result)
    
    # Pivotフィルタリング
    for threshold in thresholds:
        print(f"\nEvaluating Pivot filter (threshold={threshold})...")
        
        step1_candidates_list = []
        recalls = {limit: [] for limit in candidate_limits}
        filter_recall = []
        
        for i, q_idx in enumerate(tqdm(query_indices, desc=f"Pivot t={threshold}")):
            candidates = pivot_filter(hashes[q_idx], pivots, pivot_distances, threshold)
            candidates = candidates[candidates != q_idx]
            step1_candidates_list.append(len(candidates))
            
            gt_in_candidates = len(ground_truth[i] & set(candidates)) / top_k
            filter_recall.append(gt_in_candidates)
            
            if len(candidates) == 0:
                for limit in candidate_limits:
                    recalls[limit].append(0.0)
                continue
            
            query_hash = hashes[q_idx]
            candidate_hashes = hashes[candidates]
            distances = hamming_distance_batch(query_hash, candidate_hashes)
            sorted_indices = np.argsort(distances)
            
            for limit in candidate_limits:
                if len(sorted_indices) < limit:
                    top_candidates = set(candidates[sorted_indices])
                else:
                    top_candidates = set(candidates[sorted_indices[:limit]])
                
                recall = len(top_candidates & ground_truth[i]) / top_k
                recalls[limit].append(recall)
        
        result = {
            'method': f'Pivot t={threshold}',
            'threshold': threshold,
            'reduction_rate': 1 - np.mean(step1_candidates_list) / n_docs,
            'filter_recall': np.mean(filter_recall),
        }
        for limit in candidate_limits:
            result[f'recall@{top_k}_limit{limit}'] = np.mean(recalls[limit])
        results.append(result)
    
    return results

In [15]:
# 評価実行
results = evaluate_model(
    embeddings, hashes, pivots, pivot_distances,
    thresholds=[15, 20],
    n_queries=100,
    top_k=10,
    candidate_limits=[100, 500, 1000]
)

Computing ground truth for 100 queries...


Ground truth:   0%|          | 0/100 [00:00<?, ?it/s]

Ground truth:  76%|███████▌  | 76/100 [00:00<00:00, 757.88it/s]

Ground truth: 100%|██████████| 100/100 [00:00<00:00, 743.64it/s]





Evaluating baseline (no filter)...


Baseline:   0%|          | 0/100 [00:00<?, ?it/s]

Baseline:  90%|█████████ | 90/100 [00:00<00:00, 899.66it/s]

Baseline: 100%|██████████| 100/100 [00:00<00:00, 923.35it/s]





Evaluating Pivot filter (threshold=15)...


Pivot t=15:   0%|          | 0/100 [00:00<?, ?it/s]

Pivot t=15: 100%|██████████| 100/100 [00:00<00:00, 1221.17it/s]





Evaluating Pivot filter (threshold=20)...


Pivot t=20:   0%|          | 0/100 [00:00<?, ?it/s]

Pivot t=20:  86%|████████▌ | 86/100 [00:00<00:00, 851.64it/s]

Pivot t=20: 100%|██████████| 100/100 [00:00<00:00, 840.88it/s]




In [16]:
# 結果表示
import pandas as pd

df_results = pd.DataFrame(results)
print(f"\n{'='*80}")
print(f"BGE-M3 ({EMBEDDING_DIM}次元) 評価結果")
print(f"{'='*80}")
print(df_results.to_string(index=False))


BGE-M3 (1024次元) 評価結果
              method threshold  reduction_rate  filter_recall  recall@10_limit100  recall@10_limit500  recall@10_limit1000
Baseline (no filter)         -        0.000000          1.000               0.854               0.986                0.996
          Pivot t=15        15        0.624123          0.915               0.808               0.913                0.914
          Pivot t=20        20        0.326546          0.992               0.857               0.979                0.988


## 6. サマリー

In [17]:
print("="*60)
print(f"BGE-M3 ITQ/Pivot Evaluation - Summary")
print("="*60)
print(f"Model: {MODEL_NAME}")
print(f"Embedding dimension: {EMBEDDING_DIM}")
print(f"Documents: {len(documents):,}")
print(f"ITQ bits: {N_BITS}")
print(f"Pivots: {N_PIVOTS}")
print(f"")
print(f"Saved files:")
print(f"  - {EMB_PATH.name}")
print(f"  - {ITQ_PATH.name}")
print(f"  - {HASH_PATH.name}")
print(f"  - {PIVOT_PATH.name}")
print(f"  - {PIVOT_DIST_PATH.name}")
print("="*60)

BGE-M3 ITQ/Pivot Evaluation - Summary
Model: BAAI/bge-m3
Embedding dimension: 1024
Documents: 9,990
ITQ bits: 128
Pivots: 8

Saved files:
  - 10k_bge_m3_embeddings.npy
  - itq_bge_m3_128bits.pkl
  - 10k_bge_m3_hashes_128bits.npy
  - pivots_8_bge_m3.npy
  - 10k_bge_m3_pivot_distances.npy


## 7. 実験結果サマリー

### モデル情報
| 項目 | 値 |
|------|-----|
| モデル名 | BAAI/bge-m3 |
| 埋め込み次元 | 1024 |
| ドキュメント数 | 9,990 |
| ITQビット数 | 128 bits |
| ピボット数 | 8 |

### 評価結果

| 手法 | 削減率 | Filter Recall | Recall@10 (lim100) | Recall@10 (lim500) | Recall@10 (lim1000) |
|------|--------|---------------|--------------------|--------------------|---------------------|
| Baseline | 0% | 100% | 85.4% | 98.6% | **99.6%** |
| Pivot t=15 | **62.4%** | 91.5% | 80.8% | 91.3% | 91.4% |
| Pivot t=20 | 32.7% | **99.2%** | 85.7% | 97.9% | 98.8% |

### 保存ファイル一覧
- `data/10k_bge_m3_embeddings.npy` - 埋め込みベクトル (39.0 MB)
- `data/itq_bge_m3_128bits.pkl` - ITQ学習済みモデル
- `data/10k_bge_m3_hashes_128bits.npy` - 128bitハッシュ
- `data/pivots_8_bge_m3.npy` - 8ピボット
- `data/10k_bge_m3_pivot_distances.npy` - ピボット距離

### 考察
- BGE-M3は1024次元の大きなモデルで、最高のRecall（99.6%）を達成
- Pivot t=20で約33%削減しながら98.8%のRecallを維持（3モデル中最高）
- 埋め込み生成速度は143.8 docs/sec（他モデルの約1/4〜1/6）
- 1024次元→128 bitsで約12.5%の情報を保持するが、依然として高い精度