# 62. Multilingual-E5-Small ITQ・Pivot評価

## 目的
- `intfloat/multilingual-e5-small` (384次元) での埋め込み生成とITQ/Pivot学習
- 10,000件Wikipediaデータでの評価
- 学習済みデータの保存（ITQモデル、Pivot重心点）

## 出力ファイル
- `data/itq_e5_small_128bits.pkl` - ITQモデル
- `data/10k_e5_small_hashes_128bits.npy` - ハッシュ
- `data/pivots_8_e5_small.npy` - 8ピボット
- `data/10k_e5_small_pivot_distances.npy` - ピボット距離
- `data/10k_e5_small_embeddings.npy` - 埋め込み

## 注意
- E5モデルは `passage:` プレフィックスを付ける

## 0. セットアップ

In [1]:
import numpy as np
import time
from pathlib import Path
from tqdm import tqdm
import sys
sys.path.insert(0, '../src')
from itq_lsh import ITQLSH, hamming_distance_batch

# GPU確認
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

DATA_DIR = Path("../data")
np.random.seed(42)

# モデル設定
MODEL_NAME = "intfloat/multilingual-e5-small"
MODEL_SHORT = "e5_small"
EMBEDDING_DIM = 384
N_SAMPLES = 10000

PyTorch version: 2.10.0+cu128
CUDA available: True
GPU: NVIDIA GeForce RTX 4090


## 1. データ準備（10,000件サンプリング）

In [2]:
# 既存のWikipediaデータからテキストを取得
from datasets import load_dataset

print("Loading Wikipedia Japanese dataset (streaming)...")
start_time = time.time()

wiki_ja = load_dataset(
    "wikimedia/wikipedia",
    "20231101.ja",
    split="train",
    streaming=True
)

print(f"Dataset loaded in {time.time() - start_time:.1f}s")

Loading Wikipedia Japanese dataset (streaming)...


Dataset loaded in 2.9s


In [3]:
# 10,000件を収集
print(f"Collecting {N_SAMPLES:,} documents...")
start_time = time.time()

documents = []
titles = []

for i, item in enumerate(tqdm(wiki_ja, total=N_SAMPLES, desc="Collecting")):
    if i >= N_SAMPLES:
        break
    
    # テキストの前処理（最初の500文字程度を使用）
    text = item['text'][:500].strip()
    if len(text) < 50:  # 短すぎるものはスキップ
        continue
    
    # E5モデル用のプレフィックス
    documents.append(f"passage: {text}")
    titles.append(item['title'])

print(f"\nCollected {len(documents):,} documents in {time.time() - start_time:.1f}s")
print(f"Sample title: {titles[0]}")
print(f"Sample text (first 100 chars): {documents[0][:100]}...")

Collecting 10,000 documents...


Collecting:   0%|          | 0/10000 [00:00<?, ?it/s]

Collecting:   0%|          | 1/10000 [00:03<9:33:19,  3.44s/it]

Collecting:  10%|▉         | 998/10000 [00:03<00:22, 397.77it/s]

Collecting:  16%|█▌        | 1585/10000 [00:06<00:31, 270.59it/s]

Collecting:  25%|██▌       | 2528/10000 [00:06<00:13, 538.00it/s]

Collecting:  35%|███▌      | 3507/10000 [00:06<00:07, 908.40it/s]

Collecting:  42%|████▏     | 4225/10000 [00:09<00:10, 557.16it/s]

Collecting:  52%|█████▏    | 5242/10000 [00:09<00:05, 877.85it/s]

Collecting:  63%|██████▎   | 6280/10000 [00:09<00:02, 1310.04it/s]

Collecting:  71%|███████   | 7059/10000 [00:11<00:04, 719.97it/s] 

Collecting:  81%|████████  | 8098/10000 [00:11<00:01, 1063.15it/s]

Collecting:  92%|█████████▏| 9150/10000 [00:11<00:00, 1519.46it/s]

Collecting: 100%|██████████| 10000/10000 [00:14<00:00, 676.03it/s]


Collected 9,990 documents in 14.8s
Sample title: アンパサンド
Sample text (first 100 chars): passage: アンパサンド（&, ）は、並立助詞「…と…」を意味する記号である。ラテン語で「…と…」を表す接続詞 "et" の合字を起源とする。現代のフォントでも、Trebuchet MS など一...





## 2. 埋め込み生成

In [4]:
from sentence_transformers import SentenceTransformer

# モデルロード
print(f"Loading model: {MODEL_NAME}")
start_time = time.time()

model = SentenceTransformer(MODEL_NAME)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"Model loaded in {time.time() - start_time:.1f}s")
print(f"Device: {device}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

Loading model: intfloat/multilingual-e5-small


Model loaded in 4.3s
Device: cuda
Embedding dimension: 384


In [5]:
# 埋め込み生成
print(f"\nGenerating embeddings for {len(documents):,} documents...")
start_time = time.time()

embeddings = model.encode(
    documents,
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
)

elapsed = time.time() - start_time
print(f"\nEmbedding generation completed!")
print(f"Time: {elapsed:.1f}s")
print(f"Speed: {len(documents)/elapsed:.1f} docs/sec")
print(f"Shape: {embeddings.shape}")


Generating embeddings for 9,990 documents...


Batches:   0%|          | 0/157 [00:00<?, ?it/s]


Embedding generation completed!
Time: 11.2s
Speed: 895.4 docs/sec
Shape: (9990, 384)


In [6]:
# 埋め込みを保存
EMB_PATH = DATA_DIR / f"10k_{MODEL_SHORT}_embeddings.npy"
np.save(EMB_PATH, embeddings)
print(f"Saved embeddings: {EMB_PATH} ({EMB_PATH.stat().st_size / 1024**2:.1f} MB)")

Saved embeddings: ../data/10k_e5_small_embeddings.npy (14.6 MB)


## 3. ITQ学習と保存

In [7]:
# ITQ学習 (128 bits)
N_BITS = 128

print(f"Training ITQ with {N_BITS} bits...")
start_time = time.time()

itq = ITQLSH(n_bits=N_BITS, n_iterations=50, seed=42)
itq.fit(embeddings)

print(f"\nTraining time: {time.time() - start_time:.1f}s")

Training ITQ with 128 bits...
ITQ学習開始: samples=9990, dim=384, bits=128
  Centering完了: mean_norm=0.8826
  PCA完了: explained_variance=67.67%
  ITQ iteration 10: quantization_error=0.9450


  ITQ iteration 20: quantization_error=0.9446
  ITQ iteration 30: quantization_error=0.9445


  ITQ iteration 40: quantization_error=0.9444
  ITQ iteration 50: quantization_error=0.9443
ITQ学習完了

Training time: 0.6s


In [8]:
# ITQモデルを保存
ITQ_PATH = DATA_DIR / f"itq_{MODEL_SHORT}_{N_BITS}bits.pkl"
itq.save(str(ITQ_PATH))
print(f"Saved ITQ model: {ITQ_PATH}")

Saved ITQ model: ../data/itq_e5_small_128bits.pkl


In [9]:
# ハッシュを生成して保存
print("Generating hashes...")
start_time = time.time()

hashes = itq.transform(embeddings)
print(f"Hashes shape: {hashes.shape}, time: {time.time() - start_time:.2f}s")

HASH_PATH = DATA_DIR / f"10k_{MODEL_SHORT}_hashes_{N_BITS}bits.npy"
np.save(HASH_PATH, hashes)
print(f"Saved hashes: {HASH_PATH}")

Generating hashes...
Hashes shape: (9990, 128), time: 0.01s
Saved hashes: ../data/10k_e5_small_hashes_128bits.npy


## 4. Pivot選択と保存

In [10]:
# ヘルパー関数
def hamming_distance(h1: np.ndarray, h2: np.ndarray) -> int:
    """2つのハッシュ間のハミング距離"""
    return np.sum(h1 != h2)

def hamming_distance_to_all(query_hash: np.ndarray, all_hashes: np.ndarray) -> np.ndarray:
    """クエリと全ドキュメントのハミング距離を計算"""
    return np.sum(query_hash != all_hashes, axis=1)

def select_pivots_furthest_first(hashes: np.ndarray, n_pivots: int, seed: int = 42) -> np.ndarray:
    """
    Furthest First法でピボットを選択
    """
    rng = np.random.default_rng(seed)
    n_samples = len(hashes)
    
    sample_size = min(10000, n_samples)
    sample_indices = rng.choice(n_samples, sample_size, replace=False)
    sample_hashes = hashes[sample_indices]
    
    pivot_indices = [rng.integers(sample_size)]
    pivots = [sample_hashes[pivot_indices[0]]]
    
    for _ in range(n_pivots - 1):
        min_dists = np.full(sample_size, np.inf)
        for pivot in pivots:
            dists = hamming_distance_to_all(pivot, sample_hashes)
            min_dists = np.minimum(min_dists, dists)
        
        min_dists[pivot_indices] = -1
        new_idx = np.argmax(min_dists)
        pivot_indices.append(new_idx)
        pivots.append(sample_hashes[new_idx])
    
    return np.array(pivots)

def compute_pivot_distances(hashes: np.ndarray, pivots: np.ndarray) -> np.ndarray:
    n_samples = len(hashes)
    n_pivots = len(pivots)
    
    distances = np.zeros((n_samples, n_pivots), dtype=np.uint8)
    for i, pivot in enumerate(tqdm(pivots, desc="Computing pivot distances")):
        distances[:, i] = hamming_distance_to_all(pivot, hashes)
    
    return distances

In [11]:
# 8ピボットを選択
N_PIVOTS = 8

print(f"Selecting {N_PIVOTS} pivots using Furthest First method...")
pivots = select_pivots_furthest_first(hashes, N_PIVOTS)
print(f"Pivots shape: {pivots.shape}")

# ピボット間の距離を確認
pivot_dists = []
for i in range(N_PIVOTS):
    for j in range(i+1, N_PIVOTS):
        pivot_dists.append(hamming_distance(pivots[i], pivots[j]))
print(f"Pivot-to-pivot distances: min={min(pivot_dists)}, max={max(pivot_dists)}, mean={np.mean(pivot_dists):.1f}")

Selecting 8 pivots using Furthest First method...
Pivots shape: (8, 128)
Pivot-to-pivot distances: min=65, max=86, mean=70.0


In [12]:
# ピボットを保存
PIVOT_PATH = DATA_DIR / f"pivots_8_{MODEL_SHORT}.npy"
np.save(PIVOT_PATH, pivots)
print(f"Saved pivots: {PIVOT_PATH}")

Saved pivots: ../data/pivots_8_e5_small.npy


In [13]:
# 全文書のピボット距離を計算して保存
print("Computing pivot distances for all documents...")
pivot_distances = compute_pivot_distances(hashes, pivots)
print(f"Pivot distances shape: {pivot_distances.shape}")

PIVOT_DIST_PATH = DATA_DIR / f"10k_{MODEL_SHORT}_pivot_distances.npy"
np.save(PIVOT_DIST_PATH, pivot_distances)
print(f"Saved pivot distances: {PIVOT_DIST_PATH}")

Computing pivot distances for all documents...


Computing pivot distances:   0%|          | 0/8 [00:00<?, ?it/s]

Computing pivot distances: 100%|██████████| 8/8 [00:00<00:00, 1226.85it/s]

Pivot distances shape: (9990, 8)
Saved pivot distances: ../data/10k_e5_small_pivot_distances.npy





## 5. 評価（Recall@10, Filter Recall）

In [14]:
def pivot_filter(query_hash: np.ndarray, pivots: np.ndarray, 
                 all_pivot_distances: np.ndarray, threshold: int) -> np.ndarray:
    n_docs, n_pivots = all_pivot_distances.shape
    query_pivot_dists = np.array([hamming_distance(query_hash, p) for p in pivots])
    
    mask = np.ones(n_docs, dtype=bool)
    for i in range(n_pivots):
        lower = query_pivot_dists[i] - threshold
        upper = query_pivot_dists[i] + threshold
        mask &= (all_pivot_distances[:, i] >= lower) & (all_pivot_distances[:, i] <= upper)
    
    return np.where(mask)[0]

def evaluate_model(
    embeddings: np.ndarray,
    hashes: np.ndarray,
    pivots: np.ndarray,
    pivot_distances: np.ndarray,
    thresholds: list = [15, 20],
    n_queries: int = 100,
    top_k: int = 10,
    candidate_limits: list = [100, 500, 1000]
):
    n_docs = len(embeddings)
    query_indices = np.random.choice(n_docs, n_queries, replace=False)
    
    print(f"Computing ground truth for {n_queries} queries...")
    ground_truth = []
    for q_idx in tqdm(query_indices, desc="Ground truth"):
        sims = embeddings @ embeddings[q_idx]
        sims[q_idx] = -1
        top_indices = np.argsort(sims)[-top_k:][::-1]
        ground_truth.append(set(top_indices))
    
    results = []
    
    # ベースライン
    print("\nEvaluating baseline (no filter)...")
    baseline_recalls = {limit: [] for limit in candidate_limits}
    
    for i, q_idx in enumerate(tqdm(query_indices, desc="Baseline")):
        query_hash = hashes[q_idx]
        distances = hamming_distance_batch(query_hash, hashes)
        distances[q_idx] = 999
        sorted_indices = np.argsort(distances)
        
        for limit in candidate_limits:
            top_candidates = set(sorted_indices[:limit])
            recall = len(top_candidates & ground_truth[i]) / top_k
            baseline_recalls[limit].append(recall)
    
    baseline_result = {
        'method': 'Baseline (no filter)',
        'threshold': '-',
        'reduction_rate': 0.0,
        'filter_recall': 1.0,
    }
    for limit in candidate_limits:
        baseline_result[f'recall@{top_k}_limit{limit}'] = np.mean(baseline_recalls[limit])
    results.append(baseline_result)
    
    # Pivotフィルタリング
    for threshold in thresholds:
        print(f"\nEvaluating Pivot filter (threshold={threshold})...")
        
        step1_candidates_list = []
        recalls = {limit: [] for limit in candidate_limits}
        filter_recall = []
        
        for i, q_idx in enumerate(tqdm(query_indices, desc=f"Pivot t={threshold}")):
            candidates = pivot_filter(hashes[q_idx], pivots, pivot_distances, threshold)
            candidates = candidates[candidates != q_idx]
            step1_candidates_list.append(len(candidates))
            
            gt_in_candidates = len(ground_truth[i] & set(candidates)) / top_k
            filter_recall.append(gt_in_candidates)
            
            if len(candidates) == 0:
                for limit in candidate_limits:
                    recalls[limit].append(0.0)
                continue
            
            query_hash = hashes[q_idx]
            candidate_hashes = hashes[candidates]
            distances = hamming_distance_batch(query_hash, candidate_hashes)
            sorted_indices = np.argsort(distances)
            
            for limit in candidate_limits:
                if len(sorted_indices) < limit:
                    top_candidates = set(candidates[sorted_indices])
                else:
                    top_candidates = set(candidates[sorted_indices[:limit]])
                
                recall = len(top_candidates & ground_truth[i]) / top_k
                recalls[limit].append(recall)
        
        result = {
            'method': f'Pivot t={threshold}',
            'threshold': threshold,
            'reduction_rate': 1 - np.mean(step1_candidates_list) / n_docs,
            'filter_recall': np.mean(filter_recall),
        }
        for limit in candidate_limits:
            result[f'recall@{top_k}_limit{limit}'] = np.mean(recalls[limit])
        results.append(result)
    
    return results

In [15]:
# 評価実行
results = evaluate_model(
    embeddings, hashes, pivots, pivot_distances,
    thresholds=[15, 20],
    n_queries=100,
    top_k=10,
    candidate_limits=[100, 500, 1000]
)

Computing ground truth for 100 queries...


Ground truth:   0%|          | 0/100 [00:00<?, ?it/s]

Ground truth: 100%|██████████| 100/100 [00:00<00:00, 2954.57it/s]





Evaluating baseline (no filter)...


Baseline:   0%|          | 0/100 [00:00<?, ?it/s]

Baseline: 100%|██████████| 100/100 [00:00<00:00, 1103.96it/s]





Evaluating Pivot filter (threshold=15)...


Pivot t=15:   0%|          | 0/100 [00:00<?, ?it/s]

Pivot t=15: 100%|██████████| 100/100 [00:00<00:00, 1289.93it/s]





Evaluating Pivot filter (threshold=20)...


Pivot t=20:   0%|          | 0/100 [00:00<?, ?it/s]

Pivot t=20:  92%|█████████▏| 92/100 [00:00<00:00, 917.93it/s]

Pivot t=20: 100%|██████████| 100/100 [00:00<00:00, 902.40it/s]




In [16]:
# 結果表示
import pandas as pd

df_results = pd.DataFrame(results)
print(f"\n{'='*80}")
print(f"Multilingual-E5-Small ({EMBEDDING_DIM}次元) 評価結果")
print(f"{'='*80}")
print(df_results.to_string(index=False))


Multilingual-E5-Small (384次元) 評価結果
              method threshold  reduction_rate  filter_recall  recall@10_limit100  recall@10_limit500  recall@10_limit1000
Baseline (no filter)         -        0.000000          1.000               0.835               0.970                0.984
          Pivot t=15        15        0.660493          0.895               0.782               0.887                0.894
          Pivot t=20        20        0.404089          0.974               0.826               0.956                0.973


## 6. サマリー

In [17]:
print("="*60)
print(f"Multilingual-E5-Small ITQ/Pivot Evaluation - Summary")
print("="*60)
print(f"Model: {MODEL_NAME}")
print(f"Embedding dimension: {EMBEDDING_DIM}")
print(f"Documents: {len(documents):,}")
print(f"ITQ bits: {N_BITS}")
print(f"Pivots: {N_PIVOTS}")
print(f"")
print(f"Saved files:")
print(f"  - {EMB_PATH.name}")
print(f"  - {ITQ_PATH.name}")
print(f"  - {HASH_PATH.name}")
print(f"  - {PIVOT_PATH.name}")
print(f"  - {PIVOT_DIST_PATH.name}")
print("="*60)

Multilingual-E5-Small ITQ/Pivot Evaluation - Summary
Model: intfloat/multilingual-e5-small
Embedding dimension: 384
Documents: 9,990
ITQ bits: 128
Pivots: 8

Saved files:
  - 10k_e5_small_embeddings.npy
  - itq_e5_small_128bits.pkl
  - 10k_e5_small_hashes_128bits.npy
  - pivots_8_e5_small.npy
  - 10k_e5_small_pivot_distances.npy


## 7. 実験結果サマリー

### モデル情報
| 項目 | 値 |
|------|-----|
| モデル名 | intfloat/multilingual-e5-small |
| 埋め込み次元 | 384 |
| ドキュメント数 | 9,990 |
| ITQビット数 | 128 bits |
| ピボット数 | 8 |

### 評価結果

| 手法 | 削減率 | Filter Recall | Recall@10 (lim100) | Recall@10 (lim500) | Recall@10 (lim1000) |
|------|--------|---------------|--------------------|--------------------|---------------------|
| Baseline | 0% | 100% | 83.5% | 97.0% | **98.4%** |
| Pivot t=15 | **66.0%** | 89.5% | 78.2% | 88.7% | 89.4% |
| Pivot t=20 | 40.4% | **97.4%** | 82.6% | 95.6% | 97.3% |

### 保存ファイル一覧
- `data/10k_e5_small_embeddings.npy` - 埋め込みベクトル (14.6 MB)
- `data/itq_e5_small_128bits.pkl` - ITQ学習済みモデル
- `data/10k_e5_small_hashes_128bits.npy` - 128bitハッシュ
- `data/pivots_8_e5_small.npy` - 8ピボット
- `data/10k_e5_small_pivot_distances.npy` - ピボット距離

### 考察
- E5-Smallは同じ384次元でGTE-Smallと類似した性能
- Pivot t=20で約40%削減しながら97.3%のRecallを維持（GTE-Smallより高い）
- E5モデルは`passage:`プレフィックスを付けて使用
- E5-baseと同じ系統のため、E5-baseへの移行が容易