# 71. ITQ LSH・Pivotデータのエクスポート

## 目的
- ITQ LSHモデル（128bits）のデータをnpy形式で保存し、再利用可能にする
- 8ピボットの重心データの再利用可能性を確認
- 別システムへの移植用にファイルを整理

## 対象モデル
1. **multilingual-e5-base** (768次元) - 多言語対応
2. **all-MiniLM-L6-v2** (384次元) - 英語専用・高速

## 0. セットアップ

In [1]:
import numpy as np
import pickle
from pathlib import Path

DATA_DIR = Path("../data")
EXPORT_DIR = DATA_DIR / "export"
EXPORT_DIR.mkdir(exist_ok=True)

print(f"Export directory: {EXPORT_DIR}")

Export directory: ../data/export


## 1. multilingual-e5-base 用データのエクスポート

In [2]:
# E5-base ITQモデルの読み込み
print("Loading E5-base ITQ model...")
with open(DATA_DIR / "itq_e5_base_128bits.pkl", "rb") as f:
    e5_itq_params = pickle.load(f)

print(f"\nE5-base ITQ parameters:")
for key, value in e5_itq_params.items():
    if isinstance(value, np.ndarray):
        print(f"  {key}: shape={value.shape}, dtype={value.dtype}")
    else:
        print(f"  {key}: {value}")

Loading E5-base ITQ model...

E5-base ITQ parameters:
  n_bits: 128
  n_iterations: 50
  seed: 42
  mean_vector: shape=(768,), dtype=float32
  pca_matrix: shape=(768, 128), dtype=float32
  rotation_matrix: shape=(128, 128), dtype=float32


In [3]:
# E5-base ITQパラメータをnpyで保存
print("Exporting E5-base ITQ parameters...")

# 平均ベクトル
np.save(EXPORT_DIR / "e5_base_itq_mean_vector.npy", e5_itq_params['mean_vector'])
print(f"  Saved: e5_base_itq_mean_vector.npy - shape {e5_itq_params['mean_vector'].shape}")

# PCA行列
np.save(EXPORT_DIR / "e5_base_itq_pca_matrix.npy", e5_itq_params['pca_matrix'])
print(f"  Saved: e5_base_itq_pca_matrix.npy - shape {e5_itq_params['pca_matrix'].shape}")

# 回転行列
np.save(EXPORT_DIR / "e5_base_itq_rotation_matrix.npy", e5_itq_params['rotation_matrix'])
print(f"  Saved: e5_base_itq_rotation_matrix.npy - shape {e5_itq_params['rotation_matrix'].shape}")

# メタデータ（設定値）
e5_metadata = {
    'n_bits': e5_itq_params['n_bits'],
    'n_iterations': e5_itq_params['n_iterations'],
    'seed': e5_itq_params['seed'],
    'model_name': 'intfloat/multilingual-e5-base',
    'embedding_dim': 768
}
np.save(EXPORT_DIR / "e5_base_itq_metadata.npy", e5_metadata)
print(f"  Saved: e5_base_itq_metadata.npy - {e5_metadata}")

Exporting E5-base ITQ parameters...
  Saved: e5_base_itq_mean_vector.npy - shape (768,)
  Saved: e5_base_itq_pca_matrix.npy - shape (768, 128)
  Saved: e5_base_itq_rotation_matrix.npy - shape (128, 128)
  Saved: e5_base_itq_metadata.npy - {'n_bits': 128, 'n_iterations': 50, 'seed': 42, 'model_name': 'intfloat/multilingual-e5-base', 'embedding_dim': 768}


In [4]:
# E5-base Pivotデータの確認
print("\nChecking E5-base Pivot data...")

# 既存のピボットファイルを確認
e5_pivots_old = np.load(DATA_DIR / "pivots_8_furthest_first.npy")
print(f"  Original file: pivots_8_furthest_first.npy")
print(f"  Shape: {e5_pivots_old.shape}")
print(f"  Dtype: {e5_pivots_old.dtype}")
print(f"  ⚠ This is ITQ hashes, not embeddings!")

# E5-baseの埋め込みからPivotを再計算
print("\n  Recalculating E5-base pivots from embeddings...")
from sklearn.metrics.pairwise import cosine_similarity

e5_embeddings = np.load(DATA_DIR / "wikipedia_400k_e5_base_embeddings.npy")
print(f"  Loaded embeddings: {e5_embeddings.shape}")

def select_pivots_furthest_first(embeddings, n_pivots, seed=42):
    """Furthest First法でピボットを選択"""
    n = len(embeddings)
    pivot_indices = []
    np.random.seed(seed)
    first_pivot = np.random.randint(n)
    pivot_indices.append(first_pivot)
    min_distances = np.full(n, np.inf)
    
    for _ in range(n_pivots - 1):
        last_pivot = pivot_indices[-1]
        distances = 1 - cosine_similarity(embeddings, embeddings[last_pivot:last_pivot+1]).flatten()
        min_distances = np.minimum(min_distances, distances)
        min_distances[pivot_indices] = -1
        next_pivot = np.argmax(min_distances)
        pivot_indices.append(next_pivot)
    
    return np.array(pivot_indices)

# サンプルから計算（メモリ節約）
sample_size = 50000
np.random.seed(42)
sample_indices = np.random.choice(len(e5_embeddings), sample_size, replace=False)
sample_embeddings = e5_embeddings[sample_indices]

pivot_indices = select_pivots_furthest_first(sample_embeddings, 8)
e5_pivots = sample_embeddings[pivot_indices]

print(f"  Calculated pivots shape: {e5_pivots.shape}")
print(f"  Pivot dtype: {e5_pivots.dtype}")

# エクスポート
np.save(EXPORT_DIR / "e5_base_pivots_8.npy", e5_pivots)
print(f"  ✓ Saved: e5_base_pivots_8.npy (8, 768)")


Checking E5-base Pivot data...
  Original file: pivots_8_furthest_first.npy
  Shape: (8, 128)
  Dtype: uint8
  ⚠ This is ITQ hashes, not embeddings!

  Recalculating E5-base pivots from embeddings...


  Loaded embeddings: (399029, 768)


  Calculated pivots shape: (8, 768)
  Pivot dtype: float32
  ✓ Saved: e5_base_pivots_8.npy (8, 768)


## 2. all-MiniLM-L6-v2 用データのエクスポート

In [5]:
# MiniLM ITQモデルの読み込み
print("Loading MiniLM ITQ model...")
with open(DATA_DIR / "itq_minilm_128bits.pkl", "rb") as f:
    minilm_itq_params = pickle.load(f)

print(f"\nMiniLM ITQ parameters:")
for key, value in minilm_itq_params.items():
    if isinstance(value, np.ndarray):
        print(f"  {key}: shape={value.shape}, dtype={value.dtype}")
    else:
        print(f"  {key}: {value}")

Loading MiniLM ITQ model...

MiniLM ITQ parameters:
  n_bits: 128
  n_iterations: 50
  seed: 42
  mean_vector: shape=(384,), dtype=float64
  pca_matrix: shape=(384, 128), dtype=float32
  rotation_matrix: shape=(128, 128), dtype=float32


In [6]:
# MiniLM ITQパラメータをnpyで保存
print("Exporting MiniLM ITQ parameters...")

# 平均ベクトル
np.save(EXPORT_DIR / "minilm_itq_mean_vector.npy", minilm_itq_params['mean_vector'])
print(f"  Saved: minilm_itq_mean_vector.npy - shape {minilm_itq_params['mean_vector'].shape}")

# PCA行列
np.save(EXPORT_DIR / "minilm_itq_pca_matrix.npy", minilm_itq_params['pca_matrix'])
print(f"  Saved: minilm_itq_pca_matrix.npy - shape {minilm_itq_params['pca_matrix'].shape}")

# 回転行列
np.save(EXPORT_DIR / "minilm_itq_rotation_matrix.npy", minilm_itq_params['rotation_matrix'])
print(f"  Saved: minilm_itq_rotation_matrix.npy - shape {minilm_itq_params['rotation_matrix'].shape}")

# メタデータ（設定値）
minilm_metadata = {
    'n_bits': minilm_itq_params['n_bits'],
    'n_iterations': minilm_itq_params['n_iterations'],
    'seed': minilm_itq_params['seed'],
    'model_name': 'sentence-transformers/all-MiniLM-L6-v2',
    'embedding_dim': 384
}
np.save(EXPORT_DIR / "minilm_itq_metadata.npy", minilm_metadata)
print(f"  Saved: minilm_itq_metadata.npy - {minilm_metadata}")

Exporting MiniLM ITQ parameters...
  Saved: minilm_itq_mean_vector.npy - shape (384,)
  Saved: minilm_itq_pca_matrix.npy - shape (384, 128)
  Saved: minilm_itq_rotation_matrix.npy - shape (128, 128)
  Saved: minilm_itq_metadata.npy - {'n_bits': 128, 'n_iterations': 50, 'seed': 42, 'model_name': 'sentence-transformers/all-MiniLM-L6-v2', 'embedding_dim': 384}


In [7]:
# MiniLM Pivotデータの確認と保存
print("\nChecking MiniLM Pivot data...")

minilm_pivots = np.load(DATA_DIR / "pivots_8_minilm.npy")
print(f"  Original file: pivots_8_minilm.npy")
print(f"  Shape: {minilm_pivots.shape}")
print(f"  Dtype: {minilm_pivots.dtype}")

# エクスポート用にコピー
np.save(EXPORT_DIR / "minilm_pivots_8.npy", minilm_pivots)
print(f"  Saved: minilm_pivots_8.npy")


Checking MiniLM Pivot data...
  Original file: pivots_8_minilm.npy
  Shape: (8, 384)
  Dtype: float64
  Saved: minilm_pivots_8.npy


## 3. 利用方法の確認（再利用テスト）

In [8]:
def itq_transform(embedding, mean_vector, pca_matrix, rotation_matrix):
    """
    埋め込みベクトルをITQハッシュに変換
    
    Args:
        embedding: 入力ベクトル (dim,) or (n, dim)
        mean_vector: 平均ベクトル (dim,)
        pca_matrix: PCA変換行列 (dim, n_bits)
        rotation_matrix: ITQ回転行列 (n_bits, n_bits)
    
    Returns:
        バイナリハッシュ (n_bits,) or (n, n_bits)
    """
    single_input = embedding.ndim == 1
    if single_input:
        embedding = embedding.reshape(1, -1)
    
    # Centering
    centered = embedding - mean_vector
    
    # PCA投影
    projected = centered @ pca_matrix
    
    # ITQ回転
    rotated = projected @ rotation_matrix
    
    # 符号で量子化
    binary_hash = (rotated > 0).astype(np.uint8)
    
    if single_input:
        return binary_hash[0]
    return binary_hash

print("ITQ transform function defined.")

ITQ transform function defined.


In [9]:
# E5-baseでの再利用テスト
print("Testing E5-base ITQ reload...")

# npyファイルから読み込み
e5_mean = np.load(EXPORT_DIR / "e5_base_itq_mean_vector.npy")
e5_pca = np.load(EXPORT_DIR / "e5_base_itq_pca_matrix.npy")
e5_rot = np.load(EXPORT_DIR / "e5_base_itq_rotation_matrix.npy")
e5_pivots_reload = np.load(EXPORT_DIR / "e5_base_pivots_8.npy")

# ダミーの768次元ベクトルでテスト
dummy_e5_emb = np.random.randn(768).astype(np.float32)
e5_hash = itq_transform(dummy_e5_emb, e5_mean, e5_pca, e5_rot)

print(f"  Input embedding shape: {dummy_e5_emb.shape}")
print(f"  Output hash shape: {e5_hash.shape}")
print(f"  Output hash (first 32 bits): {e5_hash[:32]}")
print(f"  Pivots shape: {e5_pivots_reload.shape}")
print(f"  Pivots dtype: {e5_pivots_reload.dtype}")
assert e5_pivots_reload.shape == (8, 768), f"Expected (8, 768), got {e5_pivots_reload.shape}"
print(f"  ✓ E5-base reload test passed!")

Testing E5-base ITQ reload...
  Input embedding shape: (768,)
  Output hash shape: (128,)
  Output hash (first 32 bits): [1 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 1 1 0 0 0 0]
  Pivots shape: (8, 768)
  Pivots dtype: float32
  ✓ E5-base reload test passed!


In [10]:
# MiniLMでの再利用テスト
print("Testing MiniLM ITQ reload...")

# npyファイルから読み込み
minilm_mean = np.load(EXPORT_DIR / "minilm_itq_mean_vector.npy")
minilm_pca = np.load(EXPORT_DIR / "minilm_itq_pca_matrix.npy")
minilm_rot = np.load(EXPORT_DIR / "minilm_itq_rotation_matrix.npy")
minilm_pivots_reload = np.load(EXPORT_DIR / "minilm_pivots_8.npy")

# ダミーの384次元ベクトルでテスト
dummy_minilm_emb = np.random.randn(384).astype(np.float32)
minilm_hash = itq_transform(dummy_minilm_emb, minilm_mean, minilm_pca, minilm_rot)

print(f"  Input embedding shape: {dummy_minilm_emb.shape}")
print(f"  Output hash shape: {minilm_hash.shape}")
print(f"  Output hash (first 32 bits): {minilm_hash[:32]}")
print(f"  Pivots shape: {minilm_pivots_reload.shape}")
print(f"  ✓ MiniLM reload test passed!")

Testing MiniLM ITQ reload...
  Input embedding shape: (384,)
  Output hash shape: (128,)
  Output hash (first 32 bits): [0 0 1 1 0 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 0 0 0 0 1 0 0]
  Pivots shape: (8, 384)
  ✓ MiniLM reload test passed!


## 4. エクスポートファイル一覧

In [11]:
import os

print("="*70)
print("Exported Files Summary")
print("="*70)
print(f"\nExport directory: {EXPORT_DIR.absolute()}")
print("\n" + "-"*70)

# ファイル一覧
files = sorted(EXPORT_DIR.glob("*.npy"))
print(f"\n{'File Name':<45} {'Size':>10}")
print("-"*60)
for f in files:
    size = os.path.getsize(f)
    if size > 1024*1024:
        size_str = f"{size/1024/1024:.1f} MB"
    elif size > 1024:
        size_str = f"{size/1024:.1f} KB"
    else:
        size_str = f"{size} B"
    print(f"{f.name:<45} {size_str:>10}")

Exported Files Summary

Export directory: /home/terapyon/dev/vibe-coding/lsh-cascade-poc/notebooks/../data/export

----------------------------------------------------------------------

File Name                                           Size
------------------------------------------------------------
e5_base_itq_mean_vector.npy                       3.1 KB
e5_base_itq_metadata.npy                           378 B
e5_base_itq_pca_matrix.npy                      384.1 KB
e5_base_itq_rotation_matrix.npy                  64.1 KB
e5_base_pivots_8.npy                             24.1 KB
minilm_itq_mean_vector.npy                        3.1 KB
minilm_itq_metadata.npy                            387 B
minilm_itq_pca_matrix.npy                       192.1 KB
minilm_itq_rotation_matrix.npy                   64.1 KB
minilm_pivots_8.npy                              24.1 KB


In [12]:
print("\n" + "="*70)
print("Files by Model")
print("="*70)

print("\n【multilingual-e5-base (768次元)】")
print("  ITQ LSH:")
print(f"    - e5_base_itq_mean_vector.npy    : 平均ベクトル (768,)")
print(f"    - e5_base_itq_pca_matrix.npy     : PCA行列 (768, 128)")
print(f"    - e5_base_itq_rotation_matrix.npy: 回転行列 (128, 128)")
print(f"    - e5_base_itq_metadata.npy       : メタデータ")
print("  Pivot:")
print(f"    - e5_base_pivots_8.npy           : 8ピボット重心 (8, 768)")

print("\n【all-MiniLM-L6-v2 (384次元)】")
print("  ITQ LSH:")
print(f"    - minilm_itq_mean_vector.npy     : 平均ベクトル (384,)")
print(f"    - minilm_itq_pca_matrix.npy      : PCA行列 (384, 128)")
print(f"    - minilm_itq_rotation_matrix.npy : 回転行列 (128, 128)")
print(f"    - minilm_itq_metadata.npy        : メタデータ")
print("  Pivot:")
print(f"    - minilm_pivots_8.npy            : 8ピボット重心 (8, 384)")


Files by Model

【multilingual-e5-base (768次元)】
  ITQ LSH:
    - e5_base_itq_mean_vector.npy    : 平均ベクトル (768,)
    - e5_base_itq_pca_matrix.npy     : PCA行列 (768, 128)
    - e5_base_itq_rotation_matrix.npy: 回転行列 (128, 128)
    - e5_base_itq_metadata.npy       : メタデータ
  Pivot:
    - e5_base_pivots_8.npy           : 8ピボット重心 (8, 768)

【all-MiniLM-L6-v2 (384次元)】
  ITQ LSH:
    - minilm_itq_mean_vector.npy     : 平均ベクトル (384,)
    - minilm_itq_pca_matrix.npy      : PCA行列 (384, 128)
    - minilm_itq_rotation_matrix.npy : 回転行列 (128, 128)
    - minilm_itq_metadata.npy        : メタデータ
  Pivot:
    - minilm_pivots_8.npy            : 8ピボット重心 (8, 384)


---

# エクスポート完了サマリー

## 出力ディレクトリ
`data/export/`

## multilingual-e5-base (768次元)

| ファイル名 | 形状 | 用途 |
|------------|------|------|
| `e5_base_itq_mean_vector.npy` | (768,) | ITQ: 平均ベクトル |
| `e5_base_itq_pca_matrix.npy` | (768, 128) | ITQ: PCA投影行列 |
| `e5_base_itq_rotation_matrix.npy` | (128, 128) | ITQ: 回転行列 |
| `e5_base_itq_metadata.npy` | dict | ITQ: メタデータ |
| `e5_base_pivots_8.npy` | (8, 768) | Pivot: 8個の重心ベクトル |

## all-MiniLM-L6-v2 (384次元)

| ファイル名 | 形状 | 用途 |
|------------|------|------|
| `minilm_itq_mean_vector.npy` | (384,) | ITQ: 平均ベクトル |
| `minilm_itq_pca_matrix.npy` | (384, 128) | ITQ: PCA投影行列 |
| `minilm_itq_rotation_matrix.npy` | (128, 128) | ITQ: 回転行列 |
| `minilm_itq_metadata.npy` | dict | ITQ: メタデータ |
| `minilm_pivots_8.npy` | (8, 384) | Pivot: 8個の重心ベクトル |

## 利用方法

### ITQハッシュ変換
```python
import numpy as np

# ファイル読み込み
mean = np.load("e5_base_itq_mean_vector.npy")
pca = np.load("e5_base_itq_pca_matrix.npy")
rot = np.load("e5_base_itq_rotation_matrix.npy")

# 変換関数
def itq_transform(embedding, mean, pca, rot):
    centered = embedding - mean
    projected = centered @ pca
    rotated = projected @ rot
    return (rotated > 0).astype(np.uint8)

# 使用例
embedding = get_embedding(text)  # (768,)
hash_code = itq_transform(embedding, mean, pca, rot)  # (128,)
```

### Pivotフィルタリング
```python
from sklearn.metrics.pairwise import cosine_similarity

# ファイル読み込み
pivots = np.load("e5_base_pivots_8.npy")  # (8, 768)

# クエリとピボット間の距離
query_embedding = get_embedding(query)  # (768,)
query_pivot_dist = 1 - cosine_similarity(query_embedding.reshape(1,-1), pivots)[0]

# 候補フィルタリング（threshold例: 0.20）
def filter_candidates(doc_pivot_distances, query_pivot_dist, threshold=0.20):
    diff = np.abs(doc_pivot_distances - query_pivot_dist)
    return np.max(diff, axis=1) < threshold
```