# 75. ANN Benchmark Summary

## Purpose
- Aggregate results from experiments 73-74
- Master comparison: Exact vs HNSW vs C-only vs C+Pivot
- Analyze ITQ hash quality across different data characteristics
- Determine when ITQ-LSH + Pivot is practical vs HNSW

## 0. Setup

In [1]:
import numpy as np
from pathlib import Path

DATA_DIR = Path('../data')
DATASET_KEYS = ['glove', 'sift', 'fashion', 'nytimes', 'gist']
TOP_K_VALUES = [1, 10, 100]
CANDIDATE_LIMITS = [100, 500, 1000, 2000, 5000]
PIVOT_THRESHOLDS = [15, 20, 25, 30]

## 1. Load All Results

In [2]:
# From notebook 73
c_only_results = np.load(DATA_DIR / 'ann_c_only_results.npy', allow_pickle=True).item()
quality_results = np.load(DATA_DIR / 'ann_quality_results.npy', allow_pickle=True).item()
filter_recalls = np.load(DATA_DIR / 'ann_filter_recalls.npy', allow_pickle=True).item()

# From notebook 74
c_pivot_results = np.load(DATA_DIR / 'ann_c_pivot_results.npy', allow_pickle=True).item()
hnsw_results = np.load(DATA_DIR / 'ann_hnsw_results.npy', allow_pickle=True).item()

# Dataset metadata
dataset_info = {}
for key in DATASET_KEYS:
    train_path = DATA_DIR / f'ann_{key}_train.npy'
    if not train_path.exists():
        continue
    train = np.load(train_path, mmap_mode='r')
    dataset_info[key] = {'n_train': train.shape[0], 'dim': train.shape[1]}

available_keys = [k for k in DATASET_KEYS if k in dataset_info]
print(f'Loaded results for {len(available_keys)} datasets: {available_keys}')

Loaded results for 5 datasets: ['glove', 'sift', 'fashion', 'nytimes', 'gist']


## 2. Master Comparison Table: Recall@10

In [3]:
print('='*100)
print('Master Comparison: Recall@10')
print('='*100)

L_VALS = [500, 1000, 2000]
T_VALS = [20, 25]

# Header
header = f'{"Dataset":<10} {"D":>4} {"N":>10} {"HNSW":>7}'
for L in L_VALS:
    header += f' {"C(L="+str(L)+")":>10}'
for t in T_VALS:
    for L in [1000, 2000]:
        header += f' {"P("+str(t)+","+str(L)+")":>11}'
print(f'\n{header}')
print('-' * len(header))

for key in available_keys:
    info = dataset_info[key]
    
    # HNSW
    if key in hnsw_results and hnsw_results[key]['recalls'].get(10) is not None:
        hnsw_r10 = f'{hnsw_results[key]["recalls"][10]*100:.1f}%'
    else:
        hnsw_r10 = 'N/A'
    
    line = f'{key:<10} {info["dim"]:>4} {info["n_train"]:>10,} {hnsw_r10:>7}'
    
    # C-only
    for L in L_VALS:
        r10 = c_only_results[key][L][10]
        line += f' {r10*100:>9.1f}%'
    
    # C+Pivot
    for t in T_VALS:
        for L in [1000, 2000]:
            r10 = c_pivot_results[key][t][L]['recalls'][10]
            line += f' {r10*100:>10.1f}%'
    
    print(line)

Master Comparison: Recall@10

Dataset       D          N    HNSW   C(L=500)  C(L=1000)  C(L=2000)  P(20,1000)  P(20,2000)  P(25,1000)  P(25,2000)
-------------------------------------------------------------------------------------------------------------------
glove       100  1,183,514   79.9%      58.5%      66.9%      74.8%       67.0%       74.6%       66.7%       74.8%
sift        128  1,000,000   97.9%      76.2%      85.0%      91.6%       85.3%       91.4%       85.3%       91.5%
fashion     784     60,000   99.3%      97.2%      99.1%      99.7%       98.9%       99.4%       99.2%       99.8%
nytimes     256    290,000   84.7%      68.9%      74.2%      79.4%       73.0%       78.0%       73.6%       78.6%
gist        960  1,000,000   81.6%      48.1%      59.4%      70.2%       59.5%       70.3%       59.4%       70.4%


## 3. Recall@K Detail (K=1, 10, 100)

In [4]:
print('='*100)
print('Recall@K Detail (L=1000 for C-only/C+Pivot, t=25 for C+Pivot)')
print('='*100)

L_DETAIL = 1000
T_DETAIL = 25

for k_val in TOP_K_VALUES:
    print(f'\n--- Recall@{k_val} ---')
    print(f'{"Dataset":<10} {"HNSW":>8} {"C-only":>8} {"C+Pivot":>8} {"Gap(HNSW-C)":>12} {"Gap(C-CP)":>10}')
    print('-'*60)
    
    for key in available_keys:
        # HNSW
        if key in hnsw_results and hnsw_results[key]['recalls'].get(k_val) is not None:
            hnsw_r = hnsw_results[key]['recalls'][k_val]
            hnsw_str = f'{hnsw_r*100:.1f}%'
        else:
            hnsw_r = None
            hnsw_str = 'N/A'
        
        c_r = c_only_results[key][L_DETAIL][k_val]
        cp_r = c_pivot_results[key][T_DETAIL][L_DETAIL]['recalls'][k_val]
        
        gap_hnsw_c = f'{(hnsw_r - c_r)*100:+.1f}pp' if hnsw_r is not None else 'N/A'
        gap_c_cp = f'{(c_r - cp_r)*100:+.1f}pp'
        
        print(f'{key:<10} {hnsw_str:>8} {c_r*100:>7.1f}% {cp_r*100:>7.1f}% {gap_hnsw_c:>12} {gap_c_cp:>10}')

Recall@K Detail (L=1000 for C-only/C+Pivot, t=25 for C+Pivot)

--- Recall@1 ---
Dataset        HNSW   C-only  C+Pivot  Gap(HNSW-C)  Gap(C-CP)
------------------------------------------------------------
glove         84.8%    81.8%    81.1%       +3.0pp     +0.7pp
sift          98.1%    91.5%    91.8%       +6.6pp     -0.3pp
fashion       99.2%    99.6%    99.8%       -0.4pp     -0.2pp
nytimes       82.3%    81.2%    80.8%       +1.1pp     +0.4pp
gist          86.8%    72.9%    73.2%      +13.9pp     -0.3pp

--- Recall@10 ---
Dataset        HNSW   C-only  C+Pivot  Gap(HNSW-C)  Gap(C-CP)
------------------------------------------------------------
glove         79.9%    66.9%    66.7%      +12.9pp     +0.2pp
sift          97.9%    85.0%    85.3%      +12.9pp     -0.4pp
fashion       99.3%    99.1%    99.2%       +0.2pp     -0.0pp
nytimes       84.7%    74.2%    73.6%      +10.5pp     +0.6pp
gist          81.6%    59.4%    59.4%      +22.2pp     +0.0pp

--- Recall@100 ---
Dataset        

## 4. Hash Quality vs Data Characteristics

In [5]:
print('='*80)
print('Hash Quality: Hamming-Cosine Spearman Correlation')
print('='*80)

print(f'\n{"Dataset":<10} {"Dim":>5} {"N":>10} {"Spearman":>10} {"Quality":>10}')
print('-'*50)

for key in available_keys:
    info = dataset_info[key]
    corr = quality_results[key]
    
    # Quality rating
    if abs(corr) > 0.6:
        quality = 'Good'
    elif abs(corr) > 0.4:
        quality = 'Fair'
    elif abs(corr) > 0.2:
        quality = 'Weak'
    else:
        quality = 'Poor'
    
    print(f'{key:<10} {info["dim"]:>5} {info["n_train"]:>10,} {corr:>10.4f} {quality:>10}')

print('\nNote: Negative Spearman = larger Hamming ~ lower cosine (expected, good)')
print('Stronger negative correlation -> better hash quality')

Hash Quality: Hamming-Cosine Spearman Correlation

Dataset      Dim          N   Spearman    Quality
--------------------------------------------------
glove        100  1,183,514    -0.5083       Fair
sift         128  1,000,000    -0.9252       Good
fashion      784     60,000    -0.7446       Good
nytimes      256    290,000    -0.5084       Fair
gist         960  1,000,000    -0.4551       Fair

Note: Negative Spearman = larger Hamming ~ lower cosine (expected, good)
Stronger negative correlation -> better hash quality


## 5. Pivot Pruning Analysis

In [6]:
print('='*80)
print('Pivot Pruning: Reduction Rate vs Filter Recall vs Final Recall@10 (L=1000)')
print('='*80)

L_PIVOT = 1000

for key in available_keys:
    info = dataset_info[key]
    print(f'\n{key} (N={info["n_train"]:,}, D={info["dim"]}):')
    print(f'  {"Threshold":>10} {"Reduction":>10} {"FilterR@10":>12} {"FinalR@10":>10} {"AvgCands":>10}')
    print(f'  {"-"*56}')
    
    # No pivot (C-only)
    c_r10 = c_only_results[key][L_PIVOT][10]
    print(f'  {"(no pivot)":>10} {"0.0%":>10} {"100.0%":>12} {c_r10*100:>9.1f}% {info["n_train"]:>10,}')
    
    for t in PIVOT_THRESHOLDS:
        r = c_pivot_results[key][t][L_PIVOT]
        print(f'  {t:>10} {r["reduction_rate"]*100:>9.1f}% {r["filter_recall_10"]*100:>11.1f}% '
              f'{r["recalls"][10]*100:>9.1f}% {r["mean_candidates"]:>10,.0f}')

Pivot Pruning: Reduction Rate vs Filter Recall vs Final Recall@10 (L=1000)

glove (N=1,183,514, D=100):
   Threshold  Reduction   FilterR@10  FinalR@10   AvgCands
  --------------------------------------------------------
  (no pivot)       0.0%       100.0%      66.9%  1,183,514
          15      24.3%        95.7%      65.7%    896,148
          20       4.7%        99.8%      67.0%  1,128,130
          25       0.7%       100.0%      66.7%  1,175,447
          30       0.1%       100.0%      67.1%  1,182,525

sift (N=1,000,000, D=128):
   Threshold  Reduction   FilterR@10  FinalR@10   AvgCands
  --------------------------------------------------------
  (no pivot)       0.0%       100.0%      85.0%  1,000,000
          15      79.3%        98.8%      84.9%    207,124
          20      66.8%        99.9%      85.3%    332,284
          25      59.3%       100.0%      85.3%    407,290
          30      55.1%       100.0%      85.4%    449,048

fashion (N=60,000, D=784):
   Threshold  

## 6. Candidate Limit to Match HNSW

In [7]:
print('='*80)
print('Candidate Limit Needed to Match HNSW Recall@10')
print('='*80)

for key in available_keys:
    info = dataset_info[key]
    
    if key not in hnsw_results or hnsw_results[key]['recalls'].get(10) is None:
        print(f'\n{key}: HNSW not available')
        continue
    
    hnsw_r10 = hnsw_results[key]['recalls'][10]
    
    print(f'\n{key} (HNSW R@10={hnsw_r10*100:.1f}%):')
    
    # C-only: find smallest L that achieves various fractions of HNSW R@10
    targets = [0.90, 0.95, 0.99]
    for target in targets:
        target_r10 = hnsw_r10 * target
        found_L = None
        for L in CANDIDATE_LIMITS:
            if c_only_results[key][L][10] >= target_r10:
                found_L = L
                break
        
        status = f'L={found_L}' if found_L else f'L>{CANDIDATE_LIMITS[-1]}'
        actual = c_only_results[key][found_L][10] if found_L else c_only_results[key][CANDIDATE_LIMITS[-1]][10]
        print(f'  {target*100:.0f}% of HNSW ({target_r10*100:.1f}%): C-only needs {status} '
              f'(achieves {actual*100:.1f}%)')

Candidate Limit Needed to Match HNSW Recall@10

glove (HNSW R@10=79.9%):
  90% of HNSW (71.9%): C-only needs L=2000 (achieves 74.8%)
  95% of HNSW (75.9%): C-only needs L=5000 (achieves 83.9%)
  99% of HNSW (79.1%): C-only needs L=5000 (achieves 83.9%)

sift (HNSW R@10=97.9%):
  90% of HNSW (88.1%): C-only needs L=2000 (achieves 91.6%)
  95% of HNSW (93.0%): C-only needs L=5000 (achieves 96.9%)
  99% of HNSW (96.9%): C-only needs L>5000 (achieves 96.9%)

fashion (HNSW R@10=99.3%):
  90% of HNSW (89.3%): C-only needs L=500 (achieves 97.2%)
  95% of HNSW (94.3%): C-only needs L=500 (achieves 97.2%)
  99% of HNSW (98.3%): C-only needs L=1000 (achieves 99.1%)

nytimes (HNSW R@10=84.7%):
  90% of HNSW (76.2%): C-only needs L=2000 (achieves 79.4%)
  95% of HNSW (80.5%): C-only needs L=5000 (achieves 85.8%)
  99% of HNSW (83.8%): C-only needs L=5000 (achieves 85.8%)

gist (HNSW R@10=81.6%):
  90% of HNSW (73.5%): C-only needs L=5000 (achieves 83.0%)
  95% of HNSW (77.5%): C-only needs L=5000 

## 7. Filter Recall Analysis

In [8]:
print('='*80)
print('Filter Recall: True top-10 in Hamming candidates (C-only, no pivot)')
print('='*80)

print(f'\n{"Dataset":<10}', end='')
for L in CANDIDATE_LIMITS:
    print(f' {"L="+str(L):>8}', end='')
print()
print('-'*60)

for key in available_keys:
    print(f'{key:<10}', end='')
    for L in CANDIDATE_LIMITS:
        print(f' {filter_recalls[key][L]*100:>7.1f}%', end='')
    print()

Filter Recall: True top-10 in Hamming candidates (C-only, no pivot)

Dataset       L=100    L=500   L=1000   L=2000   L=5000
------------------------------------------------------------
glove         38.6%    58.5%    66.9%    74.8%    83.9%
sift          50.0%    76.2%    85.0%    91.7%    97.0%
fashion       79.7%    97.2%    99.1%    99.7%    99.9%
nytimes       55.9%    69.8%    75.1%    80.4%    86.8%
gist          26.5%    48.1%    59.5%    70.2%    83.0%


## 8. Conclusions

In [9]:
print('='*80)
print('CONCLUSIONS')
print('='*80)

print('\n1. ITQ-LSH Hash Quality:')
for key in available_keys:
    corr = quality_results[key]
    print(f'   {key}: Spearman={corr:.4f}')

print('\n2. C-only Pipeline (ITQ -> Hamming -> Cosine rerank):')
for key in available_keys:
    r10_1k = c_only_results[key][1000][10]
    r10_5k = c_only_results[key][5000][10]
    print(f'   {key}: R@10={r10_1k*100:.1f}% (L=1000), {r10_5k*100:.1f}% (L=5000)')

print('\n3. C+Pivot Pipeline:')
for key in available_keys:
    # Best threshold maintaining >90% of C-only R@10
    c_r10 = c_only_results[key][1000][10]
    best_t = None
    for t in sorted(PIVOT_THRESHOLDS):
        cp_r10 = c_pivot_results[key][t][1000]['recalls'][10]
        red = c_pivot_results[key][t][1000]['reduction_rate']
        if cp_r10 >= c_r10 * 0.90:
            best_t = t
            break
    if best_t:
        r = c_pivot_results[key][best_t][1000]
        print(f'   {key}: Best t={best_t} -> R@10={r["recalls"][10]*100:.1f}%, '
              f'reduction={r["reduction_rate"]*100:.1f}%')
    else:
        r = c_pivot_results[key][PIVOT_THRESHOLDS[-1]][1000]
        print(f'   {key}: Even t={PIVOT_THRESHOLDS[-1]} -> R@10={r["recalls"][10]*100:.1f}%, '
              f'reduction={r["reduction_rate"]*100:.1f}%')

print('\n4. HNSW Comparison:')
for key in available_keys:
    if key in hnsw_results and hnsw_results[key]['recalls'].get(10) is not None:
        hnsw_r10 = hnsw_results[key]['recalls'][10]
        c_r10 = c_only_results[key][1000][10]
        gap = (hnsw_r10 - c_r10) * 100
        print(f'   {key}: HNSW={hnsw_r10*100:.1f}% vs C-only(L=1000)={c_r10*100:.1f}% (gap={gap:+.1f}pp)')
    else:
        print(f'   {key}: HNSW not available')

print('\n5. Recommendations:')
print('   - ITQ-LSH 128-bit provides a reasonable hash quality baseline across diverse datasets')
print('   - C-only pipeline at L=1000-2000 balances candidate reduction with Recall')
print('   - Pivot pruning (t=20-25) can reduce candidates with controllable Recall loss')
print('   - For high-Recall requirements (>95%), larger candidate limits or HNSW preferred')
print('   - Dataset characteristics (dim, data distribution) significantly affect hash quality')

CONCLUSIONS

1. ITQ-LSH Hash Quality:
   glove: Spearman=-0.5083
   sift: Spearman=-0.9252
   fashion: Spearman=-0.7446
   nytimes: Spearman=-0.5084
   gist: Spearman=-0.4551

2. C-only Pipeline (ITQ -> Hamming -> Cosine rerank):
   glove: R@10=66.9% (L=1000), 83.9% (L=5000)
   sift: R@10=85.0% (L=1000), 96.9% (L=5000)
   fashion: R@10=99.1% (L=1000), 99.9% (L=5000)
   nytimes: R@10=74.2% (L=1000), 85.8% (L=5000)
   gist: R@10=59.4% (L=1000), 83.0% (L=5000)

3. C+Pivot Pipeline:
   glove: Best t=15 -> R@10=65.7%, reduction=24.3%
   sift: Best t=15 -> R@10=84.9%, reduction=79.3%
   fashion: Best t=15 -> R@10=97.3%, reduction=89.3%
   nytimes: Best t=15 -> R@10=68.3%, reduction=35.0%
   gist: Best t=15 -> R@10=56.5%, reduction=80.4%

4. HNSW Comparison:
   glove: HNSW=79.9% vs C-only(L=1000)=66.9% (gap=+12.9pp)
   sift: HNSW=97.9% vs C-only(L=1000)=85.0% (gap=+12.9pp)
   fashion: HNSW=99.3% vs C-only(L=1000)=99.1% (gap=+0.2pp)
   nytimes: HNSW=84.7% vs C-only(L=1000)=74.2% (gap=+10.5pp)


## 9. 総合評価

### ITQ-LSH + Pivotパイプラインの客観的位置づけ

5つの標準ベンチマークデータセットでの検証により、ITQ-LSH + Pivot パイプラインの**適用条件と限界**が明確になった。

#### データセット特性との対応関係

| データセット | ハッシュ品質 | C-only (L=1000) | HNSW | Gap | Pivot効果 | 総合判定 |
|---|---|---|---|---|---|---|
| **Fashion** | Good (-0.74) | 99.1% | 99.3% | 0.2pp | ◎ 82%削減 | **実用的** |
| **SIFT** | Good (-0.93) | 85.0% | 97.9% | 12.9pp | ◎ 67%削減 | **条件付き実用的** |
| **NYTimes** | Fair (-0.51) | 74.2% | 84.7% | 10.5pp | × 9%削減 | **限定的** |
| **GloVe** | Fair (-0.51) | 66.9% | 79.9% | 12.9pp | × 5%削減 | **限定的** |
| **GIST** | Fair (-0.46) | 59.4% | 81.6% | 22.2pp | ○ 53%削減 | **不適** |

#### ITQ-LSHが有効に機能する条件

1. **ハッシュ品質（Spearman）が-0.7以上**: Fashion、SIFTでは良好な相関を示し、Hamming距離がcosine類似度の良い近似となる
2. **ビット数が十分確保できる（n_bits ≥ 128）**: GloVeはdim=100のため100bitに制限され、ハッシュ解像度が不足
3. **データが比較的低次元（< 256D）**: 高次元→128bitの情報圧縮損失が大きいほど精度が劣化（GIST: 960D→128bit）

#### Pivot枝刈りが有効に機能する条件

1. **ハッシュ空間でのピボット距離分散が大きい（std > 10）**: SIFT (16.1), Fashion (15.3)で大きな削減率
2. **ピボット間距離が十分に大きい**: SIFT (mean=68.5), GIST (mean=70.5)で良好な分離
3. **低分散データでは無効**: GloVe (std=5.2), NYTimes (std=5.7)ではほぼ枝刈りできない

#### HNSW比較からの学び

- **L=1000固定時**: HNSWに対して0.2〜22.2ppの精度差。Fashionのみ同等
- **L=5000時**: GloVe (83.9% vs 79.9%) とGIST (83.0% vs 81.6%) ではHNSWを**超過**
- **SIFTでL=5000**: 96.9%でHNSWの97.9%にほぼ到達
- → **候補数を十分に取れば、ITQ-LSHはHNSWと同等以上の精度を達成可能**。ただし速度面の考慮が必要

#### 既存実験（Wikipedia 10K/400K）との整合性

- 実験84（Wikipedia英語10K）でのR@10=84.2%（Pivot, t=20）は、SIFTのR@10=85.3%と整合的
- E5-base-v2（768D→128bit）の精度は、GIST（960D→128bit）と同様の高次元圧縮問題を抱えていると推測される
- 小規模データ（10K-60K）ではFashionと同様にITQ-LSHが高精度を達成しやすい

### 実用上の推奨

| 利用シナリオ | 推奨設定 | 期待Recall@10 |
|---|---|---|
| 小規模 (< 100K)、低-中次元 | C-only, L=1000 | > 95% |
| 中規模 (100K-1M)、低次元 (< 256D) | C+Pivot(t=20), L=2000 | 85-92% |
| 中規模、高次元 (> 256D) | C-only, L=5000 or HNSW推奨 | 70-83% |
| 大規模 (> 1M) | HNSW推奨（ITQ-LSHは候補数増でコスト増大） | - |

### 残課題

1. **ビット数の拡張**: 128bitでは高次元データ（GIST, GloVe）で不十分。256bit/512bitでの評価が必要
2. **速度ベンチマーク**: 本実験では精度のみ評価。Hamming全探索の速度 vs HNSW速度の比較が未実施
3. **Multi-probe LSH**: Band filterやConfidence multi-probeの再検討（実験80-84で課題が判明済み）
4. **Pivot選択の改善**: 低分散データ向けのピボット選択戦略（例：embedding空間での選択→hash空間へのマッピング）