# 72. ANN Benchmark Dataset Preparation

## Purpose
- Download 5 standard ANN benchmark datasets from ann-benchmarks.com
- Normalize L2-metric datasets to unit vectors (for cosine evaluation)
- Compute/validate cosine ground truth for all datasets
- Save preprocessed arrays as .npy for experiments 73-75

## Datasets
| Name | Short key | N | D | Original metric | Action |
|------|-----------|---|---|-----------------|--------|
| GloVe-100 | glove | 1,183,514 | 100 | angular | Use neighbors directly |
| SIFT-128 | sift | 1,000,000 | 128 | euclidean | L2-normalize, recompute GT |
| Fashion-MNIST-784 | fashion | 60,000 | 784 | euclidean | L2-normalize, recompute GT |
| NYTimes-256 | nytimes | 290,000 | 256 | angular | Use neighbors directly |
| GIST-960 | gist | 1,000,000 | 960 | euclidean | L2-normalize, recompute GT |

## 0. Setup

In [1]:
import numpy as np
import h5py
import urllib.request
import time
from pathlib import Path
from tqdm import tqdm

DATA_DIR = Path('../data')
ANN_DIR = DATA_DIR / 'ann-benchmarks'
ANN_DIR.mkdir(exist_ok=True)

DATASETS = {
    'glove':   {'url': 'glove-100-angular.hdf5',   'metric': 'angular'},
    'sift':    {'url': 'sift-128-euclidean.hdf5',   'metric': 'euclidean'},
    'fashion': {'url': 'fashion-mnist-784-euclidean.hdf5', 'metric': 'euclidean'},
    'nytimes': {'url': 'nytimes-256-angular.hdf5',  'metric': 'angular'},
    'gist':    {'url': 'gist-960-euclidean.hdf5',   'metric': 'euclidean'},
}

BASE_URL = 'https://ann-benchmarks.com/'
GT_K = 100  # ground truth top-K

print(f'Data directory: {DATA_DIR}')
print(f'ANN benchmarks directory: {ANN_DIR}')
print(f'Datasets: {list(DATASETS.keys())}')

Data directory: ../data
ANN benchmarks directory: ../data/ann-benchmarks
Datasets: ['glove', 'sift', 'fashion', 'nytimes', 'gist']


## 1. Download Datasets

In [2]:
import subprocess

def download_dataset(name, info):
    """Download HDF5 file if not already present, using curl."""
    filepath = ANN_DIR / info['url']
    if filepath.exists():
        size_mb = filepath.stat().st_size / (1024 * 1024)
        print(f'  {name}: already exists ({size_mb:.1f} MB)')
        return filepath
    
    url = BASE_URL + info['url']
    print(f'  {name}: downloading from {url} ...')
    start = time.time()
    result = subprocess.run(
        ['curl', '-sL', '-o', str(filepath), url],
        capture_output=True, text=True
    )
    if result.returncode != 0:
        raise RuntimeError(f'Download failed: {result.stderr}')
    elapsed = time.time() - start
    size_mb = filepath.stat().st_size / (1024 * 1024)
    print(f'  {name}: downloaded {size_mb:.1f} MB in {elapsed:.1f}s')
    return filepath

print('Downloading ANN benchmark datasets...')
for name, info in DATASETS.items():
    download_dataset(name, info)

Downloading ANN benchmark datasets...
  glove: downloading from https://ann-benchmarks.com/glove-100-angular.hdf5 ...


  glove: downloaded 462.9 MB in 43.5s
  sift: downloading from https://ann-benchmarks.com/sift-128-euclidean.hdf5 ...


  sift: downloaded 500.8 MB in 46.6s
  fashion: downloading from https://ann-benchmarks.com/fashion-mnist-784-euclidean.hdf5 ...


  fashion: downloaded 217.0 MB in 21.1s
  nytimes: downloading from https://ann-benchmarks.com/nytimes-256-angular.hdf5 ...


  nytimes: downloaded 300.6 MB in 28.2s
  gist: downloading from https://ann-benchmarks.com/gist-960-euclidean.hdf5 ...


  gist: downloaded 3666.5 MB in 362.6s


## 2. Inspect HDF5 Structure

In [3]:
print('Inspecting HDF5 structure...')
for name, info in DATASETS.items():
    filepath = ANN_DIR / info['url']
    with h5py.File(filepath, 'r') as f:
        print(f'\n{name} ({info["url"]}):')
        for key in f.keys():
            print(f'  {key}: shape={f[key].shape}, dtype={f[key].dtype}')

Inspecting HDF5 structure...

glove (glove-100-angular.hdf5):
  distances: shape=(10000, 100), dtype=float32
  neighbors: shape=(10000, 100), dtype=int32
  test: shape=(10000, 100), dtype=float32
  train: shape=(1183514, 100), dtype=float32

sift (sift-128-euclidean.hdf5):
  distances: shape=(10000, 100), dtype=float32
  neighbors: shape=(10000, 100), dtype=int32
  test: shape=(10000, 128), dtype=float32
  train: shape=(1000000, 128), dtype=float32

fashion (fashion-mnist-784-euclidean.hdf5):
  distances: shape=(10000, 100), dtype=float32
  neighbors: shape=(10000, 100), dtype=int32
  test: shape=(10000, 784), dtype=float32
  train: shape=(60000, 784), dtype=float32

nytimes (nytimes-256-angular.hdf5):
  distances: shape=(10000, 100), dtype=float32
  neighbors: shape=(10000, 100), dtype=int32
  test: shape=(10000, 256), dtype=float32
  train: shape=(290000, 256), dtype=float32

gist (gist-960-euclidean.hdf5):
  distances: shape=(1000, 100), dtype=float32
  neighbors: shape=(1000, 100),

## 3. Load and Preprocess

In [4]:
def load_and_preprocess(name, info):
    """Load HDF5, optionally L2-normalize."""
    filepath = ANN_DIR / info['url']
    with h5py.File(filepath, 'r') as f:
        train = np.array(f['train'], dtype=np.float32)
        test = np.array(f['test'], dtype=np.float32)
        neighbors = np.array(f['neighbors'])
        distances = np.array(f['distances'])
    
    print(f'\n{name}: train={train.shape}, test={test.shape}, '
          f'neighbors={neighbors.shape}, metric={info["metric"]}')
    
    if info['metric'] == 'euclidean':
        print(f'  L2-normalizing...')
        train_norms = np.linalg.norm(train, axis=1, keepdims=True)
        test_norms = np.linalg.norm(test, axis=1, keepdims=True)
        
        zero_train = np.sum(train_norms.flatten() == 0)
        zero_test = np.sum(test_norms.flatten() == 0)
        if zero_train > 0 or zero_test > 0:
            print(f'  WARNING: {zero_train} zero train, {zero_test} zero test vectors')
            train_norms[train_norms == 0] = 1.0
            test_norms[test_norms == 0] = 1.0
        
        train = train / train_norms
        test = test / test_norms
        
        sample_norms = np.linalg.norm(train[:100], axis=1)
        print(f'  Post-norm: mean={sample_norms.mean():.6f}, std={sample_norms.std():.8f}')
    
    return {
        'train': train,
        'test': test,
        'neighbors_original': neighbors,
        'distances_original': distances,
        'metric': info['metric'],
        'name': name,
    }

all_datasets = {}
for name, info in DATASETS.items():
    all_datasets[name] = load_and_preprocess(name, info)


glove: train=(1183514, 100), test=(10000, 100), neighbors=(10000, 100), metric=angular



sift: train=(1000000, 128), test=(10000, 128), neighbors=(10000, 100), metric=euclidean
  L2-normalizing...


  Post-norm: mean=1.000000, std=0.00000003

fashion: train=(60000, 784), test=(10000, 784), neighbors=(10000, 100), metric=euclidean
  L2-normalizing...


  Post-norm: mean=1.000000, std=0.00000003

nytimes: train=(290000, 256), test=(10000, 256), neighbors=(10000, 100), metric=angular



gist: train=(1000000, 960), test=(1000, 960), neighbors=(1000, 100), metric=euclidean
  L2-normalizing...




  Post-norm: mean=1.000000, std=0.00000003


## 4. Compute Cosine Ground Truth

In [5]:
def compute_cosine_ground_truth(train, test, k=100, batch_size=100):
    """Brute-force cosine similarity ground truth."""
    n_queries = len(test)
    gt_neighbors = np.zeros((n_queries, k), dtype=np.int64)
    
    # Precompute norms
    train_norms = np.linalg.norm(train, axis=1)
    train_norms[train_norms == 0] = 1.0
    
    for start in tqdm(range(0, n_queries, batch_size), desc='GT computation'):
        end = min(start + batch_size, n_queries)
        batch = test[start:end]
        batch_norms = np.linalg.norm(batch, axis=1, keepdims=True)
        batch_norms[batch_norms == 0] = 1.0
        
        # Cosine similarities
        dots = batch @ train.T
        cos_sims = dots / (batch_norms * train_norms[np.newaxis, :])
        
        for i in range(end - start):
            top_indices = np.argpartition(-cos_sims[i], k)[:k]
            top_sorted = top_indices[np.argsort(-cos_sims[i, top_indices])]
            gt_neighbors[start + i] = top_sorted
    
    return gt_neighbors

print('Computing/validating ground truth for all datasets...')

Computing/validating ground truth for all datasets...


In [6]:
for name, data in all_datasets.items():
    print(f'\n{"="*60}')
    print(f'{name} (metric={data["metric"]})')
    
    if data['metric'] == 'angular':
        # Spot-check: recompute for 10 queries
        print('  Angular: spot-checking provided GT...')
        gt_check = compute_cosine_ground_truth(
            data['train'], data['test'][:10], k=GT_K, batch_size=10
        )
        overlaps = [
            len(set(gt_check[i]) & set(data['neighbors_original'][i, :GT_K])) / GT_K
            for i in range(10)
        ]
        mean_overlap = np.mean(overlaps)
        print(f'  Overlap with provided GT: {mean_overlap:.1%}')
        
        if mean_overlap > 0.90:
            print('  Using provided GT directly.')
            data['gt_neighbors'] = data['neighbors_original'][:, :GT_K].astype(np.int64)
        else:
            print('  Overlap low, recomputing full cosine GT...')
            data['gt_neighbors'] = compute_cosine_ground_truth(
                data['train'], data['test'], k=GT_K
            )
    else:
        # Euclidean (normalized): must recompute
        print('  Euclidean (normalized): computing cosine GT...')
        data['gt_neighbors'] = compute_cosine_ground_truth(
            data['train'], data['test'], k=GT_K
        )
    
    print(f'  GT shape: {data["gt_neighbors"].shape}')


glove (metric=angular)
  Angular: spot-checking provided GT...


GT computation:   0%|          | 0/1 [00:00<?, ?it/s]

GT computation: 100%|██████████| 1/1 [00:00<00:00,  8.94it/s]

GT computation: 100%|██████████| 1/1 [00:00<00:00,  8.88it/s]




  Overlap with provided GT: 100.0%
  Using provided GT directly.
  GT shape: (10000, 100)

sift (metric=euclidean)
  Euclidean (normalized): computing cosine GT...


GT computation:   0%|          | 0/100 [00:00<?, ?it/s]

GT computation:   1%|          | 1/100 [00:00<01:14,  1.34it/s]

GT computation:   2%|▏         | 2/100 [00:01<01:15,  1.30it/s]

GT computation:   3%|▎         | 3/100 [00:02<01:15,  1.29it/s]

GT computation:   4%|▍         | 4/100 [00:03<01:14,  1.29it/s]

GT computation:   5%|▌         | 5/100 [00:03<01:13,  1.28it/s]

GT computation:   6%|▌         | 6/100 [00:04<01:13,  1.28it/s]

GT computation:   7%|▋         | 7/100 [00:05<01:13,  1.27it/s]

GT computation:   8%|▊         | 8/100 [00:06<01:12,  1.27it/s]

GT computation:   9%|▉         | 9/100 [00:07<01:11,  1.27it/s]

GT computation:  10%|█         | 10/100 [00:07<01:10,  1.27it/s]

GT computation:  11%|█         | 11/100 [00:08<01:09,  1.28it/s]

GT computation:  12%|█▏        | 12/100 [00:09<01:08,  1.28it/s]

GT computation:  13%|█▎        | 13/100 [00:10<01:08,  1.28it/s]

GT computation:  14%|█▍        | 14/100 [00:10<01:07,  1.28it/s]

GT computation:  15%|█▌        | 15/100 [00:11<01:06,  1.27it/s]

GT computation:  16%|█▌        | 16/100 [00:12<01:05,  1.28it/s]

GT computation:  17%|█▋        | 17/100 [00:13<01:05,  1.27it/s]

GT computation:  18%|█▊        | 18/100 [00:14<01:04,  1.27it/s]

GT computation:  19%|█▉        | 19/100 [00:14<01:03,  1.27it/s]

GT computation:  20%|██        | 20/100 [00:15<01:02,  1.27it/s]

GT computation:  21%|██        | 21/100 [00:16<01:02,  1.27it/s]

GT computation:  22%|██▏       | 22/100 [00:17<01:01,  1.27it/s]

GT computation:  23%|██▎       | 23/100 [00:18<01:00,  1.27it/s]

GT computation:  24%|██▍       | 24/100 [00:18<00:59,  1.27it/s]

GT computation:  25%|██▌       | 25/100 [00:19<00:59,  1.27it/s]

GT computation:  26%|██▌       | 26/100 [00:20<00:58,  1.27it/s]

GT computation:  27%|██▋       | 27/100 [00:21<00:57,  1.27it/s]

GT computation:  28%|██▊       | 28/100 [00:21<00:56,  1.27it/s]

GT computation:  29%|██▉       | 29/100 [00:22<00:55,  1.27it/s]

GT computation:  30%|███       | 30/100 [00:23<00:55,  1.27it/s]

GT computation:  31%|███       | 31/100 [00:24<00:54,  1.27it/s]

GT computation:  32%|███▏      | 32/100 [00:25<00:53,  1.27it/s]

GT computation:  33%|███▎      | 33/100 [00:25<00:52,  1.27it/s]

GT computation:  34%|███▍      | 34/100 [00:26<00:52,  1.27it/s]

GT computation:  35%|███▌      | 35/100 [00:27<00:51,  1.27it/s]

GT computation:  36%|███▌      | 36/100 [00:28<00:50,  1.27it/s]

GT computation:  37%|███▋      | 37/100 [00:29<00:49,  1.27it/s]

GT computation:  38%|███▊      | 38/100 [00:29<00:48,  1.27it/s]

GT computation:  39%|███▉      | 39/100 [00:30<00:48,  1.27it/s]

GT computation:  40%|████      | 40/100 [00:31<00:47,  1.27it/s]

GT computation:  41%|████      | 41/100 [00:32<00:46,  1.27it/s]

GT computation:  42%|████▏     | 42/100 [00:33<00:45,  1.27it/s]

GT computation:  43%|████▎     | 43/100 [00:33<00:45,  1.26it/s]

GT computation:  44%|████▍     | 44/100 [00:34<00:44,  1.27it/s]

GT computation:  45%|████▌     | 45/100 [00:35<00:43,  1.27it/s]

GT computation:  46%|████▌     | 46/100 [00:36<00:42,  1.26it/s]

GT computation:  47%|████▋     | 47/100 [00:36<00:42,  1.26it/s]

GT computation:  48%|████▊     | 48/100 [00:37<00:41,  1.26it/s]

GT computation:  49%|████▉     | 49/100 [00:38<00:40,  1.26it/s]

GT computation:  50%|█████     | 50/100 [00:39<00:39,  1.26it/s]

GT computation:  51%|█████     | 51/100 [00:40<00:38,  1.26it/s]

GT computation:  52%|█████▏    | 52/100 [00:40<00:38,  1.26it/s]

GT computation:  53%|█████▎    | 53/100 [00:41<00:37,  1.26it/s]

GT computation:  54%|█████▍    | 54/100 [00:42<00:36,  1.26it/s]

GT computation:  55%|█████▌    | 55/100 [00:43<00:35,  1.27it/s]

GT computation:  56%|█████▌    | 56/100 [00:44<00:34,  1.26it/s]

GT computation:  57%|█████▋    | 57/100 [00:44<00:34,  1.26it/s]

GT computation:  58%|█████▊    | 58/100 [00:45<00:33,  1.26it/s]

GT computation:  59%|█████▉    | 59/100 [00:46<00:32,  1.26it/s]

GT computation:  60%|██████    | 60/100 [00:47<00:31,  1.26it/s]

GT computation:  61%|██████    | 61/100 [00:48<00:30,  1.27it/s]

GT computation:  62%|██████▏   | 62/100 [00:48<00:30,  1.27it/s]

GT computation:  63%|██████▎   | 63/100 [00:49<00:29,  1.26it/s]

GT computation:  64%|██████▍   | 64/100 [00:50<00:28,  1.26it/s]

GT computation:  65%|██████▌   | 65/100 [00:51<00:27,  1.26it/s]

GT computation:  66%|██████▌   | 66/100 [00:52<00:26,  1.26it/s]

GT computation:  67%|██████▋   | 67/100 [00:52<00:26,  1.26it/s]

GT computation:  68%|██████▊   | 68/100 [00:53<00:25,  1.27it/s]

GT computation:  69%|██████▉   | 69/100 [00:54<00:24,  1.26it/s]

GT computation:  70%|███████   | 70/100 [00:55<00:23,  1.26it/s]

GT computation:  71%|███████   | 71/100 [00:55<00:23,  1.26it/s]

GT computation:  72%|███████▏  | 72/100 [00:56<00:22,  1.26it/s]

GT computation:  73%|███████▎  | 73/100 [00:57<00:21,  1.26it/s]

GT computation:  74%|███████▍  | 74/100 [00:58<00:20,  1.26it/s]

GT computation:  75%|███████▌  | 75/100 [00:59<00:19,  1.26it/s]

GT computation:  76%|███████▌  | 76/100 [00:59<00:18,  1.26it/s]

GT computation:  77%|███████▋  | 77/100 [01:00<00:18,  1.26it/s]

GT computation:  78%|███████▊  | 78/100 [01:01<00:17,  1.26it/s]

GT computation:  79%|███████▉  | 79/100 [01:02<00:16,  1.26it/s]

GT computation:  80%|████████  | 80/100 [01:03<00:15,  1.26it/s]

GT computation:  81%|████████  | 81/100 [01:03<00:15,  1.26it/s]

GT computation:  82%|████████▏ | 82/100 [01:04<00:14,  1.27it/s]

GT computation:  83%|████████▎ | 83/100 [01:05<00:13,  1.26it/s]

GT computation:  84%|████████▍ | 84/100 [01:06<00:12,  1.26it/s]

GT computation:  85%|████████▌ | 85/100 [01:07<00:11,  1.26it/s]

GT computation:  86%|████████▌ | 86/100 [01:07<00:11,  1.26it/s]

GT computation:  87%|████████▋ | 87/100 [01:08<00:10,  1.27it/s]

GT computation:  88%|████████▊ | 88/100 [01:09<00:09,  1.27it/s]

GT computation:  89%|████████▉ | 89/100 [01:10<00:08,  1.27it/s]

GT computation:  90%|█████████ | 90/100 [01:11<00:07,  1.27it/s]

GT computation:  91%|█████████ | 91/100 [01:11<00:07,  1.27it/s]

GT computation:  92%|█████████▏| 92/100 [01:12<00:06,  1.27it/s]

GT computation:  93%|█████████▎| 93/100 [01:13<00:05,  1.26it/s]

GT computation:  94%|█████████▍| 94/100 [01:14<00:04,  1.27it/s]

GT computation:  95%|█████████▌| 95/100 [01:14<00:03,  1.27it/s]

GT computation:  96%|█████████▌| 96/100 [01:15<00:03,  1.27it/s]

GT computation:  97%|█████████▋| 97/100 [01:16<00:02,  1.26it/s]

GT computation:  98%|█████████▊| 98/100 [01:17<00:01,  1.26it/s]

GT computation:  99%|█████████▉| 99/100 [01:18<00:00,  1.26it/s]

GT computation: 100%|██████████| 100/100 [01:18<00:00,  1.27it/s]

GT computation: 100%|██████████| 100/100 [01:18<00:00,  1.27it/s]




  GT shape: (10000, 100)

fashion (metric=euclidean)
  Euclidean (normalized): computing cosine GT...


GT computation:   0%|          | 0/100 [00:00<?, ?it/s]

GT computation:   2%|▏         | 2/100 [00:00<00:07, 13.53it/s]

GT computation:   4%|▍         | 4/100 [00:00<00:06, 15.53it/s]

GT computation:   6%|▌         | 6/100 [00:00<00:05, 16.54it/s]

GT computation:   8%|▊         | 8/100 [00:00<00:05, 16.80it/s]

GT computation:  10%|█         | 10/100 [00:00<00:05, 17.22it/s]

GT computation:  12%|█▏        | 12/100 [00:00<00:05, 17.24it/s]

GT computation:  14%|█▍        | 14/100 [00:00<00:04, 17.44it/s]

GT computation:  16%|█▌        | 16/100 [00:00<00:04, 17.37it/s]

GT computation:  18%|█▊        | 18/100 [00:01<00:04, 17.55it/s]

GT computation:  20%|██        | 20/100 [00:01<00:04, 17.45it/s]

GT computation:  22%|██▏       | 22/100 [00:01<00:04, 17.54it/s]

GT computation:  24%|██▍       | 24/100 [00:01<00:04, 17.42it/s]

GT computation:  26%|██▌       | 26/100 [00:01<00:04, 17.38it/s]

GT computation:  28%|██▊       | 28/100 [00:01<00:04, 16.57it/s]

GT computation:  30%|███       | 30/100 [00:01<00:04, 16.50it/s]

GT computation:  32%|███▏      | 32/100 [00:01<00:04, 16.70it/s]

GT computation:  34%|███▍      | 34/100 [00:02<00:03, 17.07it/s]

GT computation:  36%|███▌      | 36/100 [00:02<00:03, 17.11it/s]

GT computation:  38%|███▊      | 38/100 [00:02<00:03, 17.34it/s]

GT computation:  40%|████      | 40/100 [00:02<00:03, 17.25it/s]

GT computation:  42%|████▏     | 42/100 [00:02<00:03, 17.48it/s]

GT computation:  44%|████▍     | 44/100 [00:02<00:03, 17.40it/s]

GT computation:  46%|████▌     | 46/100 [00:02<00:03, 17.49it/s]

GT computation:  48%|████▊     | 48/100 [00:02<00:02, 17.42it/s]

GT computation:  50%|█████     | 50/100 [00:02<00:02, 17.64it/s]

GT computation:  52%|█████▏    | 52/100 [00:03<00:02, 17.47it/s]

GT computation:  54%|█████▍    | 54/100 [00:03<00:02, 17.67it/s]

GT computation:  56%|█████▌    | 56/100 [00:03<00:02, 17.55it/s]

GT computation:  58%|█████▊    | 58/100 [00:03<00:02, 17.67it/s]

GT computation:  60%|██████    | 60/100 [00:03<00:02, 17.53it/s]

GT computation:  62%|██████▏   | 62/100 [00:03<00:02, 17.72it/s]

GT computation:  64%|██████▍   | 64/100 [00:03<00:02, 17.59it/s]

GT computation:  66%|██████▌   | 66/100 [00:03<00:02, 16.73it/s]

GT computation:  68%|██████▊   | 68/100 [00:03<00:02, 15.88it/s]

GT computation:  70%|███████   | 70/100 [00:04<00:01, 16.25it/s]

GT computation:  72%|███████▏  | 72/100 [00:04<00:01, 16.80it/s]

GT computation:  74%|███████▍  | 74/100 [00:04<00:01, 16.91it/s]

GT computation:  76%|███████▌  | 76/100 [00:04<00:01, 17.19it/s]

GT computation:  78%|███████▊  | 78/100 [00:04<00:01, 17.15it/s]

GT computation:  80%|████████  | 80/100 [00:04<00:01, 17.38it/s]

GT computation:  82%|████████▏ | 82/100 [00:04<00:01, 17.31it/s]

GT computation:  84%|████████▍ | 84/100 [00:04<00:00, 17.52it/s]

GT computation:  86%|████████▌ | 86/100 [00:05<00:00, 17.45it/s]

GT computation:  88%|████████▊ | 88/100 [00:05<00:00, 17.61it/s]

GT computation:  90%|█████████ | 90/100 [00:05<00:00, 17.49it/s]

GT computation:  92%|█████████▏| 92/100 [00:05<00:00, 17.54it/s]

GT computation:  94%|█████████▍| 94/100 [00:05<00:00, 17.46it/s]

GT computation:  96%|█████████▌| 96/100 [00:05<00:00, 17.66it/s]

GT computation:  98%|█████████▊| 98/100 [00:05<00:00, 17.52it/s]

GT computation: 100%|██████████| 100/100 [00:05<00:00, 17.65it/s]

GT computation: 100%|██████████| 100/100 [00:05<00:00, 17.23it/s]




  GT shape: (10000, 100)

nytimes (metric=angular)
  Angular: spot-checking provided GT...


GT computation:   0%|          | 0/1 [00:00<?, ?it/s]

GT computation: 100%|██████████| 1/1 [00:00<00:00, 27.42it/s]




  Overlap with provided GT: 99.7%
  Using provided GT directly.
  GT shape: (10000, 100)

gist (metric=euclidean)
  Euclidean (normalized): computing cosine GT...


GT computation:   0%|          | 0/10 [00:00<?, ?it/s]

GT computation:  10%|█         | 1/10 [00:01<00:10,  1.19s/it]

GT computation:  20%|██        | 2/10 [00:02<00:09,  1.18s/it]

GT computation:  30%|███       | 3/10 [00:03<00:08,  1.21s/it]

GT computation:  40%|████      | 4/10 [00:04<00:07,  1.20s/it]

GT computation:  50%|█████     | 5/10 [00:05<00:05,  1.20s/it]

GT computation:  60%|██████    | 6/10 [00:07<00:04,  1.19s/it]

GT computation:  70%|███████   | 7/10 [00:08<00:03,  1.19s/it]

GT computation:  80%|████████  | 8/10 [00:09<00:02,  1.19s/it]

GT computation:  90%|█████████ | 9/10 [00:10<00:01,  1.19s/it]

GT computation: 100%|██████████| 10/10 [00:11<00:00,  1.20s/it]

GT computation: 100%|██████████| 10/10 [00:11<00:00,  1.19s/it]

  GT shape: (1000, 100)





## 5. Save Preprocessed Data

In [7]:
print('Saving preprocessed data...')
for name, data in all_datasets.items():
    prefix = f'ann_{name}'
    
    np.save(DATA_DIR / f'{prefix}_train.npy', data['train'])
    np.save(DATA_DIR / f'{prefix}_test.npy', data['test'])
    np.save(DATA_DIR / f'{prefix}_gt_neighbors.npy', data['gt_neighbors'])
    
    train_mb = data['train'].nbytes / (1024**2)
    test_mb = data['test'].nbytes / (1024**2)
    
    print(f'  {name}: train={data["train"].shape} ({train_mb:.1f}MB), '
          f'test={data["test"].shape} ({test_mb:.1f}MB), '
          f'gt={data["gt_neighbors"].shape}')

print(f'\nAll data saved to {DATA_DIR}')

Saving preprocessed data...


  glove: train=(1183514, 100) (451.5MB), test=(10000, 100) (3.8MB), gt=(10000, 100)


  sift: train=(1000000, 128) (488.3MB), test=(10000, 128) (4.9MB), gt=(10000, 100)
  fashion: train=(60000, 784) (179.4MB), test=(10000, 784) (29.9MB), gt=(10000, 100)


  nytimes: train=(290000, 256) (283.2MB), test=(10000, 256) (9.8MB), gt=(10000, 100)


  gist: train=(1000000, 960) (3662.1MB), test=(1000, 960) (3.7MB), gt=(1000, 100)

All data saved to ../data


## 6. Summary

In [8]:
print('='*80)
print('Dataset Summary')
print('='*80)
print(f'{"Dataset":<15} {"N_train":>10} {"N_test":>8} {"Dim":>6} '
      f'{"Orig metric":>12} {"Normalized":>12} {"Train size":>12}')
print('-'*80)
for name, data in all_datasets.items():
    n_train = data['train'].shape[0]
    n_test = data['test'].shape[0]
    dim = data['train'].shape[1]
    normalized = 'Yes' if data['metric'] == 'euclidean' else 'No'
    train_mb = data['train'].nbytes / (1024**2)
    print(f'{name:<15} {n_train:>10,} {n_test:>8,} {dim:>6} '
          f'{data["metric"]:>12} {normalized:>12} {train_mb:>10.1f} MB')

Dataset Summary
Dataset            N_train   N_test    Dim  Orig metric   Normalized   Train size
--------------------------------------------------------------------------------
glove            1,183,514   10,000    100      angular           No      451.5 MB
sift             1,000,000   10,000    128    euclidean          Yes      488.3 MB
fashion             60,000   10,000    784    euclidean          Yes      179.4 MB
nytimes            290,000   10,000    256      angular           No      283.2 MB
gist             1,000,000    1,000    960    euclidean          Yes     3662.1 MB


## 評価・考察

### データセットの特徴

5種の標準ベンチマークデータセットを正常にダウンロード・前処理できた（合計約5GB）。

| データセット | 規模 | 次元 | 元の距離指標 | 処理内容 |
|-------------|------|------|------------|---------|
| GloVe-100 | 1.18M | 100 | angular | そのまま使用 |
| SIFT-128 | 1.00M | 128 | euclidean | L2正規化 |
| Fashion-MNIST | 60K | 784 | euclidean | L2正規化 |
| NYTimes-256 | 290K | 256 | angular | そのまま使用 |
| GIST-960 | 1.00M | 960 | euclidean | L2正規化 |

### 前処理における注意点

- **L2正規化**: euclideanデータセット（SIFT, Fashion, GIST）をL2正規化し、cosine類似度で統一評価できるようにした。正規化後はcosineランキングとL2ランキングが一致するため、公平な比較が可能になる。
- **ゼロベクトル**: GISTの学習データに10個のゼロベクトルが検出された。ゼロ除算を防ぐため、これらのノルムを1.0に設定して処理した。実験結果への影響は無視できる規模。
- **Ground Truth**: angularデータセット（GloVe, NYTimes）は提供されたneighborsをspot-check（10クエリで再計算し95%以上の一致を確認）してそのまま使用。euclideanデータセットはL2正規化後にcosine ground truthをbrute-forceで再計算した。

### 次の実験への準備

各データセットの `train.npy`、`test.npy`、`gt_neighbors.npy` が `data/` ディレクトリに保存され、実験73以降で即座に利用可能な状態になった。データセットの規模は60K〜1.2Mと幅広く、ITQ-LSHのスケーラビリティ評価に適した構成である。