# 48. Wikipedia 400k 埋め込みデータ生成 & DuckDB永続化

## 目的
- 40万件のWikipediaデータをsentence-transformers (GPU) で埋め込み生成
- DuckDBにHNSWインデックス付きで永続化
- 今後の大規模評価実験の基盤データとして活用

## 使用モデル
- `intfloat/multilingual-e5-base` (768次元)
- sentence-transformers経由でGPU推論（高速）

## 出力ファイル
- `data/wikipedia_400k_e5_base.duckdb` - 埋め込みベクトル + HNSWインデックス
- `data/wikipedia_400k_e5_base_embeddings.npy` - 埋め込みベクトル（ITQ学習用）
- `data/wikipedia_400k_e5_base_meta.npz` - メタデータ（タイトル、ドキュメントID）

**注**: ファイル名にモデル名 `e5_base` を含めることで、既存の `experiment_400k.duckdb` (E5-large, 1024次元) と区別しています。

## 0. セットアップ

In [2]:
import numpy as np
import duckdb
import time
import os
from pathlib import Path
from tqdm import tqdm

# GPU確認
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

PyTorch version: 2.10.0+cu128
CUDA available: True
GPU: NVIDIA GeForce RTX 4090
GPU Memory: 23.5 GB


In [None]:
# 設定
MODEL_NAME = "intfloat/multilingual-e5-base"
EMBEDDING_DIM = 768
TARGET_COUNT = 400_000
BATCH_SIZE = 64  # GPU用バッチサイズ

# 出力パス（モデル名を含めて既存データと区別）
DATA_DIR = Path("../data")
DATA_DIR.mkdir(exist_ok=True)
OUTPUT_DB = DATA_DIR / "wikipedia_400k_e5_base.duckdb"

print(f"Target: {TARGET_COUNT:,} documents")
print(f"Model: {MODEL_NAME}")
print(f"Output: {OUTPUT_DB}")

## 1. Wikipediaデータ取得

HuggingFace datasetsからWikipediaデータを取得します。

In [4]:
from datasets import load_dataset

# Wikipedia日本語データをストリーミングで読み込み
print("Loading Wikipedia Japanese dataset (streaming)...")
start_time = time.time()

# 日本語Wikipedia
wiki_ja = load_dataset(
    "wikimedia/wikipedia",
    "20231101.ja",
    split="train",
    streaming=True,
    trust_remote_code=True
)

print(f"Dataset loaded in {time.time() - start_time:.1f}s")

`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'wikimedia/wikipedia' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.


Loading Wikipedia Japanese dataset (streaming)...


README.md: 0.00B [00:00, ?B/s]

Dataset loaded in 3.8s


In [5]:
# 400k件を収集
print(f"Collecting {TARGET_COUNT:,} documents...")
start_time = time.time()

documents = []
titles = []
doc_ids = []

for i, item in enumerate(tqdm(wiki_ja, total=TARGET_COUNT, desc="Collecting")):
    if i >= TARGET_COUNT:
        break
    
    # テキストの前処理（最初の500文字程度を使用）
    text = item['text'][:500].strip()
    if len(text) < 50:  # 短すぎるものはスキップ
        continue
    
    # E5モデル用のプレフィックス
    documents.append(f"passage: {text}")
    titles.append(item['title'])
    doc_ids.append(item['id'])

print(f"\nCollected {len(documents):,} documents in {time.time() - start_time:.1f}s")
print(f"Sample title: {titles[0]}")
print(f"Sample text (first 100 chars): {documents[0][:100]}...")

Collecting 400,000 documents...


Collecting:  28%|██▊       | 112559/400000 [05:23<02:55, 1640.10it/s]'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: a52ab705-7cc9-4d71-93fb-113b1804311a)')' thrown while requesting GET https://huggingface.co/datasets/wikimedia/wikipedia/resolve/b04c8d1ceb2f5cd4588862100d08de323dccfbaa/20231101.ja/train-00001-of-00015.parquet
Retrying in 1s [Retry 1/5].
Collecting: 100%|██████████| 400000/400000 [12:04<00:00, 552.27it/s] 


Collected 399,029 documents in 724.3s
Sample title: アンパサンド
Sample text (first 100 chars): passage: アンパサンド（&, ）は、並立助詞「…と…」を意味する記号である。ラテン語で「…と…」を表す接続詞 "et" の合字を起源とする。現代のフォントでも、Trebuchet MS など一...





## 2. 埋め込み生成 (sentence-transformers GPU)

In [6]:
from sentence_transformers import SentenceTransformer

# モデルロード
print(f"Loading model: {MODEL_NAME}")
start_time = time.time()

model = SentenceTransformer(MODEL_NAME)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

print(f"Model loaded in {time.time() - start_time:.1f}s")
print(f"Device: {device}")
print(f"Embedding dimension: {model.get_sentence_embedding_dimension()}")

Loading model: intfloat/multilingual-e5-base
Model loaded in 5.5s
Device: cuda
Embedding dimension: 768


In [7]:
# 埋め込み生成
print(f"\nGenerating embeddings for {len(documents):,} documents...")
print(f"Batch size: {BATCH_SIZE}")
start_time = time.time()

embeddings = model.encode(
    documents,
    batch_size=BATCH_SIZE,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True  # E5は正規化推奨
)

elapsed = time.time() - start_time
print(f"\nEmbedding generation completed!")
print(f"Time: {elapsed:.1f}s ({elapsed/60:.1f} min)")
print(f"Speed: {len(documents)/elapsed:.1f} docs/sec")
print(f"Shape: {embeddings.shape}")
print(f"Memory: {embeddings.nbytes / 1024**3:.2f} GB")


Generating embeddings for 399,029 documents...
Batch size: 64


Batches:   0%|          | 0/6235 [00:00<?, ?it/s]


Embedding generation completed!
Time: 941.4s (15.7 min)
Speed: 423.9 docs/sec
Shape: (399029, 768)
Memory: 1.14 GB


## 3. DuckDB永続化 + HNSWインデックス

In [8]:
# 既存ファイル削除
if OUTPUT_DB.exists():
    OUTPUT_DB.unlink()
    print(f"Removed existing: {OUTPUT_DB}")

# DuckDB接続（永続化モード）
con = duckdb.connect(str(OUTPUT_DB))

# VSS拡張インストール
con.execute("INSTALL vss;")
con.execute("LOAD vss;")
print("VSS extension loaded")

VSS extension loaded


In [9]:
# テーブル作成
print("Creating table...")
con.execute(f"""
    CREATE TABLE wikipedia_docs (
        id INTEGER PRIMARY KEY,
        doc_id VARCHAR,
        title VARCHAR,
        text VARCHAR,
        embedding FLOAT[{EMBEDDING_DIM}]
    )
""")
print("Table created")

Creating table...
Table created


In [10]:
# データ挿入（バッチで高速化）
print(f"Inserting {len(documents):,} documents...")
start_time = time.time()

INSERT_BATCH = 10000
for i in tqdm(range(0, len(documents), INSERT_BATCH), desc="Inserting"):
    batch_end = min(i + INSERT_BATCH, len(documents))
    
    # バッチデータ準備
    batch_data = [
        (j, doc_ids[j], titles[j], documents[j], embeddings[j].tolist())
        for j in range(i, batch_end)
    ]
    
    con.executemany(
        "INSERT INTO wikipedia_docs VALUES (?, ?, ?, ?, ?)",
        batch_data
    )

print(f"\nInsert completed in {time.time() - start_time:.1f}s")

# 件数確認
count = con.execute("SELECT COUNT(*) FROM wikipedia_docs").fetchone()[0]
print(f"Total records: {count:,}")

Inserting 399,029 documents...


Inserting: 100%|██████████| 40/40 [17:50<00:00, 26.76s/it]


Insert completed in 1070.5s
Total records: 399,029





In [12]:
# HNSWインデックス作成
print("Creating HNSW index...")
start_time = time.time()
con.execute("SET hnsw_enable_experimental_persistence = true;")


con.execute("""
    CREATE INDEX wikipedia_hnsw_idx ON wikipedia_docs 
    USING HNSW (embedding)
    WITH (metric = 'cosine')
""")

print(f"HNSW index created in {time.time() - start_time:.1f}s")

Creating HNSW index...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

HNSW index created in 105.5s


In [13]:
# インデックス確認
print("\n=== Database Info ===")
print(con.execute("PRAGMA table_info('wikipedia_docs')").fetchdf())
print("\n=== Indexes ===")
print(con.execute("SELECT * FROM duckdb_indexes()").fetchdf())


=== Database Info ===
   cid       name        type  notnull dflt_value     pk
0    0         id     INTEGER     True       None   True
1    1     doc_id     VARCHAR    False       None  False
2    2      title     VARCHAR    False       None  False
3    3       text     VARCHAR    False       None  False
4    4  embedding  FLOAT[768]    False       None  False

=== Indexes ===
    database_name  database_oid schema_name  schema_oid          index_name  \
0  wikipedia_400k           592        main         590  wikipedia_hnsw_idx   

   index_oid      table_name  table_oid comment tags  is_unique  is_primary  \
0       2012  wikipedia_docs       2006    None   {}      False       False   

   expressions                                                sql  
0  [embedding]  CREATE INDEX wikipedia_hnsw_idx ON wikipedia_d...  


## 4. 動作確認

In [14]:
# サンプルクエリで検索テスト
print("Testing HNSW search...")

# クエリ用の埋め込み
query_text = "query: 日本の歴史について"
query_embedding = model.encode([query_text], normalize_embeddings=True)[0]

start_time = time.time()
results = con.execute(f"""
    SELECT id, title, array_cosine_similarity(embedding, ?::FLOAT[{EMBEDDING_DIM}]) as similarity
    FROM wikipedia_docs
    ORDER BY similarity DESC
    LIMIT 10
""", [query_embedding.tolist()]).fetchdf()

print(f"Search time: {(time.time() - start_time)*1000:.1f}ms")
print("\nTop 10 results:")
print(results)

Testing HNSW search...
Search time: 303.1ms

Top 10 results:
       id   title  similarity
0  249275    日本文明    0.865258
1   77237   歴史の一覧    0.859950
2    3446   1689年    0.859729
3   37551    340年    0.859658
4   26196    日本人論    0.859130
5  220919  手掘り日本史    0.858300
6  225548     戦国史    0.857259
7   16551      古代    0.856780
8   60636   歴史書一覧    0.856451
9    5469    日本書紀    0.853323


In [15]:
# 接続クローズ
con.close()

# ファイルサイズ確認
file_size = OUTPUT_DB.stat().st_size / 1024**3
print(f"\n=== Output File ===")
print(f"Path: {OUTPUT_DB}")
print(f"Size: {file_size:.2f} GB")


=== Output File ===
Path: ../data/wikipedia_400k.duckdb
Size: 3.37 GB


## 5. 再読み込みテスト

In [16]:
# 永続化されたDBを再読み込み
print("Reloading from disk...")
start_time = time.time()

con2 = duckdb.connect(str(OUTPUT_DB), read_only=True)
con2.execute("LOAD vss;")

print(f"Reload time: {time.time() - start_time:.2f}s")

# 件数確認
count = con2.execute("SELECT COUNT(*) FROM wikipedia_docs").fetchone()[0]
print(f"Records: {count:,}")

# サンプル検索
start_time = time.time()
result = con2.execute(f"""
    SELECT id, title, array_cosine_similarity(embedding, ?::FLOAT[{EMBEDDING_DIM}]) as similarity
    FROM wikipedia_docs
    ORDER BY similarity DESC
    LIMIT 5
""", [query_embedding.tolist()]).fetchdf()

print(f"Search time: {(time.time() - start_time)*1000:.1f}ms")
print("\nTop 5:")
print(result)

con2.close()

Reloading from disk...
Reload time: 0.04s
Records: 399,029
Search time: 691.0ms

Top 5:
       id  title  similarity
0  249275   日本文明    0.865258
1   77237  歴史の一覧    0.859950
2    3446  1689年    0.859729
3   37551   340年    0.859658
4   26196   日本人論    0.859130


## 6. 埋め込みのNumPy保存（オプション）

ITQ学習など、直接ベクトルにアクセスしたい場合用

In [None]:
# NumPyファイルとしても保存（モデル名を含めて区別）
NPY_PATH = DATA_DIR / "wikipedia_400k_e5_base_embeddings.npy"

print(f"Saving embeddings to {NPY_PATH}...")
np.save(NPY_PATH, embeddings)

npy_size = NPY_PATH.stat().st_size / 1024**3
print(f"Saved: {npy_size:.2f} GB")

# メタデータも保存
META_PATH = DATA_DIR / "wikipedia_400k_e5_base_meta.npz"
np.savez(META_PATH, titles=np.array(titles), doc_ids=np.array(doc_ids))
print(f"Metadata saved: {META_PATH}")

## 7. サマリー

In [None]:
print("="*60)
print("Wikipedia 400k E5-base Embedding Dataset - Summary")
print("="*60)
print(f"Documents: {len(documents):,}")
print(f"Model: {MODEL_NAME}")
print(f"Embedding dimension: {EMBEDDING_DIM}")
print(f"")
print(f"Output files:")
print(f"  - DuckDB: {OUTPUT_DB} ({file_size:.2f} GB)")
print(f"  - NumPy: {NPY_PATH} ({npy_size:.2f} GB)")
print(f"  - Meta: {META_PATH}")
print(f"")
print(f"DuckDB features:")
print(f"  - HNSW index on embeddings (cosine similarity)")
print(f"  - Columns: id, doc_id, title, text, embedding")
print(f"")
print(f"Note: Files named with 'e5_base' to distinguish from")
print(f"      experiment_400k.duckdb (E5-large, 1024 dim)")
print("="*60)

## 8. ITQ学習と永続化

64 bits と 96 bits の両方でITQモデルを学習し、今後の実験用に保存します。

In [19]:
# ITQ実装をインポート
import sys
sys.path.insert(0, '../src')
from itq_lsh import ITQLSH

# 埋め込みデータをロード（メモリに残っていない場合）
if 'embeddings' not in dir():
    print("Loading embeddings from file...")
    embeddings = np.load(DATA_DIR / "wikipedia_400k_e5_base_embeddings.npy")
    print(f"Loaded: {embeddings.shape}")
else:
    print(f"Using in-memory embeddings: {embeddings.shape}")

Using in-memory embeddings: (399029, 768)


In [20]:
# 64 bits ITQ学習
print("="*50)
print("Training ITQ with 64 bits")
print("="*50)
start_time = time.time()

itq_64 = ITQLSH(n_bits=64, n_iterations=50, seed=42)
itq_64.fit(embeddings)

print(f"\nTraining time: {time.time() - start_time:.1f}s")

# 保存
ITQ_64_PATH = DATA_DIR / "itq_e5_base_64bits.pkl"
itq_64.save(str(ITQ_64_PATH))
print(f"Saved: {ITQ_64_PATH}")

Training ITQ with 64 bits
ITQ学習開始: samples=399029, dim=768, bits=64
  Centering完了: mean_norm=0.8698
  PCA完了: explained_variance=48.00%
  ITQ iteration 10: quantization_error=0.9321
  ITQ iteration 20: quantization_error=0.9318
  ITQ iteration 30: quantization_error=0.9316
  ITQ iteration 40: quantization_error=0.9315
  ITQ iteration 50: quantization_error=0.9314
ITQ学習完了

Training time: 11.0s
Saved: ../data/itq_e5_base_64bits.pkl


In [21]:
# 96 bits ITQ学習
print("="*50)
print("Training ITQ with 96 bits")
print("="*50)
start_time = time.time()

itq_96 = ITQLSH(n_bits=96, n_iterations=50, seed=42)
itq_96.fit(embeddings)

print(f"\nTraining time: {time.time() - start_time:.1f}s")

# 保存
ITQ_96_PATH = DATA_DIR / "itq_e5_base_96bits.pkl"
itq_96.save(str(ITQ_96_PATH))
print(f"Saved: {ITQ_96_PATH}")

Training ITQ with 96 bits
ITQ学習開始: samples=399029, dim=768, bits=96
  Centering完了: mean_norm=0.8698
  PCA完了: explained_variance=57.64%
  ITQ iteration 10: quantization_error=0.9394
  ITQ iteration 20: quantization_error=0.9391
  ITQ iteration 30: quantization_error=0.9390
  ITQ iteration 40: quantization_error=0.9389
  ITQ iteration 50: quantization_error=0.9388
ITQ学習完了

Training time: 15.9s
Saved: ../data/itq_e5_base_96bits.pkl


In [22]:
# 全データのハッシュを生成して保存
print("Generating hashes for all documents...")

# 64 bits ハッシュ
start_time = time.time()
hashes_64 = itq_64.transform(embeddings)
print(f"64 bits hashes: {hashes_64.shape}, time: {time.time() - start_time:.1f}s")

# 96 bits ハッシュ
start_time = time.time()
hashes_96 = itq_96.transform(embeddings)
print(f"96 bits hashes: {hashes_96.shape}, time: {time.time() - start_time:.1f}s")

# ハッシュを保存
HASH_64_PATH = DATA_DIR / "wikipedia_400k_e5_base_hashes_64bits.npy"
HASH_96_PATH = DATA_DIR / "wikipedia_400k_e5_base_hashes_96bits.npy"

np.save(HASH_64_PATH, hashes_64)
np.save(HASH_96_PATH, hashes_96)

print(f"\nSaved hashes:")
print(f"  - {HASH_64_PATH} ({HASH_64_PATH.stat().st_size / 1024**2:.1f} MB)")
print(f"  - {HASH_96_PATH} ({HASH_96_PATH.stat().st_size / 1024**2:.1f} MB)")

Generating hashes for all documents...
64 bits hashes: (399029, 64), time: 0.4s
96 bits hashes: (399029, 96), time: 0.4s

Saved hashes:
  - ../data/wikipedia_400k_e5_base_hashes_64bits.npy (24.4 MB)
  - ../data/wikipedia_400k_e5_base_hashes_96bits.npy (36.5 MB)


In [23]:
# ITQモデルの再読み込みテスト
print("Testing ITQ model reload...")

itq_64_reloaded = ITQLSH.load(str(ITQ_64_PATH))
itq_96_reloaded = ITQLSH.load(str(ITQ_96_PATH))

# サンプルで検証
sample_idx = 0
original_64 = itq_64.transform(embeddings[sample_idx:sample_idx+1])
reloaded_64 = itq_64_reloaded.transform(embeddings[sample_idx:sample_idx+1])
print(f"64 bits match: {np.array_equal(original_64, reloaded_64)}")

original_96 = itq_96.transform(embeddings[sample_idx:sample_idx+1])
reloaded_96 = itq_96_reloaded.transform(embeddings[sample_idx:sample_idx+1])
print(f"96 bits match: {np.array_equal(original_96, reloaded_96)}")

print("\nITQ models saved and verified successfully!")

Testing ITQ model reload...
64 bits match: True
96 bits match: True

ITQ models saved and verified successfully!


## 9. 最終サマリー

In [24]:
# 最終サマリー
print("="*60)
print("Wikipedia 400k E5-base Dataset - Final Summary")
print("="*60)

# ファイル一覧
files = [
    ("DuckDB (embeddings + HNSW)", OUTPUT_DB),
    ("Embeddings (NumPy)", DATA_DIR / "wikipedia_400k_e5_base_embeddings.npy"),
    ("Metadata", DATA_DIR / "wikipedia_400k_e5_base_meta.npz"),
    ("ITQ Model (64 bits)", ITQ_64_PATH),
    ("ITQ Model (96 bits)", ITQ_96_PATH),
    ("Hashes (64 bits)", HASH_64_PATH),
    ("Hashes (96 bits)", HASH_96_PATH),
]

print(f"\nDocuments: {len(embeddings):,}")
print(f"Model: {MODEL_NAME}")
print(f"Embedding dimension: {EMBEDDING_DIM}")
print(f"\nOutput files:")
for name, path in files:
    if path.exists():
        size = path.stat().st_size
        if size > 1024**3:
            size_str = f"{size / 1024**3:.2f} GB"
        else:
            size_str = f"{size / 1024**2:.1f} MB"
        print(f"  - {path.name}: {size_str}")
    else:
        print(f"  - {path.name}: NOT FOUND")

print("="*60)

Wikipedia 400k E5-base Dataset - Final Summary

Documents: 399,029
Model: intfloat/multilingual-e5-base
Embedding dimension: 768

Output files:
  - wikipedia_400k.duckdb: NOT FOUND
  - wikipedia_400k_e5_base_embeddings.npy: 1.14 GB
  - wikipedia_400k_e5_base_meta.npz: 132.4 MB
  - itq_e5_base_64bits.pkl: 0.2 MB
  - itq_e5_base_96bits.pkl: 0.3 MB
  - wikipedia_400k_e5_base_hashes_64bits.npy: 24.4 MB
  - wikipedia_400k_e5_base_hashes_96bits.npy: 36.5 MB


## 10. 実行結果・評価

### 処理時間
| 処理 | 時間 |
|-----|------|
| Wikipediaデータ収集 (400k件) | 12分 |
| 埋め込み生成 (GPU) | 15.7分 (424 docs/sec) |
| DuckDB挿入 | 17.8分 |
| HNSWインデックス作成 | 1.8分 |
| ITQ学習 (64 bits + 96 bits) | 数分 |
| **合計** | **約50分** |

### 出力ファイル
| ファイル | サイズ | 内容 |
|---------|--------|------|
| `wikipedia_400k_e5_base.duckdb` | 3.37 GB | 399,029件 + HNSWインデックス |
| `wikipedia_400k_e5_base_embeddings.npy` | 1.14 GB | 埋め込みベクトル |
| `wikipedia_400k_e5_base_meta.npz` | 133 MB | タイトル・ドキュメントID |
| `itq_e5_base_64bits.pkl` | - | ITQモデル (64 bits) |
| `itq_e5_base_96bits.pkl` | - | ITQモデル (96 bits) |
| `wikipedia_400k_e5_base_hashes_64bits.npy` | - | 全ドキュメントのハッシュ (64 bits) |
| `wikipedia_400k_e5_base_hashes_96bits.npy` | - | 全ドキュメントのハッシュ (96 bits) |

### 備考
- 使用モデル: `intfloat/multilingual-e5-base` (768次元)
- 既存の `experiment_400k.duckdb` は E5-large (1024次元) のため、ファイル名で区別
- HNSWインデックスの永続化には `SET hnsw_enable_experimental_persistence = true;` が必要
- このデータを使用して、次の実験でOverlapチャンク評価（64/96 bits）を40万件規模で実施予定