# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

In [2]:
import os
import sys
import time
from pathlib import Path

# Enable automatic reloading of changed modules
%load_ext autoreload
%autoreload 2
print("Autoreload enabled; modules will update automatically when files change")

# Change to project directory
os.chdir('/scratch/edk202/word2gm-fast/notebooks')
os.chdir("..")

# Clean TensorFlow import with complete silencing
from src.word2gm_fast.utils import import_tensorflow_silently

tf = import_tensorflow_silently(deterministic=False)
print(f"TensorFlow {tf.__version__} imported silently")

# Import optimized data pipeline modules
from src.word2gm_fast.dataprep.pipeline import batch_prepare_training_data

print("All pipeline modules loaded successfully")
print("Ready to process corpus and generate training data")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Autoreload enabled; modules will update automatically when files change
TensorFlow 2.19.0 imported silently
All pipeline modules loaded successfully
Ready to process corpus and generate training data


## Prepare one or more corpora in parallel 

In [None]:
# Configuration
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

# Process years with optional parallel processing
print("Processing corpus data with optimized pipeline...")
results = batch_prepare_training_data(
    corpus_dir=corpus_dir,
    year_range="1751-1800",
    compress=True,
    show_progress=True,
    show_summary=True,
    use_multiprocessing=True
)

Processing corpus data with optimized pipeline...

PARALLEL BATCH PROCESSING
Processing 51 years
Using 14 parallel workers
Estimated speedup: 14.0x
1712 complete (1/51): 23 triplets, 36 vocab, 0.8s
1702 complete (2/51): 63 triplets, 75 vocab, 0.9s
1713 complete (3/51): 257 triplets, 210 vocab, 0.9s
1707 complete (4/51): 117 triplets, 108 vocab, 1.0s
1700 complete (5/51): 381 triplets, 325 vocab, 1.3s
1708 complete (6/51): 905 triplets, 564 vocab, 1.5s
1717 complete (7/51): 314 triplets, 312 vocab, 0.6s
1706 complete (8/51): 737 triplets, 483 vocab, 1.7s
1709 complete (9/51): 738 triplets, 465 vocab, 1.7s
1703 complete (10/51): 1,589 triplets, 794 vocab, 1.9s
1718 complete (11/51): 361 triplets, 321 vocab, 0.7s
1701 complete (12/51): 881 triplets, 542 vocab, 2.0s
1715 complete (13/51): 1,497 triplets, 815 vocab, 1.5s
1714 complete (14/51): 1,601 triplets, 834 vocab, 1.6s
1711 complete (15/51): 1,991 triplets, 983 vocab, 2.7s
1721 complete (16/51): 1,071 triplets, 644 vocab, 1.4s
1716 co