# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

In [1]:
import os
import sys
import time
from pathlib import Path

# Enable automatic reloading of changed modules
%load_ext autoreload
%autoreload 2
print("Autoreload enabled; modules will update automatically when files change")

# Change to project directory
os.chdir('/scratch/edk202/word2gm-fast/notebooks')
os.chdir("..")

# Clean TensorFlow import with complete silencing
from src.word2gm_fast.utils import import_tensorflow_silently

tf = import_tensorflow_silently(deterministic=False)
print(f"TensorFlow {tf.__version__} imported silently")

# Import optimized data pipeline modules
from src.word2gm_fast.dataprep.pipeline import batch_prepare_training_data

print("All pipeline modules loaded successfully")
print("Ready to process corpus and generate training data")

Autoreload enabled; modules will update automatically when files change
TensorFlow 2.19.0 imported silently
All pipeline modules loaded successfully
Ready to process corpus and generate training data


## Prepare one or more corpora in parallel 

In [None]:
# Configuration
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

# Process years with optional parallel processing
results = batch_prepare_training_data(
    corpus_dir=corpus_dir,
    year_range="1829-1900",
    compress=False,
    show_progress=True,
    show_summary=True,
    use_multiprocessing=True
)


PARALLEL BATCH PROCESSING
Processing 72 years
Using 14 parallel workers
Estimated speedup: 14.0x
1837 complete (1/72): 1,235,211 triplets, 25,101 vocab, 696.1s
1835 complete (2/72): 1,433,895 triplets, 25,335 vocab, 784.3s
1830 complete (3/72): 1,503,816 triplets, 25,450 vocab, 790.3s
1838 complete (4/72): 1,502,966 triplets, 26,324 vocab, 796.9s
1841 complete (5/72): 1,698,066 triplets, 26,666 vocab, 894.3s
1833 complete (6/72): 1,729,847 triplets, 26,916 vocab, 952.3s
1831 complete (7/72): 1,993,305 triplets, 27,548 vocab, 1037.5s
1840 complete (8/72): 2,031,941 triplets, 27,640 vocab, 1070.1s
1829 complete (9/72): 2,057,033 triplets, 27,319 vocab, 1071.2s
1832 complete (10/72): 2,185,572 triplets, 27,982 vocab, 1149.1s
1836 complete (11/72): 2,299,374 triplets, 29,166 vocab, 1218.5s
1834 complete (12/72): 2,415,856 triplets, 28,853 vocab, 1286.2s
1839 complete (13/72): 2,443,839 triplets, 30,160 vocab, 1319.2s
1842 complete (14/72): 3,485,500 triplets, 32,456 vocab, 1864.5s
1843 co