# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

In [1]:
import os
import sys
import time
from pathlib import Path

# Enable automatic reloading of changed modules
%load_ext autoreload
%autoreload 2
print("Autoreload enabled; modules will update automatically when files change")

# Change to project directory
os.chdir('/scratch/edk202/word2gm-fast/notebooks')
os.chdir("..")

# Clean TensorFlow import with complete silencing
from src.word2gm_fast.utils import import_tensorflow_silently

tf = import_tensorflow_silently(deterministic=False)
print(f"TensorFlow {tf.__version__} imported silently")

# Import optimized data pipeline modules
from src.word2gm_fast.dataprep.pipeline import batch_prepare_training_data

print("All pipeline modules loaded successfully")
print("Ready to process corpus and generate training data")

Autoreload enabled; modules will update automatically when files change
TensorFlow 2.19.0 imported silently
All pipeline modules loaded successfully
Ready to process corpus and generate training data


## Prepare one or more corpora in parallel 

In [5]:
# Configuration
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

# Process years with optional parallel processing
results = batch_prepare_training_data(
    corpus_dir=corpus_dir,
    year_range="1829-1900",
    compress=False,
    show_progress=True,
    show_summary=True,
    use_multiprocessing=True
)


PARALLEL BATCH PROCESSING
Processing 28 years
Using 14 parallel workers
Estimated speedup: 14.0x


1802 complete (1/28): 92,701 triplets, 8,487 vocab, 70.0s
1801 complete (2/28): 157,360 triplets, 11,000 vocab, 100.6s
1805 complete (3/28): 190,798 triplets, 11,609 vocab, 113.0s
1804 complete (4/28): 128,037 triplets, 9,769 vocab, 119.3s
1814 complete (5/28): 301,143 triplets, 14,965 vocab, 182.2s
1803 complete (6/28): 349,757 triplets, 14,473 vocab, 204.9s
1807 complete (7/28): 369,858 triplets, 14,626 vocab, 216.0s
1806 complete (8/28): 403,392 triplets, 14,124 vocab, 223.7s
1808 complete (9/28): 317,657 triplets, 13,193 vocab, 224.9s
1813 complete (10/28): 508,850 triplets, 17,428 vocab, 323.9s
1816 complete (11/28): 390,078 triplets, 16,144 vocab, 242.2s
1811 complete (12/28): 691,955 triplets, 17,916 vocab, 403.5s
1812 complete (13/28): 710,549 triplets, 18,423 vocab, 432.8s
1817 complete (14/28): 524,951 triplets, 17,803 vocab, 325.2s
1809 complete (15/28): 734,668 triplets, 19,226 vocab, 443.2s
1815 complete (16/28): 641,679 triplets, 17,905 vocab, 375.6s
1818 complete (17/28)