# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

In [3]:
import os
import sys
import time
from pathlib import Path

# Enable automatic reloading of changed modules
%load_ext autoreload
%autoreload 2
print("Autoreload enabled; modules will update automatically when files change")

# Change to project directory
os.chdir('/scratch/edk202/word2gm-fast/notebooks')
os.chdir("..")

# Clean TensorFlow import with complete silencing
from src.word2gm_fast.utils import import_tensorflow_silently

tf = import_tensorflow_silently(deterministic=False)
print(f"TensorFlow {tf.__version__} imported silently")

# Import optimized data pipeline modules
from src.word2gm_fast.dataprep.pipeline import batch_prepare_training_data

print("All pipeline modules loaded successfully")
print("Ready to process corpus and generate training data")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Autoreload enabled; modules will update automatically when files change
TensorFlow 2.19.0 imported silently
All pipeline modules loaded successfully
Ready to process corpus and generate training data


## Prepare one or more corpora in parallel 

In [None]:
# Configuration
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

# Process available corpus files with auto-discovery
results = batch_prepare_training_data(
    corpus_dir=corpus_dir,
    years=None,
    compress=True,
    max_workers=4,
    show_progress=True,
    show_summary=True
)

🔍 Discovering available corpus years...
📅 Available corpus years (390): 1578, 1579, 1583, 1587, 1590, 1594, 1595, 1597, 1598, 1600, 1602, 1603, 1604, 1608, 1609, 1611, 1613, 1615, 1620, 1622, 1623, 1626, 1627, 1629, 1631, 1632, 1633, 1634, 1635, 1640, 1642, 1643, 1644, 1647, 1650, 1652, 1655, 1660, 1661, 1662, 1667, 1668, 1669, 1671, 1673, 1674, 1675, 1677, 1678, 1679, 1680, 1681, 1682, 1683, 1684, 1685, 1686, 1687, 1688, 1689, 1690, 1691, 1692, 1693, 1694, 1695, 1696, 1697, 1698, 1699, 1700, 1701, 1702, 1703, 1704, 1705, 1706, 1707, 1708, 1709, 1710, 1711, 1712, 1713, 1714, 1715, 1716, 1717, 1718, 1719, 1720, 1721, 1722, 1723, 1724, 1725, 1726, 1727, 1728, 1729, 1730, 1731, 1732, 1733, 1734, 1735, 1736, 1737, 1738, 1739, 1740, 1741, 1742, 1743, 1744, 1745, 1746, 1747, 1748, 1749, 1750, 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768, 1769, 1770, 1771, 1772, 1773, 1774, 1775, 1776, 1777, 1778, 1779, 1780, 1781, 1782, 1783, 1784

Process ForkProcess-4:
Process ForkProcess-1:
Process ForkProcess-1:
Process ForkProcess-2:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/ext3/miniforge3/envs/word2gm-fast2/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/ext3/miniforge3/envs/word2gm-fast2/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/ext3/miniforge3/envs/word2gm-fast2/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/ext3/miniforge3/envs/word2gm-fast2/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/ext3/miniforge3/envs/word2gm-fast2/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/ext3/miniforge3/envs/word2gm-fast2/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._targe

KeyboardInterrupt: 