# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

In [3]:
import os
import sys
import time
from pathlib import Path

# Enable automatic reloading of changed modules
%load_ext autoreload
%autoreload 2
print("Autoreload enabled; modules will update automatically when files change")

# Change to project directory
os.chdir('/scratch/edk202/word2gm-fast/notebooks')
os.chdir("..")

# Clean TensorFlow import with complete silencing
from src.word2gm_fast.utils import import_tensorflow_silently

tf = import_tensorflow_silently(deterministic=False)
print(f"TensorFlow {tf.__version__} imported silently")

# Import optimized data pipeline modules
from src.word2gm_fast.dataprep.pipeline import batch_prepare_training_data

print("All pipeline modules loaded successfully")
print("Ready to process corpus and generate training data")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Autoreload enabled; modules will update automatically when files change
TensorFlow 2.19.0 imported silently
All pipeline modules loaded successfully
Ready to process corpus and generate training data


## Prepare one or more corpora in parallel 

In [4]:
# Configuration
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

# Process range of years
results = batch_prepare_training_data(
    corpus_dir=corpus_dir,
    year_range="1400-1600",
    compress=True,
    max_workers=4,
    show_progress=True,
    show_summary=True
)

Year range '1400-1600' expanded to 201 years: 1400, 1401, 1402, 1403, 1404, 1405, 1406, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1418, 1419, 1420, 1421, 1422, 1423, 1424, 1425, 1426, 1427, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435, 1436, 1437, 1438, 1439, 1440, 1441, 1442, 1443, 1444, 1445, 1446, 1447, 1448, 1449, 1450, 1451, 1452, 1453, 1454, 1455, 1456, 1457, 1458, 1459, 1460, 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 1470, 1471, 1472, 1473, 1474, 1475, 1476, 1477, 1478, 1479, 1480, 1481, 1482, 1483, 1484, 1485, 1486, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499, 1500, 1501, 1502, 1503, 1504, 1505, 1506, 1507, 1508, 1509, 1510, 1511, 1512, 1513, 1514, 1515, 1516, 1517, 1518, 1519, 1520, 1521, 1522, 1523, 1524, 1525, 1526, 1527, 1528, 1529, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538, 1539, 1540, 1541, 1542, 1543, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1554, 1555, 1556, 1557, 1558, 

1401 failed (1/201): Corpus file not found: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data/1401.txt
1400 failed (2/201): Corpus file not found: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data/1400.txt
1402 failed (3/201): Corpus file not found: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data/1402.txt
1404 failed (4/201): Corpus file not found: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data/1404.txt
1405 failed (5/201): Corpus file not found: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data/1405.txt
1406 failed (6/201): Corpus file not found: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data/1406.txt
1407 failed (7/201): Corpus file not found: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-ficti