# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

## Set Up for Data Preparation

In [1]:
import sys
from pathlib import Path
import os

# Set project root and add src to path
PROJECT_ROOT = '/scratch/edk202/word2gm-fast'
project_root = Path(PROJECT_ROOT)
src_path = project_root / 'src'

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Enable autoreload for development
%load_ext autoreload
%autoreload 2

# Import TensorFlow quietly using our simplified silencing
from word2gm_fast.utils.tf_silence import import_tf_quietly
tf = import_tf_quietly(force_cpu=False)  # Allow GPU usage in notebook

# Basic imports
import numpy as np
import pandas as pd

# Import the batch processing function directly
from word2gm_fast.dataprep.simple_pipeline import run_pipeline

print("Setup complete - ready for data preparation")
print(f"✅ TensorFlow {tf.__version__} imported quietly")

2025-07-15 00:11:22.311087: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-15 00:11:22.327766: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752552682.346540 2977804 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752552682.352161 2977804 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1752552682.366679 2977804 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

Setup complete - ready for data preparation
✅ TensorFlow 2.19.0 imported quietly


## Print Resource Summary

In [2]:
# Import and run resource summary
from word2gm_fast.utils.resource_summary import print_resource_summary

print_resource_summary()

2025-07-15 00:11:27.876782: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


<pre>SYSTEM RESOURCE SUMMARY
=============================================
Hostname: cm011.hpc.nyu.edu

Job Allocation:
   CPUs: 14
   Memory: 125.0 GB
   Partition: short
   Job ID: 63738842
   Node list: cm011

Physical GPU Hardware:
   No physical GPUs allocated to this job

TensorFlow GPU Recognition:
   TensorFlow can access 0 GPU(s)
   Built with CUDA support: True
=============================================</pre>

## Prepare Corpora

Here, we run the data-preparation pipeline from start to finish — reading preprocessed ngram corpora, generating all valid triplets, extracting the vocabulary, and saving the triplets and vocabulary as `tfrecord` files.

### Options for Data Preparation

You can control which years are processed and how the batch preparation runs by adjusting the arguments to `batch_prepare_training_data`:

**Ways to specify years:**
- `year_range="2010"` — Process a single year (e.g., only 2010).
- `year_range="2010,2012,2015"` — Process a comma-separated list of years.
- `year_range="2010-2015"` — Process a range of years, inclusive (2010 through 2015).
- `year_range="2010,2012-2014,2016"` — Combine individual years and ranges (2010, 2012, 2013, 2014, 2016).

**Other options:**
- `compress` — If `True`, output TFRecords are gzip-compressed. If `False`, output is uncompressed.
- `show_progress` — If `True`, display a progress bar for each year.
- `show_summary` — If `True`, print a summary of the processed data for each year.
- `use_multiprocessing` — If `True`, process years in parallel using multiple CPU cores (recommended for large datasets).

**TensorFlow Logging:**
- TensorFlow logging is set to ERROR level to reduce verbose output
- The pipeline still works normally, but with cleaner console output
- Critical errors will still be displayed if they occur

See the function docstring or source for more advanced options.

In [None]:
corpus_dir = '/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data'

results = run_pipeline(
    corpus_dir=corpus_dir,
    years="1600-1750",
    compress=False,
    show_progress=True,
    max_workers=None,
)


Processing 162 years: 1600-1800
Skipping 39 missing files
Using 24 parallel workers


2025-07-15 00:11:31.528978: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:31.547651: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:31.559913: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:31.608233: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:31.627596: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:31.634731: W tensorflow/core/framework/op_kernel.cc:1857] OP_REQUIRES failed at lookup_table_op.cc:1069 : INVALID_ARGUMENT: keys and values cannot be empty tensors.
2025-07-15 00:11:31.634752: I tensorflow/core/framewo

1644: FAILED (1/162)


2025-07-15 00:11:31.830826: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:31.832685: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:31.898333: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:31.906452: W tensorflow/core/framework/op_kernel.cc:1857] OP_REQUIRES failed at lookup_table_op.cc:1069 : INVALID_ARGUMENT: keys and values cannot be empty tensors.
2025-07-15 00:11:31.906476: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: INVALID_ARGUMENT: keys and values cannot be empty tensors.
2025-07-15 00:11:31.961171: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:31.9836

1615: FAILED (2/162)
1602: FAILED (3/162)


2025-07-15 00:11:32.034003: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.060917: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.072940: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.082013: W tensorflow/core/framework/op_kernel.cc:1857] OP_REQUIRES failed at lookup_table_op.cc:1069 : INVALID_ARGUMENT: keys and values cannot be empty tensors.
2025-07-15 00:11:32.082033: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: INVALID_ARGUMENT: keys and values cannot be empty tensors.
2025-07-15 00:11:32.086419: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.1077

1613: OK - 2 triplets (1.3s) [4/162]
1608: OK - 0 triplets (1.4s) [5/162]
1626: OK - 0 triplets (1.5s) [6/162]
1632: OK - 0 triplets (1.5s) [7/162]


2025-07-15 00:11:32.530097: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.563005: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.668868: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.680461: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.699525: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1635: OK - 0 triplets (1.7s) [8/162]
1634: OK - 0 triplets (1.7s) [9/162]
1611: OK - 0 triplets (1.8s) [10/162]
1640: OK - 2 triplets (1.8s) [11/162]


2025-07-15 00:11:32.736180: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.753656: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.756616: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.782248: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.788749: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.817307: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:32.829310: I tensorflow/core/framework/local_rendezvous.cc:407] L

1647: OK - 11 triplets (1.2s) [12/162]
1633: OK - 7 triplets (1.9s) [13/162]
1622: OK - 0 triplets (1.9s) [14/162]
1620: OK - 0 triplets (1.9s) [15/162]
1623: OK - 3 triplets (1.9s) [16/162]
1643: OK - 0 triplets (1.9s) [17/162]


2025-07-15 00:11:32.943319: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.024332: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.086008: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.093804: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.114808: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1652: OK - 10 triplets (1.0s) [18/162]
1604: OK - 48 triplets (2.3s) [19/162]
1655: OK - 0 triplets (1.0s) [20/162]
1629: OK - 8 triplets (2.3s) [21/162]
1609: OK - 30 triplets (2.4s) [22/162]
1631: OK - 21 triplets (2.4s) [23/162]
1600: OK - 78 triplets (2.4s) [24/162]


2025-07-15 00:11:33.152245: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.185863: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.196160: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.217557: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.242826: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.256783: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.329249: I tensorflow/core/framework/local_rendezvous.cc:407] L

1669: OK - 1 triplets (0.6s) [25/162]
1662: OK - 2 triplets (0.9s) [26/162]
1603: OK - 84 triplets (2.5s) [27/162]


2025-07-15 00:11:33.361918: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1660: OK - 31 triplets (1.3s) [28/162]
1668: OK - 7 triplets (1.0s) [29/162]
1677: OK - 0 triplets (0.9s) [30/162]


2025-07-15 00:11:33.721589: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.792852: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.811676: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:33.822510: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1661: OK - 4 triplets (1.4s) [31/162]
1675: OK - 18 triplets (1.0s) [32/162]
1674: OK - 0 triplets (1.1s) [33/162]
1642: OK - 0 triplets (3.0s) [34/162]


2025-07-15 00:11:34.186752: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:34.295279: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:34.337688: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1686: OK - 4 triplets (0.9s) [35/162]
1671: OK - 41 triplets (1.6s) [36/162]
1691: OK - 3 triplets (0.7s) [37/162]


2025-07-15 00:11:34.515605: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:34.565828: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:34.573847: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:34.573926: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:34.631470: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1690: OK - 0 triplets (1.0s) [38/162]
1695: OK - 0 triplets (0.8s) [39/162]
1693: OK - 4 triplets (0.9s) [40/162]
1689: OK - 21 triplets (1.4s) [41/162]
1650: OK - 319 triplets (3.0s) [42/162]
1680: OK - 51 triplets (1.8s) [43/162]
1673: OK - 122 triplets (2.1s) [44/162]
1687: OK - 36 triplets (1.6s) [45/162]
1694: OK - 6 triplets (1.1s) [46/162]
1682: OK - 53 triplets (1.8s) [47/162]
1689: OK - 21 triplets (1.4s) [41/162]
1650: OK - 319 triplets (3.0s) [42/162]
1680: OK - 51 triplets (1.8s) [43/162]
1673: OK - 122 triplets (2.1s) [44/162]
1687: OK - 36 triplets (1.6s) [45/162]
1694: OK - 6 triplets (1.1s) [46/162]
1682: OK - 53 triplets (1.8s) [47/162]


2025-07-15 00:11:34.868559: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:34.882515: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:34.906780: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:34.915005: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:34.928355: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1696: OK - 9 triplets (1.2s) [48/162]
1683: OK - 43 triplets (2.0s) [49/162]


2025-07-15 00:11:35.272420: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:35.281240: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:35.318056: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1699: OK - 22 triplets (1.0s) [50/162]
1692: OK - 12 triplets (1.6s) [51/162]
1698: OK - 0 triplets (1.0s) [52/162]
1697: OK - 0 triplets (1.1s) [53/162]
1702: OK - 7 triplets (0.9s) [54/162]
1702: OK - 7 triplets (0.9s) [54/162]


2025-07-15 00:11:35.645888: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:35.851856: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:35.851856: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1707: OK - 8 triplets (1.0s) [55/162]
1712: OK - 1 triplets (0.8s) [56/162]


2025-07-15 00:11:36.585867: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:36.781227: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1700: OK - 40 triplets (2.0s) [57/162]
1709: OK - 81 triplets (1.7s) [58/162]
1713: OK - 21 triplets (1.5s) [59/162]


2025-07-15 00:11:36.887772: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1701: OK - 103 triplets (2.3s) [60/162]
1703: OK - 217 triplets (2.1s) [61/162]
1688: OK - 397 triplets (3.6s) [62/162]
1708: OK - 108 triplets (2.1s) [63/162]
1718: OK - 34 triplets (1.0s) [64/162]
1718: OK - 34 triplets (1.0s) [64/162]
1717: OK - 29 triplets (1.4s) [65/162]
1717: OK - 29 triplets (1.4s) [65/162]


2025-07-15 00:11:37.409018: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:37.685158: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:37.685158: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1679: OK - 343 triplets (5.3s) [66/162]
1711: OK - 294 triplets (3.0s) [67/162]
1705: OK - 743 triplets (4.0s) [68/162]
1716: OK - 415 triplets (3.3s) [69/162]
1715: OK - 207 triplets (3.7s) [70/162]
1705: OK - 743 triplets (4.0s) [68/162]
1716: OK - 415 triplets (3.3s) [69/162]
1715: OK - 207 triplets (3.7s) [70/162]
1721: OK - 132 triplets (2.4s) [71/162]
1721: OK - 132 triplets (2.4s) [71/162]
1725: OK - 103 triplets (2.5s) [72/162]
1725: OK - 103 triplets (2.5s) [72/162]


2025-07-15 00:11:39.700634: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:39.828510: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1723: OK - 240 triplets (2.9s) [73/162]


2025-07-15 00:11:40.438627: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1734: OK - 16 triplets (0.9s) [74/162]
1731: OK - 116 triplets (1.8s) [75/162]
1627: OK - 1,003 triplets (10.0s) [76/162]
1731: OK - 116 triplets (1.8s) [75/162]
1627: OK - 1,003 triplets (10.0s) [76/162]
1706: OK - 73 triplets (6.2s) [77/162]
1706: OK - 73 triplets (6.2s) [77/162]
1735: OK - 49 triplets (1.6s) [78/162]
1724: OK - 574 triplets (4.6s) [79/162]
1735: OK - 49 triplets (1.6s) [78/162]
1724: OK - 574 triplets (4.6s) [79/162]


2025-07-15 00:11:41.777288: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1681: OK - 2,095 triplets (8.9s) [80/162]
1714: OK - 211 triplets (7.0s) [81/162]
1714: OK - 211 triplets (7.0s) [81/162]
1729: OK - 77 triplets (4.8s) [82/162]
1729: OK - 77 triplets (4.8s) [82/162]


2025-07-15 00:11:43.549648: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:44.145537: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:44.145537: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1704: OK - 910 triplets (9.5s) [83/162]


2025-07-15 00:11:44.410691: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:44.583660: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:44.938189: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:44.938189: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1741: OK - 827 triplets (4.2s) [84/162]


2025-07-15 00:11:47.201091: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1736: OK - 1,702 triplets (7.0s) [85/162]
1710: OK - 989 triplets (12.5s) [86/162]


2025-07-15 00:11:47.908165: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1738: OK - 516 triplets (7.7s) [87/162]
1678: OK - 1,407 triplets (16.3s) [88/162]
1737: OK - 1,475 triplets (8.6s) [89/162]
1678: OK - 1,407 triplets (16.3s) [88/162]
1737: OK - 1,475 triplets (8.6s) [89/162]
1746: OK - 552 triplets (3.7s) [90/162]
1740: OK - 1,358 triplets (8.1s) [91/162]
1746: OK - 552 triplets (3.7s) [90/162]
1740: OK - 1,358 triplets (8.1s) [91/162]
1745: OK - 185 triplets (6.8s) [92/162]
1745: OK - 185 triplets (6.8s) [92/162]
1733: OK - 1,065 triplets (12.3s) [93/162]
1733: OK - 1,065 triplets (12.3s) [93/162]
1684: OK - 2,959 triplets (18.5s) [94/162]
1684: OK - 2,959 triplets (18.5s) [94/162]
1744: OK - 463 triplets (9.3s) [95/162]
1744: OK - 463 triplets (9.3s) [95/162]


2025-07-15 00:11:53.869638: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1720: OK - 4,168 triplets (17.3s) [96/162]


2025-07-15 00:11:55.246145: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:11:55.273765: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1758: OK - 291 triplets (3.3s) [97/162]
1732: OK - 1,171 triplets (18.3s) [98/162]
1728: OK - 1,679 triplets (20.5s) [99/162]
1728: OK - 1,679 triplets (20.5s) [99/162]
1667: OK - 7,375 triplets (29.4s) [100/162]
1667: OK - 7,375 triplets (29.4s) [100/162]


2025-07-15 00:12:04.509534: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1743: OK - 2,573 triplets (22.1s) [101/162]
1748: OK - 1,090 triplets (18.8s) [102/162]
1748: OK - 1,090 triplets (18.8s) [102/162]
1759: OK - 2,040 triplets (11.1s) [103/162]
1759: OK - 2,040 triplets (11.1s) [103/162]
1727: OK - 9,518 triplets (31.2s) [104/162]
1727: OK - 9,518 triplets (31.2s) [104/162]


2025-07-15 00:12:11.398195: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:12:12.112060: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:12:12.112060: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1730: OK - 8,263 triplets (33.6s) [105/162]
1739: OK - 7,568 triplets (31.4s) [106/162]
1764: OK - 164 triplets (6.3s) [107/162]
1753: OK - 4,490 triplets (23.2s) [108/162]
1764: OK - 164 triplets (6.3s) [107/162]
1753: OK - 4,490 triplets (23.2s) [108/162]
1726: OK - 12,784 triplets (36.3s) [109/162]
1726: OK - 12,784 triplets (36.3s) [109/162]
1754: OK - 3,550 triplets (23.3s) [110/162]
1754: OK - 3,550 triplets (23.3s) [110/162]
1751: OK - 10,439 triplets (25.6s) [111/162]
1751: OK - 10,439 triplets (25.6s) [111/162]
1722: OK - 7,684 triplets (39.8s) [112/162]
1722: OK - 7,684 triplets (39.8s) [112/162]
1756: OK - 4,614 triplets (25.5s) [113/162]
1756: OK - 4,614 triplets (25.5s) [113/162]


2025-07-15 00:12:19.526429: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1762: OK - 2,923 triplets (20.4s) [114/162]


2025-07-15 00:12:24.644947: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1763: OK - 1,566 triplets (20.2s) [115/162]
1761: OK - 4,616 triplets (27.7s) [116/162]
1761: OK - 4,616 triplets (27.7s) [116/162]
1757: OK - 8,222 triplets (36.0s) [117/162]
1757: OK - 8,222 triplets (36.0s) [117/162]


2025-07-15 00:12:29.629660: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:12:30.176433: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:12:30.176433: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1752: OK - 13,627 triplets (40.8s) [118/162]
1685: OK - 10,242 triplets (57.4s) [119/162]
1685: OK - 10,242 triplets (57.4s) [119/162]
1765: OK - 9,420 triplets (28.9s) [120/162]
1765: OK - 9,420 triplets (28.9s) [120/162]
1778: OK - 2,284 triplets (12.4s) [121/162]
1778: OK - 2,284 triplets (12.4s) [121/162]


2025-07-15 00:12:41.752769: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:12:42.209573: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:12:42.209573: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1766: OK - 8,810 triplets (34.0s) [122/162]


2025-07-15 00:12:42.625371: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:12:43.105133: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:12:43.105133: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [55]:
# SIMPLE PIPELINE: Clean, minimal wrapper
print("🚀 SIMPLE PIPELINE")
print("=" * 50)

# Import the clean, simple pipeline
%autoreload 2
from word2gm_fast.dataprep.simple_pipeline import run_pipeline

# Set corpus directory
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

print(f"Corpus directory: {corpus_dir}")
print()

# Example 1: Process a few test years
print("📋 TESTING: Small batch (3 years)")
results = run_pipeline(
    corpus_dir=corpus_dir,
    years="1683,1684,1685",
    compress=True,
    max_workers=2,
    show_progress=True
)

print()
print("📋 RESULTS:")
for year, result in results.items():
    if "error" in result:
        print(f"❌ {year}: {result['error']}")
    else:
        triplets = result['triplet_count']
        vocab = result['vocab_size']
        duration = result['duration']
        print(f"✅ {year}: {triplets:,} triplets, {vocab:,} vocab ({duration:.1f}s)")

print()
print("🎯 SIMPLE PIPELINE FEATURES:")
print("✅ Minimal complexity - just specify years and run")
print("✅ Automatic parallel processing")
print("✅ Clean progress output")
print("✅ Error handling per year")
print("✅ One-pass vocab-from-triplets approach")
print("✅ Zero UNK contamination guaranteed")

print()
print("📖 USAGE EXAMPLES:")
print('run_pipeline(corpus_dir, "1680-1690")        # Process decade')
print('run_pipeline(corpus_dir, "1684,1690,1695")   # Specific years')
print('run_pipeline(corpus_dir, "1680-1690", max_workers=4)  # Custom workers')

print()
print("=" * 50)
print("🎉 Simple pipeline ready for production!")

🚀 SIMPLE PIPELINE
Corpus directory: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data

📋 TESTING: Small batch (3 years)
Processing 3 years: 1683-1685
Using 2 parallel workers


2025-07-14 20:45:58.728821: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_4}}


1683: OK - 43 triplets (0.6s) [1/3]


2025-07-14 20:45:59.584267: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_4}}


1684: OK - 2,959 triplets (4.6s) [2/3]
1685: OK - 10,242 triplets (9.5s) [3/3]

Completed in 10.2s
✅ Successful: 3 years
📊 Total triplets: 13,244
📚 Average vocab: 2,526

📋 RESULTS:
✅ 1683: 43 triplets, 104 vocab (0.6s)
✅ 1684: 2,959 triplets, 2,533 vocab (4.6s)
✅ 1685: 10,242 triplets, 4,942 vocab (9.5s)

🎯 SIMPLE PIPELINE FEATURES:
✅ Minimal complexity - just specify years and run
✅ Automatic parallel processing
✅ Clean progress output
✅ Error handling per year
✅ One-pass vocab-from-triplets approach
✅ Zero UNK contamination guaranteed

📖 USAGE EXAMPLES:
run_pipeline(corpus_dir, "1680-1690")        # Process decade
run_pipeline(corpus_dir, "1684,1690,1695")   # Specific years
run_pipeline(corpus_dir, "1680-1690", max_workers=4)  # Custom workers

🎉 Simple pipeline ready for production!
1685: OK - 10,242 triplets (9.5s) [3/3]

Completed in 10.2s
✅ Successful: 3 years
📊 Total triplets: 13,244
📚 Average vocab: 2,526

📋 RESULTS:
✅ 1683: 43 triplets, 104 vocab (0.6s)
✅ 1684: 2,959 triplets

In [57]:
results = run_pipeline(
    corpus_dir=corpus_dir,
    years="1680-1689",
    compress=True,
    max_workers=4,  # Use more workers for larger batches
    show_progress=True
)


Processing 10 years: 1680-1689
Using 4 parallel workers


2025-07-14 20:47:14.682083: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_4}}
2025-07-14 20:47:14.684088: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),m

1683: OK - 43 triplets (0.7s) [1/10]
1682: OK - 53 triplets (0.7s) [2/10]
1680: OK - 51 triplets (0.7s) [3/10]


2025-07-14 20:47:15.146576: E tensorflow/core/framework/node_def_util.cc:680] NodeDef mentions attribute use_unbounded_threadpool which is not in the op definition: Op<name=MapDataset; signature=input_dataset:variant, other_arguments: -> handle:variant; attr=f:func; attr=Targuments:list(type),min=0; attr=output_types:list(type),min=1; attr=output_shapes:list(shape),min=1; attr=use_inter_op_parallelism:bool,default=true; attr=preserve_cardinality:bool,default=false; attr=force_synchronous:bool,default=false; attr=metadata:string,default=""> This may be expected if your graph generating binary is newer  than this binary. Unknown attributes will be ignored. NodeDef: {{node ParallelMapDatasetV2/_4}}


1686: OK - 4 triplets (0.4s) [4/10]
1687: OK - 36 triplets (0.6s) [5/10]
1687: OK - 36 triplets (0.6s) [5/10]
1688: OK - 397 triplets (1.4s) [6/10]
1688: OK - 397 triplets (1.4s) [6/10]
1681: OK - 2,095 triplets (3.4s) [7/10]
1681: OK - 2,095 triplets (3.4s) [7/10]
1689: OK - 21 triplets (0.5s) [8/10]
1689: OK - 21 triplets (0.5s) [8/10]
1684: OK - 2,959 triplets (4.7s) [9/10]
1684: OK - 2,959 triplets (4.7s) [9/10]
1685: OK - 10,242 triplets (9.7s) [10/10]

Completed in 10.5s
✅ Successful: 10 years
📊 Total triplets: 15,901
📚 Average vocab: 1,077
1685: OK - 10,242 triplets (9.7s) [10/10]

Completed in 10.5s
✅ Successful: 10 years
📊 Total triplets: 15,901
📚 Average vocab: 1,077


In [4]:
# TEST: Improved Multiprocessing Silencing
print("🔇 TESTING IMPROVED MULTIPROCESSING SILENCING")
print("=" * 60)

# Ensure module path is available
import sys
import os
from pathlib import Path
PROJECT_ROOT = '/scratch/edk202/word2gm-fast'
src_path = Path(PROJECT_ROOT) / 'src'
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Import the pipeline function
from word2gm_fast.dataprep.simple_pipeline import run_pipeline

# Set corpus directory
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

# Set up TensorFlow environment for worker processes
from word2gm_fast.utils.tf_silence import setup_tf_environment
setup_tf_environment(force_cpu=True)

print("✅ TensorFlow environment configured for workers")
print(f"TF_CPP_MIN_LOG_LEVEL: {os.environ.get('TF_CPP_MIN_LOG_LEVEL', 'NOT SET')}")
print(f"TF_ENABLE_ONEDNN_OPTS: {os.environ.get('TF_ENABLE_ONEDNN_OPTS', 'NOT SET')}")

# Test with a small batch
print()
print("📋 Testing with small batch (should be much quieter)...")
results = run_pipeline(
    corpus_dir=corpus_dir,
    years="1683,1684",  # Just 2 files for testing
    compress=True,
    max_workers=2,
    show_progress=True
)

print()
print("📋 RESULTS:")
for year, result in results.items():
    if "error" in result:
        print(f"❌ {year}: {result['error']}")
    else:
        triplets = result['triplet_count']
        vocab = result['vocab_size']
        duration = result['duration']
        print(f"✅ {year}: {triplets:,} triplets, {vocab:,} vocab ({duration:.1f}s)")

print()
print("🎯 If you still see TensorFlow messages above, they are:")
print("• Coming from TensorFlow's C++ core (very hard to suppress)")
print("• Normal INFO messages, not errors")
print("• Much reduced compared to before!")

🔇 TESTING IMPROVED MULTIPROCESSING SILENCING
✅ TensorFlow environment configured for workers
TF_CPP_MIN_LOG_LEVEL: 3
TF_ENABLE_ONEDNN_OPTS: 0

📋 Testing with small batch (should be much quieter)...
Processing 2 years: 1683-1684
Using 2 parallel workers


2025-07-15 00:06:16.018190: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-07-15 00:06:16.018190: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-07-15 00:06:16.731085: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:06:16.731085: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:06:17.177532: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:06:17.177532: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1683: OK - 43 triplets (1.2s) [1/2]


2025-07-15 00:06:17.686904: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:06:19.418593: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-07-15 00:06:19.418593: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


1684: OK - 2,959 triplets (3.9s) [2/2]

Completed in 4.0s
✅ Successful: 2 years
📊 Total triplets: 3,002
📚 Average vocab: 1,318

📋 RESULTS:
✅ 1683: 43 triplets, 104 vocab (1.2s)
✅ 1684: 2,959 triplets, 2,533 vocab (3.9s)

🎯 If you still see TensorFlow messages above, they are:
• Coming from TensorFlow's C++ core (very hard to suppress)
• Normal INFO messages, not errors
• Much reduced compared to before!


In [None]:
# TEST: Updated TensorFlow Silencing (No INFO/WARNING)
print("🔕 TESTING: TF_CPP_MIN_LOG_LEVEL=2 (No INFO/WARNING)")
print("=" * 60)

# Reset environment and apply updated silencing
import os
os.environ.pop('TF_CPP_MIN_LOG_LEVEL', None)  # Clear old setting

# Apply new silencing settings
from word2gm_fast.utils.tf_silence import setup_tf_environment
setup_tf_environment(force_cpu=True)

print("✅ Updated TensorFlow environment:")
print(f"TF_CPP_MIN_LOG_LEVEL: {os.environ.get('TF_CPP_MIN_LOG_LEVEL', 'NOT SET')} (should suppress INFO and WARNING)")
print(f"TF_SUPPRESS_LOGS: {os.environ.get('TF_SUPPRESS_LOGS', 'NOT SET')}")

# Test with a single year
print()
print("📋 Testing single year (should be very quiet)...")
results = run_pipeline(
    corpus_dir=corpus_dir,
    years="1683",  # Single small file
    compress=True,
    max_workers=1,  # Single worker for cleaner output
    show_progress=True
)

print()
print("📋 RESULT:")
for year, result in results.items():
    if "error" in result:
        print(f"❌ {year}: {result['error']}")
    else:
        triplets = result['triplet_count']
        vocab = result['vocab_size']
        duration = result['duration']
        print(f"✅ {year}: {triplets:,} triplets, {vocab:,} vocab ({duration:.1f}s)")

print()
print("🎯 TF_CPP_MIN_LOG_LEVEL=2 should suppress:")
print("• INFO messages (I) - Local rendezvous, End of sequence, etc.")
print("• WARNING messages (W) - OP_REQUIRES failed, etc.")
print("• Keep only ERROR and FATAL messages")