# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

## Set Up for Data Preparation

In [29]:
# Set project root directory and add `src` to path
import sys
from pathlib import Path
import os
import logging

# Completely suppress TensorFlow logging before any imports
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # 0=DEBUG, 1=INFO, 2=WARNING, 3=ERROR
logging.getLogger('tensorflow').setLevel(logging.ERROR)

# Suppress all other unwanted logs
for logger_name in ['tensorflow', 'absl', 'h5py']:
    logging.getLogger(logger_name).setLevel(logging.ERROR)
    logging.getLogger(logger_name).propagate = False

# Redirect stdout and stderr for TensorFlow
import io
import contextlib
tf_stderr = io.StringIO()
tf_redirect = contextlib.redirect_stderr(tf_stderr)
tf_redirect.__enter__()

PROJECT_ROOT = '/scratch/edk202/word2gm-fast'
project_root = Path(PROJECT_ROOT)
src_path = project_root / 'src'

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Import the notebook setup utilities
from word2gm_fast.utils.notebook_setup import setup_data_preprocessing_notebook, enable_autoreload

# Enable autoreload for development
enable_autoreload()

# Set up environment (CPU-only for data preprocessing)
env = setup_data_preprocessing_notebook(project_root=PROJECT_ROOT)

# Extract commonly used modules for convenience
tf = env['tensorflow']
np = env['numpy']
pd = env['pandas']
batch_prepare_training_data = env['batch_prepare_training_data']
print_resource_summary = env['print_resource_summary']

# Add custom filter to completely suppress TensorFlow messages at C level
import ctypes
libc = ctypes.cdll.LoadLibrary('libc.so.6')
try:
    # This is a more extreme approach that redirects TF C++ messages to /dev/null
    libc.fopen.restype = ctypes.c_void_p
    null_fptr = libc.fopen(b'/dev/null', b'w')
    libc.stderr = null_fptr
except:
    # If that doesn't work, we still have our other methods
    pass

print("TensorFlow logging has been aggressively suppressed")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<pre>Autoreload enabled</pre>

<pre>Project root: /scratch/edk202/word2gm-fast
TensorFlow version: 2.19.0
Device mode: CPU-only</pre>

<pre>Data preprocessing environment ready!</pre>

TensorFlow logging has been aggressively suppressed


## Print Resource Summary

In [30]:
print_resource_summary()

<pre>SYSTEM RESOURCE SUMMARY
============================================================
Hostname: cm013.hpc.nyu.edu

Job Allocation:
   CPUs: 14
   Memory: 125.0 GB
   Requested partitions: short,cs,cm,cpu_a100_2,cpu_a100_1,cpu_gpu
   Running on: SSH failed: Host key verification failed.
   Job ID: 63583584
   Node list: cm013

GPU Information:
   Error: NVML Shared Library Not Found

TensorFlow GPU Detection:
   TensorFlow detects 0 GPU(s)
   Built with CUDA: True
============================================================</pre>

## Prepare Corpora

Here, we run the data-preparation pipeline from start to finish — reading preprocessed ngram corpora, generating all valid triplets, extracting the vocabulary, and saving the triplets and vocabulary as `tfrecord` files.

### Options for Data Preparation

You can control which years are processed and how the batch preparation runs by adjusting the arguments to `batch_prepare_training_data`:

**Ways to specify years:**
- `year_range="2010"` — Process a single year (e.g., only 2010).
- `year_range="2010,2012,2015"` — Process a comma-separated list of years.
- `year_range="2010-2015"` — Process a range of years, inclusive (2010 through 2015).
- `year_range="2010,2012-2014,2016"` — Combine individual years and ranges (2010, 2012, 2013, 2014, 2016).

**Other options:**
- `compress` — If `True`, output TFRecords are gzip-compressed. If `False`, output is uncompressed.
- `show_progress` — If `True`, display a progress bar for each year.
- `show_summary` — If `True`, print a summary of the processed data for each year.
- `use_multiprocessing` — If `True`, process years in parallel using multiple CPU cores (recommended for large datasets).

**TensorFlow Logging:**
- TensorFlow logging is set to ERROR level to reduce verbose output
- The pipeline still works normally, but with cleaner console output
- Critical errors will still be displayed if they occur

See the function docstring or source for more advanced options.

In [31]:
# Configuration
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

downsample_threshold = 1  # Typical values: 1e-3 (conservative), 1e-4 (moderate), 1e-5 (aggressive)

# Further suppress all TensorFlow logging for this operation
import os
import io
import contextlib
import tensorflow as tf
import sys
import logging

# Make absolutely sure TF is at ERROR level only
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # Silence all TF messages
tf.get_logger().setLevel('ERROR')  # Only errors from TF logger
logging.root.setLevel(logging.ERROR)  # Only errors from root logger

# Temporarily redirect stdout and stderr during processing
original_stdout = sys.stdout
original_stderr = sys.stderr
process_stdout = io.StringIO()
process_stderr = io.StringIO()

print("Starting batch processing...")
print("(TensorFlow logging has been suppressed - only pipeline results will be shown)")

try:
    # Redirect stderr to capture TF noise
    sys.stderr = process_stderr
    
    # Process year(s)
    results = batch_prepare_training_data(
        corpus_dir=corpus_dir,
        year_range="1600-1650",
        compress=False,
        show_progress=True,
        show_summary=True,
        use_multiprocessing=True,
        downsample_threshold=downsample_threshold
    )
    
    # Print only the important progress messages
    progress_output = process_stderr.getvalue()
    filtered_lines = []
    for line in progress_output.split('\n'):
        # Only keep lines with the year and triplet/vocab information
        if any(x in line for x in ['complete', 'Successful:', 'Total triplets:', 'Average vocab']):
            filtered_lines.append(line)
    
    if filtered_lines:
        print("\nPROGRESS SUMMARY (TF noise filtered out):")
        print("\n".join(filtered_lines))
    
finally:
    # Restore stdout and stderr
    sys.stdout = original_stdout
    sys.stderr = original_stderr

print("\nBatch processing complete!")
print(f"Processed {len(results)} years successfully.")

# Display a clearer summary of vocabulary stats
print("\nVOCABULARY STATISTICS EXPLANATION:")
print("=" * 60)
print("Total Vocabulary Size:   The complete number of tokens in the vocabulary file")
print("                         (All possible tokens that could appear in triplets)")
print("")
print("Unique Tokens in Triplets: The actual number of distinct token indices used")
print("                           in the generated training triplets")
print("")
print("Token Coverage %:        Percentage of vocabulary tokens that appear in triplets")
print("                         (unique_tokens ÷ total_vocab_size × 100)")
print("")
print("Unused Tokens:           Vocabulary tokens that never appear in any triplets")
print("                         (total_vocab_size - unique_tokens)")
print("=" * 60)

# Print a summary table for all processed years
if len(results) > 0:
    print("\nDETAILED VOCABULARY STATISTICS BY YEAR:")
    print("=" * 90)
    print(f"{'Year':<10} {'Vocab Size':<15} {'Unique Tokens':<20} {'Coverage %':<15} {'Unused Tokens':<15}")
    print("-" * 90)
    
    for year, data in sorted(results.items()):
        if 'error' not in data:
            vocab_size = data.get('vocab_size', 0)
            unique_tokens = data.get('unique_token_count', 0)
            coverage = data.get('unique_token_percentage', 0)
            unused = data.get('unused_token_count', 0)
            
            print(f"{year:<10} {vocab_size:<15,d} {unique_tokens:<20,d} {coverage:<15.1f} {unused:<15,d}")
    
    print("=" * 90)

Starting batch processing...
(TensorFlow logging has been suppressed - only pipeline results will be shown)
Requested 51 years, found 26 corpus files.
Skipping 25 missing years (showing first 5): 1601, 1605, 1606, 1607, 1610, ...


PARALLEL BATCH PROCESSING
Processing 26 years
Using 14 parallel workers
Estimated speedup: 14.0x


1615 failed (1/26): cannot import name 'read_triplets_from_tfrecord' from 'word2gm_fast.io.triplets' (/scratch/edk202/word2gm-fast/src/word2gm_fast/io/triplets.py)
1602 failed (2/26): cannot import name 'read_triplets_from_tfrecord' from 'word2gm_fast.io.triplets' (/scratch/edk202/word2gm-fast/src/word2gm_fast/io/triplets.py)
1620 failed (3/26): cannot import name 'read_triplets_from_tfrecord' from 'word2gm_fast.io.triplets' (/scratch/edk202/word2gm-fast/src/word2gm_fast/io/triplets.py)
1608 failed (4/26): cannot import name 'read_triplets_from_tfrecord' from 'word2gm_fast.io.triplets' (/scratch/edk202/word2gm-fast/src/word2gm_fast/io/triplets.py)
1626 failed (5/26): cannot import name 'read_triplets_from_tfrecord' from 'word2gm_fast.io.triplets' (/scratch/edk202/word2gm-fast/src/word2gm_fast/io/triplets.py)
1622 failed (6/26): cannot import name 'read_triplets_from_tfrecord' from 'word2gm_fast.io.triplets' (/scratch/edk202/word2gm-fast/src/word2gm_fast/io/triplets.py)
1611 failed (7/2

In [None]:
# Delete all vocab.txt files from the directory tree
import subprocess
import os

# First, let's find all vocab.txt files to see what will be deleted
print("Searching for vocab.txt files...")
result = subprocess.run([
    'find', corpus_dir, '-name', 'vocab.txt', '-type', 'f'
], capture_output=True, text=True)

if result.returncode == 0:
    files = result.stdout.strip().split('\n') if result.stdout.strip() else []
    if files and files[0]:  # Check if we actually found files
        print(f"Found {len(files)} vocab.txt files:")
        for file in files:
            print(f"  {file}")
        
        # Ask for confirmation before deleting
        print(f"\nTo delete all {len(files)} vocab.txt files, run:")
        print(f"find {corpus_dir} -name 'vocab.txt' -type f -delete")
        print("\nOr if you want to see what's being deleted:")
        print(f"find {corpus_dir} -name 'vocab.txt' -type f -print -delete")
        
        # Uncomment the line below to actually delete the files
        # subprocess.run(['find', corpus_dir, '-name', 'vocab.txt', '-type', 'f', '-delete'])
        # print("Files deleted!")
        
    else:
        print("No vocab.txt files found in the directory tree.")
else:
    print(f"Error searching for files: {result.stderr}")
    
print(f"\nNote: This searches recursively starting from: {corpus_dir}")