# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

## Setup

In [1]:
%load_ext autoreload
%autoreload 2

import os
import sys
import warnings
import subprocess
from pathlib import Path

# Setup project path
project_root = Path('/scratch/edk202/word2gm-fast')
os.chdir(project_root)
src_path = project_root / 'src'
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Force CPU-only mode for the entire notebook to avoid GPU/multiprocessing conflicts
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

# Import TensorFlow silently (will use CPU-only due to environment variable above)
from word2gm_fast.utils.tf_silence import import_tensorflow_silently
tf = import_tensorflow_silently(force_cpu=True, gpu_memory_growth=False)

print(f"TensorFlow version: {tf.__version__}")
print("Running in CPU-only mode (optimal for data preprocessing + multiprocessing)")

# Import core dependencies
import numpy as np
import pandas as pd
import time
import psutil

# Import Word2GM modules
from word2gm_fast.dataprep.pipeline import batch_prepare_training_data

print("Setup complete; all modules loaded successfully")

TensorFlow version: 2.19.0
Running in CPU-only mode (optimal for data preprocessing + multiprocessing)
Setup complete; all modules loaded successfully


In [2]:
# System Resource Summary
import socket
import pynvml

print("SYSTEM RESOURCE SUMMARY")
print("=" * 48)

# Hostname
hostname = socket.gethostname()
print(f"Hostname: {hostname}")

# Job-allocated CPUs (from SLURM)
cpus_allocated = int(os.environ.get('SLURM_CPUS_PER_TASK', psutil.cpu_count()))
print(f"Job-allocated CPUs: {cpus_allocated}")

# Job-allocated memory (from SLURM) 
mem_per_node_mb = os.environ.get('SLURM_MEM_PER_NODE')
if mem_per_node_mb:
    mem_gb = int(mem_per_node_mb) / 1024
    print(f"Job-allocated memory: {mem_gb:.1f} GB")

# Current partition (actual, not requested)
current_partition = os.environ.get('SLURM_JOB_PARTITION', 'unknown')
print(f"Current partition: {current_partition}")

# GPU Detection (CUDA only)
print(f"\nGPU Detection:")
try:
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    print(f"  CUDA GPUs detected: {device_count}")
    
    for i in range(device_count):
        handle = pynvml.nvmlDeviceGetHandleByIndex(i)
        name = pynvml.nvmlDeviceGetName(handle).decode('utf-8')
        memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        memory_gb = memory_info.total / (1024**3)
        print(f"    GPU {i}: {name} ({memory_gb:.1f} GB)")
        
except Exception:
    print("  No CUDA GPUs detected")

print("=" * 48)

SYSTEM RESOURCE SUMMARY
Hostname: cm015.hpc.nyu.edu
Job-allocated CPUs: 28
Job-allocated memory: 125.0 GB
Current partition: short,cs,cm,cpu_a100_2

GPU Detection:
  No CUDA GPUs detected


## Prepare one or more corpora in parallel 

In [7]:
# Configuration
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

# Process years with multiprocessing (CPU-only mode configured in cell 2)
results = batch_prepare_training_data(
    corpus_dir=corpus_dir,
    year_range="1701-1800",
    compress=False,
    show_progress=True,
    show_summary=True,
    use_multiprocessing=True
)


PARALLEL BATCH PROCESSING
Processing 100 years
Using 28 parallel workers
Estimated speedup: 28.0x
1712 complete (1/100): 23 triplets, 36 vocab, 0.4s
1702 complete (2/100): 63 triplets, 75 vocab, 0.5s
1712 complete (1/100): 23 triplets, 36 vocab, 0.4s
1702 complete (2/100): 63 triplets, 75 vocab, 0.5s
1713 complete (3/100): 257 triplets, 210 vocab, 0.6s
1707 complete (4/100): 117 triplets, 108 vocab, 0.6s
1717 complete (5/100): 314 triplets, 312 vocab, 0.7s
1713 complete (3/100): 257 triplets, 210 vocab, 0.6s
1707 complete (4/100): 117 triplets, 108 vocab, 0.6s
1717 complete (5/100): 314 triplets, 312 vocab, 0.7s
1709 complete (6/100): 738 triplets, 465 vocab, 0.9s
1718 complete (7/100): 361 triplets, 321 vocab, 1.0s
1706 complete (8/100): 737 triplets, 483 vocab, 1.0s
1701 complete (9/100): 881 triplets, 542 vocab, 1.1s
1709 complete (6/100): 738 triplets, 465 vocab, 0.9s
1718 complete (7/100): 361 triplets, 321 vocab, 1.0s
1706 complete (8/100): 737 triplets, 483 vocab, 1.0s
1701 com