# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

## Setup

In [20]:
!ls /

NGC-DL-CONTAINER-LICENSE    environment  lib32	 opt   scratch	    sys
bin			    etc		 lib64	 proc  share	    tmp
boot			    ext3	 libx32  root  singularity  usr
cuda-keyring_1.1-1_all.deb  home	 media	 run   srv	    var
dev			    lib		 mnt	 sbin  state	    vast


In [6]:
%load_ext autoreload
%autoreload 2

import os
import sys
import warnings
import subprocess
from pathlib import Path

# Setup project path
project_root = Path('/scratch/edk202/word2gm-fast')
os.chdir(project_root)
src_path = project_root / 'src'
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# 🔧 CRITICAL: Set CUDA library paths BEFORE importing TensorFlow
print("🔧 Setting up CUDA environment for GPU support...")

# Set CUDA library paths (NYU Greene)
cuda_lib_paths = [
    '/usr/lib64',  # System NVIDIA driver libraries
    '/share/apps/cuda/11.6.2/lib64',  # CUDA toolkit
    '/share/apps/cuda/11.6.2/targets/x86_64-linux/lib',  # CUDA target libs
    '/usr/local/cuda/lib64'  # Default CUDA location
]

# Update LD_LIBRARY_PATH
current_ld_path = os.environ.get('LD_LIBRARY_PATH', '')
new_ld_paths = []

for cuda_path in cuda_lib_paths:
    if os.path.exists(cuda_path):
        new_ld_paths.append(cuda_path)
        print(f"✅ Found CUDA path: {cuda_path}")

if new_ld_paths:
    if current_ld_path:
        os.environ['LD_LIBRARY_PATH'] = ':'.join(new_ld_paths) + ':' + current_ld_path
    else:
        os.environ['LD_LIBRARY_PATH'] = ':'.join(new_ld_paths)
    print(f"✅ Updated LD_LIBRARY_PATH with {len(new_ld_paths)} CUDA paths")

# Test if libcuda.so.1 is now accessible
try:
    import ctypes
    ctypes.CDLL('libcuda.so.1')
    print("✅ libcuda.so.1 successfully loaded!")
    cuda_available = True
except Exception as e:
    print(f"⚠️  libcuda.so.1 still not accessible: {e}")
    cuda_available = False

# Import TensorFlow silently
from word2gm_fast.utils.tf_silence import import_tensorflow_silently
tf = import_tensorflow_silently()

print(f"TensorFlow version: {tf.__version__}")

# Import core dependencies
import numpy as np
import pandas as pd
import time
import psutil

# Import Word2GM modules
from word2gm_fast.dataprep.pipeline import batch_prepare_training_data

print("Setup complete; all modules loaded successfully")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
🔧 Setting up CUDA environment for GPU support...
✅ Found CUDA path: /usr/lib64
✅ Found CUDA path: /share/apps/cuda/11.6.2/lib64
✅ Found CUDA path: /share/apps/cuda/11.6.2/targets/x86_64-linux/lib
✅ Found CUDA path: /usr/local/cuda/lib64
✅ Updated LD_LIBRARY_PATH with 4 CUDA paths
⚠️  libcuda.so.1 still not accessible: libcuda.so.1: cannot open shared object file: No such file or directory
TensorFlow version: 2.19.0
Setup complete; all modules loaded successfully


In [2]:
# === CLUSTER RESOURCE MONITORING ===
import socket
import subprocess
import glob
import psutil
import os
import re

print("CLUSTER RESOURCE ALLOCATION SUMMARY")
print("=" * 50)

# Hostname
hostname = socket.gethostname()
print(f"Hostname: {hostname}")

# CPU Information
logical_cores = psutil.cpu_count(logical=True)
try:
    # Detect physical cores via thread siblings
    physical_cores = 0
    seen_cores = set()
    
    cpu_paths = glob.glob('/sys/devices/system/cpu/cpu[0-9]*')
    cpu_numbers = sorted([int(path.split('cpu')[-1]) for path in cpu_paths])
    
    for cpu_num in cpu_numbers:
        if cpu_num in seen_cores:
            continue
        siblings_path = f'/sys/devices/system/cpu/cpu{cpu_num}/topology/thread_siblings_list'
        try:
            with open(siblings_path, 'r') as f:
                siblings = f.read().strip()
            if ',' in siblings:
                sibling_list = [int(x) for x in siblings.split(',')]
            elif '-' in siblings:
                start, end = siblings.split('-')
                sibling_list = list(range(int(start), int(end) + 1))
            else:
                sibling_list = [int(siblings)]
            seen_cores.update(sibling_list)
            physical_cores += 1
        except:
            physical_cores += 1
            seen_cores.add(cpu_num)
    
    if physical_cores > 0 and physical_cores != logical_cores:
        print(f"CPU cores: {physical_cores} physical, {logical_cores} logical (hyperthreading)")
    else:
        print(f"CPU cores: {logical_cores} logical")
except:
    print(f"CPU cores: {logical_cores} logical")

# Memory Information
memory = psutil.virtual_memory()
total_memory_gb = memory.total / (1024**3)
available_memory_gb = memory.available / (1024**3)

# Check for SLURM memory limits
slurm_memory = None
try:
    slurm_mem_per_node = os.environ.get('SLURM_MEM_PER_NODE')
    if slurm_mem_per_node:
        slurm_memory = int(slurm_mem_per_node) / 1024  # MB to GB
except:
    pass

if slurm_memory and slurm_memory < total_memory_gb:
    print(f"Job-allocated memory: {slurm_memory:.1f} GB")
else:
    print(f"Memory: {total_memory_gb:.1f} GB total")

# Verify GPU
accessible_gpus = []
for gpu_id in range(4):
    device_path = f"/dev/nvidia{gpu_id}"
    if os.path.exists(device_path):
        try:
            with open(device_path, 'rb') as f:
                pass
            accessible_gpus.append(gpu_id)
        except (PermissionError, OSError):
            continue

if accessible_gpus:
    print(f"GPU: {len(accessible_gpus)} accessible")
    for gpu_id in accessible_gpus:
        print(f"  GPU {gpu_id}: /dev/nvidia{gpu_id}")
else:
    print("GPU: Not accessible")

print("\nSTORAGE QUOTAS AND USAGE")
print("=" * 50)

# Get quota information
try:
    cmd = ['ssh', '-o', 'ConnectTimeout=3', '-o', 'BatchMode=yes', 
           '-o', 'StrictHostKeyChecking=no', 'log-1.hpc.nyu.edu', 'myquota']
    result = subprocess.run(cmd, capture_output=True, text=True, timeout=8)
    
    if result.returncode == 0 and result.stdout.strip():
        lines = result.stdout.strip().split('\n')
        
        # Clean ANSI color codes from output
        ansi_escape = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])')
        
        # Parse quota output
        quota_data = []
        
        for line in lines:
            line = ansi_escape.sub('', line).strip()  # Remove color codes
            if line.startswith('/'):
                # Split by whitespace, but handle the complex format
                parts = line.split()
                if len(parts) >= 5:
                    filesystem = parts[0]
                    allocation = parts[3]
                    usage_info = parts[4]
                    
                    # Extract usage and percentage from usage_info
                    usage_parts = usage_info.split('/')
                    if len(usage_parts) >= 1:
                        space_part = usage_parts[0]
                        # Extract usage and percentage using regex
                        match = re.match(r'([0-9.]+[KMGT]?B)\(([0-9.]+)%\)', space_part)
                        if match:
                            used = match.group(1)
                            percent = match.group(2) + '%'
                        else:
                            # Try simpler pattern
                            if '(' in space_part and ')' in space_part:
                                used = space_part.split('(')[0]
                                percent_match = re.search(r'\(([0-9.]+)%\)', space_part)
                                percent = percent_match.group(1) + '%' if percent_match else "N/A"
                            else:
                                used = space_part
                                percent = "N/A"
                    else:
                        used = "N/A"
                        percent = "N/A"
                    
                    quota_data.append({
                        'filesystem': filesystem,
                        'allocation': allocation,
                        'used': used,
                        'percent': percent
                    })
        
        if quota_data:
            # Print formatted table
            print(f"{'Filesystem':<12} {'Allocation':<15} {'Used':<12} {'Percent':<8}")
            print("-" * 50)
            for item in quota_data:
                print(f"{item['filesystem']:<12} {item['allocation']:<15} {item['used']:<12} {item['percent']:<8}")
        else:
            print("No quota information found")
            
except Exception as e:
    print(f"Quota check failed: {str(e)}")

print("=" * 50)
print("\nResource summary complete")

CLUSTER RESOURCE ALLOCATION SUMMARY
Hostname: gr004.hpc.nyu.edu
CPU cores: 48 logical
Job-allocated memory: 125.0 GB
GPU: 1 accessible
  GPU 0: /dev/nvidia0

STORAGE QUOTAS AND USAGE
Filesystem   Allocation      Used         Percent 
--------------------------------------------------
/home        50.0GB/30.7K    15.56GB      31.12%  
/scratch     5.0TB/1.0M      323.29GB     6.31%   
/archive     2.0TB/20.5K     1262.12GB    61.63%  
/vast        2TB/5.0M        1.37TB       69.0%   

Resource summary complete
Filesystem   Allocation      Used         Percent 
--------------------------------------------------
/home        50.0GB/30.7K    15.56GB      31.12%  
/scratch     5.0TB/1.0M      323.29GB     6.31%   
/archive     2.0TB/20.5K     1262.12GB    61.63%  
/vast        2TB/5.0M        1.37TB       69.0%   

Resource summary complete


## Prepare one or more corpora in parallel 

In [None]:
# Configuration
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

# Process years with optional parallel processing
results = batch_prepare_training_data(
    corpus_dir=corpus_dir,
    year_range="1951-1952",
    compress=False,
    show_progress=True,
    show_summary=True,
    use_multiprocessing=True
)

In [8]:
# GPU Diagnostics - Check TensorFlow GPU configuration
print("TensorFlow GPU Diagnostics")
print("=" * 40)

# Check environment variables
print(f"CUDA_VISIBLE_DEVICES: {os.environ.get('CUDA_VISIBLE_DEVICES', 'Not set')}")

# Check TensorFlow GPU detection
print(f"\nTensorFlow GPU devices:")
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    for i, gpu in enumerate(gpus):
        print(f"  GPU {i}: {gpu}")
        # Check memory info
        try:
            details = tf.config.experimental.get_device_details(gpu)
            print(f"    Details: {details}")
        except:
            print(f"    Details: Not available")
else:
    print("  No GPU devices detected by TensorFlow")

# Test GPU computation
print(f"\nGPU computation test:")
try:
    with tf.device('/GPU:0'):
        a = tf.constant([[1.0, 2.0], [3.0, 4.0]])
        b = tf.constant([[1.0, 1.0], [0.0, 1.0]])
        c = tf.matmul(a, b)
        result = c.numpy()
    print(f"  ✅ GPU computation successful: {result.flatten()}")
except Exception as e:
    print(f"  ❌ GPU computation failed: {e}")
    print(f"  Falling back to CPU...")

# Check if pipeline functions are using GPU
print(f"\nChecking Word2GM pipeline GPU usage...")
try:
    # Check if the pipeline functions have device placement
    import inspect
    from word2gm_fast.dataprep import pipeline
    
    # Look for GPU-related code in the pipeline
    source = inspect.getsource(pipeline.batch_prepare_training_data)
    if 'GPU' in source or 'device' in source:
        print("  ✅ Pipeline appears to have GPU support")
    else:
        print("  ⚠️  Pipeline may not be using GPU - check implementation")
        
except Exception as e:
    print(f"  ❌ Could not inspect pipeline: {e}")

print("=" * 40)

TensorFlow GPU Diagnostics
CUDA_VISIBLE_DEVICES: 0

TensorFlow GPU devices:
  No GPU devices detected by TensorFlow

GPU computation test:
  ✅ GPU computation successful: [1. 3. 3. 7.]

Checking Word2GM pipeline GPU usage...
  ⚠️  Pipeline may not be using GPU - check implementation


In [7]:
# 🔧 SYSTEMATIC GPU DIAGNOSIS
print("🔧 SYSTEMATIC GPU DIAGNOSIS")
print("=" * 60)

# Step 1: Hardware-level GPU check
print("1. HARDWARE-LEVEL GPU CHECK:")
print("-" * 30)
import os
accessible_gpus = []
for gpu_id in range(4):
    device_path = f"/dev/nvidia{gpu_id}"
    if os.path.exists(device_path):
        try:
            with open(device_path, 'rb') as f:
                pass
            accessible_gpus.append(gpu_id)
            print(f"   ✅ GPU {gpu_id}: {device_path} - ACCESSIBLE")
        except (PermissionError, OSError) as e:
            print(f"   ❌ GPU {gpu_id}: {device_path} - PERMISSION ERROR: {e}")
    else:
        print(f"   ❌ GPU {gpu_id}: {device_path} - NOT FOUND")

print(f"   Summary: {len(accessible_gpus)} GPU(s) accessible at hardware level")

# Step 2: Environment variables
print("\n2. ENVIRONMENT VARIABLES:")
print("-" * 30)
env_vars = ['CUDA_VISIBLE_DEVICES', 'CUDA_DEVICE_ORDER', 'NVIDIA_VISIBLE_DEVICES']
for var in env_vars:
    value = os.environ.get(var, 'NOT SET')
    print(f"   {var}: {value}")

# Step 3: TensorFlow GPU support
print("\n3. TENSORFLOW GPU SUPPORT:")
print("-" * 30)
print(f"   TensorFlow version: {tf.__version__}")
print(f"   Built with CUDA: {tf.test.is_built_with_cuda()}")
print(f"   GPU support available: {tf.test.is_gpu_available()}")

# Step 4: Physical devices detected by TensorFlow
print("\n4. TENSORFLOW PHYSICAL DEVICES:")
print("-" * 30)
all_devices = tf.config.list_physical_devices()
gpu_devices = tf.config.list_physical_devices('GPU')
print(f"   All devices: {len(all_devices)}")
for device in all_devices:
    print(f"     {device}")
print(f"   GPU devices: {len(gpu_devices)}")
for device in gpu_devices:
    print(f"     {device}")

# Step 5: CUDA library check
print("\n5. CUDA LIBRARY CHECK:")
print("-" * 30)
try:
    # Try to get CUDA version info
    import subprocess
    result = subprocess.run(['nvcc', '--version'], capture_output=True, text=True, timeout=5)
    if result.returncode == 0:
        print(f"   nvcc available: {result.stdout.split('release')[1].split(',')[0].strip()}")
    else:
        print(f"   nvcc not available or error: {result.stderr}")
except Exception as e:
    print(f"   nvcc check failed: {e}")

# Check for libcuda
import ctypes
try:
    libcuda = ctypes.CDLL('libcuda.so.1')
    print(f"   ✅ libcuda.so.1 loaded successfully")
except Exception as e:
    print(f"   ❌ libcuda.so.1 failed to load: {e}")

# Step 6: Module environment (HPC specific)
print("\n6. MODULE ENVIRONMENT (NYU Greene):")
print("-" * 30)
try:
    result = subprocess.run(['module', 'list'], capture_output=True, text=True, timeout=5)
    if 'cuda' in result.stderr.lower():
        print(f"   ✅ CUDA module appears to be loaded")
        # Extract CUDA modules
        cuda_lines = [line for line in result.stderr.split('\n') if 'cuda' in line.lower()]
        for line in cuda_lines:
            print(f"     {line.strip()}")
    else:
        print(f"   ⚠️  No CUDA module detected in module list")
        print(f"   Available modules with 'cuda': ")
        avail_result = subprocess.run(['module', 'avail', 'cuda'], capture_output=True, text=True, timeout=5)
        print(f"     {avail_result.stderr}")
except Exception as e:
    print(f"   Module check failed: {e}")

# Step 7: Recommended fixes
print("\n7. RECOMMENDED FIXES:")
print("-" * 30)
if len(accessible_gpus) > 0 and len(gpu_devices) == 0:
    print("   🎯 DIAGNOSIS: Hardware GPU accessible, but TensorFlow can't see it")
    print("   📋 LIKELY CAUSES:")
    print("      - CUDA/cuDNN version mismatch with TensorFlow")
    print("      - TensorFlow not built with GPU support")
    print("      - Missing CUDA modules on cluster")
    print("   🔧 RECOMMENDED ACTIONS:")
    print("      1. Load CUDA module: module load cuda/11.8")
    print("      2. Check TensorFlow-GPU installation")
    print("      3. Verify CUDA/cuDNN compatibility")
elif len(accessible_gpus) == 0:
    print("   🎯 DIAGNOSIS: No GPU accessible at hardware level")
    print("   🔧 RECOMMENDED ACTIONS:")
    print("      1. Request GPU node: salloc --gres=gpu:1")
    print("      2. Check job allocation")
else:
    print("   ✅ GPU appears to be properly configured")

print("=" * 60)

🔧 SYSTEMATIC GPU DIAGNOSIS
1. HARDWARE-LEVEL GPU CHECK:
------------------------------
   ✅ GPU 0: /dev/nvidia0 - ACCESSIBLE
   ❌ GPU 1: /dev/nvidia1 - PERMISSION ERROR: [Errno 1] Operation not permitted: '/dev/nvidia1'
   ❌ GPU 2: /dev/nvidia2 - PERMISSION ERROR: [Errno 1] Operation not permitted: '/dev/nvidia2'
   ❌ GPU 3: /dev/nvidia3 - PERMISSION ERROR: [Errno 1] Operation not permitted: '/dev/nvidia3'
   Summary: 1 GPU(s) accessible at hardware level

2. ENVIRONMENT VARIABLES:
------------------------------
   CUDA_VISIBLE_DEVICES: 0
   CUDA_DEVICE_ORDER: NOT SET
   NVIDIA_VISIBLE_DEVICES: all

3. TENSORFLOW GPU SUPPORT:
------------------------------
   TensorFlow version: 2.19.0
   Built with CUDA: True
   GPU support available: False

4. TENSORFLOW PHYSICAL DEVICES:
------------------------------
   All devices: 1
     PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')
   GPU devices: 0

5. CUDA LIBRARY CHECK:
------------------------------
   nvcc available: 12.6

## 🔧 GPU Fix Required: Restart Kernel

**DIAGNOSIS**: CUDA runtime libraries weren't loaded when TensorFlow was imported.

**SOLUTION**: 
1. ✅ CUDA module loaded: `module load cuda/11.6.2`  
2. 🔄 **RESTART KERNEL** (Kernel → Restart Kernel)
3. 🚀 Re-run setup cells with CUDA properly available

After restart, TensorFlow should detect GPU 0 successfully.

In [9]:
# 🔍 DEFINITIVE GPU vs CPU TEST
print("🔍 TESTING ACTUAL DEVICE PLACEMENT")
print("=" * 50)

# Test with device logging enabled to see where computation REALLY runs
print("Testing with TensorFlow device placement logging:")
tf.debugging.set_log_device_placement(True)

try:
    print("\n1. Requesting /GPU:0:")
    with tf.device('/GPU:0'):
        a = tf.constant([[1.0, 2.0]])
        b = tf.constant([[3.0], [4.0]])
        c = tf.matmul(a, b)
        actual_device = c.device
        result = c.numpy()
    
    print(f"   Requested: /GPU:0")
    print(f"   Actually used: {actual_device}")
    if "GPU" in actual_device:
        print("   ✅ TRUE GPU computation")
    else:
        print("   ❌ CPU FALLBACK (soft placement)")

    print("\n2. Explicitly requesting /CPU:0:")
    with tf.device('/CPU:0'):
        a = tf.constant([[1.0, 2.0]])
        b = tf.constant([[3.0], [4.0]])
        c = tf.matmul(a, b)
        actual_device = c.device
        result = c.numpy()
    
    print(f"   Requested: /CPU:0")
    print(f"   Actually used: {actual_device}")

except Exception as e:
    print(f"   Error during device testing: {e}")

finally:
    # Turn off device logging
    tf.debugging.set_log_device_placement(False)

print("\n🎯 CONCLUSION:")
if len(tf.config.list_physical_devices('GPU')) == 0:
    print("   TensorFlow CANNOT see any GPU devices")
    print("   All '/GPU:0' requests are falling back to CPU")
    print("   The 'successful' GPU computation was actually CPU computation")
    print("   This confirms the GPU configuration issue")
else:
    print("   TensorFlow can see GPU devices - GPU computation should work")

print("=" * 50)

🔍 TESTING ACTUAL DEVICE PLACEMENT
Testing with TensorFlow device placement logging:

1. Requesting /GPU:0:
   Requested: /GPU:0
   Actually used: /job:localhost/replica:0/task:0/device:CPU:0
   ❌ CPU FALLBACK (soft placement)

2. Explicitly requesting /CPU:0:
   Requested: /CPU:0
   Actually used: /job:localhost/replica:0/task:0/device:CPU:0

🎯 CONCLUSION:
   TensorFlow CANNOT see any GPU devices
   All '/GPU:0' requests are falling back to CPU
   The 'successful' GPU computation was actually CPU computation
   This confirms the GPU configuration issue
