# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

## Set Up for Data Preparation

In [1]:
# Set project root directory and add `src` to path
import sys
from pathlib import Path

PROJECT_ROOT = '/scratch/edk202/word2gm-fast'
project_root = Path(PROJECT_ROOT)
src_path = project_root / 'src'

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Import the notebook setup utilities
from word2gm_fast.utils.notebook_setup import setup_data_preprocessing_notebook, enable_autoreload

# Enable autoreload for development
enable_autoreload()

# Set up environment (CPU-only for data preprocessing)
env = setup_data_preprocessing_notebook(project_root=PROJECT_ROOT)

# Extract commonly used modules for convenience
tf = env['tensorflow']
np = env['numpy']
pd = env['pandas']
batch_prepare_training_data = env['batch_prepare_training_data']
print_resource_summary = env['print_resource_summary']

Autoreload enabled


SyntaxError: invalid syntax (resource_summary.py, line 6)

## Print Resource Summary

In [5]:
print_resource_summary()

[autoreload of word2gm_fast.utils.resource_summary failed: Traceback (most recent call last):
  File "/ext3/miniforge3/envs/word2gm-fast2/lib/python3.12/site-packages/IPython/extensions/autoreload.py", line 283, in check
    superreload(m, reload, self.old_objects)
  File "/ext3/miniforge3/envs/word2gm-fast2/lib/python3.12/site-packages/IPython/extensions/autoreload.py", line 483, in superreload
    module = reload(module)
             ^^^^^^^^^^^^^^
  File "/ext3/miniforge3/envs/word2gm-fast2/lib/python3.12/importlib/__init__.py", line 131, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 866, in _exec
  File "<frozen importlib._bootstrap_external>", line 995, in exec_module
  File "<frozen importlib._bootstrap_external>", line 1133, in get_code
  File "<frozen importlib._bootstrap_external>", line 1063, in source_to_code
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/scratch/edk202/word2gm-fast/src/word

SYSTEM RESOURCE SUMMARY
Hostname: gr009.hpc.nyu.edu
Hostname: gr009.hpc.nyu.edu

Job Allocation:

Job Allocation:
   CPUs: 14
   CPUs: 14
   Memory: 125.0 GB
   Memory: 125.0 GB
   Requested partitions: rtx8000,v100,a100_2,a100_1,h100_1
   Requested partitions: rtx8000,v100,a100_2,a100_1,h100_1
   Actually running on: rtx8000
   Actually running on: rtx8000
   Job ID: 62853078
   Job ID: 62853078
   Node list: gr009
   Node list: gr009

GPU Information:

GPU Information:
   CUDA GPUs detected: 1
   CUDA GPUs detected: 1
   GPU 0: Quadro RTX 8000
   GPU 0: Quadro RTX 8000
      Memory: 0.5/45.0 GB (44.5 GB free)
      Memory: 0.5/45.0 GB (44.5 GB free)
      Temperature: 29°C
      Temperature: 29°C
      Utilization: GPU 0%, Memory 0%

TensorFlow GPU Detection:
   TensorFlow detects 0 GPU(s)
   Built with CUDA: True
      Utilization: GPU 0%, Memory 0%

TensorFlow GPU Detection:
   TensorFlow detects 0 GPU(s)
   Built with CUDA: True


In [None]:
import os
import socket
import subprocess

# Show requested partitions (what you submitted to)
requested_partitions = os.environ.get('SLURM_JOB_PARTITION', 'unknown')
print(f"Requested partitions: {requested_partitions}")

# Get node name
nodename = os.environ.get('SLURMD_NODENAME', socket.gethostname())
print(f"Running on node: {nodename}")

# Get the actual partition by SSH'ing to login node
job_id = os.environ.get('SLURM_JOB_ID')
if job_id:
    try:
        # SSH to greene-login and run squeue to get actual partition
        ssh_cmd = ['ssh', 'greene-login', 'squeue', '-j', job_id, '-h', '-o', '%P']
        result = subprocess.run(ssh_cmd, capture_output=True, text=True, timeout=10)
        
        if result.returncode == 0:
            actual_partition = result.stdout.strip()
            print(f"Actually running on partition: {actual_partition}")
        else:
            print(f"SSH squeue failed: {result.stderr.strip()}")
            
    except subprocess.TimeoutExpired:
        print("SSH command timed out")
    except Exception as e:
        print(f"Error running SSH squeue: {e}")
else:
    print("No SLURM_JOB_ID found - not in a SLURM job")

# Show other useful SLURM info
print(f"\nOther SLURM info:")
print(f"Job ID: {job_id or 'N/A'}")
print(f"Node list: {os.environ.get('SLURM_JOB_NODELIST', 'N/A')}")
print(f"Allocated CPUs: {os.environ.get('SLURM_CPUS_PER_TASK', 'N/A')}")
print(f"Memory per node: {os.environ.get('SLURM_MEM_PER_NODE', 'N/A')} MB")

## Prepare one or more corpora in parallel 

In [None]:
# Configuration
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

# Process years with multiprocessing (CPU-only mode configured in cell 2)
results = batch_prepare_training_data(
    corpus_dir=corpus_dir,
    year_range="1701-1800",
    compress=False,
    show_progress=True,
    show_summary=True,
    use_multiprocessing=True
)