# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

## Set Up for Data Preparation

In [1]:
# Basic setup - add project src to path
import sys
from pathlib import Path
import os

# Set project root and add src to path
PROJECT_ROOT = '/scratch/edk202/word2gm-fast'
project_root = Path(PROJECT_ROOT)
src_path = project_root / 'src'

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Enable autoreload for development
%load_ext autoreload
%autoreload 2

# Basic imports
import tensorflow as tf
import numpy as np
import pandas as pd

# Import the batch processing function directly
from word2gm_fast.dataprep.pipeline import batch_prepare_training_data

print("Setup complete - ready for data preparation")

2025-07-14 17:40:48.131008: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-14 17:40:48.147376: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752529248.165213 2041250 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752529248.170561 2041250 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1752529248.184454 2041250 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

Setup complete - ready for data preparation


## Print Resource Summary

In [2]:
# Import and run resource summary
from word2gm_fast.utils.resource_summary import print_resource_summary

print_resource_summary()

2025-07-14 17:40:52.170233: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


<pre>SYSTEM RESOURCE SUMMARY
=============================================
Hostname: cm047.hpc.nyu.edu

Job Allocation:
   CPUs: 14
   Memory: 125.0 GB
   Partition: short
   Job ID: 63723940
   Node list: cm047

Physical GPU Hardware:
   No physical GPUs allocated to this job

TensorFlow GPU Recognition:
   TensorFlow can access 0 GPU(s)
   Built with CUDA support: True
=============================================</pre>

## Prepare Corpora

Here, we run the data-preparation pipeline from start to finish — reading preprocessed ngram corpora, generating all valid triplets, extracting the vocabulary, and saving the triplets and vocabulary as `tfrecord` files.

### Options for Data Preparation

You can control which years are processed and how the batch preparation runs by adjusting the arguments to `batch_prepare_training_data`:

**Ways to specify years:**
- `year_range="2010"` — Process a single year (e.g., only 2010).
- `year_range="2010,2012,2015"` — Process a comma-separated list of years.
- `year_range="2010-2015"` — Process a range of years, inclusive (2010 through 2015).
- `year_range="2010,2012-2014,2016"` — Combine individual years and ranges (2010, 2012, 2013, 2014, 2016).

**Other options:**
- `compress` — If `True`, output TFRecords are gzip-compressed. If `False`, output is uncompressed.
- `show_progress` — If `True`, display a progress bar for each year.
- `show_summary` — If `True`, print a summary of the processed data for each year.
- `use_multiprocessing` — If `True`, process years in parallel using multiple CPU cores (recommended for large datasets).

**TensorFlow Logging:**
- TensorFlow logging is set to ERROR level to reduce verbose output
- The pipeline still works normally, but with cleaner console output
- Critical errors will still be displayed if they occur

See the function docstring or source for more advanced options.

In [None]:
# Test the refactored frequency scanner with corpus_to_dataset
from word2gm_fast.dataprep.corpus_to_dataset import make_dataset
from word2gm_fast.dataprep.dataset_to_frequency import dataset_to_frequency

# Test corpus path
test_corpus_path = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data/2019.txt"

print("Testing refactored frequency scanner with modular approach...")
print("=" * 60)

# Step 1: Create dataset from corpus using make_dataset (unpack the tuple)
print("Step 1: Creating TextLineDataset from corpus...")
dataset, summary = make_dataset(test_corpus_path)
print(f"Dataset created from: {test_corpus_path}")

print("\nStep 2: Scanning frequencies from dataset...")
# Step 2: Scan frequencies from the dataset
frequency_table = dataset_to_frequency(dataset)

print("\n" + "=" * 60)
print("MODULAR FREQUENCY SCANNER TEST RESULTS")
print("=" * 60)
print(f"Total unique tokens: {len(frequency_table):,}")

# Show some examples
if frequency_table:
    print("\nTop 20 most frequent tokens:")
    sorted_items = sorted(frequency_table.items(), key=lambda x: x[1], reverse=True)
    for i, (token, freq) in enumerate(sorted_items[:20]):
        print(f"  {i+1:2d}. '{token}' -> {freq:,} times")
    
    print(f"\nLeast frequent tokens (showing last 10):")
    for i, (token, freq) in enumerate(sorted_items[-10:]):
        print(f"  {len(sorted_items)-9+i:2d}. '{token}' -> {freq:,} times")
    
    # Check if UNK is present
    unk_freq = frequency_table.get('UNK', 0)
    if unk_freq > 0:
        print(f"\n'UNK' token frequency: {unk_freq:,}")
    else:
        print("\n'UNK' token not found in frequency table")

print("=" * 60)
print("✅ Modular approach working: make_dataset → dataset_to_frequency")

Testing refactored frequency scanner with modular approach...
Step 1: Creating TextLineDataset from corpus...
Dataset created from: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data/2019.txt

Step 2: Scanning frequencies from dataset...
Converting dataset to frequency table...
  Processed 1,000,000 lines, collected 5,000,000 tokens
  Processed 1,000,000 lines, collected 5,000,000 tokens
  Processed 2,000,000 lines, collected 10,000,000 tokens
  Processed 2,000,000 lines, collected 10,000,000 tokens
  Processed 3,000,000 lines, collected 15,000,000 tokens
  Processed 3,000,000 lines, collected 15,000,000 tokens
  Processed 4,000,000 lines, collected 20,000,000 tokens
  Processed 4,000,000 lines, collected 20,000,000 tokens
