# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

## Set Up for Data Preparation

In [2]:
# Set project root directory and add `src` to path
import sys
from pathlib import Path

PROJECT_ROOT = '/scratch/edk202/word2gm-fast'
project_root = Path(PROJECT_ROOT)
src_path = project_root / 'src'

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Import the notebook setup utilities
from word2gm_fast.utils.notebook_setup import setup_data_preprocessing_notebook, enable_autoreload

# Enable autoreload for development
enable_autoreload()

# Set up environment (CPU-only for data preprocessing)
env = setup_data_preprocessing_notebook(project_root=PROJECT_ROOT)

# Extract commonly used modules for convenience
tf = env['tensorflow']
np = env['numpy']
pd = env['pandas']
batch_prepare_training_data = env['batch_prepare_training_data']
print_resource_summary = env['print_resource_summary']

<pre>Autoreload enabled</pre>

<pre>Project root: /scratch/edk202/word2gm-fast
TensorFlow version: 2.19.0
Device mode: CPU-only</pre>

<pre>Data preprocessing environment ready!</pre>

## Print Resource Summary

In [3]:
print_resource_summary()

<pre>SYSTEM RESOURCE SUMMARY
============================================================
Hostname: gv009.hpc.nyu.edu

Job Allocation:
   CPUs: 14
   Memory: 125.0 GB
   Requested partitions: v100,rtx8000,a100_2,a100_1,h100_1
   Running on: SSH failed: Host key verification failed.
   Job ID: 63450166
   Node list: gv009

GPU Information:
   CUDA GPUs detected: 1
   GPU 0: Tesla V100-PCIE-32GB
      Memory: 0.6/32.0 GB (31.4 GB free)
      Temperature: 31°C
      Utilization: GPU 0%, Memory 0%

TensorFlow GPU Detection:
   TensorFlow detects 0 GPU(s)
   Built with CUDA: True
============================================================</pre>

## Prepare Corpora

Here, we run the data-preparation pipeline from start to finish — reading preprocessed ngram corpora, generating all valid triplets, extracting the vocabulary, and saving the triplets and vocabulary as `tfrecord` files.

### Options for Data Preparation

You can control which years are processed and how the batch preparation runs by adjusting the arguments to `batch_prepare_training_data`:

**Ways to specify years:**
- `year_range="2010"` — Process a single year (e.g., only 2010).
- `year_range="2010,2012,2015"` — Process a comma-separated list of years.
- `year_range="2010-2015"` — Process a range of years, inclusive (2010 through 2015).
- `year_range="2010,2012-2014,2016"` — Combine individual years and ranges (2010, 2012, 2013, 2014, 2016).

**Other options:**
- `compress` — If `True`, output TFRecords are gzip-compressed. If `False`, output is uncompressed.
- `show_progress` — If `True`, display a progress bar for each year.
- `show_summary` — If `True`, print a summary of the processed data for each year.
- `use_multiprocessing` — If `True`, process years in parallel using multiple CPU cores (recommended for large datasets).

See the function docstring or source for more advanced options.

In [None]:
# Configuration
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

downsample_threshold = 1e-5  # Typical values: 1e-3 (conservative), 1e-4 (moderate), 1e-5 (aggressive)

# Process year(s)
results = batch_prepare_training_data(
    corpus_dir=corpus_dir,
    year_range="1701-1800",
    compress=False,
    show_progress=True,
    show_summary=True,
    use_multiprocessing=True,
    downsample_threshold=downsample_threshold
)

Requested 301 years, found 71 corpus files.
Skipping 230 missing years (showing first 5): 1400, 1401, 1402, 1403, 1404, ...


PARALLEL BATCH PROCESSING
Processing 71 years
Using 14 parallel workers
Estimated speedup: 14.0x


1602 complete (1/71): 0 triplets, 1 vocab, 1.6s
1608 complete (2/71): 0 triplets, 4 vocab, 1.6s
1579 complete (3/71): 0 triplets, 16 vocab, 1.6s
1590 complete (4/71): 0 triplets, 5 vocab, 1.6s
1578 complete (5/71): 0 triplets, 12 vocab, 1.6s
1583 complete (6/71): 0 triplets, 27 vocab, 1.6s
1598 complete (7/71): 12 triplets, 195 vocab, 1.6s
1595 complete (8/71): 14 triplets, 223 vocab, 1.7s
1604 complete (9/71): 34 triplets, 409 vocab, 1.7s
1597 complete (10/71): 27 triplets, 338 vocab, 1.7s
1594 complete (11/71): 51 triplets, 527 vocab, 1.8s
1600 complete (12/71): 60 triplets, 535 vocab, 1.8s
1603 complete (13/71): 46 triplets, 522 vocab, 1.9s
1611 complete (14/71): 1 triplets, 14 vocab, 0.3s
1622 complete (15/71): 0 triplets, 6 vocab, 0.3s
1620 complete (16/71): 1 triplets, 12 vocab, 0.3s
1613 complete (17/71): 1 triplets, 27 vocab, 0.3s
1615 complete (18/71): 0 triplets, 1 vocab, 0.3s
1623 complete (19/71): 0 triplets, 59 vocab, 0.3s
1626 complete (20/71): 0 triplets, 3 vocab, 0.3s
1