# Word2GM Training Data Pipeline

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare a Google 5gram corpora for Word2GM skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` NVMe storage
2. **Processing**: TensorFlow-native filtering, vocabulary building, and triplet generation
3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

## Set Up for Data Preparation

In [1]:
# Set project root directory and add `src` to path
import sys
from pathlib import Path

PROJECT_ROOT = '/scratch/edk202/word2gm-fast'
project_root = Path(PROJECT_ROOT)
src_path = project_root / 'src'

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

# Import the notebook setup utilities
from word2gm_fast.utils.notebook_setup import setup_data_preprocessing_notebook, enable_autoreload

# Enable autoreload for development
enable_autoreload()

# Set up environment (CPU-only for data preprocessing)
env = setup_data_preprocessing_notebook(project_root=PROJECT_ROOT)

# Extract commonly used modules for convenience
tf = env['tensorflow']
np = env['numpy']
pd = env['pandas']
batch_prepare_training_data = env['batch_prepare_training_data']
print_resource_summary = env['print_resource_summary']

<pre>Autoreload enabled</pre>

<pre>Project root: /scratch/edk202/word2gm-fast
TensorFlow version: 2.19.0
Device mode: CPU-only</pre>

<pre>Data preprocessing environment ready!</pre>

## Print Resource Summary

In [2]:
print_resource_summary()

<pre>SYSTEM RESOURCE SUMMARY
============================================================
Hostname: cm016.hpc.nyu.edu

Job Allocation:
   CPUs: 14
   Memory: 125.0 GB
   Requested partitions: short,cs,cm,cpu_a100_2,cpu_a100_1,cpu_gpu
   Running on: SSH failed: Host key verification failed.
   Job ID: 63299322
   Node list: cm016

GPU Information:
   Error: NVML Shared Library Not Found

TensorFlow GPU Detection:
   TensorFlow detects 0 GPU(s)
   Built with CUDA: True
============================================================</pre>

## Prepare Corpora

Here, we run the data-preparation pipeline from start to finish — reading preprocessed ngram corpora, generating all valid triplets, extracting the vocabulary, and saving the triplets and vocabulary as `tfrecord` files.

### Options for Data Preparation

You can control which years are processed and how the batch preparation runs by adjusting the arguments to `batch_prepare_training_data`:

**Ways to specify years:**
- `year_range="2010"` — Process a single year (e.g., only 2010).
- `year_range="2010,2012,2015"` — Process a comma-separated list of years.
- `year_range="2010-2015"` — Process a range of years, inclusive (2010 through 2015).
- `year_range="2010,2012-2014,2016"` — Combine individual years and ranges (2010, 2012, 2013, 2014, 2016).

**Other options:**
- `compress` — If `True`, output TFRecords are gzip-compressed. If `False`, output is uncompressed.
- `show_progress` — If `True`, display a progress bar for each year.
- `show_summary` — If `True`, print a summary of the processed data for each year.
- `use_multiprocessing` — If `True`, process years in parallel using multiple CPU cores (recommended for large datasets).

See the function docstring or source for more advanced options.

In [None]:
# Configuration
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"

# Process years with multiprocessing (CPU-only mode configured in cell 2)
results = batch_prepare_training_data(
    corpus_dir=corpus_dir,
    year_range="1801-1900",
    compress=False,
    show_progress=True,
    show_summary=True,
    use_multiprocessing=True
)


PARALLEL BATCH PROCESSING
Processing 100 years
Using 14 parallel workers
Estimated speedup: 14.0x
1825 complete (25/100): 1,187,396 triplets, 23,176 vocab, 712.2s
1820 complete (26/100): 1,640,775 triplets, 24,563 vocab, 908.9s
1822 complete (27/100): 1,583,366 triplets, 25,169 vocab, 908.9s
1824 complete (28/100): 1,567,353 triplets, 24,419 vocab, 892.2s
1830 complete (29/100): 1,503,816 triplets, 25,450 vocab, 870.7s
1829 complete (30/100): 2,057,033 triplets, 27,319 vocab, 1179.7s
1837 complete (31/100): 1,235,211 triplets, 25,101 vocab, 703.6s
1835 complete (32/100): 1,433,895 triplets, 25,335 vocab, 821.7s
1831 complete (33/100): 1,993,305 triplets, 27,548 vocab, 1084.8s
1833 complete (34/100): 1,729,847 triplets, 26,916 vocab, 1003.6s
1838 complete (35/100): 1,502,966 triplets, 26,324 vocab, 833.6s
1832 complete (36/100): 2,185,572 triplets, 27,982 vocab, 1206.4s
1834 complete (37/100): 2,415,856 triplets, 28,853 vocab, 1318.6s
1841 complete (38/100): 1,698,066 triplets, 26,666 

1804 complete (1/100): 128,037 triplets, 9,769 vocab, 97.1s
1801 complete (2/100): 157,360 triplets, 11,000 vocab, 105.8s
1802 complete (3/100): 92,701 triplets, 8,487 vocab, 110.0s
1805 complete (4/100): 190,798 triplets, 11,609 vocab, 135.5s
1808 complete (5/100): 317,657 triplets, 13,193 vocab, 198.6s
1803 complete (6/100): 349,757 triplets, 14,473 vocab, 208.7s
1807 complete (7/100): 369,858 triplets, 14,626 vocab, 218.4s
1814 complete (8/100): 301,143 triplets, 14,965 vocab, 221.1s
1806 complete (9/100): 403,392 triplets, 14,124 vocab, 245.6s
1813 complete (10/100): 508,850 triplets, 17,428 vocab, 306.5s
1816 complete (11/100): 390,078 triplets, 16,144 vocab, 235.6s
1811 complete (12/100): 691,955 triplets, 17,916 vocab, 411.9s
1812 complete (13/100): 710,549 triplets, 18,423 vocab, 420.4s
1817 complete (14/100): 524,951 triplets, 17,803 vocab, 320.6s
1809 complete (15/100): 734,668 triplets, 19,226 vocab, 433.1s
1815 complete (16/100): 641,679 triplets, 17,905 vocab, 383.7s
1818 