# Skip-Gram Data Preparation

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare Google 5gram corpora for skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` storage

2. **Processing**:
    * Convert text corpus to `tf.data.TextLineDataset` object
    * Create an in-memory vocab frequency dictionary
    * Generate an in-memory `tf.dataset` object containing string triplets; downsample if requested
    * Index vocab and convert truplets to integers

3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

## Set Up for Data Preparation

In [1]:
# Enable autoreload for development
%load_ext autoreload
%autoreload 2

In [2]:
# Set project root and add src to path
import sys
from pathlib import Path
import os

PROJECT_ROOT = '/scratch/edk202/word2gm-fast'
project_root = Path(PROJECT_ROOT)
src_path = project_root / 'src'

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

In [3]:
# Print resource summary
from word2gm_fast.utils.resource_summary import print_resource_summary

print_resource_summary()

E0000 00:00:1752760209.528197 1953303 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752760209.533473 1953303 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1752760209.547178 1953303 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1752760209.547201 1953303 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1752760209.547202 1953303 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1752760209.547203 1953303 computation_placer.cc:177] computation placer already registered. Please check linka

<pre>SYSTEM RESOURCE SUMMARY
=============================================
Hostname: cm046.hpc.nyu.edu

Job Allocation:
   CPUs: 36
   Memory: 250.0 GB
   Partition: SSH failed, using fallback: short,cm
   Job ID: 63820737
   Node list: cm046

Physical GPU Hardware:
   No physical GPUs allocated to this job

TensorFlow GPU Recognition:
   TensorFlow can access 0 GPU(s)
   Built with CUDA support: True
=============================================</pre>

In [4]:
corpus_path = (
    '/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/'
    '6corpus/yearly_files/data/1778.txt'
)

In [5]:
from word2gm_fast.dataprep.corpus_to_dataset import make_dataset

tf_dataset, _ = make_dataset(
    corpus_path,
    cache=True,
    show_summary=True,
    show_properties=True,
    preview_n=10)

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Preview of 10 random retained 5-grams:<br><br>&nbsp;&nbsp;&nbsp;   UNK foster parent UNK UNK<br>&nbsp;&nbsp;&nbsp;   UNK old way UNK UNK<br>&nbsp;&nbsp;&nbsp;   UNK free enjoyment UNK UNK<br>&nbsp;&nbsp;&nbsp;   UNK one hand UNK UNK<br>&nbsp;&nbsp;&nbsp;   UNK thus reply UNK UNK<br>&nbsp;&nbsp;&nbsp;   UNK truly great man UNK<br>&nbsp;&nbsp;&nbsp;   UNK man lay dead UNK<br>&nbsp;&nbsp;&nbsp;   UNK neither blind UNK deaf<br>&nbsp;&nbsp;&nbsp;   UNK hundred pound UNK UNK<br>&nbsp;&nbsp;&nbsp;   UNK help thinking UNK UNK<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Processed Dataset Properties:<br><br>- Element spec: TensorSpec(shape=(), dtype=tf.string, name=None)<br>- Cardinality: Unknown<br>- Threading: Default settings<br>- Transformations: Cached<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Summary:<br><br>- Retained: 7491<br>- Rejected: 15544<br>- Total: 23035<br></span>

In [6]:
from word2gm_fast.dataprep.dataset_to_frequency import dataset_to_frequency

frequency_table = dataset_to_frequency(tf_dataset)

In [7]:
from word2gm_fast.dataprep.dataset_to_triplets import dataset_to_triplets

triplets_ds, _ = dataset_to_triplets(
    dataset=tf_dataset,
    frequency_table=frequency_table,
    downsample_threshold=1e-5,
    preview_n=10,
    cache=True,
    show_properties=True,
    show_summary=True
)

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Preview of 10 random triplets:<br><br>&nbsp;&nbsp;&nbsp;   (shall, call, execute)<br>&nbsp;&nbsp;&nbsp;   (happiness, secure, foul)<br>&nbsp;&nbsp;&nbsp;   (blind, deaf, observer)<br>&nbsp;&nbsp;&nbsp;   (write, originally, distant)<br>&nbsp;&nbsp;&nbsp;   (deal, agitate, list)<br>&nbsp;&nbsp;&nbsp;   (upon, breast, force)<br>&nbsp;&nbsp;&nbsp;   (accept, invitation, dark)<br>&nbsp;&nbsp;&nbsp;   (london, port, satisfy)<br>&nbsp;&nbsp;&nbsp;   (wound, dress, white)<br>&nbsp;&nbsp;&nbsp;   (treatment, consider, intoxication)<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Triplets Dataset Properties:<br><br>- Element spec: TensorSpec(shape=(3,), dtype=tf.string, name=None)<br>- Cardinality: Unknown<br>- Threading: Default settings<br>- Transformations: Mapped, FlatMapped, Cached<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Generated Triplets Summary:<br><br>- Total triplets: 2,284<br>- Unique centers: 1,045<br>- Unique positives: 1,621<br>- Unique negatives: 1,552<br>- Total unique words: 2,354<br></span>

In [8]:
from word2gm_fast.dataprep.index_vocab import triplets_to_integers

integer_triplets, vocab_table, vocab_list, vocab_size, vocab_summary = triplets_to_integers(
    triplets_dataset=triplets_ds,
    frequency_table=frequency_table,
    preview_n=10,
    show_summary=True,
    show_properties=True,
    cache=True
)

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Preview of 10 random integer triplets:<br><br>&nbsp;&nbsp;&nbsp;   (8, 66, 1799)<br>&nbsp;&nbsp;&nbsp;   (198, 2186, 1842)<br>&nbsp;&nbsp;&nbsp;   (730, 1722, 2035)<br>&nbsp;&nbsp;&nbsp;   (222, 2047, 528)<br>&nbsp;&nbsp;&nbsp;   (117, 861, 1967)<br>&nbsp;&nbsp;&nbsp;   (7, 734, 257)<br>&nbsp;&nbsp;&nbsp;   (232, 970, 164)<br>&nbsp;&nbsp;&nbsp;   (481, 2086, 1038)<br>&nbsp;&nbsp;&nbsp;   (591, 922, 2346)<br>&nbsp;&nbsp;&nbsp;   (1529, 1196, 1933)<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Integer Triplets Dataset Properties:<br><br>- Element spec: TensorSpec(shape=(3,), dtype=tf.int32, name=None)<br>- Cardinality: 2284<br>- Threading: Default settings<br>- Transformations: TensorSlices, Cached<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Vocabulary Summary:<br><br>- Vocabulary size: 2,355<br>- Index range: 0 to 2,354<br>- UNK token: UNK (index 0)<br>- Sample tokens: would, one, great, make, good<br></span>

In [9]:
from word2gm_fast.io.vocab import write_vocab_to_tfrecord
from word2gm_fast.io.triplets import write_triplets_to_tfrecord
import os

# Define output directory
output_dir = (
    '/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/'
    '6corpus/yearly_files/data/1778_artifacts'
)

# Create the output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Save vocabulary table (compressed)
vocab_path = f"{output_dir}/vocab.tfrecord"
write_vocab_to_tfrecord(
    vocab_table=vocab_table,
    output_path=vocab_path,
    compress=False
)

# Save integer triplets (compressed)
triplets_path = f"{output_dir}/triplets.tfrecord"
triplet_count = write_triplets_to_tfrecord(
    dataset=integer_triplets,
    output_path=triplets_path,
    compress=False
)

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>Writing vocabulary TFRecord to: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data/1778_artifacts/vocab.tfrecord</span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>Vocabulary write complete. Words written: 2,355</span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>Writing TFRecord to: /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data/1778_artifacts/triplets.tfrecord</span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>Write complete. Triplets written: 2,284</span>

In [None]:
from word2gm_fast.dataprep.pipeline import run_pipeline

corpus_dir = '/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data'
year_range = "1989-2019"

results = run_pipeline(
    corpus_dir=corpus_dir,
    years=year_range,
    compress=False,
    max_workers=32,
    downsample_threshold=1e-5,
    cache=True
)

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>Processing 31 years: 1989-2019</span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>Using 32 parallel workers</span>