# Skip-Gram Data Preparation

**Pipeline: Corpus file → TFRecord training artifacts (triplets and vocabulary)**

Use this notebook to prepare Google 5gram corpora for skip-gram training.

## Pipeline Workflow

1. **Input**: Preprocessed corpus file (e.g., `2019.txt`) in `/vast` storage

2. **Processing**:
    * Convert text corpus to `tf.data.TextLineDataset` object
    * Creast vocabulary frequency table
    * Generate string triplets with optional frequency-based downsampling 

3. **Output**: TFRecord artifacts in organized subdirectories (e.g., `2019_artifacts/`)

### **Artifact Storage**
The pipeline creates year-specific subdirectories alongside the original text corpora:
<pre>
/vast/edk202/NLP_corpora/.../data/
├── 2018.txt
├── 2019.txt
├── 2020.txt
├── 2018_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
├── 2019_artifacts/
│   ├── triplets.tfrecord.gz
│   └── vocab.tfrecord.gz
└── 2020_artifacts/
    ├── triplets.tfrecord.gz
    └── vocab.tfrecord.gz
</pre>

## Set Up for Data Preparation

In [1]:
# Enable autoreload for development
%load_ext autoreload
%autoreload 2

In [2]:
# Set project root and add src to path
import sys
from pathlib import Path
import os

PROJECT_ROOT = '/scratch/edk202/word2gm-fast'
project_root = Path(PROJECT_ROOT)
src_path = project_root / 'src'

if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

In [3]:
# Print resource summary
from word2gm_fast.utils.resource_summary import print_resource_summary

print_resource_summary()

E0000 00:00:1752596173.367734 3611430 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752596173.372955 3611430 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1752596173.386267 3611430 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1752596173.386281 3611430 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1752596173.386282 3611430 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1752596173.386284 3611430 computation_placer.cc:177] computation placer already registered. Please check linka

<pre>SYSTEM RESOURCE SUMMARY
=============================================
Hostname: cm050.hpc.nyu.edu

Job Allocation:
   CPUs: 14
   Memory: 125.0 GB
   Partition: short
   Job ID: 63744123
   Node list: cm050

Physical GPU Hardware:
   No physical GPUs allocated to this job

TensorFlow GPU Recognition:
   TensorFlow can access 0 GPU(s)
   Built with CUDA support: True
=============================================</pre>

In [17]:
corpus_path = (
    '/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/'
    '6corpus/yearly_files/data/1631.txt'
)

In [18]:
from word2gm_fast.dataprep.corpus_to_dataset import make_dataset

tf_dataset, _ = make_dataset(
    corpus_path,
    cache=True,
    show_summary=True,
    show_properties=True,
    preview_n=10)

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Preview of 10 random retained 5-grams:<br><br>&nbsp;&nbsp;&nbsp;   UNK UNK spanish dominion UNK<br>&nbsp;&nbsp;&nbsp;   UNK UNK midst UNK war<br>&nbsp;&nbsp;&nbsp;   cut UNK notch UNK UNK<br>&nbsp;&nbsp;&nbsp;   take UNK sea UNK UNK<br>&nbsp;&nbsp;&nbsp;   UNK convenient place UNK UNK<br>&nbsp;&nbsp;&nbsp;   UNK know well UNK UNK<br>&nbsp;&nbsp;&nbsp;   UNK UNK much facility UNK<br>&nbsp;&nbsp;&nbsp;   build UNK one day UNK<br>&nbsp;&nbsp;&nbsp;   upon UNK guard UNK UNK<br>&nbsp;&nbsp;&nbsp;   UNK UNK neither know UNK<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Processed Dataset Properties:<br><br>- Element spec: TensorSpec(shape=(), dtype=tf.string, name=None)<br>- Cardinality: Unknown<br>- Threading: Default settings<br>- Transformations: Cached<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Summary:<br><br>- Retained: 215<br>- Rejected: 590<br>- Total: 805<br></span>

In [19]:
from word2gm_fast.dataprep.dataset_to_frequency import dataset_to_frequency

frequency_table = dataset_to_frequency(tf_dataset)

In [20]:
from word2gm_fast.dataprep.dataset_to_triplets import dataset_to_triplets

triplets_ds, _ = dataset_to_triplets(
    dataset=tf_dataset,
    frequency_table=frequency_table,
    downsample_threshold=1e-5,
    preview_n=10,
    cache=True,
    show_properties=True,
    show_summary=True
)

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Preview of 10 random triplets:<br><br>&nbsp;&nbsp;&nbsp;   (tell, well, devise)<br>&nbsp;&nbsp;&nbsp;   (child, woman, dominion)<br>&nbsp;&nbsp;&nbsp;   (would, indeed, matter)<br>&nbsp;&nbsp;&nbsp;   (hundred, pound, religion)<br>&nbsp;&nbsp;&nbsp;   (great, river, body)<br>&nbsp;&nbsp;&nbsp;   (place, call, possibly)<br>&nbsp;&nbsp;&nbsp;   (question, great, truth)<br>&nbsp;&nbsp;&nbsp;   (bad, bad, nobleman)<br>&nbsp;&nbsp;&nbsp;   (great, toe, furnish)<br>&nbsp;&nbsp;&nbsp;   (west, east, pound)<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Triplets Dataset Properties:<br><br>- Element spec: TensorSpec(shape=(3,), dtype=tf.string, name=None)<br>- Cardinality: Unknown<br>- Threading: Default settings<br>- Transformations: Mapped, FlatMapped, Cached<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Generated Triplets Summary:<br><br>- Total triplets: 21<br>- Unique centers: 20<br>- Unique positives: 21<br>- Unique negatives: 21<br>- Total unique words: 58<br></span>

In [39]:
from word2gm_fast.dataprep.dataset_to_triplets import dataset_to_triplets

triplets_ds, _ = dataset_to_triplets(
    dataset=tf_dataset,
    frequency_table=frequency_table,
    downsample_threshold=1e-5,
    preview_n=5,
    cache=True,
    show_properties=True,
    show_summary=True
)

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Preview of 5 random triplets:<br><br>&nbsp;&nbsp;&nbsp;   (good, use, none)<br>&nbsp;&nbsp;&nbsp;   (great, river, body)<br>&nbsp;&nbsp;&nbsp;   (french, spaniard, tall)<br>&nbsp;&nbsp;&nbsp;   (child, woman, dominion)<br>&nbsp;&nbsp;&nbsp;   (away, run, charity)<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Triplets Dataset Properties:<br><br>- Element spec: TensorSpec(shape=(3,), dtype=tf.string, name=None)<br>- Cardinality: Unknown<br>- Threading: Default settings<br>- Transformations: Mapped, FlatMapped, Cached<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Generated Triplets Summary:<br><br>- Total triplets: 21<br>- Unique centers: 20<br>- Unique positives: 21<br>- Unique negatives: 21<br>- Total unique words: 58<br></span>

In [51]:
from word2gm_fast.dataprep.index_vocab import triplets_to_integers

integer_triplets, vocab_table, vocab_list, vocab_size, vocab_summary = triplets_to_integers(
    triplets_dataset=triplets_ds,
    frequency_table=frequency_table,
    preview_n=10,
    show_summary=True,
    show_properties=True,
    cache=True
)

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Preview of 10 random integer triplets:<br><br>&nbsp;&nbsp;&nbsp;   (54, 35, 17)<br>&nbsp;&nbsp;&nbsp;   (5, 57, 41)<br>&nbsp;&nbsp;&nbsp;   (24, 53, 55)<br>&nbsp;&nbsp;&nbsp;   (26, 27, 42)<br>&nbsp;&nbsp;&nbsp;   (12, 14, 25)<br>&nbsp;&nbsp;&nbsp;   (10, 40, 45)<br>&nbsp;&nbsp;&nbsp;   (6, 28, 52)<br>&nbsp;&nbsp;&nbsp;   (50, 5, 58)<br>&nbsp;&nbsp;&nbsp;   (5, 29, 33)<br>&nbsp;&nbsp;&nbsp;   (13, 21, 49)<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Integer Triplets Dataset Properties:<br><br>- Element spec: TensorSpec(shape=(3,), dtype=tf.int32, name=None)<br>- Cardinality: 21<br>- Threading: Default settings<br>- Transformations: TensorSlices, Cached<br></span>

<span style='font-family: monospace; font-size: 120%; font-weight: normal;'>
Vocabulary Summary:<br><br>- Vocabulary size: 59<br>- Index range: 0 to 58<br>- UNK token: UNK (index 0)<br>- Sample tokens: hundred, good, pound, two, great<br>- Total token instances: 1,075<br>- Most frequent: UNK(594), hundred(20), good(15), pound(11), two(11)<br></span>

In [52]:
vocab_list

['UNK',
 'hundred',
 'good',
 'pound',
 'two',
 'great',
 'part',
 'one',
 'well',
 'could',
 'much',
 'west',
 'coast',
 'place',
 'upon',
 'would',
 'away',
 'thing',
 'woman',
 'another',
 'bad',
 'call',
 'child',
 'ever',
 'french',
 'manner',
 'north',
 'pole',
 'remote',
 'river',
 'run',
 'use',
 'want',
 'body',
 'charity',
 'continue',
 'devise',
 'dominion',
 'east',
 'except',
 'facility',
 'furnish',
 'glory',
 'heires',
 'indeed',
 'man',
 'matter',
 'nobleman',
 'none',
 'possibly',
 'question',
 'religion',
 'rest',
 'spaniard',
 'story',
 'tall',
 'tell',
 'toe',
 'truth']

In [None]:
# Check vocab_table object type and structure
print("Vocab table type:", type(vocab_table))

# Import tf to test lookup properly
import tensorflow as tf

# Check if it has lookup methods
if hasattr(vocab_table, 'lookup'):
    print("\nTesting lookup functionality:")
    print("'UNK' ->", vocab_table.lookup(tf.constant(['UNK'])).numpy())
    print("'hundred' ->", vocab_table.lookup(tf.constant(['hundred'])).numpy())
    print("'good' ->", vocab_table.lookup(tf.constant(['good'])).numpy())

# Check vocab_table properties
print(f"\nVocab table key dtype: {vocab_table.key_dtype}")
print(f"Vocab table value dtype: {vocab_table.value_dtype}")

# Create a simple word-to-index dictionary from vocab_list
word_to_index = {word: idx for idx, word in enumerate(vocab_list)}
print(f"\nCreated word_to_index dictionary with {len(word_to_index)} entries")
print("Sample mappings:")
for i, word in enumerate(vocab_list[:5]):
    print(f"  '{word}' -> {word_to_index[word]}")

Vocab table type: <class 'tensorflow.python.ops.lookup_ops.StaticHashTable'>
Vocab table: <tensorflow.python.ops.lookup_ops.StaticHashTable object at 0x14a22c1252b0>

Testing lookup functionality:


AttributeError: 'list' object has no attribute 'dtype'