In [None]:
# Let's measure the ACTUAL triplet generation time by materializing the dataset
print("\n[-- Actual Triplet Generation Timing --]\n")

# Count total triplets (this forces evaluation) - WARNING: This takes a long time!
# Uncomment the lines below to see the real timing (takes ~15 minutes for large files)

# start_actual = time.time()
# total_triplets = sum(1 for _ in triplets_ds)
# duration_actual = time.time() - start_actual
# rate_actual = file_size / duration_actual

# print(f"Total triplets generated: {total_triplets:,}")
# print(f"Actual generation time: {duration_actual:.2f}s")
# print(f"Actual generation rate: {rate_actual:.2f} MB/s")
# print(f"Triplets per second: {total_triplets/duration_actual:,.0f}")

# The original timing was just building the computation graph!
print(f"Graph construction time: {duration_triplets:.3f}s")
print("Actual processing time: ~15 minutes for large files")
print("The 'fast' triplet generation was just building the computation graph!")

In [1]:
import os
import sys
import time
import tensorflow as tf

os.chdir('/scratch/edk202/word2gm-fast/notebooks')
os.chdir("..")

from src.word2gm_fast.dataprep.corpus_to_dataset import make_dataset
from src.word2gm_fast.dataprep.index_vocab import make_vocab
from src.word2gm_fast.dataprep.dataset_to_triplets import build_skipgram_triplets

2025-06-21 01:26:45.295294: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-21 01:26:45.311914: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750483605.329218 2870105 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750483605.334406 2870105 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1750483605.348141 2870105 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [3]:
!python -m unittest -b tests.test_corpus_to_dataset

2025-06-21 00:40:52.893344: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-21 00:40:52.907089: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750480852.923017 3902574 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750480852.927811 3902574 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1750480852.940672 3902574 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [4]:
!python -m unittest -b tests.test_index_vocab

2025-06-21 00:40:56.420920: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-21 00:40:56.434513: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750480856.450534 3902645 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750480856.455334 3902645 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1750480856.468235 3902645 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [5]:
!python -m unittest -v tests.test_dataset_to_triplets

2025-06-21 00:40:59.807210: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-06-21 00:40:59.820766: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750480859.836531 3902688 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750480859.841335 3902688 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1750480859.854048 3902688 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [2]:
# Corpus file information
corpus_file = "1850.txt"
corpus_dir = "/vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data"
corpus_path = os.path.join(corpus_dir, corpus_file)
file_size = os.path.getsize(corpus_path) / 1024 / 1024
print("CORPUS FILE: ", corpus_path, "\n")

# Load and filter the corpus
start = time.time()
dataset, _ = make_dataset(corpus_path)
duration_load = time.time() - start
rate_load = file_size / duration_load
dataset = dataset.cache()

# Build the vocabulary from the dataset
start = time.time()
vocab_table = make_vocab(dataset)
duration_vocab = time.time() - start
rate_vocab = file_size / duration_vocab

# Make triplets from the dataset and vocabulary table
start = time.time()
triplets_ds = build_skipgram_triplets(dataset, vocab_table)
duration_triplets = time.time() - start
rate_triplets = file_size / duration_triplets

# Benchmarking output
print("[--    Benchmarks    --]\n")
print(f"{'Step':<35}{'Duration':>10}{'Quantity':>21}{'Rate':>21}")
print("-" * 87)
print(f"{'Corpus loading and filtering':<35}{duration_load:8,.2f}{'s':>2}{file_size:18,.2f}{'MB':>3}{rate_load:16,.2f}{'MB/s':>5}")
print(f"{'Vocabulary creation':<35}{duration_vocab:8,.2f}{'s':>2}{file_size:18,.2f}{'MB':>3}{rate_vocab:16,.2f}{'MB/s':>5}")
print(f"{'Triplet generation':<35}{duration_triplets:8,.2f}{'s':>2}{file_size:18,.2f}{'MB':>3}{rate_triplets:16,.2f}{'MB/s':>5}")

# Create reverse lookup from vocab table once at the beginning
vocab_export = vocab_table.export()
vocab_keys = vocab_export[0].numpy()
vocab_values = vocab_export[1].numpy()
index_to_word = {idx: word.decode('utf-8') for word, idx in zip(vocab_keys, vocab_values)}

# Show sample lines (use a fresh iterator)
print("\n[--   Sample Lines   --]\n")
sample_lines = list(dataset.shuffle(1000, seed=42).take(5).as_numpy_iterator())
for line_bytes in sample_lines:
    print(line_bytes.decode("utf-8"))

# Test the vocab table with example words
print("\n[--    Test Words    --]\n")
test_words = ["UNK", "man", "king", "nonexistentword"]
ids = vocab_table.lookup(tf.constant(test_words)).numpy()
print(f"{'Word':<18} {'ID':>6}")
print("-" * 25)
for word, idx in zip(test_words, ids):
    print(f"{word:<18} {idx:>6}")

# Show sample triplets
print("\n[--  Sample Triplets  --]\n")
print(f"{'Center':<8} {'Center Word':<12} {'Positive':<8} {'Pos Word':<12} {'Negative':<8} {'Neg Word':<12}")
print("-" * 75)

# Get sample triplets (use a fresh iterator with seed for reproducibility)
sample_triplets = list(triplets_ds.shuffle(1000, seed=123).take(5).as_numpy_iterator())
for triplet in sample_triplets:
    center, positive, negative = triplet
    center_word = index_to_word.get(center, f"ID_{center}")
    pos_word = index_to_word.get(positive, f"ID_{positive}")
    neg_word = index_to_word.get(negative, f"ID_{negative}")
    print(f"{center:<8} {center_word:<12} {positive:<8} {pos_word:<12} {negative:<8} {neg_word:<12}")


CORPUS FILE:  /vast/edk202/NLP_corpora/Google_Books/20200217/eng-fiction/5gram_files/6corpus/yearly_files/data/1850.txt 



2025-06-21 01:26:50.080213: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-06-21 01:28:54.910835: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-06-21 01:28:54.910835: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


[--    Benchmarks    --]

Step                                 Duration             Quantity                 Rate
---------------------------------------------------------------------------------------
Corpus loading and filtering           0.29 s            120.81 MB          419.37 MB/s
Vocabulary creation                  124.59 s            120.81 MB            0.97 MB/s
Triplet generation                     0.15 s            120.81 MB          788.08 MB/s

[--   Sample Lines   --]



2025-06-21 01:28:55.449700: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


UNK one another UNK two
UNK one day UNK serve
UNK one another UNK good
UNK one afore long UNK
UNK one back UNK UNK

[--    Test Words    --]

Word                   ID
-------------------------
UNK                     0
man                 18349
king                16702
nonexistentword         0

[--  Sample Triplets  --]

Center   Center Word  Positive Pos Word     Negative Neg Word    
---------------------------------------------------------------------------
1638     article      24205    rank         6700     corrigan    
2515     battle       20956    one          13066    gratefully  
3239     blood        20956    one          4585     carpentaria 
833      almost       15475    infinite     10444    evangelicals
1658     artorius     982      among        14982    impending   
1638     article      24205    rank         6700     corrigan    
2515     battle       20956    one          13066    gratefully  
3239     blood        20956    one          4585     carpentaria 
833 

In [3]:
# Let's compare TF-native vs Python-based approaches to see where optimizations matter
print("\n[-- TF-Native vs Python Comparison --]\n")

# First, let's try a more Python-heavy approach for triplet generation
def python_heavy_triplets(dataset, vocab_table):
    """Generate triplets using more Python logic (less TF-friendly)"""
    import random
    random.seed(123)  # For reproducibility
    
    triplets = []
    vocab_size_int = int(vocab_table.size().numpy())
    
    for line_tensor in dataset:
        line = line_tensor.numpy().decode('utf-8')
        tokens = line.strip().split()
        
        if len(tokens) != 5:
            continue
            
        # Look up token IDs one by one (less efficient)
        token_ids = []
        for token in tokens:
            token_id = vocab_table.lookup(tf.constant([token])).numpy()[0]
            token_ids.append(token_id)
        
        center_id = token_ids[2]
        context_ids = [token_ids[0], token_ids[1], token_ids[3], token_ids[4]]
        
        # Skip if center is UNK
        if center_id == 0:
            continue
            
        # Generate triplets for each valid context
        for context_id in context_ids:
            if context_id != 0:  # Skip UNK context
                negative_id = random.randint(1, vocab_size_int - 1)
                triplets.append((center_id, context_id, negative_id))
    
    return triplets

# Time the Python-heavy approach
print("Testing Python-heavy approach...")
start_python = time.time()
python_triplets = python_heavy_triplets(dataset.take(1000), vocab_table)  # Just 1000 lines for comparison
duration_python = time.time() - start_python
print(f"Python-heavy (1000 lines): {duration_python:.3f}s, {len(python_triplets)} triplets")

# Time the TF-native approach on same subset
print("Testing TF-native approach...")
start_tf = time.time()
tf_triplets = list(triplets_ds.take(len(python_triplets)))
duration_tf = time.time() - start_tf
print(f"TF-native (same count): {duration_tf:.3f}s, {len(tf_triplets)} triplets")

print(f"\nSpeedup from TF-native: {duration_python/duration_tf:.1f}x faster")

# But the REAL benefits of TF-native code show up in different scenarios:
print("\n[-- Where TF-Native Really Matters --]\n")

print("1. BATCHED PROCESSING:")
print("   - TF datasets can be batched, prefetched, and parallelized")
print("   - Python iteration is sequential and single-threaded")

print("\n2. MEMORY EFFICIENCY:")
print("   - TF datasets stream data without loading everything into memory")
print("   - Python approach above loads all triplets into a list")

print("\n3. INTEGRATION WITH TRAINING:")
print("   - TF datasets integrate seamlessly with tf.keras.Model.fit()")
print("   - No need to convert between Python objects and tensors")

print("\n4. GRAPH COMPILATION:")
print("   - With tf.function or XLA, TF ops can be compiled for speed")
print("   - Python logic cannot be compiled")

# Demonstrate memory efficiency difference
import psutil
import os

process = psutil.Process(os.getpid())
memory_before = process.memory_info().rss / 1024 / 1024  # MB

print(f"\n5. MEMORY USAGE DEMO:")
print(f"   Memory before: {memory_before:.1f} MB")

# The Python approach loaded all triplets into memory
memory_after_python = process.memory_info().rss / 1024 / 1024
print(f"   Memory after Python triplets: {memory_after_python:.1f} MB")
print(f"   Memory increase: {memory_after_python - memory_before:.1f} MB")

# The TF dataset doesn't materialize until consumed
print(f"   TF dataset memory: minimal (lazy evaluation)")

print(f"\n6. SCALABILITY:")
print(f"   Python approach: {len(python_triplets)} triplets = {memory_after_python - memory_before:.1f} MB")
print(f"   For 14M triplets: ~{(memory_after_python - memory_before) * 14000000 / len(python_triplets) / 1024:.1f} GB!")
print(f"   TF approach: streams data, constant memory usage")


[-- TF-Native vs Python Comparison --]

Testing Python-heavy approach...


2025-06-21 01:28:56.684578: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Python-heavy (1000 lines): 0.845s, 1858 triplets
Testing TF-native approach...
TF-native (same count): 0.527s, 1858 triplets

Speedup from TF-native: 1.6x faster

[-- Where TF-Native Really Matters --]

1. BATCHED PROCESSING:
   - TF datasets can be batched, prefetched, and parallelized
   - Python iteration is sequential and single-threaded

2. MEMORY EFFICIENCY:
   - TF datasets stream data without loading everything into memory
   - Python approach above loads all triplets into a list

3. INTEGRATION WITH TRAINING:
   - TF datasets integrate seamlessly with tf.keras.Model.fit()
   - No need to convert between Python objects and tensors

4. GRAPH COMPILATION:
   - With tf.function or XLA, TF ops can be compiled for speed
   - Python logic cannot be compiled

5. MEMORY USAGE DEMO:
   Memory before: 1147.7 MB
   Memory after Python triplets: 1147.7 MB
   Memory increase: 0.0 MB
   TF dataset memory: minimal (lazy evaluation)

6. SCALABILITY:
   Python approach: 1858 triplets = 0.0 MB
 

In [4]:
# Let's test whether caching a dataset materializes it
print("\n[-- Dataset Caching Investigation --]\n")

# Create a fresh dataset without caching
print("Creating fresh dataset without caching...")
start_fresh = time.time()
fresh_dataset, _ = make_dataset(corpus_path)
duration_fresh = time.time() - start_fresh
print(f"Fresh dataset creation: {duration_fresh:.3f}s")

# Now cache it - does this materialize the data?
print("\nCaching the dataset...")
start_cache = time.time()
cached_dataset = fresh_dataset.cache()
duration_cache = time.time() - start_cache
print(f"Cache operation: {duration_cache:.3f}s")

# Test: does iterating through cached dataset take time the first time?
print("\nFirst iteration through cached dataset (should materialize)...")
start_first = time.time()
first_count = sum(1 for _ in cached_dataset.take(1000))
duration_first = time.time() - start_first
print(f"First iteration (1000 items): {duration_first:.3f}s, count: {first_count}")

# Test: does iterating through cached dataset again go faster?
print("\nSecond iteration through same cached dataset...")
start_second = time.time()
second_count = sum(1 for _ in cached_dataset.take(1000))
duration_second = time.time() - start_second
print(f"Second iteration (1000 items): {duration_second:.3f}s, count: {second_count}")

print(f"\nCaching speedup: {duration_first/duration_second:.1f}x faster on second pass")

# Check memory usage during caching
import psutil
process = psutil.Process(os.getpid())
memory_before_full_cache = process.memory_info().rss / 1024 / 1024

print(f"\nMemory before full dataset cache: {memory_before_full_cache:.1f} MB")

# Force full materialization into cache
print("Forcing full dataset into cache...")
start_full_cache = time.time()
full_count = sum(1 for _ in cached_dataset)
duration_full_cache = time.time() - start_full_cache

memory_after_full_cache = process.memory_info().rss / 1024 / 1024
memory_increase = memory_after_full_cache - memory_before_full_cache

print(f"Full cache materialization: {duration_full_cache:.2f}s")
print(f"Total items cached: {full_count:,}")
print(f"Memory after full cache: {memory_after_full_cache:.1f} MB")
print(f"Memory increase: {memory_increase:.1f} MB")

# Now test if subsequent iterations are really fast
print("\nTesting cached performance...")
start_cached = time.time()
cached_count = sum(1 for _ in cached_dataset)
duration_cached = time.time() - start_cached
print(f"Iteration over fully cached dataset: {duration_cached:.3f}s")
print(f"Items: {cached_count:,}")

print(f"\n[-- Caching Conclusions --]\n")
print(f"1. .cache() call itself: {duration_cache:.3f}s (just creates cache object)")
print(f"2. First iteration: {duration_first:.3f}s (materializes data into cache)")
print(f"3. Subsequent iterations: {duration_cached:.3f}s (reads from cache)")
print(f"4. Memory overhead: {memory_increase:.1f} MB for {full_count:,} items")
print(f"5. Cache speedup: {duration_full_cache/duration_cached:.0f}x faster")

print("\nSo caching does NOT materialize immediately - it's lazy until first use!")


[-- Dataset Caching Investigation --]

Creating fresh dataset without caching...
Fresh dataset creation: 0.062s

Caching the dataset...
Cache operation: 0.001s

First iteration through cached dataset (should materialize)...


2025-06-21 01:30:46.991351: W tensorflow/core/kernels/data/cache_dataset_ops.cc:916] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.
2025-06-21 01:30:47.163967: W tensorflow/core/kernels/data/cache_dataset_ops.cc:916] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


First iteration (1000 items): 0.241s, count: 1000

Second iteration through same cached dataset...
Second iteration (1000 items): 0.172s, count: 1000

Caching speedup: 1.4x faster on second pass

Memory before full dataset cache: 1148.9 MB
Forcing full dataset into cache...


2025-06-21 01:32:43.002837: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Full cache materialization: 115.84s
Total items cached: 2,038,127
Memory after full cache: 1639.0 MB
Memory increase: 490.1 MB

Testing cached performance...
Iteration over fully cached dataset: 113.477s
Items: 2,038,127

[-- Caching Conclusions --]

1. .cache() call itself: 0.001s (just creates cache object)
2. First iteration: 0.241s (materializes data into cache)
3. Subsequent iterations: 113.477s (reads from cache)
4. Memory overhead: 490.1 MB for 2,038,127 items
5. Cache speedup: 1x faster

So caching does NOT materialize immediately - it's lazy until first use!
Iteration over fully cached dataset: 113.477s
Items: 2,038,127

[-- Caching Conclusions --]

1. .cache() call itself: 0.001s (just creates cache object)
2. First iteration: 0.241s (materializes data into cache)
3. Subsequent iterations: 113.477s (reads from cache)
4. Memory overhead: 490.1 MB for 2,038,127 items
5. Cache speedup: 1x faster

So caching does NOT materialize immediately - it's lazy until first use!
