# Dataset Editing Tutorial

This tutorial demonstrates how to use TokenSmith's powerful Edit Handler to modify and inject content into tokenized datasets. You'll learn how to safely edit training data, inject specific content, and validate modifications before applying them permanently.

**Prerequisites:**
- Complete tutorials 1-3 (basic setup, inspection, and sampling)
- Have a tokenized dataset ready with batch info generated
- Understanding of tokenization and sequence structure

**What you'll learn:**
- How to inject text at specific locations in the dataset
- Understanding injection types and their effects
- Safe editing practices with dry runs
- Batch injection workflows for multiple modifications
- Validation and preview techniques before permanent changes
- Best practices for dataset editing and version control

## Setup

Let's start by setting up our environment and dataset manager, building on the previous tutorials.

In [20]:
# Fix paths for imports
import sys
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/tokensmith")
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/USC_Colab/gpt-neox")

# Import required libraries
import numpy as np
import random
import warnings
from transformers import AutoTokenizer
from tokensmith.manager import DatasetManager

# Load tokenizer
TOKENIZER_NAME_OR_PATH = "EleutherAI/gpt-neox-20b"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME_OR_PATH, add_eos_token=True)
print(f"Loaded tokenizer: {TOKENIZER_NAME_OR_PATH}")
print(f"EOS token: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Loaded tokenizer: EleutherAI/gpt-neox-20b
EOS token: <|endoftext|> (ID: 0)


In [21]:
# Initialize DatasetManager
dataset_manager = DatasetManager()

# Setup the dataset for editing, inspection, sampling, and export
dataset_manager.setup_edit_inspect_sample_export(
    dataset_prefix='../../artifacts/data_tokenized_text_document',
    batch_info_save_prefix='../../artifacts/batch_info',
    train_iters=100,
    train_batch_size=16,
    train_seq_len=2048,
    seed=42,
    splits_string='990,5,5',
    packing_impl='packed',
    allow_chopped=True,
)
print("Dataset manager setup complete!")
print(f"Edit handler available: {dataset_manager.edit is not None}")

    warming up index mmap file...
    reading sizes...
    reading pointers...
    reading document index...
Dataset manager setup complete!
Edit handler available: True
    reading sizes...
    reading pointers...
    reading document index...
Dataset manager setup complete!
Edit handler available: True


## Understanding the Edit Handler

The Edit Handler provides several key methods for dataset modification:

1. **`inject_and_preview()`** - Inject single text samples with preview
2. **`inject_multiple_samples()`** - Batch injection of multiple samples
3. **`preview_sample()`** - Preview existing samples without modification
4. **`validate_injection_location()`** - Validate injection locations

Let's explore each of these methods in detail.

## Basic Sample Preview

Before making any modifications, let's examine existing samples to understand the dataset structure.

In [22]:
# Preview a sample without modification
sample_id = 50

# Get sample with document details
sample_text, doc_details = dataset_manager.edit.preview_sample(
    sample_id=sample_id,
    return_doc_details=True,
    return_detokenized=True,
    tokenizer=tokenizer
)

print(f"=== Preview of Sample {sample_id} ===")
print(f"Text length: {len(sample_text)} characters")
print(f"Document range: {doc_details['doc_index_f']} to {doc_details['doc_index_l']}")
print(f"Spans multiple docs: {doc_details['doc_index_f'] != doc_details['doc_index_l']}")
print(f"\nSample text (first 200 chars):")
print(f"{sample_text[:200]}...")
print(f"\nSample text (last 100 chars):")
print(f"...{sample_text[-100:]}")

=== Preview of Sample 50 ===
Text length: 8495 characters
Document range: 14417 to 14425
Spans multiple docs: True

Sample text (first 200 chars):
 Ben's car fell on the ground and broke. The wheel came off and the paint scratched. "Uh oh!" Lily said, looking at Ben's car. "I'm sorry, Ben. I did not mean to break your car." Ben picked up his car...

Sample text (last 100 chars):
...ngs that hid in the dark. She hugged her teddy bear and closed her eyes. She tried to think of happy


In [23]:
# Compare with tokenized version
sample_tokens = dataset_manager.edit.preview_sample(
    sample_id=sample_id,
    return_detokenized=False,
    return_doc_details=False
)

total_tokens = sum(len(segment) for segment in sample_tokens)
print(f"Tokenized version:")
print(f"Number of segments: {len(sample_tokens)}")
print(f"Total tokens: {total_tokens}")
print(f"First segment shape: {sample_tokens[0].shape}")
print(f"First 10 tokens: {sample_tokens[0][:10]}")
print(f"Last 10 tokens: {sample_tokens[-1][-10:]}")

Tokenized version:
Number of segments: 9
Total tokens: 2049
First segment shape: (261,)
First 10 tokens: [6029  434 1113 6497  327  253 3216  285 9377   15]
Last 10 tokens: [4581  617 2927   15 1500 3597  281 1158  273 5211]


## Validation and Safety Checks

Before performing injections, it's important to validate injection locations and understand dataset boundaries.

In [24]:
# Test validation function
test_locations = [0, 50, 100, 500, 1000, 2000, 10000, -1, -5]

print("=== Injection Location Validation ===")
for loc in test_locations:
    is_valid = dataset_manager.edit.validate_injection_location(loc)
    status = "✓ Valid" if is_valid else "✗ Invalid"
    print(f"Location {loc:5d}: {status}")

# Find dataset size
dataset_size = dataset_manager.WriteableMMapIndexedDataset.num_samples
print(f"\nDataset contains {dataset_size} samples")
print(f"Valid injection range: 0 to {dataset_size - 1}")

=== Injection Location Validation ===
Location     0: ✗ Invalid
Location    50: ✗ Invalid
Location   100: ✗ Invalid
Location   500: ✗ Invalid
Location  1000: ✗ Invalid
Location  2000: ✗ Invalid
Location 10000: ✗ Invalid
Location    -1: ✗ Invalid
Location    -5: ✗ Invalid

Dataset contains 1600 samples
Valid injection range: 0 to 1599


## Basic Text Injection with Dry Run

Let's start with a basic text injection using dry run mode to safely preview changes.

In [25]:
# Basic injection example with dry run
injection_text = "This is a test injection to demonstrate TokenSmith's editing capabilities."
injection_location = 75

print("=== Basic Injection Example (Dry Run) ===")
print(f"Injecting text: '{injection_text}'")
print(f"Location: {injection_location}")
print(f"Injection type: seq_shuffle (default)")
print("\n" + "="*60)

# Perform dry run injection
dataset_manager.edit.inject_and_preview(
    text=injection_text,
    tokenizer=tokenizer,
    injection_loc=injection_location,
    injection_type="seq_shuffle",
    dry_run=True,  # Safe mode - no actual changes
    add_eos_token=True
)

>> Casting injection data from int64 to <class 'numpy.uint16'>


=== Basic Injection Example (Dry Run) ===
Injecting text: 'This is a test injection to demonstrate TokenSmith's editing capabilities.'
Location: 75
Injection type: seq_shuffle (default)

Dummy sample: [ 1552   310   247  1071  8829   281  7568 35097 21484   434 14835 13789
    15     0]
Training sample 75
Sample consists of segments from 10 documents
Raw sample: [  253 16064    15 ...   253  3644    15]
---
Decoded sample:  the museum. Sam was very happy that she was able to help BRO when he was broken. The end.<|endoftext|>Once upon a time, there was a little girl named Lily. Lily wanted to know more about magnets. She asked her mom, “What are magnets?” Her mom told her, “Magnets are very impressive. They pull things to them.” Lily was excited. She wanted to see a magnet for herself. She went to the store and found the most incredible magnet. It was a rainbow and it glittered in the light. Lily was so excited and she wanted to show her friends what she knew about magnets. She took out

## Understanding Injection Types

TokenSmith supports two injection types that determine where in the sequence the new content is placed:

1. **`seq_shuffle`** - Randomly places the injection within the sequence
2. **`seq_start`** - Places the injection at the beginning of the sequence

Let's compare both types:

In [26]:
# Compare injection types
test_text = "INJECTION: This text demonstrates different injection strategies."
test_location = 125

print("=== Comparing Injection Types ===")
print(f"Test text: '{test_text}'")
print(f"Location: {test_location}")

# Test seq_shuffle injection
print("\n" + "="*50)
print("1. SEQ_SHUFFLE Injection:")
print("="*50)

result_shuffle = dataset_manager.edit.inject_and_preview(
    text=test_text,
    tokenizer=tokenizer,
    injection_loc=test_location,
    injection_type="seq_shuffle",
    dry_run=True,
    return_details=True,
    rng=np.random.default_rng(42)  # Fixed seed for reproducibility
)

print(f"Original length: {len(result_shuffle['original_sample']['decoded_text'])}")
print(f"Modified length: {len(result_shuffle['modified_sample']['decoded_text'])}")
print(f"Injection position determined by: Random placement within sequence")

>> Casting injection data from int64 to <class 'numpy.uint16'>


=== Comparing Injection Types ===
Test text: 'INJECTION: This text demonstrates different injection strategies.'
Location: 125

1. SEQ_SHUFFLE Injection:
Original length: 8419
Modified length: 8419
Injection position determined by: Random placement within sequence


In [27]:
# Test seq_start injection
print("\n" + "="*50)
print("2. SEQ_START Injection:")
print("="*50)

result_start = dataset_manager.edit.inject_and_preview(
    text=test_text,
    tokenizer=tokenizer,
    injection_loc=test_location,
    injection_type="seq_start",
    dry_run=True,
    return_details=True
)

print(f"Original length: {len(result_start['original_sample']['decoded_text'])}")
print(f"Modified length: {len(result_start['modified_sample']['decoded_text'])}")
print(f"Injection position: Beginning of sequence")

# Show first 200 characters to see the injection
print(f"\nFirst 200 chars of modified sample:")
print(f"{result_start['modified_sample']['decoded_text'][:200]}...")

>> Casting injection data from int64 to <class 'numpy.uint16'>



2. SEQ_START Injection:
Original length: 8419
Modified length: 8419
Injection position: Beginning of sequence

First 200 chars of modified sample:
 a time there was a little ice cream cone. It was filled with white, creamy ice cream, and it made the cone look normal. But then something strange happened. The ice cream started to melt. It oozed an...


## Advanced Injection with Return Details

For programmatic analysis, we can return structured data instead of just printing results.

In [28]:
# Advanced injection with detailed analysis
analysis_text = "ANALYSIS: This injection includes detailed metadata for research purposes."
analysis_location = 200

print("=== Advanced Injection Analysis ===")

injection_result = dataset_manager.edit.inject_and_preview(
    text=analysis_text,
    tokenizer=tokenizer,
    injection_loc=analysis_location,
    injection_type="seq_shuffle",
    dry_run=True,
    return_details=True,
    add_eos_token=True,
    rng=np.random.default_rng(123)
)

# Analyze the results
print(f"Injection Location: {injection_result['injection_location']}")
print(f"Injection Type: {injection_result['injection_type']}")
print(f"Dry Run: {injection_result['dry_run']}")
print(f"Injected Text: '{injection_result['injected_text']}'")
print(f"Injected Tokens: {len(injection_result['injected_tokens'])} tokens")
print(f"First 10 injected tokens: {injection_result['injected_tokens'][:10]}")

# Compare original vs modified
orig = injection_result['original_sample']
mod = injection_result['modified_sample']

print(f"\nOriginal Sample:")
print(f"  Token count: {len(orig['raw_tokens'])}")
print(f"  Character count: {len(orig['decoded_text'])}")
print(f"  Document spans: {orig['num_documents']}")

print(f"\nModified Sample:")
print(f"  Token count: {len(mod['raw_tokens'])}")
print(f"  Character count: {len(mod['decoded_text'])}")
print(f"  Document spans: {mod['num_documents']}")

# Calculate changes
token_diff = len(mod['raw_tokens']) - len(orig['raw_tokens'])
char_diff = len(mod['decoded_text']) - len(orig['decoded_text'])

print(f"\nChanges:")
print(f"  Token difference: {token_diff:+d}")
print(f"  Character difference: {char_diff:+d}")

>> Casting injection data from int64 to <class 'numpy.uint16'>


=== Advanced Injection Analysis ===
Injection Location: 200
Injection Type: seq_shuffle
Dry Run: True
Injected Text: 'ANALYSIS: This injection includes detailed metadata for research purposes.'
Injected Tokens: 15 tokens
First 10 injected tokens: [34, 21686, 6328, 1830, 27, 831, 8829, 3797, 7000, 21464]

Original Sample:
  Token count: 2049
  Character count: 8590
  Document spans: 13

Modified Sample:
  Token count: 2049
  Character count: 8590
  Document spans: 13

Changes:
  Token difference: +0
  Character difference: +0


## Batch Injection Workflows

For research and analysis, you often need to inject multiple samples. The Edit Handler supports batch operations.

In [29]:
# Prepare multiple injections
injections = [
    {
        "text": "PROMPT: Once upon a time in a digital kingdom,",
        "injection_loc": 300,
        "injection_type": "seq_start"
    },
    {
        "text": "CONTEXT: This story explores the intersection of technology and narrative.",
        "injection_loc": 301,
        "injection_type": "seq_shuffle"
    },
    {
        "text": "INSTRUCTION: Please continue this story with creative and engaging content.",
        "injection_loc": 302,
        "injection_type": "seq_start"
    },
    {
        "text": "METADATA: Generated by TokenSmith for research purposes.",
        "injection_loc": 303,
        "injection_type": "seq_shuffle"
    }
]

print("=== Batch Injection Example ===")
print(f"Prepared {len(injections)} injections:")
for i, inj in enumerate(injections, 1):
    print(f"  {i}. Location {inj['injection_loc']:3d} ({inj['injection_type']}): '{inj['text'][:50]}...'")

print("\n" + "="*70)

=== Batch Injection Example ===
Prepared 4 injections:
  1. Location 300 (seq_start): 'PROMPT: Once upon a time in a digital kingdom,...'
  2. Location 301 (seq_shuffle): 'CONTEXT: This story explores the intersection of t...'
  3. Location 302 (seq_start): 'INSTRUCTION: Please continue this story with creat...'
  4. Location 303 (seq_shuffle): 'METADATA: Generated by TokenSmith for research pur...'



In [30]:
# Execute batch injections with detailed results
batch_results = dataset_manager.edit.inject_multiple_samples(
    injections=injections,
    tokenizer=tokenizer,
    rng=np.random.default_rng(456),
    add_eos_token=True,
    dry_run=True,  # Safe mode
    return_details=True
)

print(f"\nBatch injection completed! Processed {len(batch_results)} injections.")

# Analyze batch results
print("\n=== Batch Results Summary ===")
for i, result in enumerate(batch_results, 1):
    if 'error' in result:
        print(f"Injection {i}: ERROR - {result['error']}")
    else:
        orig_len = len(result['original_sample']['decoded_text'])
        mod_len = len(result['modified_sample']['decoded_text'])
        diff = mod_len - orig_len
        print(f"Injection {i}: SUCCESS - Location {result['injection_location']}, +{diff} chars")

>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>



Batch injection completed! Processed 4 injections.

=== Batch Results Summary ===
Injection 1: SUCCESS - Location 300, +0 chars
Injection 2: SUCCESS - Location 301, +0 chars
Injection 3: SUCCESS - Location 302, +0 chars
Injection 4: SUCCESS - Location 303, +0 chars


## Reproducible Injection with Seeds

For research reproducibility, it's important to control randomness in injections.

In [31]:
# Demonstrate reproducible injections
def reproducible_injection_demo():
    """Demonstrate that same seeds produce identical injection results."""
    
    test_text = "REPRODUCIBILITY: This injection should be identical across runs."
    test_location = 150
    seed = 789
    
    print("=== Reproducibility Test ===")
    
    # First injection
    result1 = dataset_manager.edit.inject_and_preview(
        text=test_text,
        tokenizer=tokenizer,
        injection_loc=test_location,
        injection_type="seq_shuffle",
        rng=np.random.default_rng(seed),
        dry_run=True,
        return_details=True
    )
    
    # Second injection with same seed
    result2 = dataset_manager.edit.inject_and_preview(
        text=test_text,
        tokenizer=tokenizer,
        injection_loc=test_location,
        injection_type="seq_shuffle",
        rng=np.random.default_rng(seed),  # Same seed
        dry_run=True,
        return_details=True
    )
    
    # Third injection with different seed
    result3 = dataset_manager.edit.inject_and_preview(
        text=test_text,
        tokenizer=tokenizer,
        injection_loc=test_location,
        injection_type="seq_shuffle",
        rng=np.random.default_rng(seed + 1),  # Different seed
        dry_run=True,
        return_details=True
    )
    
    # Compare results
    identical_12 = result1['modified_sample']['decoded_text'] == result2['modified_sample']['decoded_text']
    identical_13 = result1['modified_sample']['decoded_text'] == result3['modified_sample']['decoded_text']
    
    print(f"Result 1 == Result 2 (same seed): {identical_12}")
    print(f"Result 1 == Result 3 (different seed): {identical_13}")
    
    # Show injection details for verification
    for i, result in enumerate([result1, result2, result3], 1):
        details = result['injection_details']
        print(f"\nRun {i} injection details:")
        if 'pt_window_offset' in details:
            print(f"  Window offset: {details['pt_window_offset']}")
        if 'pt_injection_len' in details:
            print(f"  Injection length: {details['pt_injection_len']}")

reproducible_injection_demo()

>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>


=== Reproducibility Test ===
Result 1 == Result 2 (same seed): True
Result 1 == Result 3 (different seed): True

Run 1 injection details:
  Window offset: 1
  Injection length: 15

Run 2 injection details:
  Window offset: 1
  Injection length: 15

Run 3 injection details:
  Window offset: 0
  Injection length: 15


## Safe Editing Practices

Let's explore best practices for safe dataset editing, including validation, backup strategies, and incremental testing.

In [32]:
# Comprehensive safety check function
def comprehensive_safety_check(injections_list, tokenizer, dataset_manager):
    """Perform comprehensive safety checks before batch injection."""
    
    print("=== Comprehensive Safety Check ===")
    
    # Check 1: Validate all injection locations
    print("\n1. Validating injection locations...")
    invalid_locations = []
    for i, inj in enumerate(injections_list):
        loc = inj['injection_loc']
        if not dataset_manager.edit.validate_injection_location(loc):
            invalid_locations.append((i, loc))
    
    if invalid_locations:
        print(f"  ❌ Found {len(invalid_locations)} invalid locations:")
        for idx, loc in invalid_locations:
            print(f"     Injection {idx}: location {loc}")
        return False
    else:
        print(f"  ✅ All {len(injections_list)} locations are valid")
    
    # Check 2: Validate injection texts
    print("\n2. Validating injection texts...")
    empty_texts = []
    for i, inj in enumerate(injections_list):
        if not inj.get('text') or not inj['text'].strip():
            empty_texts.append(i)
    
    if empty_texts:
        print(f"  ❌ Found {len(empty_texts)} empty texts at indices: {empty_texts}")
        return False
    else:
        print(f"  ✅ All texts are non-empty")
    
    # Check 3: Test tokenization
    print("\n3. Testing tokenization...")
    tokenization_errors = []
    for i, inj in enumerate(injections_list):
        try:
            tokens = tokenizer(inj['text'])['input_ids']
            if len(tokens) == 0:
                tokenization_errors.append((i, "Empty token sequence"))
        except Exception as e:
            tokenization_errors.append((i, str(e)))
    
    if tokenization_errors:
        print(f"  ❌ Found {len(tokenization_errors)} tokenization errors:")
        for idx, error in tokenization_errors:
            print(f"     Injection {idx}: {error}")
        return False
    else:
        print(f"  ✅ All texts tokenize successfully")
    
    # Check 4: Preview first injection
    print("\n4. Previewing first injection...")
    try:
        first_inj = injections_list[0]
        preview_result = dataset_manager.edit.inject_and_preview(
            text=first_inj['text'],
            tokenizer=tokenizer,
            injection_loc=first_inj['injection_loc'],
            injection_type=first_inj.get('injection_type', 'seq_shuffle'),
            dry_run=True,
            return_details=True
        )
        print(f"  ✅ Preview successful for location {first_inj['injection_loc']}")
    except Exception as e:
        print(f"  ❌ Preview failed: {e}")
        return False
    
    print("\n🎉 All safety checks passed!")
    return True

# Test with our injection list
safety_result = comprehensive_safety_check(injections, tokenizer, dataset_manager)
print(f"\nSafety check result: {'PASSED' if safety_result else 'FAILED'}")

=== Comprehensive Safety Check ===

1. Validating injection locations...
  ❌ Found 4 invalid locations:
     Injection 0: location 300
     Injection 1: location 301
     Injection 2: location 302
     Injection 3: location 303

Safety check result: FAILED


## Performance Considerations

When editing large datasets, performance becomes important. Let's explore efficient editing strategies.

In [33]:
import time

# Performance testing for different injection strategies
def performance_test():
    """Test performance of different injection approaches."""
    
    print("=== Performance Testing ===")
    
    # Test 1: Single injections
    print("\n1. Testing single injections...")
    single_injection_times = []
    
    for i in range(5):
        start_time = time.time()
        
        dataset_manager.edit.inject_and_preview(
            text=f"Performance test injection {i}",
            tokenizer=tokenizer,
            injection_loc=400 + i,
            injection_type="seq_shuffle",
            dry_run=True,
            return_details=False  # Skip detailed output for speed
        )
        
        single_injection_times.append(time.time() - start_time)
    
    avg_single = np.mean(single_injection_times)
    print(f"  Average single injection time: {avg_single:.4f} seconds")
    
    # Test 2: Batch injection
    print("\n2. Testing batch injection...")
    
    batch_injections = [
        {
            "text": f"Batch performance test {i}",
            "injection_loc": 500 + i,
            "injection_type": "seq_shuffle"
        }
        for i in range(5)
    ]
    
    start_time = time.time()
    
    dataset_manager.edit.inject_multiple_samples(
        injections=batch_injections,
        tokenizer=tokenizer,
        dry_run=True,
        return_details=False
    )
    
    batch_time = time.time() - start_time
    print(f"  Batch injection time (5 injections): {batch_time:.4f} seconds")
    print(f"  Average per injection in batch: {batch_time / 5:.4f} seconds")
    
    # Performance comparison
    print("\n3. Performance comparison:")
    total_single_time = avg_single * 5
    speedup = total_single_time / batch_time if batch_time > 0 else float('inf')
    print(f"  5 single injections: {total_single_time:.4f} seconds")
    print(f"  1 batch injection: {batch_time:.4f} seconds")
    print(f"  Speedup factor: {speedup:.2f}x")

performance_test()

>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>


=== Performance Testing ===

1. Testing single injections...
Dummy sample: [35975  1071  8829   470     0]
Training sample 400
Sample consists of segments from 10 documents
Raw sample: [ 1476 16543 12918 ...   452   794   326]
---
Decoded sample: !" Anna laughed. She saw a dress and a scarf and put them on. "And I look like a queen!" They pretended to be a king and a queen and had fun. They sat on a seat in the closet and talked about their kingdom. But then they heard a voice. It was Mom. She was looking for them. "Anna! Ben! Where are you?" Mom called. Anna and Ben got scared. They did not want Mom to see them in the old clothes. They thought Mom would be angry. "Quick, hide!" Anna whispered. They took off the clothes and hats and put them back in the closet. They closed the door and stayed quiet. Mom came to the hall. She saw the closet and opened it. She was surprised to see Anna and Ben inside. "What are you doing here?" Mom asked. Anna and Ben looked at Mom. They did not know wha

>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>


=== Injection 1/5 ===
=== Injection 2/5 ===


>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>


=== Injection 3/5 ===


>> Casting injection data from int64 to <class 'numpy.uint16'>


=== Injection 4/5 ===
=== Injection 5/5 ===
  Batch injection time (5 injections): 0.0395 seconds
  Average per injection in batch: 0.0079 seconds

3. Performance comparison:
  5 single injections: 0.0287 seconds
  1 batch injection: 0.0395 seconds
  Speedup factor: 0.72x
=== Injection 5/5 ===
  Batch injection time (5 injections): 0.0395 seconds
  Average per injection in batch: 0.0079 seconds

3. Performance comparison:
  5 single injections: 0.0287 seconds
  1 batch injection: 0.0395 seconds
  Speedup factor: 0.72x


## Real-World Editing Scenarios

Let's explore some practical scenarios where dataset editing is valuable for research and training.

In [34]:
# Scenario 1: Adding prompts and instructions for fine-tuning
def instruction_injection_scenario():
    """Simulate adding instruction prompts to prepare data for fine-tuning."""
    
    print("=== Scenario 1: Instruction Injection for Fine-tuning ===")
    
    # Define instruction templates
    instruction_templates = [
        "Instruction: Summarize the following text in one sentence.\n",
        "Instruction: Identify the main theme of this passage.\n",
        "Instruction: Rewrite this text in a more formal tone.\n",
        "Instruction: Extract the key facts from the following content.\n"
    ]
    
    # Create instruction injections
    instruction_injections = []
    base_location = 600
    
    for i, template in enumerate(instruction_templates):
        instruction_injections.append({
            "text": template,
            "injection_loc": base_location + i,
            "injection_type": "seq_start"  # Instructions go at the beginning
        })
    
    print(f"Created {len(instruction_injections)} instruction injections:")
    for i, inj in enumerate(instruction_injections, 1):
        print(f"  {i}. {inj['text'].strip()}")
    
    # Execute with safety checks
    if comprehensive_safety_check(instruction_injections, tokenizer, dataset_manager):
        print("\nExecuting instruction injections...")
        
        results = dataset_manager.edit.inject_multiple_samples(
            injections=instruction_injections,
            tokenizer=tokenizer,
            dry_run=True,
            return_details=True
        )
        
        print(f"\nInstruction injection results:")
        for i, result in enumerate(results, 1):
            if 'error' not in result:
                mod_text = result['modified_sample']['decoded_text']
                # Show first 150 characters to see the instruction
                preview = mod_text[:150].replace('\n', ' ')
                print(f"  Sample {i}: {preview}...")
    
instruction_injection_scenario()

=== Scenario 1: Instruction Injection for Fine-tuning ===
Created 4 instruction injections:
  1. Instruction: Summarize the following text in one sentence.
  2. Instruction: Identify the main theme of this passage.
  3. Instruction: Rewrite this text in a more formal tone.
  4. Instruction: Extract the key facts from the following content.
=== Comprehensive Safety Check ===

1. Validating injection locations...
  ❌ Found 4 invalid locations:
     Injection 0: location 600
     Injection 1: location 601
     Injection 2: location 602
     Injection 3: location 603


In [35]:
# Scenario 2: Adding metadata and provenance information
def metadata_injection_scenario():
    """Simulate adding metadata for dataset provenance and tracking."""
    
    print("\n=== Scenario 2: Metadata Injection for Provenance ===")
    
    # Create metadata injections
    metadata_injections = [
        {
            "text": "[META: Source=TinyStories, Version=1.0, ProcessedBy=TokenSmith]",
            "injection_loc": 700,
            "injection_type": "seq_start"
        },
        {
            "text": "[QUALITY: Human-reviewed=True, Rating=High, LastCheck=2024]",
            "injection_loc": 701,
            "injection_type": "seq_shuffle"
        },
        {
            "text": "[USAGE: AllowCommercial=True, AllowDerivatives=True, License=MIT]",
            "injection_loc": 702,
            "injection_type": "seq_start"
        }
    ]
    
    print(f"Created {len(metadata_injections)} metadata injections:")
    for i, inj in enumerate(metadata_injections, 1):
        print(f"  {i}. {inj['text']}")
    
    # Execute metadata injections
    print("\nExecuting metadata injections...")
    
    results = dataset_manager.edit.inject_multiple_samples(
        injections=metadata_injections,
        tokenizer=tokenizer,
        dry_run=True,
        return_details=True
    )
    
    print(f"\nMetadata injection summary:")
    for i, result in enumerate(results, 1):
        if 'error' not in result:
            orig_len = len(result['original_sample']['decoded_text'])
            mod_len = len(result['modified_sample']['decoded_text'])
            metadata_added = len(result['injected_text'])
            print(f"  Sample {i}: +{metadata_added} chars metadata, total growth: +{mod_len - orig_len} chars")

metadata_injection_scenario()


=== Scenario 2: Metadata Injection for Provenance ===
Created 3 metadata injections:
  1. [META: Source=TinyStories, Version=1.0, ProcessedBy=TokenSmith]
  2. [QUALITY: Human-reviewed=True, Rating=High, LastCheck=2024]
  3. [USAGE: AllowCommercial=True, AllowDerivatives=True, License=MIT]

Executing metadata injections...


>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>



Metadata injection summary:
  Sample 1: +63 chars metadata, total growth: +0 chars
  Sample 2: +59 chars metadata, total growth: +0 chars
  Sample 3: +65 chars metadata, total growth: +0 chars


In [36]:
# Scenario 3: Research experiment with controlled interventions
def research_intervention_scenario():
    """Simulate controlled interventions for research experiments."""
    
    print("\n=== Scenario 3: Research Intervention Experiment ===")
    
    # Define experimental conditions
    conditions = {
        "positive_sentiment": "This story has a wonderful and uplifting conclusion.",
        "negative_sentiment": "This story has a tragic and disappointing ending.",
        "neutral_control": "This story concludes in a typical manner.",
        "question_prompt": "What do you think happens next in this story?"
    }
    
    # Create experimental injections
    experiment_injections = []
    base_location = 800
    
    for i, (condition, text) in enumerate(conditions.items()):
        experiment_injections.append({
            "text": f"[CONDITION: {condition.upper()}] {text}",
            "injection_loc": base_location + i,
            "injection_type": "seq_shuffle",
            "condition": condition
        })
    
    print(f"Experimental design with {len(experiment_injections)} conditions:")
    for i, inj in enumerate(experiment_injections, 1):
        condition = inj['condition']
        text_preview = inj['text'][:60]
        print(f"  {i}. {condition}: {text_preview}...")
    
    # Execute experimental injections
    print("\nExecuting experimental injections...")
    
    results = dataset_manager.edit.inject_multiple_samples(
        injections=experiment_injections,
        tokenizer=tokenizer,
        rng=np.random.default_rng(2024),  # Fixed seed for reproducibility
        dry_run=True,
        return_details=True
    )
    
    # Analyze experimental results
    print(f"\nExperimental results:")
    for i, (result, inj) in enumerate(zip(results, experiment_injections), 1):
        if 'error' not in result:
            condition = inj['condition']
            injection_loc = result['injection_location']
            injection_details = result['injection_details']
            
            print(f"  Condition {i} ({condition}):")
            print(f"    Location: {injection_loc}")
            print(f"    Injection successful: ✓")
            
            # Show effect of intervention
            modified_text = result['modified_sample']['decoded_text']
            condition_marker = f"[CONDITION: {condition.upper()}]"
            
            if condition_marker in modified_text:
                print(f"    Condition marker found: ✓")
            else:
                print(f"    Condition marker found: ✗")

research_intervention_scenario()


=== Scenario 3: Research Intervention Experiment ===
Experimental design with 4 conditions:
  1. positive_sentiment: [CONDITION: POSITIVE_SENTIMENT] This story has a wonderful a...
  2. negative_sentiment: [CONDITION: NEGATIVE_SENTIMENT] This story has a tragic and ...
  3. neutral_control: [CONDITION: NEUTRAL_CONTROL] This story concludes in a typic...
  4. question_prompt: [CONDITION: QUESTION_PROMPT] What do you think happens next ...

Executing experimental injections...


>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>
>> Casting injection data from int64 to <class 'numpy.uint16'>



Experimental results:
  Condition 1 (positive_sentiment):
    Location: 800
    Injection successful: ✓
    Condition marker found: ✗
  Condition 2 (negative_sentiment):
    Location: 801
    Injection successful: ✓
    Condition marker found: ✗
  Condition 3 (neutral_control):
    Location: 802
    Injection successful: ✓
    Condition marker found: ✗
  Condition 4 (question_prompt):
    Location: 803
    Injection successful: ✓
    Condition marker found: ✗


## Best Practices and Guidelines

Let's consolidate the best practices for dataset editing with TokenSmith.

## TokenSmith Dataset Editing Best Practices

🛡️ **SAFETY PRACTICES:**
1. Always use `dry_run=True` for initial testing
2. Validate injection locations before batch operations
3. Test tokenization on all texts before injection
4. Use `return_details=True` for programmatic analysis
5. Preview samples before and after modification

🔄 **REPRODUCIBILITY PRACTICES:**
1. Use fixed seeds with `np.random.default_rng()`
2. Document all injection parameters
3. Save injection specifications for replay
4. Version control your injection scripts

⚡ **PERFORMANCE PRACTICES:**
1. Use batch operations for multiple injections
2. Avoid `return_details=True` for large batches unless needed
3. Test performance on small samples first
4. Consider memory usage for very large datasets

📊 **ANALYSIS PRACTICES:**
1. Compare before/after statistics
2. Monitor injection success rates
3. Track changes in sequence lengths
4. Validate injection placement for different types

🔍 **RESEARCH PRACTICES:**
1. Design clear experimental conditions
2. Use appropriate injection types for your use case
3. Document the research rationale for each injection
4. Plan for control groups and baselines

## Summary and Key Takeaways

Congratulations! You've successfully learned how to use TokenSmith's Edit Handler for dataset modification. Here's what we covered:

## Tutorial Summary: Dataset Editing Methods

📚 **KEY CONCEPTS LEARNED:**
1. Edit Handler initialization and setup
2. Single text injection with `inject_and_preview()`
3. Batch injection with `inject_multiple_samples()`
4. Sample preview without modification
5. Injection location validation
6. Dry run vs. production mode
7. Injection types: `seq_shuffle` vs. `seq_start`
8. Reproducible editing with seeds
9. Safety checks and validation workflows
10. Performance optimization techniques

🛠️ **METHODS MASTERED:**
- `dataset_manager.edit.inject_and_preview()` → Single text injection with preview
- `dataset_manager.edit.inject_multiple_samples()` → Batch injection operations
- `dataset_manager.edit.preview_sample()` → Sample inspection without changes
- `dataset_manager.edit.validate_injection_location()` → Location validation

🎯 **INJECTION STRATEGIES EXPLORED:**
- **seq_start**: Place injections at sequence beginning
- **seq_shuffle**: Randomly place injections within sequence
- **Batch operations**: Efficient multi-sample modification
- **Reproducible seeding**: Consistent results across runs
- **Safety-first approach**: Validation and dry runs before changes

🔬 **REAL-WORLD SCENARIOS COVERED:**
1. Instruction injection for fine-tuning
2. Metadata addition for provenance tracking
3. Research interventions for controlled experiments
4. Performance optimization for large datasets

✅ **BEST PRACTICES ESTABLISHED:**
- Always start with dry runs
- Validate locations and texts before injection
- Use fixed seeds for reproducibility
- Prefer batch operations for efficiency
- Monitor and analyze injection results
- Document experimental designs clearly

🚀 **NEXT STEPS:**
- Apply editing techniques to your own datasets
- Experiment with different injection strategies
- Integrate editing into your research workflows
- Explore export functionality to save modified datasets
- Combine editing with search and sampling for complex analyses

---

🎉 **You're now ready to safely and effectively edit datasets with TokenSmith!**

## Additional Resources

For more advanced usage and additional tutorials:

- **[TokenSmith Documentation](https://aflah02.github.io/tokensmith)** - Complete API reference
- **[Basic Setup Tutorial](01_basic_setup.ipynb)** - Getting started with TokenSmith
- **[Inspection Tutorial](02_inspect_samples.ipynb)** - Dataset examination techniques
- **[Sampling Tutorial](03_sampling_methods.ipynb)** - Flexible data sampling strategies
- **[Search Tutorial](05_search_functionality.ipynb)** - Advanced search capabilities

### Pro Tips for Production Use:

1. **Version Control**: Always version your datasets before making modifications
2. **Backup Strategy**: Keep backups of original datasets
3. **Testing**: Test on small subsets before applying to full datasets
4. **Documentation**: Document all modifications for reproducibility
5. **Monitoring**: Track the impact of modifications on model performance

### Common Pitfalls to Avoid:

- Skipping validation steps
- Not using dry runs for testing
- Ignoring sequence length limits
- Forgetting to set random seeds for reproducibility
- Making modifications without backing up original data

Happy editing with TokenSmith! 🔧✨