# Smart Copy Strategies and Filtering

**Duration:** 25 minutes  
**Level:** Intermediate

Learn how to copy files efficiently with skip strategies and filtering.

## What You'll Learn

- Skip strategies (never, exists, size, hash)
- Custom skip functions
- File filtering (include/exclude patterns)
- Progress tracking and callbacks
- Incremental backups
- Performance optimization

## Why Skip Strategies?

Copying can be expensive (time, bandwidth, cost). Skip strategies let you:
- Avoid copying unchanged files
- Save time and bandwidth
- Implement incremental backups
- Reduce cloud storage costs

Let's optimize! ⚡

In [None]:
from genro_storage import StorageManager
import time

storage = StorageManager()
storage.configure([
    {'name': 'source', 'type': 'memory'},
    {'name': 'dest', 'type': 'memory'}
])

print("✓ Storage ready")

## 1. Default Behavior (skip='never')

By default, copy always overwrites:

In [None]:
# Create source file
src = storage.node('source:file.txt')
src.write('Version 1')

# Copy once
dst = storage.node('dest:file.txt')
src.copy(dst)
print(f"First copy: {dst.read()}")

# Modify source and copy again
src.write('Version 2')
src.copy(dst)  # Overwrites even though dst exists
print(f"Second copy: {dst.read()}")

print("\n✓ Default behavior: always overwrite")

## 2. Skip Strategy: 'exists'

Skip if destination exists (fastest):

In [None]:
# Create files
src1 = storage.node('source:new_file.txt')
src1.write('New content')

src2 = storage.node('source:existing_file.txt')
src2.write('Will be skipped')

dst2 = storage.node('dest:existing_file.txt')
dst2.write('Already exists')

# Copy with skip='exists'
print("Copying with skip='exists':")

result1 = src1.copy(storage.node('dest:new_file.txt'), skip='exists')
print(f"  new_file: copied")

result2 = src2.copy(dst2, skip='exists')
print(f"  existing_file: skipped")

print(f"\nDestination content: {dst2.read()}")
print("✓ Existing file was not overwritten")

## 3. Skip Strategy: 'size'

Skip if file exists AND size matches:

In [None]:
# Files with same size
src_a = storage.node('source:a.txt')
src_a.write('12345')  # 5 bytes

dst_a = storage.node('dest:a.txt')
dst_a.write('ABCDE')  # 5 bytes, different content

# Files with different size
src_b = storage.node('source:b.txt')
src_b.write('123456')  # 6 bytes

dst_b = storage.node('dest:b.txt')
dst_b.write('ABC')  # 3 bytes

print("Copying with skip='size':")

src_a.copy(dst_a, skip='size')
print(f"  a.txt: skipped (same size)")
print(f"    Content still: {dst_a.read()}")

src_b.copy(dst_b, skip='size')
print(f"  b.txt: copied (different size)")
print(f"    Content now: {dst_b.read()}")

## 4. Skip Strategy: 'hash'

Skip if file exists AND MD5 hash matches (slowest but safest):

In [None]:
# Files with same content
src_same = storage.node('source:same.txt')
src_same.write('Identical content')

dst_same = storage.node('dest:same.txt')
dst_same.write('Identical content')

# Files with different content
src_diff = storage.node('source:diff.txt')
src_diff.write('New content')

dst_diff = storage.node('dest:diff.txt')
dst_diff.write('Old content')

print("Copying with skip='hash':")

src_same.copy(dst_same, skip='hash')
print(f"  same.txt: skipped (MD5 match)")

src_diff.copy(dst_diff, skip='hash')
print(f"  diff.txt: copied (MD5 mismatch)")
print(f"    New content: {dst_diff.read()}")

## 5. Custom Skip Functions

Write your own skip logic:

In [None]:
def skip_if_recent(source, dest):
    """Skip if destination was modified in last 5 seconds"""
    if not dest.exists:
        return False  # Don't skip, dest doesn't exist
    
    age = time.time() - dest.mtime
    return age < 5  # Skip if less than 5 seconds old

# Create old file
old_file = storage.node('dest:old.txt')
old_file.write('Old')
time.sleep(0.1)  # Make it "old"

# Create recent file
recent_file = storage.node('dest:recent.txt')
recent_file.write('Recent')

# Try to copy
src = storage.node('source:update.txt')
src.write('Updated content')

print("Copying with custom skip function:")

src.copy(old_file, skip=skip_if_recent)
print(f"  old.txt: copied (too old)")
print(f"    Content: {old_file.read()}")

src.copy(recent_file, skip=skip_if_recent)
print(f"  recent.txt: skipped (too recent)")
print(f"    Content: {recent_file.read()}")

## 6. Directory Copy with Skip

Apply skip strategy to entire directory:

In [None]:
# Create source directory
src_dir = storage.node('source:project')
src_dir.mkdir()
src_dir.child('file1.txt').write('Content 1')
src_dir.child('file2.txt').write('Content 2')
src_dir.child('file3.txt').write('Content 3')

# First backup
backup1 = storage.node('dest:backup1')
src_dir.copy(backup1)
print(f"✓ First backup: {len(list(backup1.children()))} files")

# Modify one file
src_dir.child('file2.txt').write('Modified content 2')

# Incremental backup with hash skip
backup2 = storage.node('dest:backup2')
src_dir.copy(backup2, skip='hash')

print(f"✓ Second backup: only changed files copied")
print(f"  file1.txt: {backup2.child('file1.txt').read()}")
print(f"  file2.txt: {backup2.child('file2.txt').read()}")

## 7. Progress Tracking

Monitor copy operations with callbacks:

In [None]:
# Create source with multiple files
data_dir = storage.node('source:data')
data_dir.mkdir()

for i in range(10):
    data_dir.child(f'file_{i}.txt').write_text(f'Data {i}')

# Progress callback
copied_count = 0
skipped_count = 0

def on_file_copied(src, dst):
    global copied_count
    copied_count += 1
    print(f"  ✓ Copied: {src.basename}")

def on_file_skipped(src, dst):
    global skipped_count
    skipped_count += 1
    print(f"  ⊘ Skipped: {src.basename}")

# Copy with callbacks
backup_dir = storage.node('dest:data_backup')
print("First copy:")
data_dir.copy(backup_dir, on_file=on_file_copied)

print(f"\nSecond copy (with skip='exists'):")
copied_count = 0
skipped_count = 0
data_dir.copy(backup_dir, skip='exists', 
              on_file=on_file_copied,
              on_skip=on_file_skipped)

print(f"\nSummary: {copied_count} copied, {skipped_count} skipped")

## 8. File Filtering: Include Patterns

Copy only specific file types:

In [None]:
# Create mixed directory
mixed_dir = storage.node('source:mixed')
mixed_dir.mkdir()

mixed_dir.child('doc1.txt').write_text('Text 1')
mixed_dir.child('doc2.txt').write_text('Text 2')
mixed_dir.child('image1.jpg').write_text('JPG data')
mixed_dir.child('image2.png').write_text('PNG data')
mixed_dir.child('video.mp4').write_text('Video data')

# Copy only text files
text_backup = storage.node('dest:text_only')
mixed_dir.copy(text_backup, include=['*.txt'])

print("Text-only backup contains:")
for child in text_backup.children():
    print(f"  - {child.basename}")

# Copy only images
image_backup = storage.node('dest:images_only')
mixed_dir.copy(image_backup, include=['*.jpg', '*.png'])

print("\nImage-only backup contains:")
for child in image_backup.children():
    print(f"  - {child.basename}")

## 9. File Filtering: Exclude Patterns

Skip specific files or patterns:

In [None]:
# Create project directory
project = storage.node('source:myproject')
project.mkdir()

project.child('main.py').write_text('# Main')
project.child('utils.py').write_text('# Utils')
project.child('.env').write_text('SECRET=xxx')
project.child('.gitignore').write_text('*.pyc')
project.child('__pycache__').mkdir()
project.child('README.md').write_text('# Project')

# Copy excluding hidden and cache files
clean_copy = storage.node('dest:clean_project')
project.copy(clean_copy, exclude=['.*', '__pycache__'])

print("Clean copy contains:")
for child in clean_copy.children():
    print(f"  - {child.basename}")
print("\n✓ Hidden files and cache excluded")

## 10. Combining Include and Exclude

Use both for fine-grained control:

In [None]:
# Create complex directory
docs = storage.node('source:documents')
docs.mkdir()

docs.child('report.pdf').write_text('PDF')
docs.child('draft.pdf').write_text('Draft PDF')
docs.child('notes.txt').write_text('Notes')
docs.child('draft.txt').write_text('Draft notes')
docs.child('data.csv').write_text('CSV')

# Copy only PDFs and TXTs, but exclude drafts
final_docs = storage.node('dest:final_documents')
docs.copy(final_docs, 
          include=['*.pdf', '*.txt'],
          exclude=['draft.*'])

print("Final documents:")
for child in final_docs.children():
    print(f"  - {child.basename}")

## 11. Custom Filter Functions

Filter based on any criteria:

In [None]:
def filter_small_files(node):
    """Only include files smaller than 20 bytes"""
    if node.isdir:
        return True  # Include directories
    return node.size < 20

# Create files of various sizes
sized_dir = storage.node('source:sized')
sized_dir.mkdir()

sized_dir.child('small.txt').write_text('small')  # 5 bytes
sized_dir.child('medium.txt').write_text('medium content here')  # 19 bytes
sized_dir.child('large.txt').write_text('large content here with more text')  # 35 bytes

# Copy with size filter
small_only = storage.node('dest:small_files')
sized_dir.copy(small_only, filter=filter_small_files)

print("Small files only:")
for child in small_only.children():
    print(f"  - {child.basename} ({child.size} bytes)")

## 12. Performance Comparison

Compare different skip strategies:

In [None]:
import time

# Create test data
test_dir = storage.node('source:perf_test')
test_dir.mkdir()

for i in range(50):
    test_dir.child(f'file_{i}.txt').write_text(f'Content {i}' * 10)

# Initial copy
dest_dir = storage.node('dest:perf_test')
test_dir.copy(dest_dir)

# Modify one file
test_dir.child('file_25.txt').write_text('Modified!')

# Test different strategies
strategies = ['never', 'exists', 'size', 'hash']
results = {}

for strategy in strategies:
    start = time.time()
    test_dir.copy(dest_dir, skip=strategy)
    elapsed = time.time() - start
    results[strategy] = elapsed

print("Performance comparison (50 files, 1 changed):")
for strategy, elapsed in results.items():
    print(f"  {strategy:10s}: {elapsed*1000:.2f}ms")

print(f"\n✓ 'exists' is typically fastest for skip scenarios")

## 13. Incremental Backup Pattern

A complete incremental backup implementation:

In [None]:
def incremental_backup(source_dir, backup_dir, strategy='hash'):
    """Perform incremental backup with statistics"""
    stats = {
        'copied': 0,
        'skipped': 0,
        'bytes_copied': 0,
        'bytes_skipped': 0
    }
    
    def on_copy(src, dst):
        stats['copied'] += 1
        stats['bytes_copied'] += src.size
        
    def on_skip(src, dst):
        stats['skipped'] += 1
        stats['bytes_skipped'] += src.size
    
    source_dir.copy(backup_dir, 
                   skip=strategy,
                   on_file=on_copy,
                   on_skip=on_skip)
    
    return stats

# Create data
data = storage.node('source:important_data')
data.mkdir()
for i in range(20):
    data.child(f'data_{i}.txt').write_text(f'Important data {i}' * 5)

# First backup
backup = storage.node('dest:backup')
print("First backup:")
stats1 = incremental_backup(data, backup, 'hash')
print(f"  Copied: {stats1['copied']} files, {stats1['bytes_copied']} bytes")

# Modify some files
data.child('data_5.txt').write_text('Modified')
data.child('data_10.txt').write_text('Modified')

# Incremental backup
print("\nIncremental backup:")
stats2 = incremental_backup(data, backup, 'hash')
print(f"  Copied: {stats2['copied']} files, {stats2['bytes_copied']} bytes")
print(f"  Skipped: {stats2['skipped']} files, {stats2['bytes_skipped']} bytes")
print(f"\n✓ Saved {stats2['bytes_skipped']} bytes by skipping")

## 14. Try It Yourself! 🎯

**Exercise 1:** Create a sync function that copies new/modified files and deletes removed ones:

In [None]:
def sync_directories(source, dest):
    """
    Sync source to dest:
    - Copy new/modified files
    - Delete files not in source
    """
    # Your code here
    pass

**Exercise 2:** Implement a smart backup that keeps only last N versions:

In [None]:
def rotating_backup(source_dir, backup_base, max_versions=3):
    """
    Create timestamped backup and keep only last N versions.
    backup_base/2024-01-15_10-30/
    backup_base/2024-01-15_14-20/
    etc.
    """
    # Your code here
    pass

**Exercise 3:** Create a deduplication function using hash comparison:

In [None]:
def find_duplicates(directory):
    """
    Find duplicate files in directory by hash.
    Return dict: {hash: [node1, node2, ...]}
    """
    # Your code here
    pass

## Summary

You've mastered copy optimization:

- ✓ Skip strategies (never, exists, size, hash)
- ✓ Custom skip functions
- ✓ File filtering (include/exclude)
- ✓ Progress tracking with callbacks
- ✓ Custom filter functions
- ✓ Performance considerations
- ✓ Incremental backup patterns

## Skip Strategy Guide

| Strategy | Speed | Safety | Use When |
|----------|-------|--------|----------|
| `never` | Fast | N/A | Always overwrite |
| `exists` | Fastest | Low | First-time sync |
| `size` | Fast | Medium | Quick incremental |
| `hash` | Slow | High | Critical data |
| Custom | Varies | Custom | Special logic |

## Best Practices

- **Development**: Use `exists` for speed
- **Production**: Use `size` for balance
- **Critical data**: Use `hash` for correctness
- **Monitor**: Always use callbacks for large operations
- **Filter**: Exclude unnecessary files early

## What's Next?

Continue to:

- **[06_versioning.ipynb](06_versioning.ipynb)** - S3 versioning features
- **[07_advanced_features.ipynb](07_advanced_features.ipynb)** - Advanced integrations

Happy optimizing! 🚀