Skip to content

Professional Python utility for real-time directory synchronization with delta updates, conflict resolution, and .gitignore-style filtering

Notifications You must be signed in to change notification settings

dilenshah23/file-sync-py

Repository files navigation

File Sync Utility

A professional-grade Python utility for watching directories and syncing changes to remote locations with delta updates, conflict resolution, and .gitignore-style filtering.

Features

Core Functionality

  • Real-time Directory Watching: Monitors source directory for file changes
  • Delta Updates: Only syncs files that have changed (using SHA-256 hashing)
  • Conflict Resolution: Multiple strategies for handling file conflicts
  • Smart Filtering: .gitignore-style pattern matching for excluding files
  • Deletion Syncing: Optional synchronization of file deletions
  • Metadata Tracking: Persistent storage of file states for efficient syncing

Technical Highlights

  • Hash-based Change Detection: Uses SHA-256 for reliable delta detection
  • Event-driven Architecture: Watchdog-based file system monitoring
  • Efficient Large File Handling: Chunked reading for memory efficiency
  • Configurable Debouncing: Prevents redundant syncs from rapid changes
  • Comprehensive Logging: File and console logging for audit trails

Installation

# Install dependencies
pip install -r requirements.txt

# Or install watchdog directly
pip install watchdog

Quick Start

One-time Sync

python file_sync_utility.py /path/to/source /path/to/destination

Watch Mode (Real-time Sync)

python file_sync_utility.py /path/to/source /path/to/destination --watch

With Deletion Sync

python file_sync_utility.py /path/to/source /path/to/destination --watch --sync-deletions

Usage Examples

Basic Examples

1. Simple one-time sync:

python file_sync_utility.py ./my_project ./backup

2. Watch mode with conflict resolution:

python file_sync_utility.py ./source ./dest --watch --conflict newest

3. Sync with custom ignore file:

python file_sync_utility.py ./source ./dest --ignore-file .mysyncignore

4. Non-recursive sync (top-level only):

python file_sync_utility.py ./source ./dest --no-recursive

Advanced Examples

5. Full production setup:

python file_sync_utility.py \
    /home/user/projects \
    /mnt/backup/projects \
    --watch \
    --sync-deletions \
    --conflict newest \
    --check-interval 2

6. Interactive conflict resolution:

python file_sync_utility.py ./source ./dest --watch --conflict prompt

Command-Line Arguments

positional arguments:
  source                Source directory to watch
  destination           Destination directory for syncing

optional arguments:
  -h, --help            show this help message and exit
  --watch, -w           Watch for changes (default: one-time sync)
  --ignore-file, -i     Path to ignore patterns file (default: .syncignore)
  --conflict, -c        Conflict resolution strategy
                        Choices: source, destination, newest, prompt
                        Default: newest
  --sync-deletions, -d  Delete files from destination that don't exist in source
  --no-recursive, -nr   Don't sync subdirectories
  --check-interval, -t  Check interval in seconds for watch mode (default: 1)

Conflict Resolution Strategies

1. source (Default for one-time sync)

Always use the source file, overwriting destination.

python file_sync_utility.py ./source ./dest --conflict source

2. destination

Keep destination file, don't overwrite.

python file_sync_utility.py ./source ./dest --conflict destination

3. newest (Recommended for watch mode)

Compare modification times and keep the newest file.

python file_sync_utility.py ./source ./dest --conflict newest --watch

4. prompt

Ask user for each conflict (interactive mode).

python file_sync_utility.py ./source ./dest --conflict prompt

Prompt options:

  • s or source: Use source file
  • d or destination: Keep destination file
  • k or keep: Keep both (destination renamed with timestamp)
  • i or skip: Skip this file

Ignore Patterns (.syncignore)

The utility supports .gitignore-style pattern matching for excluding files and directories.

Pattern Syntax

Create a .syncignore file in your source directory:

# Comments start with #

# Ignore all Python cache
__pycache__/
*.pyc

# Ignore specific directories
node_modules/
.git/
dist/

# Ignore file types
*.log
*.tmp

# Negation (include despite other rules)
!important.log

# Match anywhere in path
*.txt

# Match from root only
/root_only.txt

# Directory-only patterns
temp/

Pattern Examples

# Python development
__pycache__/
*.py[cod]
.Python
venv/
*.egg-info/

# Node.js
node_modules/
npm-debug.log

# IDE
.vscode/
.idea/
*.swp

# OS files
.DS_Store
Thumbs.db

# Build artifacts
dist/
build/
*.o
*.exe

How It Works

Delta Update Algorithm

  1. Initial Scan: On startup, scans source directory and calculates SHA-256 hashes
  2. Metadata Comparison: Compares current hashes with stored metadata
  3. Change Detection: Only syncs files where hashes differ
  4. Conflict Resolution: Applies configured strategy when both files changed
  5. Metadata Update: Stores new hashes for future comparisons

File System Watching

Source Directory Changes
        ↓
Watchdog Event Detected
        ↓
Debounce Check (1 second default)
        ↓
Hash Calculation
        ↓
Metadata Comparison
        ↓
Conflict Resolution (if needed)
        ↓
File Copy with Metadata Preservation
        ↓
Metadata Update

Conflict Detection

A conflict occurs when:

  1. File exists in both source and destination
  2. Both files have changed since last sync
  3. File hashes don't match

The utility detects this by comparing:

  • Current source hash vs. stored metadata
  • Current destination hash vs. stored metadata
  • Source hash vs. destination hash

Technical Details

File Hashing

  • Algorithm: SHA-256
  • Chunk Size: 4096 bytes (efficient for large files)
  • Memory Usage: O(1) regardless of file size

Metadata Storage

  • Format: JSON
  • Location: .file_sync_metadata.json in source directory
  • Contents: File paths, hashes, sizes, modification times

Performance Optimizations

  • Chunked file reading for memory efficiency
  • Event debouncing (1 second default) prevents redundant syncs
  • Hash-based change detection avoids unnecessary file copies
  • Efficient directory scanning with early termination

Logging

  • File Log: file_sync.log (persistent)
  • Console Log: Real-time status updates
  • Log Levels: INFO, WARNING, ERROR
  • Rotation: Manual (can be extended with logging.handlers)

Use Cases

1. Development Backup

Automatically backup your working directory to a remote location:

python file_sync_utility.py ~/projects /mnt/backup/projects --watch --sync-deletions

2. Cloud Sync Preparation

Prepare files for manual cloud upload (excluding large/temp files):

python file_sync_utility.py ~/documents ~/cloud_staging --ignore-file .cloudignore

3. Build Artifact Distribution

Copy build outputs to distribution directory:

python file_sync_utility.py ./dist ./releases --conflict source --sync-deletions

4. Remote Development

Sync local changes to remote development server:

python file_sync_utility.py ./local_project /mnt/remote_server/project --watch --conflict newest

5. Multi-location Sync

Keep multiple working directories synchronized:

# Terminal 1
python file_sync_utility.py ~/workspace/project /mnt/location1/project --watch

# Terminal 2
python file_sync_utility.py ~/workspace/project /mnt/location2/project --watch

Architecture

Class Structure

FileSyncUtility (Main Controller)
    ├── FileSyncEngine (Core Sync Logic)
    │   ├── FileHasher (Hash Calculation)
    │   ├── IgnorePatternMatcher (Filter Logic)
    │   ├── ConflictResolver (Conflict Handling)
    │   └── MetadataStore (State Persistence)
    └── Observer (Watchdog)
        └── FileSyncEventHandler (Event Processing)

Data Flow

File Change → Event → Debounce → Hash → Metadata Check → 
Conflict Resolution → Copy → Metadata Update → Log

Error Handling

The utility handles various error scenarios:

  • Missing source directory: Exits with error message
  • Permission errors: Logs error, continues with other files
  • Hash calculation failures: Logs error, skips file
  • I/O errors during copy: Logs error, continues syncing
  • Metadata corruption: Starts fresh with empty metadata

Limitations

  1. One-way Sync: Only syncs from source to destination
  2. Local Filesystem: Designed for local/mounted filesystems (not native cloud APIs)
  3. No Encryption: Files copied without encryption (add separately if needed)
  4. No Compression: Files copied as-is (add archive step if needed)
  5. Platform-specific Paths: Uses OS-native path separators

Extending the Utility

Custom Conflict Resolution

Add new resolution strategy:

@staticmethod
def _resolve_custom(source_path: str, dest_path: str) -> str:
    # Your custom logic here
    return 'source'  # or 'destination' or 'skip'

Custom Hash Algorithm

Change hash algorithm:

# In FileHasher class
source_hash = self.hasher.hash_file(source_path, algorithm='md5')

Add File Encryption

Wrap the copy operation:

# In FileSyncEngine._sync_file
shutil.copy2(source_path, dest_path)
encrypt_file(dest_path)  # Your encryption function

Testing

Create Test Environment

# Create test directories
mkdir -p test_source test_dest

# Create test files
echo "Test content" > test_source/file1.txt
echo "Another file" > test_source/file2.txt
mkdir test_source/subdir
echo "Nested file" > test_source/subdir/file3.txt

# Create ignore file
cat > test_source/.syncignore << EOF
*.log
temp/
EOF

# Run sync
python file_sync_utility.py test_source test_dest --watch

Test Scenarios

  1. Basic sync: Create/modify files, verify they sync
  2. Ignore patterns: Create ignored files, verify they don't sync
  3. Conflict resolution: Modify same file in both locations
  4. Deletion sync: Delete source file, verify destination deletion
  5. Large files: Test with files > 1GB
  6. Rapid changes: Edit files quickly, verify debouncing

Performance Tips

  1. Adjust check interval: Increase for less CPU usage

    --check-interval 5  # Check every 5 seconds
  2. Use ignore patterns: Exclude unnecessary files

    *.log
    __pycache__/
    node_modules/
    
  3. Disable deletion sync: If not needed

    # Don't use --sync-deletions flag
  4. One-time sync: Use for large initial syncs

    # Do initial sync without watch
    python file_sync_utility.py source dest
    # Then start watching
    python file_sync_utility.py source dest --watch

Troubleshooting

Issue: Files not syncing

Check:

  1. File not in ignore patterns
  2. Source directory path correct
  3. Destination directory writable
  4. Check file_sync.log for errors

Issue: Too many sync operations

Solutions:

  1. Increase check interval: --check-interval 5
  2. Add more ignore patterns
  3. Check for infinite loop (syncing to subdirectory of source)

Issue: Conflicts not resolving

Check:

  1. Conflict strategy setting: --conflict <strategy>
  2. File permissions
  3. file_sync.log for conflict messages

Issue: High memory usage

Possible causes:

  1. Very large files (hashing loads chunks, not entire file)
  2. Too many files being watched
  3. Metadata file corruption

Solutions:

  1. Split large directories
  2. Use ignore patterns to exclude large files
  3. Delete .file_sync_metadata.json and restart

Contributing

To extend this utility:

  1. Add new conflict resolution strategies in ConflictResolver
  2. Implement additional hash algorithms in FileHasher
  3. Add pattern matching features in IgnorePatternMatcher
  4. Extend metadata storage in MetadataStore
  5. Add new event handlers in FileSyncEventHandler

License

This is a portfolio project demonstrating Python development skills including:

  • File I/O and system operations
  • Event-driven programming with watchdog
  • Hash algorithms for change detection
  • Configuration management
  • Command-line interfaces with argparse
  • Professional logging
  • Object-oriented design patterns

Author

Created as a portfolio project demonstrating proficiency in:

  • Python 3 development
  • File system operations
  • Hash algorithms (SHA-256)
  • Event-driven architecture
  • Configuration management
  • Error handling and logging
  • CLI tool development
  • Object-oriented programming

Skills Demonstrated

  • watchdog library for file system monitoring
  • ✅ File I/O with chunked reading for efficiency
  • ✅ Hashing algorithms (SHA-256) for change detection
  • ✅ Configuration files (JSON for metadata)
  • ✅ Pattern matching (.gitignore-style)
  • ✅ Conflict resolution strategies
  • ✅ Delta updates (only sync changes)
  • ✅ Professional logging
  • ✅ Command-line interfaces
  • ✅ Error handling and resilience
  • ✅ Object-oriented design
  • ✅ Type hints and documentation

About

Professional Python utility for real-time directory synchronization with delta updates, conflict resolution, and .gitignore-style filtering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages