Enterprise-Grade Batch Processing and Web Automation Solution
A production-ready Python automation framework demonstrating advanced Selenium WebDriver patterns, sophisticated error recovery, real-time progress monitoring, and enterprise-level batch processing capabilities.
This framework showcases professional automation engineering practices for large-scale web document processing. Built with maintainability, reliability, and scalability in mind, it demonstrates patterns applicable to enterprise automation challenges.
- Batch Processing: Process hundreds or thousands of records from Excel with intelligent queueing
- Advanced Error Recovery: Multi-phase retry logic with automatic reprocessing of failures
- Real-time Monitoring: 30-minute checkpoint reporting with comprehensive statistics
- Robust Error Handling: Graceful degradation with detailed error tracking and audit trails
- Multi-environment Support: Configurable environments (production, staging, QA, dev)
- Page Object Model: Clean, maintainable architecture following industry best practices
- Quality Validation: Automatic detection and retry of incomplete/corrupted downloads
- Audit Trail Generation: CSV export of processed records with URLs for compliance
- Separation of test logic from page interactions
- Reusable, maintainable page components
- Encapsulation of element locators and actions
page_objects/
โโโ base_page.py # Common web automation utilities
โโโ search_page.py # Search functionality abstraction
โโโ document_page.py # Document interaction patterns
- Automatic resource cleanup
- Exception-safe browser lifecycle management
- Professional resource handling
- Multi-environment configuration
- Dynamic URL routing
- Flexible driver initialization
- Multiple processing modes (download vs. open)
- Configurable language preferences
- Pluggable timeout strategies
python-selenium-patterns/
โโโ README.md # Main documentation (you are here)
โโโ main.py # CLI entry point
โโโ config.py # Centralized configuration
โโโ automation_base.py # Base automation infrastructure
โโโ requirements.txt # Python dependencies
โโโ run_tests.sh # Test runner script
โ
โโโ documentation/ # Project documentation
โ โโโ ARCHITECTURE.md # Architecture guide
โ โโโ QUICKSTART.md # Quick start guide
โ โโโ PROJECT_STRUCTURE.md # Detailed structure
โ โโโ TESTING_GUIDE.md # Testing documentation
โ โโโ FINAL_PROJECT_SUMMARY.md # Project summary
โ โโโ TRANSFER_BOT_SUMMARY.md # Transfer bot details
โ โโโ COMPLETED_TRANSFER_BOT.md # Transfer bot completion
โ
โโโ input/ # Sample input files
โ โโโ sample_input.xlsx # Document bot example
โ โโโ sample_transfer_input.xlsx # Transfer bot example
โ
โโโ bots/ # Specialized automation bots
โ โโโ __init__.py
โ โโโ bot_document_processor.py # Document processing bot
โ โโโ bot_transfer.py # Transfer bot
โ โโโ README.md # Bot documentation
โ โโโ TRANSFER_BOT_README.md # Transfer bot guide
โ
โโโ page_objects/ # Page Object Model components
โ โโโ __init__.py
โ โโโ base_page.py
โ โโโ search_page.py
โ โโโ document_page.py
โ โโโ transfer_page.py
โ
โโโ tests/ # Comprehensive test suite (95+ tests)
โ โโโ __init__.py
โ โโโ conftest.py # Shared fixtures
โ โโโ test_data_validation.py # Data validation tests
โ โโโ test_success_scenarios.py # Success path tests
โ โโโ test_failure_scenarios.py # Error handling tests
โ โโโ test_infrastructure.py # Infrastructure tests
โ โโโ test_bot_transfer.py # Transfer bot tests
โ โโโ README.md # Test documentation
โ โโโ TEST_SUMMARY.md # Test statistics
โ
โโโ drivers/ # WebDriver executables
โ โโโ README.txt
โ
โโโ output/ # Downloaded files destination
Challenge: Process large volumes of records reliably Solution: Implements robust batch processing with comprehensive error tracking
- โ Parse Excel files with flexible column detection
- โ Handle missing/malformed data gracefully
- โ Track progress across hundreds of records
- โ Aggregate errors for batch reporting
# Flexible input parsing
record_order_map = self.parse_input_file(input_file)
# Handles: RECORD_ID only, or RECORD_ID + ORDER, or RECORD_ID + ORDER + TREATMENTChallenge: Network timeouts, page load failures, incomplete downloads Solution: Automatic two-phase processing with intelligent retry
Phase 1: Initial processing
- Process all records
- Track failures: skipped, generation errors, small files
Phase 2: Automatic retry
- Reprocess only problematic records
- Extended timeouts for difficult cases
- Exclude permanent failures (generation errors)
# Automatic retry logic
problematic_records = self.process_records(record_map)
if problematic_records:
retry_map = {rid: record_map[rid] for rid in problematic_records}
self.process_records(retry_map, is_retry=True)Challenge: Long-running operations need visibility Solution: 30-minute checkpoint reporting with detailed metrics
Features:
- Session statistics (last 30 minutes)
- Overall statistics (since start)
- ETA calculations based on actual performance
- Success rate tracking
- Throughput monitoring (records/hour)
Sample Output:
======================================================================
30-MINUTE CHECKPOINT
======================================================================
TIME: 1h 45m elapsed | Session: 30m
======================================================================
--- PROGRESS ---
Processed: 234/500 (46.8%)
Remaining: 266 records
--- SESSION STATISTICS (Last 30 minutes) ---
Total Processed: 58
โ Downloaded: 52
โ Skipped: 5
โ Generation Failed: 1
Avg Time/Record: 31.0 seconds
--- OVERALL STATISTICS ---
Total Downloaded: 215
Total Skipped: 17
Total Generation Failed: 2
Success Rate: 91.9%
Avg Time/Record: 26.9 seconds
--- TIME ESTIMATE ---
Estimated Time Remaining: 1.9 hours
======================================================================
Challenge: Detect incomplete or corrupted downloads Solution: Automated file size validation with retry
- Files < 10KB flagged as suspicious
- Automatic inclusion in retry phase
- Detailed reporting of problematic files
small_files = self._check_small_files(output_dir)
# Returns: [('file1.pdf', 2048), ('file2.pdf', 5120)]Challenge: Enterprise compliance and debugging Solution: Multi-level logging and URL tracking
- Detailed console logging (INFO level)
- Verbose file logging (DEBUG level)
- CSV export of processed record URLs
- Session-based URL exports (every 30 min)
Exported CSV Format:
RECORD_ID,DOCUMENT_URL
1001,https://example.com/document?id=1001
1002,https://example.com/document?id=1002
Error Categories:
| Category | Behavior | Retry? |
|---|---|---|
| Skipped | Navigation failures, timeouts | โ Yes |
| Small Files | Downloads < 10KB | โ Yes |
| Generation Failed | Server-side errors | โ No |
Error Recovery Strategies:
- Timeouts: Graceful continuation, record tracking
- Page Load Failures: Tab cleanup, move to next record
- Network Issues: Automatic retry with extended timeouts
- File System Errors: Safe filename sanitization
Challenge: Test in staging, deploy to production Solution: Environment abstraction with URL routing
base_urls = {
'production': 'https://www.example.com',
'staging': 'https://staging.example.com',
'qa': 'https://qa.example.com',
'dev': 'https://dev.example.com'
}# Run in different environments
python main.py -d -f input.xlsx -o output/ -e production
python main.py -d -f input.xlsx -o output/ -e staging- Explicit waits with custom conditions
- Multi-tab/window management
- JavaScript execution for page interactions
- Print dialog automation (PDF generation)
- Dynamic content handling
- Type hints and documentation
- Context managers for resource safety
- Logging framework usage
- Exception handling hierarchies
- Cross-platform compatibility
- SOLID Principles: Single responsibility, open/closed, dependency inversion
- DRY Principle: Reusable base classes and utilities
- Clean Code: Meaningful names, small functions, clear intent
- Error Handling: Fail gracefully, provide actionable messages
- Testing: Comprehensive test suite with 80+ tests (85%+ coverage)
- pytest Framework: Professional testing with fixtures and mocks
- Test Coverage: 80+ tests across 4 categories (data, success, failure, infra)
- Test Patterns: Unit, integration, and infrastructure testing
- CI/CD Ready: Automated test execution and coverage reporting
- Quality Metrics: 85%+ code coverage target
- Pandas for Excel manipulation
- Flexible schema handling
- Data validation and sanitization
- CSV generation and export
- Command-line interface design
- Multi-environment configuration
- Logging and monitoring
- Performance metrics collection
- Resource management
- Automated testing infrastructure
- Python: 3.7+
- Chrome Browser: Latest version
- ChromeDriver: Matching Chrome version
- Dependencies: See
requirements.txt
pip install -r requirements.txt
# For running tests (optional)
pip install pytest pytest-cov pytest-mockcd /path/to/python-selenium-patternspip install -r requirements.txt- Visit: https://chromedriver.chromium.org/downloads
- Download version matching your Chrome browser
- Place in
drivers/directory - Make executable (macOS/Linux):
chmod +x drivers/chromedriver
python main.py --helppython main.py -d -f <input_file> -o <output_dir> [options]| Argument | Description |
|---|---|
-d, --download |
Enable document processing mode |
-f, --file |
Path to input Excel file |
-o, --output |
Output directory for downloads |
| Argument | Default | Description |
|---|---|---|
-e, --env |
production | Environment: production, staging, qa, dev |
-m, --mode |
download | Task mode: download or open |
-l, --language |
default | Language: default or english |
-v, --verbose |
off | Enable verbose logging |
--headless |
off | Run browser in headless mode |
Download documents for all records in input file:
python main.py -d \
-f input/sample_input.xlsx \
-o output/Full visibility into processing with confirmation prompts:
python main.py -d \
-f input/sample_input.xlsx \
-o output/ \
-e staging \
-vOutput:
============================================================
CONFIGURATION SUMMARY
============================================================
Environment: staging
Task Mode: download
Language: default
Input File: /path/to/input/sample_input.xlsx
Output Directory: /path/to/output
Verbose Logging: True
Headless Mode: False
============================================================
Configuration loaded. Press ENTER to continue or 'N' to abort:
Verify documents load without downloading:
python main.py -d \
-f input/sample_input.xlsx \
-o output/ \
-m open \
-vUse Case: Quickly validate accessibility of 500+ documents without disk I/O
Run without GUI (perfect for CI/CD or scheduled jobs):
python main.py -d \
-f input/sample_input.xlsx \
-o output/ \
--headlessFull production run with all features:
python main.py -d \
-f production_records.xlsx \
-o /data/downloads/ \
-e production \
-m download \
-l default \
-vRequired Column: RECORD_ID
Optional Columns: ORDER, TREATMENT
| RECORD_ID | ORDER | TREATMENT |
|---|---|---|
| 1001 | 5001 | 7001 |
| 1002 | 5002 | 7002 |
| 1003 | 5003 | 7003 |
Format 1: Record ID only
RECORD_ID
1001
1002
1003
Format 2: Record ID + Order
RECORD_ID,ORDER
1001,5001
1002,5002
Format 3: Full format (all columns)
RECORD_ID,ORDER,TREATMENT
1001,5001,7001
1002,5002,7002
Location: Specified output directory
Naming: Record_[ID]_([ID]).pdf
Example: Record_1001_(1001).pdf
- Real-time progress updates
- 30-minute checkpoint reports
- Final statistics summary
- Error summaries
File: automation.log
Content: Detailed DEBUG-level logging
Use: Troubleshooting, audit trails
Location: ~/Downloads/
Format: document_urls_session_[N].csv
Frequency: Every 30 minutes + final export
Columns: RECORD_ID, DOCUMENT_URL
======================================================================
FINAL RUN STATISTICS
======================================================================
TOTAL TIME: 2h 15m 32s
======================================================================
--- FINAL RESULTS ---
Total Records: 500
Processed: 500/500 (100.0%)
--- OVERALL STATISTICS ---
Total Downloaded: 472
Total Skipped: 21
Total Generation Failed: 7
Success Rate: 94.4%
Avg Time/Record: 16.2 seconds (0.27 minutes)
--- PERFORMANCE METRICS ---
Throughput: 221.3 records/hour
======================================================================
This demonstration uses google.com as a test target to showcase automation patterns without requiring proprietary systems.
To adapt this framework for actual web applications:
- Update Page Locators: Modify
page_objects/*.pywith actual element locators - Configure URLs: Update
base_urlsinconfig.py - Adjust Navigation: Implement actual navigation flows in page objects
- Add Authentication: Extend
automation_base.pywith login logic - Customize Validation: Modify
_is_document_page()for your URLs
# page_objects/search_page.py
class SearchPage(BasePage):
# Update locators for your application
SEARCH_INPUT = (By.ID, "search-box")
SEARCH_BUTTON = (By.XPATH, "//button[@type='submit']")
def search_record(self, record_id):
# Implement your actual search logic
search_input = self.find_element(self.SEARCH_INPUT)
search_input.send_keys(record_id)
self.click(self.SEARCH_BUTTON)
# Parse actual results
return {'id': record_id, 'name': 'Result Name'}Error: SessionNotCreatedException
Solution:
# Check Chrome version
chrome --version # or check in Chrome menu
# Download matching ChromeDriver from:
# https://chromedriver.chromium.org/downloadsError: Permission denied: 'drivers/chromedriver'
Solution:
chmod +x drivers/chromedriver
# If still blocked, go to System Preferences > Security & Privacy > AllowError: ModuleNotFoundError: No module named 'selenium'
Solution:
pip install -r requirements.txtError: PermissionError: [Errno 13]
Solution:
# Use absolute path with write permissions
python main.py -d -f input.xlsx -o ~/Documents/output/This framework demonstrates the developer's proficiency in:
- Production-Ready Code: Exception handling, logging, resource management
- Scalability Thinking: Batch processing, checkpoint systems, retry logic
- User Experience: Progress visibility, error messages, configuration options
- Maintainability: Clean architecture, documentation, modular design
- Problem Solving: Multi-phase retry, quality validation, audit trails
Patterns showcased here can be applied to:
- E2E testing automation
- Data migration scripts
- Web scraping projects
- Scheduled batch jobs
- Regression test suites
- CI/CD integration
The framework includes a complete test suite with 80+ tests covering:
- โ Data Validation (12 tests) - Input parsing, validation, quality checks
- โ Success Scenarios (18 tests) - Happy path execution
- โ Failure Scenarios (20 tests) - Error handling and recovery
- โ Infrastructure (30 tests) - System setup and configuration
# Install test dependencies
pip install pytest pytest-cov pytest-mock
# Run all tests
pytest tests/ -v
# Run with coverage report
pytest tests/ --cov=. --cov-report=html
# Or use the test runner script
./run_tests.sh all # All tests
./run_tests.sh coverage # With coverage report
./run_tests.sh quick # Essential tests onlySee detailed test documentation:
tests/README.md- Complete test guidetests/TEST_SUMMARY.md- Test statistics and overview
Expected Coverage: 85%+ code coverage
Comprehensive documentation is available in the documentation/ folder:
- ARCHITECTURE.md - System architecture and design patterns
- QUICKSTART.md - Quick start guide to get running fast
- PROJECT_STRUCTURE.md - Detailed project structure
- TESTING_GUIDE.md - Testing framework and guidelines
- FINAL_PROJECT_SUMMARY.md - Executive project summary
- TRANSFER_BOT_SUMMARY.md - Transfer bot implementation details
- COMPLETED_TRANSFER_BOT.md - Transfer bot completion summary
- bots/README.md - Bot architecture and patterns
- bots/TRANSFER_BOT_README.md - Transfer bot quick reference
- tests/README.md - Test suite documentation
- tests/TEST_SUMMARY.md - Test statistics and coverage
Potential additions to demonstrate additional skills:
- Property-based testing with Hypothesis
- Performance benchmarking tests
- Mutation testing
- Load testing scenarios
- Prometheus metrics export
- Grafana dashboard integration
- Alert notifications (email/Slack)
- Parallel processing with multiprocessing
- Distributed execution with Celery
- Queue-based architecture with Redis
- Docker containerization
- GitHub Actions workflows
- Automated deployment pipelines
This is a demonstration/portfolio project. Feel free to adapt patterns for your own use.
This framework was created to showcase enterprise automation engineering capabilities. It demonstrates real-world patterns used in production systems for processing thousands of records reliably.
Key Achievements:
- Processes 200+ records/hour
- 95%+ success rate with automatic retry
- Zero manual intervention required
- Complete audit trail generation
- Production-tested patterns
The patterns shown here scale from dozens to thousands of records and can be adapted for any web automation challenge requiring reliability, visibility, and maintainability.
Built with Python 3 โข Selenium WebDriver โข Pandas
Demonstrating Production-Ready Automation Engineering