A comprehensive system for generating realistic test datasets from business entities while maintaining data relationships and business logic integrity.
The Test Data Generation System is designed to create realistic test datasets for both functional and load testing by leveraging existing data sources (S3 files and DynamoDB tables) while maintaining data relationships and business logic integrity.
- Business Entity Management: Define and manage business entities with their data sources and relationships
- Functional Test Data Generation: Create targeted datasets for functional testing with specific criteria
- Load Test Data Generation: Generate large-scale datasets for performance testing with realistic patterns
- Data Source Integration: Support for Amazon S3 and DynamoDB data sources
- Relationship Maintenance: Automatically maintain referential integrity between related entities
- Pattern Recognition: Learn from existing data patterns to generate realistic synthetic data
- Data Anonymization: Built-in PII masking and anonymization capabilities
- Entity Registry: Central repository for business entity definitions
- Data Connectors: Pluggable connectors for various data sources
- Generation Engines: Separate engines for functional and load test data generation
- Web Interface: User-friendly web interface for system management
- REST API: Comprehensive API for programmatic access
- Python 3.8+
- pip or poetry for dependency management
- Clone the repository:
git clone <repository-url>
cd test-data-generation-system
- Install dependencies:
pip install -r requirements.txt
- Start the API server:
python -m src.api.main
- Open the web interface:
Navigate to
http://localhost:52712
in your browser
- Define Business Entities: Create entity definitions with data sources and relationships
- Generate Functional Test Data: Specify criteria and generate targeted datasets
- Generate Load Test Data: Create large-scale datasets for performance testing
- Monitor Jobs: Track generation progress and download results
entities:
customer:
description: "Customer business entity"
sources:
- type: dynamodb
connection_params:
region: us-east-1
table: customers
primary_key: customer_id
relationships:
- entity: order
type: one_to_many
foreign_key: customer_id
schema:
customer_id:
type: string
likely_identifier: true
email:
type: string
likely_email: true
name:
type: string
Create a .env
file with your configuration:
# AWS Configuration
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
# API Configuration
API_HOST=0.0.0.0
API_PORT=52712
DEBUG=true
# Generation Settings
DEFAULT_SAMPLE_SIZE=1000
MAX_GENERATION_VOLUME=1000000
GET /entities
- List all entitiesGET /entities/{name}
- Get entity detailsPOST /entities
- Register new entityGET /entities/validate
- Validate entity graphPOST /entities/discover-relationships
- Auto-discover relationships
POST /generate/functional
- Generate functional test dataPOST /generate/load
- Generate load test data
GET /jobs
- List all jobsGET /jobs/{id}
- Get job detailsGET /jobs/{id}/result
- Download job resultDELETE /jobs/{id}
- Delete job
curl -X POST http://localhost:52712/generate/functional \
-H "Content-Type: application/json" \
-d '{
"name": "checkout_test",
"entities": {
"customer": {
"count": 50,
"criteria": [
{
"field": "account_status",
"value": "active"
}
],
"include_related": true
}
},
"anonymize": true
}'
curl -X POST http://localhost:52712/generate/load \
-H "Content-Type: application/json" \
-d '{
"name": "peak_load_test",
"volume": 100000,
"entities": {
"customer": {
"count": 20000,
"distribution": "normal"
}
},
"relationships_maintained": true
}'
sources:
- type: s3
connection_params:
region: us-east-1
access_key_id: ${AWS_ACCESS_KEY_ID}
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
bucket: my-data-bucket
prefix: customer-data/
format: json
sources:
- type: dynamodb
connection_params:
region: us-east-1
access_key_id: ${AWS_ACCESS_KEY_ID}
secret_access_key: ${AWS_SECRET_ACCESS_KEY}
table: customers
primary_key: customer_id
Run the test suite:
pytest tests/
Run specific test files:
pytest tests/test_entity_registry.py -v
├── src/
│ ├── api/ # REST API implementation
│ ├── core/ # Core models and types
│ ├── entities/ # Entity management
│ ├── connectors/ # Data source connectors
│ ├── generators/ # Data generation engines
│ ├── config/ # Configuration management
│ └── utils/ # Utility functions
├── tests/ # Test suite
├── config/ # Configuration files
├── web/ # Web interface
└── requirements.txt # Python dependencies
- Create a new connector class inheriting from
DataConnector
- Implement required methods:
connect()
,disconnect()
,get_schema()
, etc. - Register the connector in the generation engines
- Update entity configuration schema
- Add new pattern analysis methods in the load generator
- Implement custom data generation algorithms
- Add new distribution types and sampling strategies
- Update the API to support new parameters
- Data Anonymization: All sensitive data is automatically anonymized
- Access Control: Implement proper authentication and authorization
- Encryption: Use encryption for data in transit and at rest
- Audit Logging: Complete audit trail of all generation activities
- Parallel Processing: Multi-threaded data generation for large datasets
- Caching: Pattern and metadata caching for improved performance
- Streaming: Stream processing for very large datasets
- Resource Management: Dynamic resource allocation based on workload
- Connection Errors: Verify AWS credentials and network connectivity
- Memory Issues: Reduce batch sizes for large dataset generation
- Slow Generation: Check data source performance and network latency
- Relationship Errors: Validate entity graph before generation
Enable debug logging:
import logging
logging.basicConfig(level=logging.DEBUG)
Check API logs:
tail -f /tmp/api.log
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
For questions and support:
- Create an issue in the repository
- Check the documentation
- Review existing issues and discussions