Test Data Generation System

A comprehensive system for generating realistic test datasets from business entities while maintaining data relationships and business logic integrity.

Overview

The Test Data Generation System is designed to create realistic test datasets for both functional and load testing by leveraging existing data sources (S3 files and DynamoDB tables) while maintaining data relationships and business logic integrity.

Features

Core Capabilities

Business Entity Management: Define and manage business entities with their data sources and relationships
Functional Test Data Generation: Create targeted datasets for functional testing with specific criteria
Load Test Data Generation: Generate large-scale datasets for performance testing with realistic patterns
Data Source Integration: Support for Amazon S3 and DynamoDB data sources
Relationship Maintenance: Automatically maintain referential integrity between related entities
Pattern Recognition: Learn from existing data patterns to generate realistic synthetic data
Data Anonymization: Built-in PII masking and anonymization capabilities

Architecture Components

Entity Registry: Central repository for business entity definitions
Data Connectors: Pluggable connectors for various data sources
Generation Engines: Separate engines for functional and load test data generation
Web Interface: User-friendly web interface for system management
REST API: Comprehensive API for programmatic access

Quick Start

Prerequisites

Python 3.8+
pip or poetry for dependency management

Installation

Clone the repository:

git clone <repository-url>
cd test-data-generation-system

Install dependencies:

pip install -r requirements.txt

Start the API server:

python -m src.api.main

Open the web interface: Navigate to http://localhost:52712 in your browser

Basic Usage

Define Business Entities: Create entity definitions with data sources and relationships
Generate Functional Test Data: Specify criteria and generate targeted datasets
Generate Load Test Data: Create large-scale datasets for performance testing
Monitor Jobs: Track generation progress and download results

Configuration

Entity Definition Example

entities:
  customer:
    description: "Customer business entity"
    sources:
      - type: dynamodb
        connection_params:
          region: us-east-1
        table: customers
        primary_key: customer_id
    relationships:
      - entity: order
        type: one_to_many
        foreign_key: customer_id
    schema:
      customer_id:
        type: string
        likely_identifier: true
      email:
        type: string
        likely_email: true
      name:
        type: string

Environment Variables

Create a .env file with your configuration:

# AWS Configuration
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1

# API Configuration
API_HOST=0.0.0.0
API_PORT=52712
DEBUG=true

# Generation Settings
DEFAULT_SAMPLE_SIZE=1000
MAX_GENERATION_VOLUME=1000000

API Reference

Entity Management

GET /entities - List all entities
GET /entities/{name} - Get entity details
POST /entities - Register new entity
GET /entities/validate - Validate entity graph
POST /entities/discover-relationships - Auto-discover relationships

Data Generation

POST /generate/functional - Generate functional test data
POST /generate/load - Generate load test data

Job Management

GET /jobs - List all jobs
GET /jobs/{id} - Get job details
GET /jobs/{id}/result - Download job result
DELETE /jobs/{id} - Delete job

Example API Usage

Generate Functional Test Data

curl -X POST http://localhost:52712/generate/functional \
  -H "Content-Type: application/json" \
  -d '{
    "name": "checkout_test",
    "entities": {
      "customer": {
        "count": 50,
        "criteria": [
          {
            "field": "account_status",
            "value": "active"
          }
        ],
        "include_related": true
      }
    },
    "anonymize": true
  }'

Generate Load Test Data

curl -X POST http://localhost:52712/generate/load \
  -H "Content-Type: application/json" \
  -d '{
    "name": "peak_load_test",
    "volume": 100000,
    "entities": {
      "customer": {
        "count": 20000,
        "distribution": "normal"
      }
    },
    "relationships_maintained": true
  }'

Data Sources

Amazon S3 Configuration

sources:
  - type: s3
    connection_params:
      region: us-east-1
      access_key_id: ${AWS_ACCESS_KEY_ID}
      secret_access_key: ${AWS_SECRET_ACCESS_KEY}
    bucket: my-data-bucket
    prefix: customer-data/
    format: json

DynamoDB Configuration

sources:
  - type: dynamodb
    connection_params:
      region: us-east-1
      access_key_id: ${AWS_ACCESS_KEY_ID}
      secret_access_key: ${AWS_SECRET_ACCESS_KEY}
    table: customers
    primary_key: customer_id

Testing

Run the test suite:

pytest tests/

Run specific test files:

pytest tests/test_entity_registry.py -v

Development

Project Structure

├── src/
│   ├── api/                 # REST API implementation
│   ├── core/                # Core models and types
│   ├── entities/            # Entity management
│   ├── connectors/          # Data source connectors
│   ├── generators/          # Data generation engines
│   ├── config/              # Configuration management
│   └── utils/               # Utility functions
├── tests/                   # Test suite
├── config/                  # Configuration files
├── web/                     # Web interface
└── requirements.txt         # Python dependencies

Adding New Data Sources

Create a new connector class inheriting from DataConnector
Implement required methods: connect(), disconnect(), get_schema(), etc.
Register the connector in the generation engines
Update entity configuration schema

Extending Generation Strategies

Add new pattern analysis methods in the load generator
Implement custom data generation algorithms
Add new distribution types and sampling strategies
Update the API to support new parameters

Security Considerations

Data Anonymization: All sensitive data is automatically anonymized
Access Control: Implement proper authentication and authorization
Encryption: Use encryption for data in transit and at rest
Audit Logging: Complete audit trail of all generation activities

Performance Optimization

Parallel Processing: Multi-threaded data generation for large datasets
Caching: Pattern and metadata caching for improved performance
Streaming: Stream processing for very large datasets
Resource Management: Dynamic resource allocation based on workload

Troubleshooting

Common Issues

Connection Errors: Verify AWS credentials and network connectivity
Memory Issues: Reduce batch sizes for large dataset generation
Slow Generation: Check data source performance and network latency
Relationship Errors: Validate entity graph before generation

Debugging

Enable debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Check API logs:

tail -f /tmp/api.log

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions and support:

Create an issue in the repository
Check the documentation
Review existing issues and discussions

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.browser_screenshots		.browser_screenshots
.vscode		.vscode
config		config
src		src
tests		tests
web		web
README.md		README.md
demo.py		demo.py
requirements.txt		requirements.txt
server.log		server.log

githubusermb/nosql-test-data-gen

Folders and files

Latest commit

History

Repository files navigation