Skip to content

githubusermb/nosql-test-data-gen

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Test Data Generation System

A comprehensive system for generating realistic test datasets from business entities while maintaining data relationships and business logic integrity.

Overview

The Test Data Generation System is designed to create realistic test datasets for both functional and load testing by leveraging existing data sources (S3 files and DynamoDB tables) while maintaining data relationships and business logic integrity.

Features

Core Capabilities

  • Business Entity Management: Define and manage business entities with their data sources and relationships
  • Functional Test Data Generation: Create targeted datasets for functional testing with specific criteria
  • Load Test Data Generation: Generate large-scale datasets for performance testing with realistic patterns
  • Data Source Integration: Support for Amazon S3 and DynamoDB data sources
  • Relationship Maintenance: Automatically maintain referential integrity between related entities
  • Pattern Recognition: Learn from existing data patterns to generate realistic synthetic data
  • Data Anonymization: Built-in PII masking and anonymization capabilities

Architecture Components

  • Entity Registry: Central repository for business entity definitions
  • Data Connectors: Pluggable connectors for various data sources
  • Generation Engines: Separate engines for functional and load test data generation
  • Web Interface: User-friendly web interface for system management
  • REST API: Comprehensive API for programmatic access

Quick Start

Prerequisites

  • Python 3.8+
  • pip or poetry for dependency management

Installation

  1. Clone the repository:
git clone <repository-url>
cd test-data-generation-system
  1. Install dependencies:
pip install -r requirements.txt
  1. Start the API server:
python -m src.api.main
  1. Open the web interface: Navigate to http://localhost:52712 in your browser

Basic Usage

  1. Define Business Entities: Create entity definitions with data sources and relationships
  2. Generate Functional Test Data: Specify criteria and generate targeted datasets
  3. Generate Load Test Data: Create large-scale datasets for performance testing
  4. Monitor Jobs: Track generation progress and download results

Configuration

Entity Definition Example

entities:
  customer:
    description: "Customer business entity"
    sources:
      - type: dynamodb
        connection_params:
          region: us-east-1
        table: customers
        primary_key: customer_id
    relationships:
      - entity: order
        type: one_to_many
        foreign_key: customer_id
    schema:
      customer_id:
        type: string
        likely_identifier: true
      email:
        type: string
        likely_email: true
      name:
        type: string

Environment Variables

Create a .env file with your configuration:

# AWS Configuration
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1

# API Configuration
API_HOST=0.0.0.0
API_PORT=52712
DEBUG=true

# Generation Settings
DEFAULT_SAMPLE_SIZE=1000
MAX_GENERATION_VOLUME=1000000

API Reference

Entity Management

  • GET /entities - List all entities
  • GET /entities/{name} - Get entity details
  • POST /entities - Register new entity
  • GET /entities/validate - Validate entity graph
  • POST /entities/discover-relationships - Auto-discover relationships

Data Generation

  • POST /generate/functional - Generate functional test data
  • POST /generate/load - Generate load test data

Job Management

  • GET /jobs - List all jobs
  • GET /jobs/{id} - Get job details
  • GET /jobs/{id}/result - Download job result
  • DELETE /jobs/{id} - Delete job

Example API Usage

Generate Functional Test Data

curl -X POST http://localhost:52712/generate/functional \
  -H "Content-Type: application/json" \
  -d '{
    "name": "checkout_test",
    "entities": {
      "customer": {
        "count": 50,
        "criteria": [
          {
            "field": "account_status",
            "value": "active"
          }
        ],
        "include_related": true
      }
    },
    "anonymize": true
  }'

Generate Load Test Data

curl -X POST http://localhost:52712/generate/load \
  -H "Content-Type: application/json" \
  -d '{
    "name": "peak_load_test",
    "volume": 100000,
    "entities": {
      "customer": {
        "count": 20000,
        "distribution": "normal"
      }
    },
    "relationships_maintained": true
  }'

Data Sources

Amazon S3 Configuration

sources:
  - type: s3
    connection_params:
      region: us-east-1
      access_key_id: ${AWS_ACCESS_KEY_ID}
      secret_access_key: ${AWS_SECRET_ACCESS_KEY}
    bucket: my-data-bucket
    prefix: customer-data/
    format: json

DynamoDB Configuration

sources:
  - type: dynamodb
    connection_params:
      region: us-east-1
      access_key_id: ${AWS_ACCESS_KEY_ID}
      secret_access_key: ${AWS_SECRET_ACCESS_KEY}
    table: customers
    primary_key: customer_id

Testing

Run the test suite:

pytest tests/

Run specific test files:

pytest tests/test_entity_registry.py -v

Development

Project Structure

├── src/
│   ├── api/                 # REST API implementation
│   ├── core/                # Core models and types
│   ├── entities/            # Entity management
│   ├── connectors/          # Data source connectors
│   ├── generators/          # Data generation engines
│   ├── config/              # Configuration management
│   └── utils/               # Utility functions
├── tests/                   # Test suite
├── config/                  # Configuration files
├── web/                     # Web interface
└── requirements.txt         # Python dependencies

Adding New Data Sources

  1. Create a new connector class inheriting from DataConnector
  2. Implement required methods: connect(), disconnect(), get_schema(), etc.
  3. Register the connector in the generation engines
  4. Update entity configuration schema

Extending Generation Strategies

  1. Add new pattern analysis methods in the load generator
  2. Implement custom data generation algorithms
  3. Add new distribution types and sampling strategies
  4. Update the API to support new parameters

Security Considerations

  • Data Anonymization: All sensitive data is automatically anonymized
  • Access Control: Implement proper authentication and authorization
  • Encryption: Use encryption for data in transit and at rest
  • Audit Logging: Complete audit trail of all generation activities

Performance Optimization

  • Parallel Processing: Multi-threaded data generation for large datasets
  • Caching: Pattern and metadata caching for improved performance
  • Streaming: Stream processing for very large datasets
  • Resource Management: Dynamic resource allocation based on workload

Troubleshooting

Common Issues

  1. Connection Errors: Verify AWS credentials and network connectivity
  2. Memory Issues: Reduce batch sizes for large dataset generation
  3. Slow Generation: Check data source performance and network latency
  4. Relationship Errors: Validate entity graph before generation

Debugging

Enable debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Check API logs:

tail -f /tmp/api.log

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For questions and support:

  • Create an issue in the repository
  • Check the documentation
  • Review existing issues and discussions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 79.2%
  • HTML 20.8%