Skip to content

dichaelen/mock-data-generator

mock-data-generator

Overview

The mock-data-generator.py python script produces mock data for Senzing. The senzing/mock-data-generator docker image produces mock data for Senzing for use in docker formations (e.g. docker-compose, kubernetes).

mock-data-generator.py has a number of subcommands for performing different types of Senzing mock data creation.

To see all of the subcommands, run

$ ./mock-data-generator.py --help
usage: mock-data-generator.py [-h]
                              {version,random-to-stdout,random-to-kafka,url-to-stdout,url-to-kafka}
                              ...

Generate mock data from a URL-addressable file or templated random data. For
more information, see https://github.com/Senzing/mock-data-generator

positional arguments:
  {version,random-to-stdout,random-to-kafka,url-to-stdout,url-to-kafka}
                        Subcommands (SENZING_SUBCOMMAND):
    version             Print version of mock-data-generator.py.
    random-to-stdout    Send random data to STDOUT
    random-to-kafka     Send random data to Kafka
    url-to-stdout       Send HTTP or file data to STDOUT
    url-to-kafka        Send HTTP or file data to Kafka

optional arguments:
  -h, --help            show this help message and exit

To see the options for a subcommand, run commands like:

./mock-data-generator.py random-to-stdout --help

Contents

  1. Using Command Line
    1. Prerequisite software
    2. Set environment variables
    3. Clone repository
    4. Install
    5. Demonstrate
  2. Using Docker
    1. Build docker image
    2. Configuration
    3. Run docker image
      1. Demonstrate random to STDOUT
      2. Demonstrate random to Kafka
      3. Demonstrate URL to STDOUT
      4. Demonstrate URL to Kafka
  3. Developing
    1. Build docker image for development
  4. Errors

Using Command Line

Prerequisite software

The following software programs need to be installed.

  1. YUM-based installs - For Red Hat, CentOS, openSuse and others.

    sudo yum -y install epel-release
    sudo yum -y install git
  2. APT-based installs - For Ubuntu and others

    sudo apt update
    sudo apt -y install git

Set environment variables

  1. These variables may be modified, but do not need to be modified. The variables are used throughout the installation procedure.

    export GIT_ACCOUNT=senzing
    export GIT_REPOSITORY=mock-data-generator
    export DOCKER_IMAGE_TAG=senzing/mock-data-generator
  2. Synthesize environment variables.

    export GIT_ACCOUNT_DIR=~/${GIT_ACCOUNT}.git
    export GIT_REPOSITORY_DIR="${GIT_ACCOUNT_DIR}/${GIT_REPOSITORY}"
    export GIT_REPOSITORY_URL="https://github.com/${GIT_ACCOUNT}/${GIT_REPOSITORY}.git"
  3. Set environment variables described in "Configuration".

Clone repository

  1. Get repository.

    mkdir --parents ${GIT_ACCOUNT_DIR}
    cd  ${GIT_ACCOUNT_DIR}
    git clone ${GIT_REPOSITORY_URL}

Install

  1. YUM installs - For Red Hat, CentOS, openSuse and others.

    sudo xargs yum -y install < ${GIT_REPOSITORY_DIR}/src/yum-packages.txt
  2. APT installs - For Ubuntu and others

    sudo xargs apt -y install < ${GIT_REPOSITORY_DIR}/src/apt-packages.txt
  3. PIP installs

    sudo pip install -r ${GIT_REPOSITORY_DIR}/requirements.txt

Demonstrate

  1. Show help. Example:

    cd ${GIT_REPOSITORY_DIR}
    ./mock-data-generator.py --help
    ./mock-data-generator.py random-to-stdout --help
  2. Show random file output. Example:

    cd ${GIT_REPOSITORY_DIR}
    ./mock-data-generator.py random-to-stdout
  3. Show random file output with 1 record per second. Example:

    cd ${GIT_REPOSITORY_DIR}
    ./mock-data-generator.py random-to-stdout \
      --records-per-second 1
  4. Show repeatable "random" output using random seed. Example:

    cd ${GIT_REPOSITORY_DIR}
    ./mock-data-generator.py random-to-stdout \
      --random-seed 1
  5. Show generating 10 (repeatable) random records at the rate of 2 per second. Example:

    cd ${GIT_REPOSITORY_DIR}
    ./mock-data-generator.py random-to-stdout \
      --random-seed 22 \
      --record-min 1 \
      --record-max 10 \
      --records-per-second 2
  6. Show sending output to a file of JSON-lines. Example:

    cd ${GIT_REPOSITORY_DIR}
    ./mock-data-generator.py random-to-stdout \
      --random-seed 22 \
      --record-min 1 \
      --record-max 10 \
      --records-per-second 2 \
      > output-file.jsonlines
  7. Show reading 5 records from URL-based file at the rate of 3 per second. Example:

    cd ${GIT_REPOSITORY_DIR}
    ./mock-data-generator.py url-to-stdout \
      --input-url https://s3.amazonaws.com/public-read-access/TestDataSets/loadtest-dataset-1M.json \
      --record-min 1 \
      --record-max 5 \
      --records-per-second 3

Using Docker

Build docker image

  1. Build docker image.

    sudo docker build --tag senzing/mock-data-generator https://github.com/senzing/mock-data-generator.git

Configuration

  • SENZING_DEBUG - Print debug statements to log.
  • SENZING_DATA_SOURCE - If a JSON line does not have the DATA_SOURCE key/value, this value is inserted.
  • SENZING_ENTITY_TYPE - If a JSON line does not have the ENTITY_TYPE key/value, this value is inserted.
  • SENZING_INPUT_URL - URL of source file. Default: https://s3.amazonaws.com/public-read-access/TestDataSets/loadtest-dataset-1M.json
  • SENZING_KAFKA_BOOTSTRAP_SERVER - Hostname and port of Kafka server. Default: "localhost"
  • SENZING_KAFKA_TOPIC - Kafka topic. Default: "senzing-kafka-topic"
  • SENZING_RANDOM_SEED - Identify seed for random number generator. Value of 0 uses system clock. Values greater than 0 give repeatable results. Default: "0"
  • SENZING_RECORD_MAX - Identify highest record number to generate. Value of 0 means no maximum. Default: "0"
  • SENZING_RECORD_MIN - Identify lowest record number to generate. Default: "1"
  • SENZING_RECORD_MONITOR - Write a log record every N mock records. Default: "10000"
  • SENZING_RECORDS_PER_SECOND - Throttle output to a specified records per second. Value of 0 means no throttling. Default: "0"
  • SENZING_SUBCOMMAND - Identify the subcommand to be run. See mock-data-generator.py --help for complete list.
  1. To determine which configuration parameters are use for each <subcommand>, run:

    ./mock-data-generator.py <subcommand> --help

Run docker image

Demonstrate random to STDOUT

  1. Run the docker container. Example:

    export SENZING_SUBCOMMAND=random-to-stdout
    export SENZING_RANDOM_SEED=0
    export SENZING_RECORD_MIN=1
    export SENZING_RECORD_MAX=10
    export SENZING_RECORDS_PER_SECOND=0
    
    sudo docker run -it  \
      --env SENZING_SUBCOMMAND="${SENZING_SUBCOMMAND}" \
      --env SENZING_RANDOM_SEED="${SENZING_RANDOM_SEED}" \
      --env SENZING_RECORD_MIN="${SENZING_RECORD_MIN}" \
      --env SENZING_RECORD_MAX="${SENZING_RECORD_MAX}" \
      --env SENZING_RECORDS_PER_SECOND="${SENZING_RECORDS_PER_SECOND}" \
      senzing/mock-data-generator

Demonstrate random to Kafka

  1. Run docker-compose-stream-loader-demo

  2. Identify the Docker network. Example:

    docker network ls
    
    # Choose value from NAME column of docker network ls
    export SENZING_NETWORK=nameofthe_network
  3. Run the docker container. Example:

    export SENZING_SUBCOMMAND=random-to-kafka
    
    export SENZING_KAFKA_BOOTSTRAP_SERVER=senzing-kafka:9092
    export SENZING_KAFKA_TOPIC="senzing-kafka-topic"
    export SENZING_NETWORK=senzingdockercomposestreamloaderdemo_backend
    export SENZING_RANDOM_SEED=1
    export SENZING_RECORD_MIN=210
    export SENZING_RECORD_MAX=220
    export SENZING_RECORDS_PER_SECOND=1
    
    sudo docker run -it  \
      --net ${SENZING_NETWORK} \
      --env SENZING_SUBCOMMAND="${SENZING_SUBCOMMAND}" \
      --env SENZING_KAFKA_BOOTSTRAP_SERVER=${SENZING_KAFKA_BOOTSTRAP_SERVER} \
      --env SENZING_KAFKA_TOPIC=${SENZING_KAFKA_TOPIC} \
      --env SENZING_RANDOM_SEED="${SENZING_RANDOM_SEED}" \
      --env SENZING_RECORD_MIN="${SENZING_RECORD_MIN}" \
      --env SENZING_RECORD_MAX="${SENZING_RECORD_MAX}" \
      --env SENZING_RECORDS_PER_SECOND="${SENZING_RECORDS_PER_SECOND}" \
      senzing/mock-data-generator

Demonstrate URL to STDOUT

  1. Run the docker container. Example:

    export SENZING_SUBCOMMAND=url-to-stdout
    
    export SENZING_INPUT_URL=https://s3.amazonaws.com/public-read-access/TestDataSets/loadtest-dataset-1M.json
    export SENZING_RECORD_MIN=240
    export SENZING_RECORD_MAX=250
    export SENZING_RECORDS_PER_SECOND=0
    
    sudo docker run -it  \
      --env SENZING_SUBCOMMAND="${SENZING_SUBCOMMAND}" \
      --env SENZING_INPUT_URL=${SENZING_INPUT_URL} \
      --env SENZING_RECORD_MIN="${SENZING_RECORD_MIN}" \
      --env SENZING_RECORD_MAX="${SENZING_RECORD_MAX}" \
      --env SENZING_RECORDS_PER_SECOND="${SENZING_RECORDS_PER_SECOND}" \
      senzing/mock-data-generator

Demonstrate URL to Kafka

  1. Run docker-compose-stream-loader-demo

  2. Identify the Docker network. Example:

    docker network ls
    
    # Choose value from NAME column of docker network ls
    export SENZING_NETWORK=nameofthe_network
  3. Run the docker container. Example:

    export SENZING_SUBCOMMAND=url-to-kafka
    
    export SENZING_INPUT_URL=https://s3.amazonaws.com/public-read-access/TestDataSets/loadtest-dataset-1M.json
    export SENZING_KAFKA_BOOTSTRAP_SERVER=senzing-kafka:9092
    export SENZING_KAFKA_TOPIC="senzing-kafka-topic"
    export SENZING_NETWORK=senzingdockercomposestreamloaderdemo_backend
    export SENZING_RECORD_MIN=260
    export SENZING_RECORD_MAX=300
    export SENZING_RECORD_MONITOR=10
    export SENZING_RECORDS_PER_SECOND=10
    
    sudo docker run -it  \
      --net ${SENZING_NETWORK} \
      --env SENZING_SUBCOMMAND="${SENZING_SUBCOMMAND}" \
      --env SENZING_INPUT_URL=${SENZING_INPUT_URL} \
      --env SENZING_KAFKA_BOOTSTRAP_SERVER=${SENZING_KAFKA_BOOTSTRAP_SERVER} \
      --env SENZING_KAFKA_TOPIC=${SENZING_KAFKA_TOPIC} \
      --env SENZING_RECORD_MIN="${SENZING_RECORD_MIN}" \
      --env SENZING_RECORD_MAX="${SENZING_RECORD_MAX}" \
      --env SENZING_RECORD_MONITOR="${SENZING_RECORD_MONITOR}" \
      --env SENZING_RECORDS_PER_SECOND="${SENZING_RECORDS_PER_SECOND}" \
      senzing/mock-data-generator

Developing

Build docker image for development

  1. See if docker is already installed.

    sudo docker --version
  2. If needed, install Docker. See HOWTO - Install Docker

  3. Option #1 - Using make command

    cd ${GIT_REPOSITORY_DIR}
    sudo make docker-build
  4. Option #2 - Using docker command

    cd ${GIT_REPOSITORY_DIR}
    sudo docker build --tag ${DOCKER_IMAGE_TAG} .

Errors

  1. See doc/errors.md.

About

Python tool for generating mock Senzing data and sending it to Kafka or STDOUT.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

No packages published