# Data Generator Server
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
The default Docker Compose deployment includes a data generation service created from the published Docker image at `imply/datagen:latest`. 
This image is built by the project https://github.com/implydata/druid-datagenerator. 

This notebook shows you how to use the data generation service included in the Docker Compose deployment. It explains how to use predefined data generator configurations as well as how to build a custom data generator. You will also learn how to create sample data files for batch ingestion and how to generate live streaming data for streaming ingestion.

## Table of contents

* [Initialization](#Initialization)
* [List available configurations](#List-available-configurations)
* [Generate a data file for backfilling history](#Generate-a-data-file-for-backfilling-history)
* [Batch ingestion of generated files](#Batch-ingestion-of-generated-files)
* [Generate custom data](#Generate-custom-data)
* [Stream generated data](#Stream-generated-data)
* [Ingest data from a stream](#Ingest-data-from-a-stream)
* [Cleanup](#Cleanup)


## Initialization

To interact with the data generation service, use the REST client provided in the [`druidapi` Python package](https://druid.apache.org/docs/latest/tutorials/tutorial-jupyter-index.html#python-api-for-druid).

In [None]:
import druidapi
import os
import time

# Datagen client 
datagen = druidapi.rest.DruidRestClient("http://datagen:9999")

if (os.environ['DRUID_HOST'] == None):
    druid_host=f"http://router:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

# Druid client
druid = druidapi.jupyter_client(druid_host)



# these imports and constants are used by multiple cells
from datetime import datetime, timedelta
import json

headers = {
  'Content-Type': 'application/json'
}

### List available configurations
Use the `/list` API endpoint to get the data generator's available configuration values with predefined data generator schemas.

In [None]:
display(datagen.get(f"/list", require_ok=False).json())

### Generate a data file for backfilling history
When generating a file for backfill purposes, you can select the start time and the duration of the simulation.

Configure the data generator request as follows:
* `name`: an arbitrary name you assign to the job. Refer to the job name to get the job status or to stop the job.
* `target.type`: "file" to generate a data file
* `target.path`: identifies the name of the file to generate. The data generator ignores any path specified and creates the file in the current working directory.
* `time_type`,`time`: The data generator simulates the time range you specify with a start timestamp in the `time_type` property and a duration in the `time` property. To specify `time`, use the `h` suffix for hours, `m` for minutes, and `s` for seconds.
- `concurrency` indicates the maximum number of entities used concurrently to generate events. Each entity is a separate state machine that simulates things like user sessions, IoT devices, or other concurrent sources of event data.

The following example uses the `clickstream.json` predefined configuration to generate data into a file called `clicks.json`. The data generator starts the sample data at one hour prior to the current time and simulates events for a duration of one hour. Since it is simulated, it does this in just a few seconds.

In [None]:
# Configure the start time to one hour prior to the current time. 
startDateTime = (datetime.now() - timedelta(hours = 1)).strftime('%Y-%m-%dT%H:%M:%S.001')
print(f"Starting to generate history at {startDateTime}.")

# Give the datagen job a name for use in subsequent API calls
job_name="gen_clickstream1"

# Generate a data file on the datagen server
datagen_request = {
    "name": job_name,
    "target": { "type": "file", "path":"clicks.json"},
    "config_file": "clickstream/clickstream.json", 
    "time_type": startDateTime,
    "time": "1h",
    "concurrency":100
}
response = datagen.post("/start", json.dumps(datagen_request), headers=headers, require_ok=False)
response.json()

#### Display jobs
Use the `/jobs` API endpoint to get the current jobs and job statuses.

In [None]:
display(datagen.get(f"/jobs").json())

#### Get status of a job
Use the `/status/JOB_NAME` API endpoint to get the current jobs and their status.

In [None]:
display(datagen.get(f"/status/{job_name}", require_ok=False).json())

#### Stop a job
Use the `/stop/JOB_NAME` API endpoint to stop a job.

In [None]:
display(datagen.post(f"/stop/{job_name}", '').json())

#### List files created on datagen server
Use the `/files` API endpoint to list files available on the server.

In [None]:
display(datagen.get(f"/files", '').json())

### Batch ingestion of generated files
Use a [Druid HTTP input source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#http-input-source) in the [EXTERN function](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#extern-function) of a [SQL-based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html) to load generated files.
You can access files by name from within Druid using the URI `http://datagen:9999/file/FILE_NAME`. Alternatively, if you run Druid outside of Docker but on the same machine, access the file with `http://localhost:9999/file/FILE_NAME`.
The following example assumes that both Druid and the data generator server are running in Docker Compose.

In [None]:
sql = '''
REPLACE INTO "clicks" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["http://datagen:9999/file/clicks.json"]}',
    '{"type":"json"}'
  )
) EXTEND ("time" VARCHAR, "user_id" VARCHAR, "event_type" VARCHAR, "client_ip" VARCHAR, "client_device" VARCHAR, "client_lang" VARCHAR, "client_country" VARCHAR, "referrer" VARCHAR, "keyword" VARCHAR, "product" VARCHAR))
SELECT
  TIME_PARSE("time") AS "__time",
  "user_id",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country",
  "referrer",
  "keyword",
  "product"
FROM "ext"
PARTITIONED BY DAY
'''  

druid.display.run_task(sql)
print("Waiting for segment avaialbility ...")
druid.sql.wait_until_ready('clicks')
print("Data is available for query.")

In [None]:
sql = '''
SELECT  "event_type", "user_id", count( DISTINCT "client_ip") ip_count
FROM "clicks"
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 10
'''
druid.display.sql(sql)

## Generate custom data

You can find the full set of configuration options for the data generator in the [README](https://github.com/implydata/druid-datagenerator#data-generator-configuration).

This section demonstrates a simple custom configuration as an example. Notice that the emitter defined the schema as a list of dimensions, each dimension specifies how its values are generated: 

In [None]:
gen_config = {
  "emitters": [
    {
      "name": "simple_record",
      "dimensions": [
        {
          "type": "string",
          "name": "random_string_column",
          "length_distribution": {
            "type": "constant",
            "value": 13
          },
          "cardinality": 0,
          "chars": "#.abcdefghijklmnopqrstuvwxyz"
        },
        {
          "type": "int",
          "name": "distributed_number",
          "distribution": {
            "type": "uniform",
            "min": 0,
            "max": 1000
          },
          "cardinality": 10,
          "cardinality_distribution": {
            "type": "exponential",
            "mean": 5
          }
        }
      ]
    }
  ],
  "interarrival": {
    "type": "constant",
    "value": 1
  },
  "states": [
    {
      "name": "state_1",
      "emitter": "simple_record",
      "delay": {
        "type": "constant",
        "value": 1
      },
      "transitions": [
        {
          "next": "state_1",
          "probability": 1.0
        }
      ]
    }
  ]
}

target = { "type":"file", "path":"sample_data.json"}

This example uses the `config` attribute of the request to configure a new custom data generator instead of using a  predefined `config_file`.

In [None]:
# generate 1 hour of simulated time using custom configuration
datagen_request = {
    "name": "sample_custom",
    "target": target,
    "config": gen_config, 
    "time": "1h",
    "concurrency":10,
    "time_type": "SIM"
}
response = datagen.post("/start", json.dumps(datagen_request), headers=headers, require_ok=False)
response.json()

In [None]:
display(datagen.get(f"/jobs", require_ok=False).json())

In [None]:
# display the first 1k characters of the generated data file
display( datagen.get(f"/file/sample_data.json").content[:1024])

In [None]:
datagen.post(f"/stop/sample_custom",'')

## Stream generated data

The data generator works exactly the same whether it is writing data to a file or publishing messages into a stream. You  only need to change the target configuration.

To use the Kafka container running on Docker Compose, use the host name `kafka:9092`. This tutorial uses the KAFKA_HOST environment variable from Docker Compose to specify the Kafka host. 

In [None]:
if (os.environ['KAFKA_HOST'] == None):
    kafka_host=f"kafka:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

The simplest `target` object for Kafka and, similarly, Confluent is:

In [None]:
target = {
    "type":"kafka",
    "endpoint": kafka_host,
    "topic": "custom_data"
}

# Generate 1 hour of real time using custom configuration, this means that this stream will run for an hour if not stopped
datagen_request = {
    "name": "sample_custom",
    "target": target,
    "config": gen_config, 
    "time": "1h",
    "concurrency":10,
    "time_type": "REAL"
}
response = datagen.post("/start", json.dumps(datagen_request), headers=headers, require_ok=False)
response.json()

In [None]:
time.sleep(1) # avoid race condition of async job start
display(datagen.get(f"/jobs", require_ok=False).json())

### Ingest data from a stream 
This example shows how to start a streaming ingestion supervisor in Apache Druid to consume your custom data:

In [None]:
ingestion_spec ={
  "type": "kafka",
  "spec": {
    "ioConfig": {
      "type": "kafka",
      "consumerProperties": {
        "bootstrap.servers": "kafka:9092"
      },
      "topic": "custom_data",
      "inputFormat": {
        "type": "json"
      },
      "useEarliestOffset": True
    },
    "tuningConfig": {
      "type": "kafka",
      "maxRowsInMemory": 100000,
      "resetOffsetAutomatically": False
    },
    "dataSchema": {
      "dataSource": "custom_data",
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [
          "random_string_column",
          {
            "type": "long",
            "name": "distributed_number"
          }
        ]
      },
      "granularitySpec": {
        "queryGranularity": "none",
        "rollup": False,
        "segmentGranularity": "hour"
      }
    }
  }
}

headers = {
  'Content-Type': 'application/json'
}

druid.rest.post("/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=headers)

Query the data on the stream, but first wait for its availability. It takes a bit of time for the streaming tasks to start, but once they are consuming you can see data very close to real time: Run the following cell multiple times to see how the data is changing:

In [None]:
druid.sql.wait_until_ready('custom_data', verify_load_status=False)
druid.display.sql('''
SELECT SUM(distributed_number) sum_randoms, count(*) total_count
FROM custom_data
''')

### Cleanup

Stop the streaming ingestion and the streaming producer:

In [None]:
print(f"Stop streaming generator: [{datagen.post('/stop/sample_custom','',require_ok=False)}]")
print(f'Reset offsets for streaming ingestion: [{druid.rest.post("/druid/indexer/v1/supervisor/custom_data/reset","", require_ok=False)}]')
print(f'Stop streaming ingestion: [{druid.rest.post("/druid/indexer/v1/supervisor/custom_data/terminate","", require_ok=False)}]')

Wait for streaming ingestion to complete and then remove the custom data table:

In [None]:
print(f"Drop datasource: [{druid.datasources.drop('custom_data')}]")