# Data Generator Server
The default docker compose deployment includes a data generation service created from the published docker image `imply/datagen:latest`. 
This image is built by the project https://github.com/implydata/druid-datagenerator. 

To interact with the data generation service, you can use the rest client provided in the druidapi python package.

In [None]:
import druidapi
import os

# Datagen client 
datagen = druidapi.rest.DruidRestClient("http://datagen:9999")

if (os.environ['DRUID_HOST'] == None):
    druid_host=f"http://router:8888"
else:
    druid_host=f"http://{os.environ['DRUID_HOST']}:8888"

# Druid client
druid = druidapi.jupyter_client(druid_host)


### List available configurations
Use /list API to get the data generator's available configuration values with pre-defined data generator schemas.

In [None]:
display(datagen.get(f"/list", require_ok=False).json())

### Generate a data file for back filling history
When generating a file for backfill purposes, you can select the start time and the duration of the simulation.
This example shows how to do that:
- "target" specifies "type":"file" which generates a data file.
- "path" within the "target" is only a filename, it will ignore any path specified on the file.
- The data generator simulates time when you specify a start time in the "time_type" property and a duration in the "time" property.
- "concurrency" indicates the maximum number of entities used concurrently to generate events. Each entity is a separate state machine that simulates things like user sessions, IoT devices, or other concurrent sources of event data. 

In [None]:
from datetime import datetime, timedelta
import json

# determine start time, in this example we are starting one hour ago 
startDateTime = (datetime.now() - timedelta(hours = 1)).strftime('%Y-%m-%dT%H:%M:%S.001')
print(f"Starting to generate history at {startDateTime}.")

job_name="gen_clickstream1"

headers = {
  'Content-Type': 'application/json'
}

# this request if generating a data file at on the datagen server
datagen_request = {
    "name": job_name,
    "target": { "type": "file", "path":"clicks.json"},
    "config_file": "clickstream/clickstream.json", 
    "time": "1h",
    "concurrency":100,
    "time_type": startDateTime
}
response = datagen.post("/start", json.dumps(datagen_request), headers=headers, require_ok=False)
response.json()

### Display jobs
Use the /jobs API to get the current jobs and their status.

In [None]:
display(datagen.get(f"/jobs").json())

### Get status of a job
Use the /jobs API to get the current jobs and their status.

In [None]:
display(datagen.get(f"/status/{job_name}", require_ok=False).json())

### Stop a job
Use the /stop/\<job_name> API to stop a job.

In [None]:
display(datagen.post(f"/stop/{job_name}", '').json())

### List files created on datagen server
Use the /files API to list files available on the server.

In [None]:
display(datagen.get(f"/files", '').json())

### Batch Loading of Generated Files
Use a [Druid HTTP input source](https://druid.apache.org/docs/latest/ingestion/native-batch-input-sources.html#http-input-source) in the [EXTERN function](https://druid.apache.org/docs/latest/multi-stage-query/reference.html#extern-function) of a [SQL Based ingestion](https://druid.apache.org/docs/latest/multi-stage-query/index.html) to load generated files.
The files can be accessed by name using the `http://datagen:9999/file/<name of the file>` or if ingesting into a Druid instance outside of docker, but still running locally, then use `http://localhost:9999/file/<name of the file>`.
The following example assumes that both Druid and the data generator server are running in docker compose.

In [None]:
sql = '''
REPLACE INTO "clicks" OVERWRITE ALL
WITH "ext" AS (SELECT *
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["http://datagen:9999/file/clicks.json"]}',
    '{"type":"json"}'
  )
) EXTEND ("time" VARCHAR, "user_id" VARCHAR, "event_type" VARCHAR, "client_ip" VARCHAR, "client_device" VARCHAR, "client_lang" VARCHAR, "client_country" VARCHAR, "referrer" VARCHAR, "keyword" VARCHAR, "product" VARCHAR))
SELECT
  TIME_PARSE("time") AS "__time",
  "user_id",
  "event_type",
  "client_ip",
  "client_device",
  "client_lang",
  "client_country",
  "referrer",
  "keyword",
  "product"
FROM "ext"
PARTITIONED BY DAY
'''
druid.sql.run_task(sql)

In [None]:
druid.display.sql('''
SELECT  event_type, 
        count( DISTINCT "user_id") users, 
        count( DISTINCT "client_ip") ips, 
        count( DISTINCT "client_ip") - count( DISTINCT "user_id") ips_minus_users
FROM "clicks"
GROUP BY 1
HAVING count( DISTINCT "user_id") - count( DISTINCT "client_ip") < 0
ORDER BY 4 DESC
''')


## Generating custom data

You can fine the full set of configuration option in the [data generator project's readme](https://github.com/implydata/druid-datagenerator#data-generator-configuration).

In this section we use a simple custom configuration as an example to generate some data.

In [None]:
gen_config = {
  "emitters": [
    {
      "name": "simple_record",
      "dimensions": [
        {
          "type": "string",
          "name": "random_string_column",
          "length_distribution": {
            "type": "constant",
            "value": 13
          },
          "cardinality": 0,
          "chars": "#.abcdefghijklmnopqrstuvwxyz"
        },
        {
          "type": "int",
          "name": "distributed_number",
          "distribution": {
            "type": "uniform",
            "min": 0,
            "max": 1000
          },
          "cardinality": 10,
          "cardinality_distribution": {
            "type": "exponential",
            "mean": 5
          }
        }
      ]
    }
  ],
  "interarrival": {
    "type": "constant",
    "value": 1
  },
  "states": [
    {
      "name": "state_1",
      "emitter": "simple_record",
      "delay": {
        "type": "constant",
        "value": 1
      },
      "transitions": [
        {
          "next": "state_1",
          "probability": 1.0
        }
      ]
    }
  ]
}

target = { "type":"file", "path":"sample_data.json"}

Now, instead of using a config_file, we use the config attribute of the request to use our new custom data generator.

In [None]:
# generate 1 hour of simulated time using custom configuration
datagen_request = {
    "name": "sample_custom",
    "target": target,
    "config": gen_config, 
    "time": "1h",
    "concurrency":10,
    "time_type": "SIM"
}
response = datagen.post("/start", json.dumps(datagen_request), headers=headers, require_ok=False)
response.json()

In [None]:
display(datagen.get(f"/jobs", require_ok=False).json())

In [None]:

display( datagen.get(f"/file/sample_data.json").content[:1024])

## Streaming generated data

The data generator works exactly the same whether it is outputing data to a file or publishing messages into a stream, all you need to change is the target configuration.

To use the kafka container running on the docker compose set use the host name `kafka:9092`. This piece of code uses the KAFKA_HOST variable specified when bringing up the cluster to designate the appropriate host. 

In [None]:
if (os.environ['KAFKA_HOST'] == None):
    kafka_host=f"kafka:9092"
else:
    kafka_host=f"{os.environ['KAFKA_HOST']}:9092"

The simplest `target` object for kafka (and similarly confluent) is:

In [None]:
target = {
    "type":"kafka",
    "endpoint": kafka_host,
    "topic": "custom_data"
}

# generate 1 hour of simulated time using custom configuration
datagen_request = {
    "name": "sample_custom",
    "target": target,
    "config": gen_config, 
    "time": "1h",
    "concurrency":10,
    "time_type": "SIM"
}
response = datagen.post("/start", json.dumps(datagen_request), headers=headers, require_ok=False)
response.json()

In [None]:
display(datagen.get(f"/jobs", require_ok=False).json())

### Ingesting data from a stream 
This example shows how to start a streaming ingestion for the custom data being published:

In [None]:
ingestion_spec ={
  "type": "kafka",
  "spec": {
    "ioConfig": {
      "type": "kafka",
      "consumerProperties": {
        "bootstrap.servers": "kafka:9092"
      },
      "topic": "custom_data",
      "inputFormat": {
        "type": "json"
      },
      "useEarliestOffset": True
    },
    "tuningConfig": {
      "type": "kafka",
      "maxRowsInMemory": 100000,
      "resetOffsetAutomatically": False
    },
    "dataSchema": {
      "dataSource": "custom_data",
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "dimensionsSpec": {
        "dimensions": [
          "random_string_column",
          {
            "type": "long",
            "name": "distributed_number"
          }
        ]
      },
      "granularitySpec": {
        "queryGranularity": "none",
        "rollup": False,
        "segmentGranularity": "hour"
      }
    }
  }
}

headers = {
  'Content-Type': 'application/json'
}

druid.rest.post("/druid/indexer/v1/supervisor", json.dumps(ingestion_spec), headers=headers)

In [None]:
druid.display.sql('''
SELECT random_string_column, MAX(distributed_number)
FROM custom_data
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
''')