Skip to content

Latest commit

 

History

History
291 lines (273 loc) · 8.6 KB

datagen.md

File metadata and controls

291 lines (273 loc) · 8.6 KB
title weight type aliases
DataGen
13
docs
/dev/table/connectors/datagen.html

DataGen SQL Connector

{{< label "Scan Source: Bounded" >}} {{< label "Scan Source: UnBounded" >}}

The DataGen connector allows for creating tables based on in-memory data generation. This is useful when developing queries locally without access to external systems such as Kafka. Tables can include [Computed Column syntax]({{< ref "docs/dev/table/sql/create" >}}#create-table) which allows for flexible record generation.

The DataGen connector is built-in, no additional dependencies are required.

Usage

By default, a DataGen table will create an unbounded number of rows with a random value for each column. For variable sized types, char/varchar/string/array/map/multiset, the length can be specified. Additionally, a total number of rows can be specified, resulting in a bounded table.

There also exists a sequence generator, where users specify a sequence of start and end values. If any column in a table is a sequence type, the table will be bounded and end with the first sequence completes.

Time types are always the local machines current system time.

CREATE TABLE Orders (
    order_number BIGINT,
    price        DECIMAL(32,2),
    buyer        ROW<first_name STRING, last_name STRING>,
    order_time   TIMESTAMP(3)
) WITH (
  'connector' = 'datagen'
)

Often, the data generator connector is used in conjunction with the LIKE clause to mock out physical tables.

CREATE TABLE Orders (
    order_number BIGINT,
    price        DECIMAL(32,2),
    buyer        ROW<first_name STRING, last_name STRING>,
    order_time   TIMESTAMP(3)
) WITH (...)

-- create a bounded mock table
CREATE TEMPORARY TABLE GenOrders
WITH (
    'connector' = 'datagen',
    'number-of-rows' = '10'
)
LIKE Orders (EXCLUDING ALL)

Types

Type Supported Generators Notes
BOOLEAN random
CHAR random / sequence
VARCHAR random / sequence
STRING random / sequence
DECIMAL random / sequence
TINYINT random / sequence
SMALLINT random / sequence
INT random / sequence
BIGINT random / sequence
FLOAT random / sequence
DOUBLE random / sequence
DATE random Always resolves to the current date of the local machine.
TIME random Always resolves to the current time of the local machine.
TIMESTAMP random Resolves a past timestamp relative to the current timestamp of the local machine. The max past can be specified by the 'max-past' option.
TIMESTAMP_LTZ random Resolves a past timestamp relative to the current timestamp of the local machine. The max past can be specified by the 'max-past' option.
INTERVAL YEAR TO MONTH random
INTERVAL DAY TO MONTH random
ROW random Generates a row with random subfields.
ARRAY random Generates an array with random entries.
MAP random Generates a map with random entries.
MULTISET random Generates a multiset with random entries.

Connector Options

Option Required Default Type Description
connector
required (none) String Specify what connector to use, here should be 'datagen'.
rows-per-second
optional 10000 Long Rows per second to control the emit rate.
number-of-rows
optional (none) Long The total number of rows to emit. By default, the table is unbounded.
fields.#.kind
optional random String Generator of this '#' field. Can be 'sequence' or 'random'.
fields.#.min
optional (Minimum value of type) (Type of field) Minimum value of random generator, work for numeric types.
fields.#.max
optional (Maximum value of type) (Type of field) Maximum value of random generator, work for numeric types.
fields.#.max-past
optional 0 Duration Maximum past of timestamp random generator, only works for timestamp types.
fields.#.length
optional 100 Integer Size or length of the collection for generating char/varchar/string/array/map/multiset types.
fields.#.start
optional (none) (Type of field) Start value of sequence generator.
fields.#.end
optional (none) (Type of field) End value of sequence generator.