title | weight | type | aliases | |
---|---|---|---|---|
DataGen |
13 |
docs |
|
{{< label "Scan Source: Bounded" >}} {{< label "Scan Source: UnBounded" >}}
The DataGen connector allows for creating tables based on in-memory data generation. This is useful when developing queries locally without access to external systems such as Kafka. Tables can include [Computed Column syntax]({{< ref "docs/dev/table/sql/create" >}}#create-table) which allows for flexible record generation.
The DataGen connector is built-in, no additional dependencies are required.
By default, a DataGen table will create an unbounded number of rows with a random value for each column. For variable sized types, char/varchar/string/array/map/multiset, the length can be specified. Additionally, a total number of rows can be specified, resulting in a bounded table.
There also exists a sequence generator, where users specify a sequence of start and end values. If any column in a table is a sequence type, the table will be bounded and end with the first sequence completes.
Time types are always the local machines current system time.
CREATE TABLE Orders (
order_number BIGINT,
price DECIMAL(32,2),
buyer ROW<first_name STRING, last_name STRING>,
order_time TIMESTAMP(3)
) WITH (
'connector' = 'datagen'
)
Often, the data generator connector is used in conjunction with the LIKE
clause to mock out physical tables.
CREATE TABLE Orders (
order_number BIGINT,
price DECIMAL(32,2),
buyer ROW<first_name STRING, last_name STRING>,
order_time TIMESTAMP(3)
) WITH (...)
-- create a bounded mock table
CREATE TEMPORARY TABLE GenOrders
WITH (
'connector' = 'datagen',
'number-of-rows' = '10'
)
LIKE Orders (EXCLUDING ALL)
Type | Supported Generators | Notes |
---|---|---|
BOOLEAN | random | |
CHAR | random / sequence | |
VARCHAR | random / sequence | |
STRING | random / sequence | |
DECIMAL | random / sequence | |
TINYINT | random / sequence | |
SMALLINT | random / sequence | |
INT | random / sequence | |
BIGINT | random / sequence | |
FLOAT | random / sequence | |
DOUBLE | random / sequence | |
DATE | random | Always resolves to the current date of the local machine. |
TIME | random | Always resolves to the current time of the local machine. |
TIMESTAMP | random | Resolves a past timestamp relative to the current timestamp of the local machine. The max past can be specified by the 'max-past' option. |
TIMESTAMP_LTZ | random | Resolves a past timestamp relative to the current timestamp of the local machine. The max past can be specified by the 'max-past' option. |
INTERVAL YEAR TO MONTH | random | |
INTERVAL DAY TO MONTH | random | |
ROW | random | Generates a row with random subfields. |
ARRAY | random | Generates an array with random entries. |
MAP | random | Generates a map with random entries. |
MULTISET | random | Generates a multiset with random entries. |
Option | Required | Default | Type | Description |
---|---|---|---|---|
required | (none) | String | Specify what connector to use, here should be 'datagen'. | |
optional | 10000 | Long | Rows per second to control the emit rate. | |
optional | (none) | Long | The total number of rows to emit. By default, the table is unbounded. | |
optional | random | String | Generator of this '#' field. Can be 'sequence' or 'random'. | |
optional | (Minimum value of type) | (Type of field) | Minimum value of random generator, work for numeric types. | |
optional | (Maximum value of type) | (Type of field) | Maximum value of random generator, work for numeric types. | |
optional | 0 | Duration | Maximum past of timestamp random generator, only works for timestamp types. | |
optional | 100 | Integer | Size or length of the collection for generating char/varchar/string/array/map/multiset types. | |
optional | (none) | (Type of field) | Start value of sequence generator. | |
optional | (none) | (Type of field) | End value of sequence generator. |