Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull Queries: Perf test #3545

Closed
big-andy-coates opened this issue Oct 11, 2019 · 8 comments
Closed

Pull Queries: Perf test #3545

big-andy-coates opened this issue Oct 11, 2019 · 8 comments
Assignees
Labels

Comments

@big-andy-coates
Copy link
Contributor

big-andy-coates commented Oct 11, 2019

Current state:

Based on a simple setup like below

CREATE STREAM orders_stream (order_id STRING, item_id INTEGER, qty INTEGER) WITH (KAFKA_TOPIC='orders_stream', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='JSON');

INSERT INTO orders_stream(order_id, item_id, qty) VALUES ('order-1', 1, 1);
INSERT INTO orders_stream(order_id, item_id, qty) VALUES ('order-1', 2, 3);
INSERT INTO orders_stream(order_id, item_id, qty) VALUES ('order-2', 1, 2);


SET 'auto.offset.reset'='earliest';
CREATE TABLE order_quantities AS
  SELECT order_id,
         sum(qty) as total_qty
  FROM orders_stream
  GROUP BY order_id;


SELECT * FROM order_quantities WHERE ROWKEY = 'order-1';

We found a few bottlenecks.

@apurvam
Copy link
Contributor

apurvam commented Oct 25, 2019

@vinothchandar I believe you are already working on this? Maybe reassign?

@vinothchandar
Copy link
Contributor

done. makes sense.

@vinothchandar
Copy link
Contributor

Goals going forward :

  • Get the KSQL level overheads to under 1 ms (parsing, compilation)
  • Run a benchmark generating orders data using ksql-datagen tool and wrk client workload generator (for ease of reproducibility by everyone)

@vinothchandar
Copy link
Contributor

Benchmark setup :

Use ksql-datagen to generate some orders (more commandline opts to be added)

ksql-datagen quickstart=orders topic=orders_topic

ksql queries to setup the final table. We decide how many orders we want in the table for pull queries (1000 in example below) and map each orders_raw row to an order id in that range.
This involves a UDF : randomstr(min, max) which generates a random string to help vary the row sizes in storage and a UDAF : STR_MAX which implements a max of strings, used to update the value for a given orderId and keep 1 string in the agg_order_data column.

SET 'auto.offset.reset'='earliest';
CREATE STREAM orders_raw (
        ordertime BIGINT,
        orderid INT,
        itemid VARCHAR,
        orderunits DOUBLE,
        address STRUCT<
            city VARCHAR,
            state VARCHAR,
            zipcode INT>)
     WITH (
        KAFKA_TOPIC='orders_topic',
        VALUE_FORMAT='JSON');

CREATE STREAM orders_stream AS
SELECT
  *,
  CONCAT('order-', CAST(CAST(FLOOR(RANDOM() * 1000) AS BIGINT) AS VARCHAR)) as gen_orderid,
  RANDOMSTR(10, 20) as order_data
FROM orders_raw;


SET 'auto.offset.reset'='earliest';
CREATE TABLE order_quantities AS
SELECT
  gen_orderid,
  sum(orderunits) as total_qty,
  STR_MAX(order_data) as agg_order_data
FROM orders_stream
GROUP BY gen_orderid;

Once we have this, the following lua script now generate pull queries that randomly picks an order in the range and queries it for benchmarking.

[ksql-benchmark]$ cat orders-pull-query-bench.lua 
-- Generates pull queries that fetch out random orders from KSQL 
num_orders = 1000

function init(args)
   print("ksqlDB workload generator")
end

request = function()
   wrk.method = "POST"
   wrk.body   = '{"ksql":"SELECT * FROM order_quantities WHERE ROWKEY = \'order-' .. math.random(num_orders) .. '\';"}'
   wrk.headers["Content-Type"] = "application/vnd.ksql.v1+json"
   return wrk.format(nil, nil)
end


[ksql-benchmark]$ wrk -t 1 -c 1 -d 5 --latency -s ./orders-pull-query-bench.lua http://localhost:8088/ksql
Running 5s test @ http://localhost:8088/ksql
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   180.76us  282.09us   5.29ms   98.83%
    Req/Sec     6.17k   589.59     6.86k    78.43%
  Latency Distribution
     50%  139.00us
     75%  163.00us
     90%  215.00us
     99%  786.00us
  31297 requests in 5.10s, 143.48MB read
  Non-2xx or 3xx responses: 31297
Requests/sec:   6137.02
Transfer/sec:     28.13MB
------------------------------

@vpapavas
Copy link
Member

vpapavas commented Nov 1, 2019

This is a flame graph of a pull query with the above mentioned fixes:

Screen Shot 2019-10-31 at 4 30 21 PM

The next bottleneck seems to be building the logical plan #3709

@vinothchandar
Copy link
Contributor

With #3542 and #3663 put in, ksql can do about 2.5K-3K pull queries per box at < 20ms p90 latency

@rodesai
Copy link
Contributor

rodesai commented Nov 6, 2019

@vinothchandar can you add the specs of the environment you tested on? (cloud provider, instance type, memory, cpu specs, storage used, jvm settings)

@vpapavas
Copy link
Member

vpapavas commented Nov 7, 2019

Cloud provider = AWS
Instance type = i3xlarge
Memory = 32GB
CPU = 4 procs, 2 cores
Storage = SSD
jvm settings = no extra settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants