Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance INSERT'ing large JSONB column in YCQL #2110

Closed
ajcaldera1 opened this issue Aug 20, 2019 · 1 comment

Comments

@ajcaldera1
Copy link
Contributor

commented Aug 20, 2019

[UPDATE: This doesn't appear to be a JSONB specific issue, but rather an issue related to large CQL statements and the time to prepare/compile the statement. Using bind variables and reusing the prepared statements seems to address the performance problem. See later comment in this issue with a sample code snippet for optimal way to rewrite the program.]

Original Issue:
When INSERT'ing large JSONB columns into YCQL tables, a noticeable slowdown occurs as the size of the column increases. Consider the following Python 2.7 program contributed by a forum user:

import os
import json
import time
from cassandra.cluster import Cluster, ExecutionProfile

cluster = Cluster(['127.0.0.1'])
profile = ExecutionProfile()
profile.request_timeout = 60
cluster.add_execution_profile('standard', profile)
session = cluster.connect()
session.execute('CREATE KEYSPACE IF NOT EXISTS ybdemo;')

session.execute("CREATE TABLE IF NOT EXISTS ybdemo.ugent (id int PRIMARY KEY, data jsonb);")

for i in range (0,100):
    data = dict(text='#'*int(1.5**i))
    data = json.dumps(data, sort_keys=True)
    try:
        ts = time.time()
        s = "INSERT INTO ybdemo.ugent(id, data) VALUES (0, '{data}');".format(data=data)
        session.execute(s, execution_profile=profile)
    except Exception as e:
        raise
    finally:
        print(len(data), time.time() - ts)

cluster.shutdown()

Running on a RF=1 Docker created instance on a MacBook Pro (w/ SSD) and --memory_limit_hard_bytes=8589934592 set:

Alans-MacBook-Pro-2:yugabyte acaldera$ python test_json.py
(13, 0.004549980163574219)
(13, 0.0017061233520507812)
(14, 0.001416921615600586)
(15, 0.0013608932495117188)
(17, 0.001291036605834961)
(19, 0.0012450218200683594)
(23, 0.0011868476867675781)
(29, 0.0013430118560791016)
(37, 0.0014600753784179688)
(50, 0.0015170574188232422)
(69, 0.0012469291687011719)
(98, 0.0013740062713623047)
(141, 0.001354217529296875)
(206, 0.0013859272003173828)
(303, 0.0023958683013916016)
(449, 0.0019290447235107422)
(668, 0.002177000045776367)
(997, 0.0017459392547607422)
(1489, 0.0014219284057617188)
(2228, 0.0015060901641845703)
(3337, 0.0014979839324951172)
(4999, 0.0014851093292236328)
(7493, 0.0015978813171386719)
(11234, 0.003070831298828125)
(16846, 0.0034270286560058594)
(25263, 0.0036149024963378906)
(37888, 0.004394054412841797)
(56827, 0.006770133972167969)
(85234, 0.010408163070678711)
(127846, 0.015661954879760742)
(191763, 0.030829906463623047)
(287638, 0.053843021392822266)
(431451, 0.10183215141296387)
(647171, 0.21008610725402832)
(970751, 0.46135783195495605)
(1456121, 0.9927921295166016)
(2184176, 2.1320548057556152)
(3276258, 4.839470863342285)
(4914381, 10.423587083816528)
(7371566, 23.607666969299316)
(11057344, 52.74215888977051)
(16586010, 60.01662302017212)
Traceback (most recent call last):
  File "test_json.py", line 26, in <module>

  File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1

@ajcaldera1 ajcaldera1 added this to To Do in YBase features via automation Aug 20, 2019

@ajcaldera1 ajcaldera1 added this to To do in JSON datatype via automation Aug 20, 2019

@kmuthukk

This comment has been minimized.

Copy link
Collaborator

commented Aug 21, 2019

I ran a 'sudo perf top' when this workload was running, and it showed these two functions on the top:

   89.96%  libql_parser.so     [.] yyFlexLexer::yy_get_previous_state
    5.17%  libql_parser.so     [.] yyFlexLexer::yy_get_next_buffer
   ...

Looks like the compilation (re-prepare of statements) with large JSONB literals embedded in the CQL statement text is inefficient, and we are spending a lot of time in the lexer. We'll look into if there's anything we can do to optimize that.

Meanwhile, please use prepared statements as a workaround, and you'll see noticeably better results. Here is a sample snippet:

import os
import json
import time
from cassandra.cluster import Cluster, ExecutionProfile

cluster = Cluster(['127.0.0.1'])
profile = ExecutionProfile()
profile.request_timeout = 60

session = cluster.connect()
session.execute('CREATE KEYSPACE IF NOT EXISTS ybdemo;')
session.execute("CREATE TABLE IF NOT EXISTS ybdemo.ugent (id int PRIMARY KEY, data jsonb);")

prepared = session.prepare("INSERT INTO ybdemo.ugent(id, data) VALUES (?, ?)")

for i in range (0,50):
    size = int(1.5**i)
    if (size > 32 * 1024 * 1024):
      size = 32 * 1024 * 1024
    data = dict(text='#'*size)
    data = json.dumps(data, sort_keys=True)
    try:
        ts = time.time()
        bound = prepared.bind((i, data))
        session.execute(bound)
    except Exception as e:
        raise
    finally:
        print(len(data), time.time() - ts)

cluster.shutdown()

And here's the corresponding output:

00:28 $ python ~/notes/json_ycql_prepared.py
(13, 0.002209186553955078)
(13, 0.0022399425506591797)
(14, 0.0016598701477050781)
(15, 0.0009889602661132812)
(17, 0.0011608600616455078)
(19, 0.002560138702392578)
(23, 0.0010418891906738281)
(29, 0.0010190010070800781)
(37, 0.003557920455932617)
(50, 0.0009789466857910156)
(69, 0.0014739036560058594)
(98, 0.0010399818420410156)
(141, 0.001013040542602539)
(206, 0.0009589195251464844)
(303, 0.0010008811950683594)
(449, 0.0009090900421142578)
(668, 0.0009469985961914062)
(997, 0.0009300708770751953)
(1489, 0.001394033432006836)
(2228, 0.0022170543670654297)
(3337, 0.002758026123046875)
(4999, 0.04141998291015625)
(7493, 0.04092907905578613)
(11234, 0.040647029876708984)
(16846, 0.040585994720458984)
(25263, 0.04088902473449707)
(37888, 0.04099392890930176)
(56827, 0.04108405113220215)
(85234, 0.04209303855895996)
(127846, 0.042201995849609375)
(191763, 0.04306983947753906)
(287638, 0.0450739860534668)
(431451, 0.04624605178833008)
(647171, 0.053468942642211914)
(970751, 0.05219316482543945)
(1456121, 0.0660390853881836)
(2184176, 0.07591390609741211)
(3276258, 0.08906793594360352)
(4914381, 0.11516404151916504)
(7371566, 0.11293411254882812)
(11057344, 0.18571805953979492)
(16586010, 0.3744819164276123)
(24879009, 0.4528789520263672)
(33554444, 0.6168291568756104)
(33554444, 0.5642240047454834)
(33554444, 0.553412914276123)
(33554444, 0.649878978729248)
(33554444, 0.63043212890625)
(33554444, 0.6016709804534912)
(33554444, 0.6068048477172852)

Please also note that when using yb-docker-ctl, it by default gives the yb-tserver processes only 1GB of memory. For workloads with large payload (like 10MB+), would recommend setting the memory higher as Alan mention in the forum thread.

Please also note that we support only up to 32MB sizes RPCs and data sizes. Therefore, I modified the program provided by the original filer of the issue on forum to cap the size at 32MB.

@kmuthukk kmuthukk added this to To Do in Query Language via automation Aug 21, 2019

@kmuthukk kmuthukk removed this from To Do in YBase features Aug 21, 2019

@kmuthukk kmuthukk removed the area/docdb label Aug 21, 2019

@kmuthukk kmuthukk closed this Aug 21, 2019

Query Language automation moved this from To Do to Done Aug 21, 2019

JSON datatype automation moved this from To do to Done Aug 21, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.