TAJO-1430: Improve SQLAnalyzer by session-based parsing-result caching #442
Conversation
The cache is maintained by approximately LRU manner with 200 size and also cached item will be expired after 1 hour since last access.
|
Rebased. |
1 similar comment
Rebased. |
Hmm. I attached the test pass result of the following command.
https://app.box.com/s/rx85b6pv26sgo400e5dxf30bepyr86nl Currently, Travis CI seems to be unstable and the builds fail on other reasons like the following.
|
@@ -155,7 +174,7 @@ public SubmitQueryResponse executeQuery(Session session, String query, boolean i | |||
if (isJson) { | |||
planningContext = buildExpressionFromJson(query); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better if the cache is supported for json queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, I didn't estimate the exact time. But, using cache will be much faster than parsing huge json queries, I think.
Thanks for the contribution. |
Thank you, @jihoonson . By the way, could you see the Jira, too? There was some discussion about this. https://issues.apache.org/jira/browse/TAJO-1430 It's just I want to make sure what I understand. |
@jihoonson , I changed this patch according to your advice except JSON. |
// Set queryCache in session | ||
if (session.getQueryCache() == null) { | ||
session.setQueryCache(CacheBuilder.newBuilder() | ||
.maximumSize(200) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the size means the number of items contained in the cache. So, the actual cache size depends on the size of cached queries. This will have a problem of potentially exhausting memory. So, it would be better to set the maximum weight rather than the maximum size.
In addition, I think that the maximum weight and the expiration period should be configurable. ConfVar
would be a good place because we don't need to maintain different configurations for each user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. No problem. By the way, the weight should be estimated by SQL query string length, right? Or, do you have some Util to measure Expr memory consumption? Up to now, I can not find that kind of code in Expr. Do you think SQL query string length is okay?
Ok. I'll investigate the json parsing performance. If you have any references for that, please share with me. I left another comment. |
In fact, I don't know any use cases of JSON in the real fields. Is it popular to use JSON query? |
We support json queries to address some requests to integrate other olap tools. As you know, generating json is much easier than generating sql. On the weight, the query length looks good. |
According to your advice, I updated and finished testing the code. In these days, Travis CI is prone to fail. If you need, I will upload my tested log. |
Travis CI fails. Here is my succeeded test log.
|
Thanks. Would you mind triggering Jenkins by putting your patch on Jira? |
Sure! |
Hi, @jihoonson . Jenkins fails to start the test.
What can I do now? Please visit the following link for detail info. https://builds.apache.org/job/PreCommit-TAJO-Build/719/console |
Hmm. Let me check for a second. I found the following Jenkins logs.
|
For JDK6, I split one function call line into two lines. |
@@ -203,6 +203,7 @@ public static int setDateOrder(int dateOrder) { | |||
|
|||
// Query Configuration | |||
QUERY_SESSION_TIMEOUT("tajo.query.session.timeout-sec", 60, Validators.min("0")), | |||
QUERY_SESSION_CACHE_SIZE("tajo.query.session.cache-size", 1000000, Validators.min("1000000")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of comments on this line.
- The cache size seems to be specified in bytes. This will be difficult for humans to specify the exact cache size. How about using the size unit of KB?
- The session variable name looks too general even though this cache is only for parsed queries. It would be better if users can figure out what will be configured from its name. Also, the size unit should be included in the name like
tajo.task.size-mb
. - The minimum cache size is 1 GB. Do you have any reasons?
- In addition to the size configuration, it would be great if users can turn off/on this cache feature. This is because the cache may be useless in some workloads such as ad-hoc analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jihoonson . Thank you for kind advice.
- What about tajo.query.session.query-cache-size-mb, then? I think it would be more proper according to your advice.
- The minimum cache size is 1MB in terms of SQL String length. As you see in TAJO-1430 example, 100K-length query takes over 30 seconds in SQL parsing, so the cache will hold up to 10 x 100K queries accessed in last 1 hour. In real situation, 10 or more MB is needed.
- For the cache on/off feautre, is it okay by using the condition tajo.query.session.query-cache-size-mb=0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. I meant 1MB not 1GB. According to your comment, it takes quite long time to parse queries even though their size is a few KBs. So, I think that KB is appropriate for the unit of size.
On minimum cache size, Validators.min("1000000")
means that the cache size under 1MB is an invalid value for this configuration. I think that you didn't intend that, but if so, please tell me the reason.
On the cache on/off, your suggestion looks good. In addition to your suggestion, it will be good if we can avoid checking the cache when the configured cache size is 0. I think that this cache will not be used in many cases because the cached data can be used only when the exactly same queries are submitted repeatedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- tajo.query.session.query-cache-size-kb, then? No problem.
- On minimum cache size, let say 1K-size cache holding 10 x 100Byte SQL queries. The effect is small. In this case, we had better off the cache. In that sense, we need to guide the effective cache size practically.
- If cache is off, the cache will be null and Tajo will avoid checking the cache. I agree that it's important.
- This query cache will be important especially for PreparedStatement of TAJO-1435. (I'm working on this too.) PreparedStatement is popular in real enterprise environments. I did add PLACE_HOLDER in SQL syntax for '?' of PreparedStatement and am trying to replace them after cloning Expr. (Anyway, this is beyond the scope of this issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your rapid reply. Here are my answers.
- Thanks for understanding.
- I mean, with
Validators.min("1000000")
, Tajo won't accept tajo.query.session.query-cache-size-mb=0. Validators should be used to validate the configuration. So, IMO,Validators.min(0)
is better to prohibit negative values. - Thanks for understanding.
- I also think that query caching is important. I mean, the current implementation will not be popularly used because the cached data can be used only when the exactly same queries are submitted repeatedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You're right. Validators.min(0) is proper now because we use 0 for cache off option, too.
- I see. Current implementation does.
Now, I will update the code soon. Thank you for reviewing, always!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your understanding!
I changed the code and finished testing on my laptop. |
Thanks. I think that this work is almost done with the latest patch. |
Thank you, @jihoonson . |
+1 LGTM! |
Please see the issue TAJO-1430 for the effect.