Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

TAJO-1430: Improve SQLAnalyzer by session-based parsing-result caching #442

Closed
wants to merge 1 commit into from
Closed

Conversation

dongjoon-hyun
Copy link
Member

Please see the issue TAJO-1430 for the effect.

@dongjoon-hyun
Copy link
Member Author

The cache is maintained by approximately LRU manner with 200 size and also cached item will be expired after 1 hour since last access.

session.setQueryCache(CacheBuilder.newBuilder()
          .maximumSize(200)
          .expireAfterAccess(1, TimeUnit.HOURS)
          .build(new CacheLoader<String, Expr>() {
            public Expr load(String sql) throws SQLSyntaxError {
              return analyzer.parse(sql);
            }
      })
);

@dongjoon-hyun
Copy link
Member Author

Rebased.

1 similar comment
@dongjoon-hyun
Copy link
Member Author

Rebased.

@dongjoon-hyun
Copy link
Member Author

Hmm. I attached the test pass result of the following command.

mvn clean install -Pparallel-test,hcatalog-0.12.0 -DLOG_LEVEL=ERROR -Dmaven.fork.count=2 > TAJO-1430.travis.log.txt

https://app.box.com/s/rx85b6pv26sgo400e5dxf30bepyr86nl

Currently, Travis CI seems to be unstable and the builds fail on other reasons like the following.

ERROR: org.apache.tajo.master.rm.TajoWorkerResourceManager (run(346)) - java.lang.InterruptedException
2015-04-04 10:12:39,174 ERROR: org.apache.tajo.util.history.HistoryWriter (writeHistory(318)) - Error while saving query history: q_1428141094535_0543:Filesystem closed

@@ -155,7 +174,7 @@ public SubmitQueryResponse executeQuery(Session session, String query, boolean i
if (isJson) {
planningContext = buildExpressionFromJson(query);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if the cache is supported for json queries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it slow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I didn't estimate the exact time. But, using cache will be much faster than parsing huge json queries, I think.

@jihoonson
Copy link
Contributor

Thanks for the contribution.
I left some comments.

@dongjoon-hyun
Copy link
Member Author

Thank you, @jihoonson . By the way, could you see the Jira, too? There was some discussion about this.

https://issues.apache.org/jira/browse/TAJO-1430

It's just I want to make sure what I understand.
By the way, sorry for the outdated title of this pull request outdate.
I'll update now according to the Jira title.

@dongjoon-hyun dongjoon-hyun changed the title TAJO-1430: Implement Query Parsing Result Caching TAJO-1430: Improve SQLAnalyzer by session-based parsing-result caching Apr 5, 2015
@dongjoon-hyun
Copy link
Member Author

@jihoonson , I changed this patch according to your advice except JSON.
I'm still not sure if JSON parser has performance issue. I think it's a little bit beyond this issue's scope.

// Set queryCache in session
if (session.getQueryCache() == null) {
session.setQueryCache(CacheBuilder.newBuilder()
.maximumSize(200)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the size means the number of items contained in the cache. So, the actual cache size depends on the size of cached queries. This will have a problem of potentially exhausting memory. So, it would be better to set the maximum weight rather than the maximum size.

In addition, I think that the maximum weight and the expiration period should be configurable. ConfVar would be a good place because we don't need to maintain different configurations for each user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. No problem. By the way, the weight should be estimated by SQL query string length, right? Or, do you have some Util to measure Expr memory consumption? Up to now, I can not find that kind of code in Expr. Do you think SQL query string length is okay?

@jihoonson
Copy link
Contributor

Ok. I'll investigate the json parsing performance. If you have any references for that, please share with me.

I left another comment.
Thanks.

@dongjoon-hyun
Copy link
Member Author

In fact, I don't know any use cases of JSON in the real fields. Is it popular to use JSON query?

@jihoonson
Copy link
Contributor

We support json queries to address some requests to integrate other olap tools. As you know, generating json is much easier than generating sql.

On the weight, the query length looks good.

@dongjoon-hyun
Copy link
Member Author

According to your advice, I updated and finished testing the code. In these days, Travis CI is prone to fail. If you need, I will upload my tested log.

@dongjoon-hyun
Copy link
Member Author

Travis CI fails. Here is my succeeded test log.

mvn clean install -Pparallel-test,hcatalog-0.12.0 -DLOG_LEVEL=WARN -Dmaven.fork.count=2 > TAJO-1430.150407.log.txt

https://app.box.com/s/s7ivihn6myvpr2sbr0t6cfd8fb55zawj

@jihoonson
Copy link
Contributor

Thanks. Would you mind triggering Jenkins by putting your patch on Jira?
Jenkins is better than Travis because it also includes the Findbugs test.

@dongjoon-hyun
Copy link
Member Author

Sure!

@dongjoon-hyun
Copy link
Member Author

Hi, @jihoonson . Jenkins fails to start the test.

Compiling /home/jenkins/jenkins-slave/workspace/PreCommit-TAJO-Build/incubator-tajo
/home/jenkins/tools/maven/latest/bin/mvn clean test -DskipTests -Phcatalog-0.12.0 > /home/jenkins/jenkins-slave/workspace/PreCommit-TAJO-Build/patchprocess/trunkJavacWarnings.txt 2>&1
Trunk compilation is broken?

What can I do now? Please visit the following link for detail info.

https://builds.apache.org/job/PreCommit-TAJO-Build/719/console

@dongjoon-hyun
Copy link
Member Author

Hmm. Let me check for a second. I found the following Jenkins logs.

[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /home/jenkins/jenkins-slave/workspace/PreCommit-TAJO-Build/incubator-tajo/tajo-core/src/main/java/org/apache/tajo/master/GlobalEngine.java:[160,13] setQueryCache(com.google.common.cache.LoadingCache<java.lang.String,org.apache.tajo.algebra.Expr>) in org.apache.tajo.session.Session cannot be applied to (com.google.common.cache.LoadingCache<java.lang.Object,org.apache.tajo.algebra.Expr>)
[INFO] 1 error

@dongjoon-hyun
Copy link
Member Author

For JDK6, I split one function call line into two lines.

@@ -203,6 +203,7 @@ public static int setDateOrder(int dateOrder) {

// Query Configuration
QUERY_SESSION_TIMEOUT("tajo.query.session.timeout-sec", 60, Validators.min("0")),
QUERY_SESSION_CACHE_SIZE("tajo.query.session.cache-size", 1000000, Validators.min("1000000")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of comments on this line.

  • The cache size seems to be specified in bytes. This will be difficult for humans to specify the exact cache size. How about using the size unit of KB?
  • The session variable name looks too general even though this cache is only for parsed queries. It would be better if users can figure out what will be configured from its name. Also, the size unit should be included in the name like tajo.task.size-mb.
  • The minimum cache size is 1 GB. Do you have any reasons?
  • In addition to the size configuration, it would be great if users can turn off/on this cache feature. This is because the cache may be useless in some workloads such as ad-hoc analysis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jihoonson . Thank you for kind advice.

  • What about tajo.query.session.query-cache-size-mb, then? I think it would be more proper according to your advice.
  • The minimum cache size is 1MB in terms of SQL String length. As you see in TAJO-1430 example, 100K-length query takes over 30 seconds in SQL parsing, so the cache will hold up to 10 x 100K queries accessed in last 1 hour. In real situation, 10 or more MB is needed.
  • For the cache on/off feautre, is it okay by using the condition tajo.query.session.query-cache-size-mb=0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. I meant 1MB not 1GB. According to your comment, it takes quite long time to parse queries even though their size is a few KBs. So, I think that KB is appropriate for the unit of size.

On minimum cache size, Validators.min("1000000") means that the cache size under 1MB is an invalid value for this configuration. I think that you didn't intend that, but if so, please tell me the reason.

On the cache on/off, your suggestion looks good. In addition to your suggestion, it will be good if we can avoid checking the cache when the configured cache size is 0. I think that this cache will not be used in many cases because the cached data can be used only when the exactly same queries are submitted repeatedly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. tajo.query.session.query-cache-size-kb, then? No problem.
  2. On minimum cache size, let say 1K-size cache holding 10 x 100Byte SQL queries. The effect is small. In this case, we had better off the cache. In that sense, we need to guide the effective cache size practically.
  3. If cache is off, the cache will be null and Tajo will avoid checking the cache. I agree that it's important.
  4. This query cache will be important especially for PreparedStatement of TAJO-1435. (I'm working on this too.) PreparedStatement is popular in real enterprise environments. I did add PLACE_HOLDER in SQL syntax for '?' of PreparedStatement and am trying to replace them after cloning Expr. (Anyway, this is beyond the scope of this issue).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your rapid reply. Here are my answers.

  1. Thanks for understanding.
  2. I mean, with Validators.min("1000000"), Tajo won't accept tajo.query.session.query-cache-size-mb=0. Validators should be used to validate the configuration. So, IMO, Validators.min(0) is better to prohibit negative values.
  3. Thanks for understanding.
  4. I also think that query caching is important. I mean, the current implementation will not be popularly used because the cached data can be used only when the exactly same queries are submitted repeatedly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. You're right. Validators.min(0) is proper now because we use 0 for cache off option, too.
  2. I see. Current implementation does.
    Now, I will update the code soon. Thank you for reviewing, always!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your understanding!

@dongjoon-hyun
Copy link
Member Author

I changed the code and finished testing on my laptop.

@jihoonson
Copy link
Contributor

Thanks. I think that this work is almost done with the latest patch.
I'll commit tonight after some tests.

@dongjoon-hyun
Copy link
Member Author

Thank you, @jihoonson .

@jihoonson
Copy link
Contributor

+1 LGTM!
Thanks for your contribution.
I think that it would be great if there are any test cases, but it can be added at https://issues.apache.org/jira/browse/TAJO-1435.
I'll commit shortly.

@asfgit asfgit closed this in 7d72088 Apr 13, 2015
@dongjoon-hyun dongjoon-hyun deleted the TAJO-1430 branch April 15, 2015 09:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants