improve task archiver query performance #5088

ticoann · 2014-04-18T04:59:47Z

ticoann · 2014-04-21T19:49:15Z

@yuyiguo, @hufnagel
Please review.
Couple of things changed from original query.
It seem Group By and Having clause cost a lot of time.
So instead of using that following is used after LEFT OUTER JOIN
wmbs_sub_files_acquired.subscription is Null
This have the same effect since subscription is key for the LEFT OUTER JOIN.

However,

LEFT OUTER JOIN wmbs_jobgroup ON
     wmbs_jobgroup.subscription = wmbs_subscription.id
LEFT OUTER JOIN wmbs_job ON
     wmbs_job.jobgroup = wmbs_jobgroup.id AND
     wmbs_job.state_time > :maxTime AND
     wmbs_job.state != %d
GROUP BY ...
HAVING COUNT(wmbs_job.id) = 0

In this case (which is major bottle neck) I can't replace that to wmbs.job == NULL,
since it means different things.
So used inner join and get the subtraction from the original.
It seems query is much faster (more than ~20 times) and passes current unittest
However, if you could look at the correctness, I will appreciate that.
One thing I didn't change is
GROUP BY ...
HAVING COUNT
clause for child subscription - it seem the data is small enough (over workflows) it doesn't gain too much time - but we can change that later

I think it might be better break down the query at least 3 parts. (for more clear step)
select parent subscription done
select child subscription done
update subscription.

But for now I will leave as it is.

Seangchan

yuyiguo · 2014-04-23T18:15:20Z

It is hard for me to really examine the code w/o seeing the DB schema and the data in the DB. Just comparing the revised sql with thw current sql, it seems to me that the revised is better written. The one thing I observed is that completeNonJobSQL and subWithUnfinishedJobSQL have most of the comment query. If we can create temporary static table using the common part of the query, then subWithUnfinishedJobSQL do additional select from that table .
Something like
with t1 as (
SELECT distinct wmbs_subscription.id,
wmbs_subscription.fileset,
wmbs_workflow.name
FROM wmbs_subscription
INNER JOIN wmbs_fileset ON
wmbs_fileset.id = wmbs_subscription.fileset AND
wmbs_fileset.open = 0
INNER JOIN wmbs_workflow ON
wmbs_workflow.id = wmbs_subscription.workflow AND
wmbs_workflow.injected = 1
LEFT OUTER JOIN wmbs_sub_files_available ON
wmbs_sub_files_available.subscription = wmbs_subscription.id
LEFT OUTER JOIN wmbs_sub_files_acquired ON
wmbs_sub_files_acquired.subscription = wmbs_subscription.id
WHERE wmbs_subscription.finished = 0 AND
wmbs_sub_files_available.subscription is Null AND
wmbs_sub_files_acquired.subscription is Null)

Then we can select from t1 w/o repeating the same query.
We can talk about this tomorrow.
Yuyi

ticoann · 2014-04-23T18:24:26Z

Thanks Yuyi, I didn't know we could to that. That is a good tip. In the performance wise, newer query is much faster. Most concern for me is the correctness of the query. Whether both query gives the same result in all the cases. Let's talk tomorrow, Dirk will be here as well.

ticoann · 2014-04-24T23:58:15Z

Hi Yuyi,
It seems WITH .. AS ... clause is not supported by mysql (at least mysql version we are using)
However is seems the query is faster in oracle. I will have two different version of query for oracle and MySQL

WITH
  complete_subscription
AS
(SELECT distinct wmbs_subscription.id,
                wmbs_subscription.fileset,
                wmbs_workflow.name
    FROM wmbs_subscription
    INNER JOIN wmbs_fileset ON
        wmbs_fileset.id = wmbs_subscription.fileset AND
        wmbs_fileset.open = 0
    INNER JOIN wmbs_workflow ON
        wmbs_workflow.id = wmbs_subscription.workflow AND
        wmbs_workflow.injected = 1
    LEFT OUTER JOIN wmbs_sub_files_available ON
        wmbs_sub_files_available.subscription = wmbs_subscription.id
    LEFT OUTER JOIN wmbs_sub_files_acquired ON
        wmbs_sub_files_acquired.subscription = wmbs_subscription.id
  WHERE wmbs_subscription.finished = 0 AND
        wmbs_sub_files_available.subscription is Null AND
        wmbs_sub_files_acquired.subscription is Null
 )
SELECT complete_subscription.id
    FROM complete_subscription 

    INNER JOIN wmbs_fileset ON
        wmbs_fileset.id = complete_subscription.fileset
    LEFT OUTER JOIN wmbs_fileset_files ON
        wmbs_fileset_files.fileset = wmbs_fileset.id
    LEFT OUTER JOIN wmbs_file_parent ON
        wmbs_file_parent.parent = wmbs_fileset_files.fileid
    LEFT OUTER JOIN wmbs_fileset_files child_fileset ON
        child_fileset.fileid = wmbs_file_parent.child
    LEFT OUTER JOIN wmbs_subscription child_subscription ON
        child_subscription.fileset = child_fileset.fileset AND
        child_subscription.finished = 0
    LEFT OUTER JOIN wmbs_workflow child_workflow ON
        child_subscription.workflow = child_workflow.id AND
        child_workflow.name != complete_subscription.name

  WHERE complete_subscription.id
    NOT IN (SELECT complete_subscription.id 
                          FROM complete_subscription
                          INNER JOIN wmbs_jobgroup ON
                            wmbs_jobgroup.subscription = complete_subscription.id
                          INNER JOIN wmbs_job ON
                            wmbs_job.jobgroup = wmbs_jobgroup.id AND
                            wmbs_job.state_time > 0 AND
                            wmbs_job.state != 4)
GROUP BY complete_subscription.id
                    HAVING COUNT(child_workflow.name) = 0

improve task archiver query performance

improve task archiver query performance

f3a30d2

ticoann added a commit that referenced this pull request Apr 27, 2014

Merge pull request #5088 from ticoann/improve_taskarchiver_query

5f573f9

improve task archiver query performance

ticoann merged commit 5f573f9 into dmwm:master Apr 27, 2014

ticoann added the High Priority label Jun 3, 2014

ticoann added this to the WMAgent1404 milestone Jun 3, 2014

ticoann self-assigned this Jun 3, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve task archiver query performance #5088

improve task archiver query performance #5088

ticoann commented Apr 18, 2014

ticoann commented Apr 21, 2014

yuyiguo commented Apr 23, 2014

ticoann commented Apr 23, 2014

ticoann commented Apr 24, 2014

improve task archiver query performance #5088

improve task archiver query performance #5088

Conversation

ticoann commented Apr 18, 2014

ticoann commented Apr 21, 2014

yuyiguo commented Apr 23, 2014

ticoann commented Apr 23, 2014

ticoann commented Apr 24, 2014