Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve task archiver query performance #5088

Merged
merged 1 commit into from
Apr 27, 2014

Conversation

ticoann
Copy link
Contributor

@ticoann ticoann commented Apr 18, 2014

fixes #4865

@ticoann
Copy link
Contributor Author

ticoann commented Apr 21, 2014

@yuyiguo, @hufnagel
Please review.
Couple of things changed from original query.
It seem Group By and Having clause cost a lot of time.
So instead of using that following is used after LEFT OUTER JOIN
wmbs_sub_files_acquired.subscription is Null
This have the same effect since subscription is key for the LEFT OUTER JOIN.

However,

LEFT OUTER JOIN wmbs_jobgroup ON
     wmbs_jobgroup.subscription = wmbs_subscription.id
LEFT OUTER JOIN wmbs_job ON
     wmbs_job.jobgroup = wmbs_jobgroup.id AND
     wmbs_job.state_time > :maxTime AND
     wmbs_job.state != %d
GROUP BY ...
HAVING COUNT(wmbs_job.id) = 0

In this case (which is major bottle neck) I can't replace that to wmbs.job == NULL,
since it means different things.
So used inner join and get the subtraction from the original.
It seems query is much faster (more than ~20 times) and passes current unittest
However, if you could look at the correctness, I will appreciate that.
One thing I didn't change is
GROUP BY ...
HAVING COUNT
clause for child subscription - it seem the data is small enough (over workflows) it doesn't gain too much time - but we can change that later

I think it might be better break down the query at least 3 parts. (for more clear step)
select parent subscription done
select child subscription done
update subscription.

But for now I will leave as it is.

Seangchan

@yuyiguo
Copy link
Member

yuyiguo commented Apr 23, 2014

It is hard for me to really examine the code w/o seeing the DB schema and the data in the DB. Just comparing the revised sql with thw current sql, it seems to me that the revised is better written. The one thing I observed is that completeNonJobSQL and subWithUnfinishedJobSQL have most of the comment query. If we can create temporary static table using the common part of the query, then subWithUnfinishedJobSQL do additional select from that table .
Something like
with t1 as (
SELECT distinct wmbs_subscription.id,
wmbs_subscription.fileset,
wmbs_workflow.name
FROM wmbs_subscription
INNER JOIN wmbs_fileset ON
wmbs_fileset.id = wmbs_subscription.fileset AND
wmbs_fileset.open = 0
INNER JOIN wmbs_workflow ON
wmbs_workflow.id = wmbs_subscription.workflow AND
wmbs_workflow.injected = 1
LEFT OUTER JOIN wmbs_sub_files_available ON
wmbs_sub_files_available.subscription = wmbs_subscription.id
LEFT OUTER JOIN wmbs_sub_files_acquired ON
wmbs_sub_files_acquired.subscription = wmbs_subscription.id
WHERE wmbs_subscription.finished = 0 AND
wmbs_sub_files_available.subscription is Null AND
wmbs_sub_files_acquired.subscription is Null)

Then we can select from t1 w/o repeating the same query.
We can talk about this tomorrow.
Yuyi

@ticoann
Copy link
Contributor Author

ticoann commented Apr 23, 2014

Thanks Yuyi, I didn't know we could to that. That is a good tip. In the performance wise, newer query is much faster. Most concern for me is the correctness of the query. Whether both query gives the same result in all the cases. Let's talk tomorrow, Dirk will be here as well.

@ticoann
Copy link
Contributor Author

ticoann commented Apr 24, 2014

Hi Yuyi,
It seems WITH .. AS ... clause is not supported by mysql (at least mysql version we are using)
However is seems the query is faster in oracle. I will have two different version of query for oracle and MySQL

WITH
  complete_subscription
AS
(SELECT distinct wmbs_subscription.id,
                wmbs_subscription.fileset,
                wmbs_workflow.name
    FROM wmbs_subscription
    INNER JOIN wmbs_fileset ON
        wmbs_fileset.id = wmbs_subscription.fileset AND
        wmbs_fileset.open = 0
    INNER JOIN wmbs_workflow ON
        wmbs_workflow.id = wmbs_subscription.workflow AND
        wmbs_workflow.injected = 1
    LEFT OUTER JOIN wmbs_sub_files_available ON
        wmbs_sub_files_available.subscription = wmbs_subscription.id
    LEFT OUTER JOIN wmbs_sub_files_acquired ON
        wmbs_sub_files_acquired.subscription = wmbs_subscription.id
  WHERE wmbs_subscription.finished = 0 AND
        wmbs_sub_files_available.subscription is Null AND
        wmbs_sub_files_acquired.subscription is Null
 )
SELECT complete_subscription.id
    FROM complete_subscription 

    INNER JOIN wmbs_fileset ON
        wmbs_fileset.id = complete_subscription.fileset
    LEFT OUTER JOIN wmbs_fileset_files ON
        wmbs_fileset_files.fileset = wmbs_fileset.id
    LEFT OUTER JOIN wmbs_file_parent ON
        wmbs_file_parent.parent = wmbs_fileset_files.fileid
    LEFT OUTER JOIN wmbs_fileset_files child_fileset ON
        child_fileset.fileid = wmbs_file_parent.child
    LEFT OUTER JOIN wmbs_subscription child_subscription ON
        child_subscription.fileset = child_fileset.fileset AND
        child_subscription.finished = 0
    LEFT OUTER JOIN wmbs_workflow child_workflow ON
        child_subscription.workflow = child_workflow.id AND
        child_workflow.name != complete_subscription.name

  WHERE complete_subscription.id
    NOT IN (SELECT complete_subscription.id 
                          FROM complete_subscription
                          INNER JOIN wmbs_jobgroup ON
                            wmbs_jobgroup.subscription = complete_subscription.id
                          INNER JOIN wmbs_job ON
                            wmbs_job.jobgroup = wmbs_jobgroup.id AND
                            wmbs_job.state_time > 0 AND
                            wmbs_job.state != 4)
GROUP BY complete_subscription.id
                    HAVING COUNT(child_workflow.name) = 0

ticoann added a commit that referenced this pull request Apr 27, 2014
improve task archiver query performance
@ticoann ticoann merged commit 5f573f9 into dmwm:master Apr 27, 2014
@ticoann ticoann added this to the WMAgent1404 milestone Jun 3, 2014
@ticoann ticoann self-assigned this Jun 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

revisit the subscription finishing query
2 participants