Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revisit the subscription finishing query #4865

Closed
ticoann opened this issue Dec 2, 2013 · 6 comments · Fixed by #5088
Closed

revisit the subscription finishing query #4865

ticoann opened this issue Dec 2, 2013 · 6 comments · Fixed by #5088

Comments

@ticoann
Copy link
Contributor

ticoann commented Dec 2, 2013

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Subscriptions/GetAndMarkNewFinishedSubscriptions.py
Not sure how can it be improved or what other things need to be changed.
There are a couple of problems on this query.

  1. performance. It takes ~4hours to finish the query sometimes.
  2. hard to know where the condition fails. (Whether is is not finished because of child subscription is not done or current subscription is not finished).

Will it be helpful to break down the query?

Dirk could you look at this if you have a chance? If you don't have time, could you give me some hint to improve this.

@ghost ghost assigned hufnagel Dec 2, 2013
@hufnagel
Copy link
Member

hufnagel commented Dec 2, 2013

It would be good to make a snapshot of a database when it is in a state where the query takes a very long time. I can take a look at things, but without being able to test various different query optimizations and look at the execution plans and trace the queries, it's hard to make any improvements.

If this is encountered in an active agent, just shutdown the agent and let me know. I can then open a ticket with Oracle support to make a copy of the db and we can restart the agent after. It will mean at least a few hours downtime though.

@ticoann
Copy link
Contributor Author

ticoann commented Dec 3, 2013

We probably need to do that. The one I was looking at (vocms201) currently takes more than ~23h and going for this query to finish. I have to ask ops whether we can shutdown the agent completely for a few hours. So I guess it is not possible to copy db while agent is running. Anyway, I will talk to ops tomorrow.

@hufnagel
Copy link
Member

hufnagel commented Dec 3, 2013

I think at the very least one has to lock the db while the copy is running. Otherwise the copied information would not be internally consistent, because you would allow changes in some already copied tables while other not yet copied tables haven't been copied yet and that would mean the target database might have inconsistent information (FK relationships missing etc). If you can let me know the database, I can open a ticket with CERN Oracle Support and ask, they might have a way (blocking the application for a while might be preferable to shutting the agent down outright).

@hufnagel
Copy link
Member

hufnagel commented Dec 3, 2013

Btw, I have been observing this from the other end. On the weekly CMSR Oracle production use statistics reports send by the CERN Oracle team, the WMAgent production instances have been constantly among the highest load of all of the CMS production Oracle accounts. And we have 6 of them ! This is just an example from the other end that contributes to that, I am sure there are others. We need to do a schema review and look at how we do queries and make sure needed indexes are there and what the most busiest queries are etc etc. Otherwise we will run against a wall at some point (would say queries taking 24 hours already is hitting a wall).

@ericvaandering
Copy link
Member

Yes, this is on the work plan. Probably Feb or March is when it will start.

On Dec 3, 2013, at 3:10, Dirk Hufnagel notifications@github.com wrote:

Btw, I have been observing this from the other end. On the weekly CMSR Oracle production use statistics reports send by the CERN Oracle team, the WMAgent production instances have been constantly among the highest load of all of the CMS production Oracle accounts. And we have 6 of them ! This is just an example from the other end that contributes to that, I am sure there are others. We need to do a schema review and look at how we do queries and make sure needed indexes are there and what the most busiest queries are etc etc. Otherwise we will run against a wall at some point (would say queries taking 24 hours already is hitting a wall).


Reply to this email directly or view it on GitHub.

@ticoann
Copy link
Contributor Author

ticoann commented Feb 6, 2014

            SELECT complete_subscription.id
            FROM ( SELECT wmbs_subscription.id,
                          wmbs_subscription.fileset,
                          wmbs_workflow.name
                   FROM wmbs_subscription
                       INNER JOIN wmbs_fileset ON
                           wmbs_fileset.id = wmbs_subscription.fileset AND
                           wmbs_fileset.open = 0
                       INNER JOIN wmbs_workflow ON
                           wmbs_workflow.id = wmbs_subscription.workflow AND
                           wmbs_workflow.injected = 1
                       LEFT OUTER JOIN wmbs_sub_files_available ON
                           wmbs_sub_files_available.subscription = wmbs_subscription.id
                       LEFT OUTER JOIN wmbs_sub_files_acquired ON
                           wmbs_sub_files_acquired.subscription = wmbs_subscription.id
                       LEFT OUTER JOIN wmbs_jobgroup ON
                           wmbs_jobgroup.subscription = wmbs_subscription.id
                       LEFT OUTER JOIN wmbs_job ON
                           wmbs_job.jobgroup = wmbs_jobgroup.id AND
                           wmbs_job.state_time > 1386074466 AND
                           wmbs_job.state != 4
                   WHERE wmbs_subscription.finished = 0
                   GROUP BY wmbs_subscription.id,
                            wmbs_subscription.fileset,
                            wmbs_workflow.name
                   HAVING COUNT(wmbs_sub_files_available.subscription) = 0
                   AND COUNT(wmbs_sub_files_acquired.subscription) = 0
                   AND COUNT(wmbs_job.id) = 0 ) complete_subscription
                INNER JOIN wmbs_fileset ON
                    wmbs_fileset.id = complete_subscription.fileset
                LEFT OUTER JOIN wmbs_fileset_files ON
                    wmbs_fileset_files.fileset = wmbs_fileset.id
                LEFT OUTER JOIN wmbs_file_parent ON
                    wmbs_file_parent.parent = wmbs_fileset_files.fileid
                LEFT OUTER JOIN wmbs_fileset_files child_fileset ON
                    child_fileset.fileid = wmbs_file_parent.child
                LEFT OUTER JOIN wmbs_subscription child_subscription ON
                    child_subscription.fileset = child_fileset.fileset AND
                    child_subscription.finished = 0
                LEFT OUTER JOIN wmbs_workflow child_workflow ON
                    child_subscription.workflow = child_workflow.id AND
                    child_workflow.name != complete_subscription.name
            GROUP BY complete_subscription.id
            HAVING COUNT(child_workflow.name) = 0

1097 rows selected.

Execution Plan

Plan hash value: 1923316344


| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |

| 0 | SELECT STATEMENT | | 1 | 197 | | 557K (94)| 00:00:39 |
|* 1 | FILTER | | | | | | |
| 2 | HASH GROUP BY | | 1 | 197 | | 557K (94)| 00:00:39 |
|* 3 | HASH JOIN RIGHT OUTER | | 10976 | 2111K| | 557K (94)| 00:00:39 |
| 4 | TABLE ACCESS FULL | WMBS_WORKFLOW | 915 | 56730 | | 15 (0)| 00:00:01 |
| 5 | VIEW | | 10976 | 1447K| | 557K (94)| 00:00:39 |
|* 6 | HASH JOIN RIGHT OUTER | | 10976 | 2208K| | 557K (94)| 00:00:39 |
|* 7 | TABLE ACCESS FULL | WMBS_SUBSCRIPTION | 1806 | 21672 | | 11 (0)| 00:00:01 |
|* 8 | HASH JOIN OUTER | | 10976 | 2079K| | 557K (94)| 00:00:39 |
|* 9 | HASH JOIN OUTER | | 5590 | 998K| | 552K (95)| 00:00:38 |
| 10 | NESTED LOOPS OUTER | | 2748 | 458K| | 549K (95)| 00:00:38 |
| 11 | NESTED LOOPS | | 53 | 8480 | | 548K (95)| 00:00:38 |
| 12 | VIEW | | 53 | 8215 | | 548K (95)| 00:00:38 |
|* 13 | FILTER | | | | | | |
| 14 | HASH GROUP BY | | 53 | 6678 | | 548K (95)| 00:00:38 |
|* 15 | HASH JOIN RIGHT OUTER | | 2108M| 247G| | 95679 (71)| 00:00:07 |
|* 16 | TABLE ACCESS FULL | WMBS_JOB | 9697 | 198K| | 25277 (1)| 00:00:02 |
|* 17 | HASH JOIN RIGHT OUTER | | 2108M| 206G| | 37937 (91)| 00:00:03 |
| 18 | TABLE ACCESS FULL | WMBS_JOBGROUP | 96366 | 846K| | 241 (2)| 00:00:01 |
|* 19 | HASH JOIN | | 63M| 5805M| | 5228 (39)| 00:00:01 |
|* 20 | TABLE ACCESS FULL | WMBS_FILESET | 49732 | 339K| | 244 (3)| 00:00:01 |
|* 21 | HASH JOIN RIGHT OUTER| | 63M| 5381M| 7176K| 4006 (26)| 00:00:01 |
| 22 | TABLE ACCESS FULL | WMBS_SUB_FILES_ACQUIRED | 459K| 1793K| | 901 (2)| 00:00:01 |
|* 23 | HASH JOIN OUTER | | 249K| 20M| | 619 (3)| 00:00:01 |
|* 24 | HASH JOIN | | 1805 | 142K| | 27 (4)| 00:00:01 |
|* 25 | TABLE ACCESS FULL | WMBS_SUBSCRIPTION | 1806 | 28896 | | 11 (0)| 00:00:01 |
|* 26 | TABLE ACCESS FULL | WMBS_WORKFLOW | 905 | 58825 | | 15 (0)| 00:00:01 |
| 27 | TABLE ACCESS FULL | WMBS_SUB_FILES_AVAILABLE | 249K| 976K| | 588 (2)| 00:00:01 |
|* 28 | INDEX UNIQUE SCAN | WMBS_FILESET_PK | 1 | 5 | | 0 (0)| 00:00:01 |
| 29 | TABLE ACCESS BY INDEX ROWID | WMBS_FILESET_FILES | 52 | 572 | | 8 (0)| 00:00:01 |
|* 30 | INDEX RANGE SCAN | WMBS_FILESET_FILES_IDX_FILESET | 80 | | | 2 (0)| 00:00:01 |
| 31 | TABLE ACCESS FULL | WMBS_FILE_PARENT | 5342K| 61M| | 3719 (5)| 00:00:01 |

| 32 | TABLE ACCESS FULL | WMBS_FILESET_FILES | 5157K| 54M| | 4824 (4)| 00:00:01 |

Predicate Information (identified by operation id):

1 - filter(COUNT("CHILD_WORKFLOW"."NAME")=0)
3 - access("CHILD_SUBSCRIPTION"."WORKFLOW"="CHILD_WORKFLOW"."ID"(+))
filter("CHILD_WORKFLOW"."NAME"(+)<>"COMPLETE_SUBSCRIPTION"."NAME")
6 - access("CHILD_SUBSCRIPTION"."FILESET"(+)="CHILD_FILESET"."FILESET")
7 - filter("CHILD_SUBSCRIPTION"."FINISHED"(+)=0)
8 - access("CHILD_FILESET"."FILEID"(+)="WMBS_FILE_PARENT"."CHILD")
9 - access("WMBS_FILE_PARENT"."PARENT"(+)="WMBS_FILESET_FILES"."FILEID")
13 - filter(COUNT("WMBS_SUB_FILES_AVAILABLE"."SUBSCRIPTION")=0 AND COUNT("WMBS_SUB_FILES_ACQUIRED"."SUBSCRIPTION")=0
AND COUNT("WMBS_JOB"."ID")=0)
15 - access("WMBS_JOB"."JOBGROUP"(+)="WMBS_JOBGROUP"."ID")
16 - filter("WMBS_JOB"."STATE"(+)<>4 AND "WMBS_JOB"."STATE_TIME"(+)>1386074466)
17 - access("WMBS_JOBGROUP"."SUBSCRIPTION"(+)="WMBS_SUBSCRIPTION"."ID")
19 - access("WMBS_FILESET"."ID"="WMBS_SUBSCRIPTION"."FILESET")
20 - filter(TO_NUMBER("WMBS_FILESET"."OPEN")=0)
21 - access("WMBS_SUB_FILES_ACQUIRED"."SUBSCRIPTION"(+)="WMBS_SUBSCRIPTION"."ID")
23 - access("WMBS_SUB_FILES_AVAILABLE"."SUBSCRIPTION"(+)="WMBS_SUBSCRIPTION"."ID")
24 - access("WMBS_WORKFLOW"."ID"="WMBS_SUBSCRIPTION"."WORKFLOW")
25 - filter("WMBS_SUBSCRIPTION"."FINISHED"=0)
26 - filter("WMBS_WORKFLOW"."INJECTED"=1)
28 - access("WMBS_FILESET"."ID"="COMPLETE_SUBSCRIPTION"."FILESET")
30 - access("WMBS_FILESET_FILES"."FILESET"(+)="WMBS_FILESET"."ID")

Statistics

   442  recursive calls
     0  db block gets
667078  consistent gets
     0  physical reads
   940  redo size
 19677  bytes sent via SQL*Net to client
  1323  bytes received via SQL*Net from client
    75  SQL*Net roundtrips to/from client
    57  sorts (memory)
     0  sorts (disk)
  1097  rows processed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants