Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize object store interaction during job creation. #1957

Closed
wants to merge 1 commit into from

Conversation

jmchilton
Copy link
Member

Previously object_store_id on datasets was set as part of create - and create requires an object ID and therefore by extension a database flush. Together, this required a flush per output dataset while creating jobs.

This commit splits the object_store_id setting out from create - so that datasets can set an object_store_id during dataset creation and before flushing. The upshot is that flushing the database can now be done once for all datasets instead of once per dataset.

This optimization will in particular help executing tools with many outputs. For instance, the time to create a job for the sample tool create_10.xml which has 10 outputs dropped from 1086.074 ms to 230.539 ms with this change.

Before:

galaxy.tools DEBUG 2016-03-21 01:28:51,785 Validated and populated state for tool request (37.990 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:51,885 Handled output named out_file1 for tool create_10 (85.515 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:51,972 Handled output named out_file2 for tool create_10 (86.611 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,045 Handled output named out_file3 for tool create_10 (72.657 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,135 Handled output named out_file4 for tool create_10 (90.179 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,210 Handled output named out_file5 for tool create_10 (74.831 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,277 Handled output named out_file6 for tool create_10 (66.720 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,349 Handled output named out_file7 for tool create_10 (71.524 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,429 Handled output named out_file8 for tool create_10 (79.768 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,517 Handled output named out_file9 for tool create_10 (87.794 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,627 Handled output named out_file10 for tool create_10 (110.615 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,720 Verified access to datasets for Job[unflushed,tool_id=create_10] (7.573 ms)
galaxy.tools.execute DEBUG 2016-03-21 01:28:52,845 Tool [create_10] created job [19] (1060.070 ms)
galaxy.tools.execute DEBUG 2016-03-21 01:28:52,871 Executed 1 job(s) for tool create_10 request: (1086.074 ms)

After:

galaxy.tools DEBUG 2016-03-21 01:30:05,284 Validated and populated state for tool request (44.121 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,299 Handled output named out_file1 for tool create_10 (1.674 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,301 Handled output named out_file2 for tool create_10 (0.983 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,302 Handled output named out_file3 for tool create_10 (1.046 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,302 Handled output named out_file4 for tool create_10 (0.490 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,303 Handled output named out_file5 for tool create_10 (0.469 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,304 Handled output named out_file6 for tool create_10 (0.512 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,304 Handled output named out_file7 for tool create_10 (0.465 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,305 Handled output named out_file8 for tool create_10 (0.462 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,305 Handled output named out_file9 for tool create_10 (0.488 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,306 Handled output named out_file10 for tool create_10 (0.446 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,393 Created output datasets in object store (0.958 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,395 Verified access to datasets for Job[unflushed,tool_id=create_10] (0.359 ms)
galaxy.tools.execute DEBUG 2016-03-21 01:30:05,491 Tool [create_10] created job [20] (207.226 ms)
galaxy.tools.execute DEBUG 2016-03-21 01:30:05,515 Executed 1 job(s) for tool create_10 request: (230.539 ms)

Previously object_store_id on datasets was set as part of create - and create requires an object ID and therefore by extension a database flush. Together, this required a flush per output dataset while creating jobs.

This commit splits the object_store_id setting out from create - so that datasets can set an object_store_id during dataset creation and before flushing. The upshot is that flushing the database can now be done once for all datasets instead of once per dataset.

This optimization will in particular help executing tools with many outputs. For instance, the time to create a job for the sample tool create_10.xml which has 10 outputs dropped from 1086.074 ms to 230.539 ms with this change.

Before:

```
galaxy.tools DEBUG 2016-03-21 01:28:51,785 Validated and populated state for tool request (37.990 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:51,885 Handled output named out_file1 for tool create_10 (85.515 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:51,972 Handled output named out_file2 for tool create_10 (86.611 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,045 Handled output named out_file3 for tool create_10 (72.657 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,135 Handled output named out_file4 for tool create_10 (90.179 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,210 Handled output named out_file5 for tool create_10 (74.831 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,277 Handled output named out_file6 for tool create_10 (66.720 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,349 Handled output named out_file7 for tool create_10 (71.524 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,429 Handled output named out_file8 for tool create_10 (79.768 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,517 Handled output named out_file9 for tool create_10 (87.794 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,627 Handled output named out_file10 for tool create_10 (110.615 ms)
galaxy.tools.actions INFO 2016-03-21 01:28:52,720 Verified access to datasets for Job[unflushed,tool_id=create_10] (7.573 ms)
galaxy.tools.execute DEBUG 2016-03-21 01:28:52,845 Tool [create_10] created job [19] (1060.070 ms)
galaxy.tools.execute DEBUG 2016-03-21 01:28:52,871 Executed 1 job(s) for tool create_10 request: (1086.074 ms)
```

After:

```
galaxy.tools DEBUG 2016-03-21 01:30:05,284 Validated and populated state for tool request (44.121 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,299 Handled output named out_file1 for tool create_10 (1.674 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,301 Handled output named out_file2 for tool create_10 (0.983 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,302 Handled output named out_file3 for tool create_10 (1.046 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,302 Handled output named out_file4 for tool create_10 (0.490 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,303 Handled output named out_file5 for tool create_10 (0.469 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,304 Handled output named out_file6 for tool create_10 (0.512 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,304 Handled output named out_file7 for tool create_10 (0.465 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,305 Handled output named out_file8 for tool create_10 (0.462 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,305 Handled output named out_file9 for tool create_10 (0.488 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,306 Handled output named out_file10 for tool create_10 (0.446 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,393 Created output datasets in object store (0.958 ms)
galaxy.tools.actions INFO 2016-03-21 01:30:05,395 Verified access to datasets for Job[unflushed,tool_id=create_10] (0.359 ms)
galaxy.tools.execute DEBUG 2016-03-21 01:30:05,491 Tool [create_10] created job [20] (207.226 ms)
galaxy.tools.execute DEBUG 2016-03-21 01:30:05,515 Executed 1 job(s) for tool create_10 request: (230.539 ms)
```

Conflicts:
	lib/galaxy/tools/actions/__init__.py
@jmchilton
Copy link
Member Author

Doing more debugging and optimization and of course I came right back to this critical path. This makes a lot of sense for many outputs - but I think the benefit of this potentially skyrockets with large output collections. I wish I understood why I marked it as WIP, I'm going to close and reapply against dev and see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant