-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup startup and toolbox operations #3909
Conversation
lib/galaxy/queue_worker.py
Outdated
app.reindex_tool_search(new_toolbox) | ||
app.toolbox = new_toolbox | ||
app.reindex_tool_search() | ||
app.tool_cache.reset_status() | ||
return new_toolbox |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the return
statement.
lib/galaxy/tools/data/__init__.py
Outdated
if time.time() - self.update_time > 1: | ||
content = os.walk(self.tool_data_path) | ||
try: | ||
self._tool_data_path_files = set([os.path.join(dirpath, fn) for dirpath, _, fn_list in content for fn in fn_list if fn and fn.endswith('.loc') or fn.endswith('.loc.sample')]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the squared parentheses inside set()
, i.e. you can just write set(os.path.join...or fn.endswith('.loc.sample'))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a nice one, I didn't know about that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you use a generator instead of creating the list first and then converting it to a set.
lib/galaxy/tools/data/__init__.py
Outdated
if path in self.tool_data_path_files: | ||
return True | ||
else: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return path in self.tool_data_path_files
?
A few things:
|
Thanks @jmchilton @nsoranzo indeed, there are still some things left to be worked out, I just wanted to get some results from jenkins (your early feedback is great, of course!) ... somehow the toolshed tests are failing even on dev. Probably OSX to blame. |
2e1eea8
to
3549a65
Compare
lib/galaxy/tools/search/__init__.py
Outdated
return add_doc_kwds | ||
|
||
def update_index(self, tool_cache, index_help=True): | ||
"""Use `tool_cache` to determine which tools need indexing and which tools should be expired.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice!
882f815
to
9e853ef
Compare
Thanks @jmchilton ... I think I'm close to the optimum for the things that I can improve without real-life feedback. I've brought the startup time down from 38 +/- 4 seconds on my server to 17+/- 2 seconds (the server does not yet have huge amount of tools installed, it's running in docker and some of the things are stored on a relatively slow network drive ... YMMV). Profiling shows me that most of the time is being spent now in parsing xml, so apart from moving everything to yaml or getting cElementTree to work I don't think there is much left that I can do. For the toolbox reloading part I strongly suspect that @bgruening's reloading slowness was due to deactivated or outdated entries in the shed_tool_conf.xml file. Those entries would never enter the tool_cache, so every single one of them was triggering a new db query. I hope I solved this with the ToolShedRepository cache. In any case I added a few timing statement here and there, so that we have a better idea of what could be slow. |
Opps sorry about closing that - I meant to cancel my half written comment not close your PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 91 tools locally and this PR:
Increases my startup toolbox indexing time from ~900ms to ~1.4s.
Decreases my tool installation index rebuilding down to 33ms from ~700ms.
Leaves my startup time at ~8s.
Overall very nice improvement on the installation rebuild time.
@mvdbeek Can we somehow clean the async queue for index rebuilding if Galaxy goes down? It is useless at that point.
Hmm, that's a bit surprising, because that means the startup actually became slower, since the toolbox indexing doesn't count towards startup time anymore, or did you include the indexing in the timing? Are you using sqlite locally ? It's possible that the ToolShedRepositoryCache doesn't do much for you if you are not using sqlite.
That was the primary objective, so hooray :)
Did you actually manage to queue up indexing form a previous restart? |
@mvdbeek I have postgres on a very quick machine with ssd. I included the index in timing. So ~6.7 without I guess. And yeah, after one restart I had three queued up indexing tasks. |
OK, we should definitely address that before merging. And I have seen an instance on my server where the search index had disappeared (no tools found by typing in the search box) until I triggered a toolbox reload. |
This reverts commit 556469f.
…o trigger cache rebuild
Updating the upload file formats is handled when loading the datatype into the registry, and reloading the upload tool seems unnecessary and insufficient, since this would only affect the current web handler. In any case removed and added datatypes are immediately propagated to the upload tool.
…x or datamanager reload
OK, I am building the toolbox search index during startup as we used to (af156a8), because doing that via the task queue didn't work on my production instance, which would start up with an empty tool search and only fill the search once the toolbox had been reloaded once. I am keeping track of how many times the toolbox had been reloaded, and reindex the search only if it has been rebuild less often than the toolbox. This way we ignore queued up rebuild requests. I think this is now ready for a final round of review (ping @martenson). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Marius, again, a fabulous contribution! 💯
@bgruening mentioned slow toolbox reloading in gitter a while back, so I had a look at what where major things that could be sped up.
The most effective speedup for reloading the toolbox on my dev instance is rebuilding the search index only for the tools that have changed since the toolbox reloads. The speedup varies with how many tools are installed of course, for me this went from ~2-3 seconds to 0.4 seconds.
In terms of startup times the most effective change is caching the ToolShedRepositories for 1 second in 1b6da12 and always fetching the ToolVersion objects from the cache in 4d5dd2d. This should be long enough to be able to load the complete toolbox with just a single emitted db query. This reduced the time to build the full galaxy app object from ~ 7.7 seconds to ~ 4.7 seconds. The more tools you have the better the speedup, I guess.
Data table reloading was also quite slow. It turns out there were hundreds of
os.path.exists()
calls, that even on my SSD were taking quite a while to finish. I addressed that in 12f7dcaI think this may even speedup other areas in galaxy, just running the unittests went from 140 seconds to 80 seconds locally. I guess everything that accesses a bunch of tools sequentially will benefit from this.