-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File search optimisations #1904
Conversation
Love this. Just sat down to properly understand the changes and I like all of them. 👍🏻 🌟 The only hesitation I have with any of it is the caching - this now stores 1k lines of every searched file in memory, and I worry slightly that this could lead to huge memory footprints for users running on millions of files where only a very small percentage of files are relevant for MultiQC. What do you think about adding a sanity check in whereby the cache is skipped once we have cached a certain number of files? That hopefully won't kick in for most users and the majority can enjoy faster run times. But for people running in these situations with massive numbers of files it might prevent MultiQC from crashing / using up the entire server.. Again it could be a customisable threshold, and so people wanting to reduce either run time or memory usage could tweak it.. |
Wait, I think I was getting mixed up with scope and order of execution here..
Here, Great! |
Just did my own speed benchmark on the ewels/MultiQC_TestData repo and with So I make it a 6x speedup! |
multiqc/utils/report.py
Outdated
@@ -363,35 +363,40 @@ def search_file(pattern, f, module_key): | |||
if pattern.get("contents") is not None or pattern.get("contents_re") is not None: | |||
if pattern.get("contents_re") is not None: | |||
repattern = re.compile(pattern["contents_re"]) | |||
try: | |||
if not f.get("contents_lines") or config.filesearch_lines_limit < pattern.get("num_lines", 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand why we need or config.filesearch_lines_limit < pattern.get("num_lines", 0)
to bust the cache here..?
if not f.get("contents_lines") or config.filesearch_lines_limit < pattern.get("num_lines", 0): | |
if not f.get("contents_lines"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, wait - it's because it's for this specific search pattern. So you have to bust earlier search pattern caches, I get it..
However here we're busting the cache on every loop if the config is greater than the global setting. I think we could probably check the length of the parsed cache and check if what we have already is sufficient..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a change here - please sanity check that it makes sense 😅
multiqc/utils/report.py
Outdated
with io.open(file_path, "r", encoding="utf-8") as fh: | ||
for i, line in enumerate(fh): | ||
f["contents_lines"].append(line) | ||
if i >= config.filesearch_lines_limit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the following is needed here, so that individual modules can override the global default in the maximum number of lines cached..?
if i >= config.filesearch_lines_limit: | |
if i >= config.filesearch_lines_limit and i >= pattern.get("num_lines", 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed this change, hope that's ok
- Return False immediately if filename patterns don't match - Fix behaviour with num_lines - Bump num_lines for afterqc
I just ran a before- and after- run on the test data repo and did a crude diff on the log output to spot any potential problems. Two modules come up with fewer samples found: < | htseq | Found 3 reports
< | afterqc | Found 6 reports
> | afterqc | Found 5 reports Full diff:1c1
< /// MultiQC 🔍 | v1.15.dev0 (abc25b9)
---
> /// MultiQC 🔍 | v1.15.dev0 (4c7395a)
58d57
< | htseq | Found 3 reports
164c163
< | afterqc | Found 6 reports
---
> | afterqc | Found 5 reports
189,192c188,191
< | multiqc | Run took 104.20 seconds
< | multiqc | - 21.31s: Searching files
< | multiqc | - 35.73s: Running modules
< | multiqc | - 2.92s: Compressing report data
---
> | multiqc | Run took 86.75 seconds
> | multiqc | - 3.46s: Searching files
> | multiqc | - 35.97s: Running modules
> | multiqc | - 2.82s: Compressing report data Sure enough, htseq finds files by looking for the string AfterQC is similar: afterqc:
fn: "*.json"
contents: "allow_mismatch_in_poly" ..and sure enough, one JSON file has it at line 1558. (oh, how much easier life would be if everyone's tools generated files with standard filename suffixes). Out of curiosity I tried pushing Instead, I tried fixing these by setting For htseq we ideally need an option to be able to |
I also noticed another minor optimisation in ba717a9 - that if a filename search is specified and a file doesn't match it, we can return immediately. I think that this is safe - the test log output seems ok. It shaves another few ms off, down to 2.95s, 7.2x speedup 😎 |
ok I need to go to bed. |
Been thinking about this in the back of my mind the past couple of days. Adding new code to fetch the end of files efficiently should certainly be possible. However, no matter how fast it is, it'll still likely significantly slow down the search for just this one single module, which is quite annoying. I'm wondering if the least bad solution is to add a new module-level config into the search configs whereby a module can be disabled by default. Then if the module is specified on the command line with That way, the search will be fast for the vast majority of users who are not using this module, and those who are can take a small performance hit. Anyone who is following best practices of defining the expected modules will not be affected. Another option is to add a "fake" file name pattern for this module, for example the path must include I don't love this, as it will be a breaking change for some users. However it also seems cruel to penalise the majority for the few. I may change my mind about it, just putting the idea down for discussion 😅 |
Just caught up! The One quick thought is to while searching the files, to exclude all files that are recognised by other modules, and then only pass whatever is left unrecognised to the modules like Another idea is to support more complex search rules, so for the |
Hello! Just chiming in on this as this is a nice update I would like to see come through. How about updating the current search pattern for
There was a few "false" matches from other modules ( At the moment there was no noticeable increase for the |
@pontushojer - I think this is brilliant! The only downside of it I see is that in the logs, I think those false positives would appear as "found" for |
@vladsavelyev The "false" files are actually not reported in the log as the module only reports the successfully parsed files , see below.
|
@pontushojer, oh that's right! Awesome, I'll add your code if you don't mind :) |
I don't mind 😄 |
Co-authored-by: Pontus Höjer <pontushojer@gmail.com>
Awesome stuff, thanks @pontushojer! 🙌🏻 Diff is looking good now. Basically just this < | multiqc | - 21.31s: Searching files
> | multiqc | - 3.51s: Searching files Just needs changelog and some docs, then it's good to merge I think.. 🏃🏻♂️ |
ok, I need to go and sleep now so I will merge as this is functionally complete. I'll make a new issue for the required docs + changelog and tackle that tomorrow prior to release if no-one beats me to it. Thanks both! |
Now that this is in place, I think that the As a result I think I can safely bump that, so I'll change the default from 10MB to 50MB. This is a fairly common issue people run into so should be a nice improvement, and now possible without sacrificing speed. |
Hey Phil,
I was playing with the codebase a bit, thinking about the v2 roadmap. Unrelated to it, I noticed a few potential optimisations in file search that make the tests run faster. Not sure if you wanted any refactoring before v2, but I'll leave this PR anyway for the future.
Moving the expensive
mimetypes.guess_type
call out of the inner loop. This check is independent of the search patterns, so decreasing the number of times it's called gives a significant speed boost: 372s -> 102s on file search.Adding a default global limit to the number of lines. From the profiler, the limit of 1000 lines speeds up file search 3x: 372s -> 123s. Not sure though if any module would break from this global limit, need to look into that.
File reading is expensive, and the
search_file
function reads the same file multiple times, once for a search pattern - similar tomimetypes.guess_type
. But we can't move it out of the inner loop as it depends on the search pattern, however, we can cache lines that we read. It needs a global limit on lines to be read, so coupled with item 2, it gives a nice performance boost (372s -> 73s).Her are full per-line profiling logs of
add_file
andsearch_file
: https://gist.github.com/vladsavelyev/9ec7c0d21349477f5fa437e5dd2118cfCould do more of that, though the Python codebase is fast enough, so it's not really critical in any way. More important would be to scale the HTML reports to a larger number of samples, but that's on the roadmap!