Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File search optimisations #1904

Merged
merged 13 commits into from
Aug 3, 2023
Merged

File search optimisations #1904

merged 13 commits into from
Aug 3, 2023

Conversation

vladsavelyev
Copy link
Member

@vladsavelyev vladsavelyev commented Apr 14, 2023

Hey Phil,

I was playing with the codebase a bit, thinking about the v2 roadmap. Unrelated to it, I noticed a few potential optimisations in file search that make the tests run faster. Not sure if you wanted any refactoring before v2, but I'll leave this PR anyway for the future.

  1. Moving the expensive mimetypes.guess_type call out of the inner loop. This check is independent of the search patterns, so decreasing the number of times it's called gives a significant speed boost: 372s -> 102s on file search.

  2. Adding a default global limit to the number of lines. From the profiler, the limit of 1000 lines speeds up file search 3x: 372s -> 123s. Not sure though if any module would break from this global limit, need to look into that.

  3. File reading is expensive, and the search_file function reads the same file multiple times, once for a search pattern - similar to mimetypes.guess_type. But we can't move it out of the inner loop as it depends on the search pattern, however, we can cache lines that we read. It needs a global limit on lines to be read, so coupled with item 2, it gives a nice performance boost (372s -> 73s).

Her are full per-line profiling logs of add_file and search_file: https://gist.github.com/vladsavelyev/9ec7c0d21349477f5fa437e5dd2118cf

Could do more of that, though the Python codebase is fast enough, so it's not really critical in any way. More important would be to scale the HTML reports to a larger number of samples, but that's on the roadmap!

@ewels
Copy link
Member

ewels commented Apr 21, 2023

A 5x search speedup from a PR with a 40 line diff??

Amazing!! 👏🏻 😮

@ewels ewels added this to the MultiQC v1.15 milestone Apr 21, 2023
@ewels
Copy link
Member

ewels commented Jul 29, 2023

Love this. Just sat down to properly understand the changes and I like all of them. 👍🏻 🌟

The only hesitation I have with any of it is the caching - this now stores 1k lines of every searched file in memory, and I worry slightly that this could lead to huge memory footprints for users running on millions of files where only a very small percentage of files are relevant for MultiQC.

What do you think about adding a sanity check in whereby the cache is skipped once we have cached a certain number of files? That hopefully won't kick in for most users and the majority can enjoy faster run times. But for people running in these situations with massive numbers of files it might prevent MultiQC from crashing / using up the entire server..

Again it could be a customisable threshold, and so people wanting to reduce either run time or memory usage could tweak it..

@ewels
Copy link
Member

ewels commented Jul 29, 2023

Wait, I think I was getting mixed up with scope and order of execution here..

  • Main function
    • Loop for each file found in the search directory
    • add_file() function
      • Loop through each search pattern
      • search_file() function

Here, f is defined within the add_file() function and so is only kept for the duration of that one file, so it shouldn't eat memory usage with increasing numbers of files searched.. right? So I think we're fine.

Great!

@ewels
Copy link
Member

ewels commented Jul 29, 2023

Just did my own speed benchmark on the ewels/MultiQC_TestData repo and with --profile-runtime and the search time went from 21.31s down to 3.48s 🤯

So I make it a 6x speedup!

@@ -363,35 +363,40 @@ def search_file(pattern, f, module_key):
if pattern.get("contents") is not None or pattern.get("contents_re") is not None:
if pattern.get("contents_re") is not None:
repattern = re.compile(pattern["contents_re"])
try:
if not f.get("contents_lines") or config.filesearch_lines_limit < pattern.get("num_lines", 0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why we need or config.filesearch_lines_limit < pattern.get("num_lines", 0) to bust the cache here..?

Suggested change
if not f.get("contents_lines") or config.filesearch_lines_limit < pattern.get("num_lines", 0):
if not f.get("contents_lines"):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, wait - it's because it's for this specific search pattern. So you have to bust earlier search pattern caches, I get it..

However here we're busting the cache on every loop if the config is greater than the global setting. I think we could probably check the length of the parsed cache and check if what we have already is sufficient..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a change here - please sanity check that it makes sense 😅

with io.open(file_path, "r", encoding="utf-8") as fh:
for i, line in enumerate(fh):
f["contents_lines"].append(line)
if i >= config.filesearch_lines_limit:
Copy link
Member

@ewels ewels Jul 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the following is needed here, so that individual modules can override the global default in the maximum number of lines cached..?

Suggested change
if i >= config.filesearch_lines_limit:
if i >= config.filesearch_lines_limit and i >= pattern.get("num_lines", 0):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed this change, hope that's ok

- Return False immediately if filename patterns don't match
- Fix behaviour with num_lines
- Bump num_lines for afterqc
@ewels
Copy link
Member

ewels commented Jul 29, 2023

I just ran a before- and after- run on the test data repo and did a crude diff on the log output to spot any potential problems.

Two modules come up with fewer samples found:

< |             htseq | Found 3 reports
< |           afterqc | Found 6 reports
> |           afterqc | Found 5 reports
Full diff:
1c1
<   /// MultiQC 🔍 | v1.15.dev0 (abc25b9)
---
>   /// MultiQC 🔍 | v1.15.dev0 (4c7395a)
58d57
< |             htseq | Found 3 reports
164c163
< |           afterqc | Found 6 reports
---
> |           afterqc | Found 5 reports
189,192c188,191
< |           multiqc | Run took 104.20 seconds
< |           multiqc |  - 21.31s: Searching files
< |           multiqc |  - 35.73s: Running modules
< |           multiqc |  - 2.92s: Compressing report data
---
> |           multiqc | Run took 86.75 seconds
> |           multiqc |  - 3.46s: Searching files
> |           multiqc |  - 35.97s: Running modules
> |           multiqc |  - 2.82s: Compressing report data

Sure enough, htseq finds files by looking for the string __too_low_aQual, which sits at the end of the count files - approximately line 50-60k in the test files 😕

AfterQC is similar:

afterqc:
  fn: "*.json"
  contents: "allow_mismatch_in_poly"

..and sure enough, one JSON file has it at line 1558. (oh, how much easier life would be if everyone's tools generated files with standard filename suffixes).

Out of curiosity I tried pushing filesearch_lines_limit up to 100k instead of 1k but the search time went from 3s back up to 11s, so that really is needed.

Instead, I tried fixing these by setting num_lines to something larger for afterqc - as it's only happening for JSON files due to the filename pattern it basically doesn't affect run time for us.

For htseq we ideally need an option to be able to tail a file instead of reading from the top. I spent a while trying to implement this (using file-read-backwards amongst others), but couldn't get anything that was close to being fast enough. So that one remains unresolved.

@ewels
Copy link
Member

ewels commented Jul 29, 2023

I also noticed another minor optimisation in ba717a9 - that if a filename search is specified and a file doesn't match it, we can return immediately. I think that this is safe - the test log output seems ok.

It shaves another few ms off, down to 2.95s, 7.2x speedup 😎

@ewels
Copy link
Member

ewels commented Jul 29, 2023

ok I need to go to bed.
Just need to figure out how to solve the htseq file search, then I think we can merge... (and release 🎉 )

@ewels
Copy link
Member

ewels commented Jul 31, 2023

Been thinking about this in the back of my mind the past couple of days. Adding new code to fetch the end of files efficiently should certainly be possible. However, no matter how fast it is, it'll still likely significantly slow down the search for just this one single module, which is quite annoying.

I'm wondering if the least bad solution is to add a new module-level config into the search configs whereby a module can be disabled by default. Then if the module is specified on the command line with -m or in a custom config file, it is enabled.

That way, the search will be fast for the vast majority of users who are not using this module, and those who are can take a small performance hit. Anyone who is following best practices of defining the expected modules will not be affected.

Another option is to add a "fake" file name pattern for this module, for example the path must include htseq. Users would be able to customise this, I think it would amount to the same thing though and maybe slightly more disruptive.

I don't love this, as it will be a breaking change for some users. However it also seems cruel to penalise the majority for the few. I may change my mind about it, just putting the idea down for discussion 😅

@vladsavelyev
Copy link
Member Author

Just caught up! The htseq case is really annoying indeed. And I don't like the idea of skipping htseq silently as well. Huh.

One quick thought is to while searching the files, to exclude all files that are recognised by other modules, and then only pass whatever is left unrecognised to the modules like htseq. But I think in the general case, when users pass a run directory, there are always going to be some unrecognised buljy logs.

Another idea is to support more complex search rules, so for the htseq module the rule would be to also require that a file starts with 2 columns, the second column being a numerical integer, apart from the header. Normal users and contributors wouldn't need to worry about it; we would just have to do that once for the modules such as htseq, which should be rare, hopefully.

@pontushojer
Copy link
Contributor

Hello! Just chiming in on this as this is a nice update I would like to see come through.

How about updating the current search pattern for htseq? From I understand from the output the first row is either a header or two columns with the first being the feature name and the second an integer. I run a some tests using the regex search pattern below and it caught all three of the test datasets in MultiQC_TestData.

htseq:
  contents_re: '^(feature\tcount|\w+\t\d+)$'
  num_lines: 1

There was a few "false" matches from other modules (pairtools, rna-seqc, flash, hicpro and genomescope2) but these were removed by the file parsing in the htseq module as it looks for feature names from this list ["__no_feature", "__ambiguous", "__too_low_aQual", "__not_aligned", "__alignment_not_unique"]. Thereby any falsely matched files caught here would be removed.

At the moment there was no noticeable increase for the htseq module runtime with this change but the runtime might increase if there was a lot of "false" matches. Maybe we could set an upper limit on the filesize accepted for the htseq module. One could also implement the reverse file parsing, as suggested above, but on a module level since only the last 5 lines need to be read.

@vladsavelyev
Copy link
Member Author

@pontushojer - I think this is brilliant!

The only downside of it I see is that in the logs, I think those false positives would appear as "found" for htseq, which might be misleading a bit? But still much better than having to read entire long files on file search.

@pontushojer
Copy link
Contributor

@vladsavelyev The "false" files are actually not reported in the log as the module only reports the successfully parsed files , see below.

$ multiqc --profile-runtime -f testdata -m htseq

  /// MultiQC 🔍 | v1.15.dev0 (b08c6b9)

|           multiqc | Only using modules: htseq
|           multiqc | Search path : /Users/pontus.hojer/projects/MultiQC-pontus/testdata
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2270/2270  
|            report | Profile-runtime: Searching files took 0.30s
|             htseq | Found 3 reports
|   profile_runtime | Running run time profiling module
|           multiqc | Report      : multiqc_report.html   (overwritten)
|           multiqc | Data        : multiqc_data   (overwritten)
|           multiqc | Run took 1.00 seconds
|           multiqc |  - 0.30s: Searching files
|           multiqc |  - 0.28s: Running modules
|           multiqc |  - 0.00s: Compressing report data
|           multiqc | For more information, see the 'Run Time' section in multiqc_report.html
|           multiqc | MultiQC complete

@vladsavelyev
Copy link
Member Author

@pontushojer, oh that's right! Awesome, I'll add your code if you don't mind :)

@pontushojer
Copy link
Contributor

I don't mind 😄

Co-authored-by: Pontus Höjer <pontushojer@gmail.com>
@ewels
Copy link
Member

ewels commented Aug 3, 2023

Awesome stuff, thanks @pontushojer! 🙌🏻

Diff is looking good now. Basically just this

< |           multiqc |  - 21.31s: Searching files
> |           multiqc |  - 3.51s: Searching files

Just needs changelog and some docs, then it's good to merge I think.. 🏃🏻‍♂️

@ewels
Copy link
Member

ewels commented Aug 3, 2023

ok, I need to go and sleep now so I will merge as this is functionally complete. I'll make a new issue for the required docs + changelog and tackle that tomorrow prior to release if no-one beats me to it.

Thanks both!

@ewels ewels merged commit dd43295 into MultiQC:master Aug 3, 2023
16 checks passed
@ewels
Copy link
Member

ewels commented Aug 4, 2023

Now that this is in place, I think that the log_filesize_limit config option is a lot less important - it was there mostly to solve a similar problem - spending ages parsing huge files looking for logs.

As a result I think I can safely bump that, so I'll change the default from 10MB to 50MB. This is a fairly common issue people run into so should be a nice improvement, and now possible without sacrificing speed.

@vladsavelyev vladsavelyev deleted the optimisation branch August 5, 2023 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants