File search optimisations #1904

vladsavelyev · 2023-04-14T05:55:02Z

Hey Phil,

I was playing with the codebase a bit, thinking about the v2 roadmap. Unrelated to it, I noticed a few potential optimisations in file search that make the tests run faster. Not sure if you wanted any refactoring before v2, but I'll leave this PR anyway for the future.

Moving the expensive mimetypes.guess_type call out of the inner loop. This check is independent of the search patterns, so decreasing the number of times it's called gives a significant speed boost: 372s -> 102s on file search.
Adding a default global limit to the number of lines. From the profiler, the limit of 1000 lines speeds up file search 3x: 372s -> 123s. Not sure though if any module would break from this global limit, need to look into that.
File reading is expensive, and the search_file function reads the same file multiple times, once for a search pattern - similar to mimetypes.guess_type. But we can't move it out of the inner loop as it depends on the search pattern, however, we can cache lines that we read. It needs a global limit on lines to be read, so coupled with item 2, it gives a nice performance boost (372s -> 73s).

Her are full per-line profiling logs of add_file and search_file: https://gist.github.com/vladsavelyev/9ec7c0d21349477f5fa437e5dd2118cf

Could do more of that, though the Python codebase is fast enough, so it's not really critical in any way. More important would be to scale the HTML reports to a larger number of samples, but that's on the roadmap!

ewels · 2023-04-21T12:49:50Z

A 5x search speedup from a PR with a 40 line diff??

Amazing!! 👏🏻 😮

ewels · 2023-07-29T19:20:31Z

Love this. Just sat down to properly understand the changes and I like all of them. 👍🏻 🌟

The only hesitation I have with any of it is the caching - this now stores 1k lines of every searched file in memory, and I worry slightly that this could lead to huge memory footprints for users running on millions of files where only a very small percentage of files are relevant for MultiQC.

What do you think about adding a sanity check in whereby the cache is skipped once we have cached a certain number of files? That hopefully won't kick in for most users and the majority can enjoy faster run times. But for people running in these situations with massive numbers of files it might prevent MultiQC from crashing / using up the entire server..

Again it could be a customisable threshold, and so people wanting to reduce either run time or memory usage could tweak it..

ewels · 2023-07-29T21:30:21Z

Wait, I think I was getting mixed up with scope and order of execution here..

Main function

Loop for each file found in the search directory
add_file() function

Loop through each search pattern
search_file() function

Here, f is defined within the add_file() function and so is only kept for the duration of that one file, so it shouldn't eat memory usage with increasing numbers of files searched.. right? So I think we're fine.

Great!

ewels · 2023-07-29T21:32:35Z

Just did my own speed benchmark on the ewels/MultiQC_TestData repo and with --profile-runtime and the search time went from 21.31s down to 3.48s 🤯

So I make it a 6x speedup!

ewels · 2023-07-29T21:34:25Z

multiqc/utils/report.py

@@ -363,35 +363,40 @@ def search_file(pattern, f, module_key):
    if pattern.get("contents") is not None or pattern.get("contents_re") is not None:
        if pattern.get("contents_re") is not None:
            repattern = re.compile(pattern["contents_re"])
-        try:
+        if not f.get("contents_lines") or config.filesearch_lines_limit < pattern.get("num_lines", 0):


I don't really understand why we need or config.filesearch_lines_limit < pattern.get("num_lines", 0) to bust the cache here..?

Suggested change

if not f.get("contents_lines") or config.filesearch_lines_limit < pattern.get("num_lines", 0):

if not f.get("contents_lines"):

No, wait - it's because it's for this specific search pattern. So you have to bust earlier search pattern caches, I get it..

However here we're busting the cache on every loop if the config is greater than the global setting. I think we could probably check the length of the parsed cache and check if what we have already is sufficient..

Pushed a change here - please sanity check that it makes sense 😅

ewels · 2023-07-29T21:36:03Z

multiqc/utils/report.py

+                with io.open(file_path, "r", encoding="utf-8") as fh:
+                    for i, line in enumerate(fh):
+                        f["contents_lines"].append(line)
+                        if i >= config.filesearch_lines_limit:


I think that the following is needed here, so that individual modules can override the global default in the maximum number of lines cached..?

Suggested change

if i >= config.filesearch_lines_limit:

if i >= config.filesearch_lines_limit and i >= pattern.get("num_lines", 0):

Pushed this change, hope that's ok

- Return False immediately if filename patterns don't match - Fix behaviour with num_lines - Bump num_lines for afterqc

ewels · 2023-07-29T22:52:38Z

I just ran a before- and after- run on the test data repo and did a crude diff on the log output to spot any potential problems.

Two modules come up with fewer samples found:

< |             htseq | Found 3 reports
< |           afterqc | Found 6 reports
> |           afterqc | Found 5 reports

Full diff:

1c1
<   /// MultiQC 🔍 | v1.15.dev0 (abc25b9)
---
>   /// MultiQC 🔍 | v1.15.dev0 (4c7395a)
58d57
< |             htseq | Found 3 reports
164c163
< |           afterqc | Found 6 reports
---
> |           afterqc | Found 5 reports
189,192c188,191
< |           multiqc | Run took 104.20 seconds
< |           multiqc |  - 21.31s: Searching files
< |           multiqc |  - 35.73s: Running modules
< |           multiqc |  - 2.92s: Compressing report data
---
> |           multiqc | Run took 86.75 seconds
> |           multiqc |  - 3.46s: Searching files
> |           multiqc |  - 35.97s: Running modules
> |           multiqc |  - 2.82s: Compressing report data

Sure enough, htseq finds files by looking for the string __too_low_aQual, which sits at the end of the count files - approximately line 50-60k in the test files 😕

AfterQC is similar:

afterqc:
  fn: "*.json"
  contents: "allow_mismatch_in_poly"

..and sure enough, one JSON file has it at line 1558. (oh, how much easier life would be if everyone's tools generated files with standard filename suffixes).

Out of curiosity I tried pushing filesearch_lines_limit up to 100k instead of 1k but the search time went from 3s back up to 11s, so that really is needed.

Instead, I tried fixing these by setting num_lines to something larger for afterqc - as it's only happening for JSON files due to the filename pattern it basically doesn't affect run time for us.

For htseq we ideally need an option to be able to tail a file instead of reading from the top. I spent a while trying to implement this (using file-read-backwards amongst others), but couldn't get anything that was close to being fast enough. So that one remains unresolved.

ewels · 2023-07-29T22:55:08Z

I also noticed another minor optimisation in ba717a9 - that if a filename search is specified and a file doesn't match it, we can return immediately. I think that this is safe - the test log output seems ok.

It shaves another few ms off, down to 2.95s, 7.2x speedup 😎

ewels · 2023-07-29T23:13:00Z

ok I need to go to bed.
Just need to figure out how to solve the htseq file search, then I think we can merge... (and release 🎉 )

ewels · 2023-07-31T18:04:55Z

Been thinking about this in the back of my mind the past couple of days. Adding new code to fetch the end of files efficiently should certainly be possible. However, no matter how fast it is, it'll still likely significantly slow down the search for just this one single module, which is quite annoying.

I'm wondering if the least bad solution is to add a new module-level config into the search configs whereby a module can be disabled by default. Then if the module is specified on the command line with -m or in a custom config file, it is enabled.

That way, the search will be fast for the vast majority of users who are not using this module, and those who are can take a small performance hit. Anyone who is following best practices of defining the expected modules will not be affected.

Another option is to add a "fake" file name pattern for this module, for example the path must include htseq. Users would be able to customise this, I think it would amount to the same thing though and maybe slightly more disruptive.

I don't love this, as it will be a breaking change for some users. However it also seems cruel to penalise the majority for the few. I may change my mind about it, just putting the idea down for discussion 😅

vladsavelyev · 2023-07-31T18:38:14Z

Just caught up! The htseq case is really annoying indeed. And I don't like the idea of skipping htseq silently as well. Huh.

One quick thought is to while searching the files, to exclude all files that are recognised by other modules, and then only pass whatever is left unrecognised to the modules like htseq. But I think in the general case, when users pass a run directory, there are always going to be some unrecognised buljy logs.

Another idea is to support more complex search rules, so for the htseq module the rule would be to also require that a file starts with 2 columns, the second column being a numerical integer, apart from the header. Normal users and contributors wouldn't need to worry about it; we would just have to do that once for the modules such as htseq, which should be rare, hopefully.

pontushojer · 2023-08-01T11:57:00Z

Hello! Just chiming in on this as this is a nice update I would like to see come through.

How about updating the current search pattern for htseq? From I understand from the output the first row is either a header or two columns with the first being the feature name and the second an integer. I run a some tests using the regex search pattern below and it caught all three of the test datasets in MultiQC_TestData.

htseq:
  contents_re: '^(feature\tcount|\w+\t\d+)$'
  num_lines: 1

There was a few "false" matches from other modules (pairtools, rna-seqc, flash, hicpro and genomescope2) but these were removed by the file parsing in the htseq module as it looks for feature names from this list ["__no_feature", "__ambiguous", "__too_low_aQual", "__not_aligned", "__alignment_not_unique"]. Thereby any falsely matched files caught here would be removed.

At the moment there was no noticeable increase for the htseq module runtime with this change but the runtime might increase if there was a lot of "false" matches. Maybe we could set an upper limit on the filesize accepted for the htseq module. One could also implement the reverse file parsing, as suggested above, but on a module level since only the last 5 lines need to be read.

vladsavelyev · 2023-08-01T13:51:30Z

@pontushojer - I think this is brilliant!

The only downside of it I see is that in the logs, I think those false positives would appear as "found" for htseq, which might be misleading a bit? But still much better than having to read entire long files on file search.

pontushojer · 2023-08-01T14:02:38Z

@vladsavelyev The "false" files are actually not reported in the log as the module only reports the successfully parsed files , see below.

$ multiqc --profile-runtime -f testdata -m htseq

  /// MultiQC 🔍 | v1.15.dev0 (b08c6b9)

|           multiqc | Only using modules: htseq
|           multiqc | Search path : /Users/pontus.hojer/projects/MultiQC-pontus/testdata
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2270/2270  
|            report | Profile-runtime: Searching files took 0.30s
|             htseq | Found 3 reports
|   profile_runtime | Running run time profiling module
|           multiqc | Report      : multiqc_report.html   (overwritten)
|           multiqc | Data        : multiqc_data   (overwritten)
|           multiqc | Run took 1.00 seconds
|           multiqc |  - 0.30s: Searching files
|           multiqc |  - 0.28s: Running modules
|           multiqc |  - 0.00s: Compressing report data
|           multiqc | For more information, see the 'Run Time' section in multiqc_report.html
|           multiqc | MultiQC complete

…pontushojer

…pontushojer!)

vladsavelyev · 2023-08-01T15:10:49Z

@pontushojer, oh that's right! Awesome, I'll add your code if you don't mind :)

pontushojer · 2023-08-01T15:18:22Z

I don't mind 😄

multiqc/utils/search_patterns.yaml

Co-authored-by: Pontus Höjer <pontushojer@gmail.com>

ewels · 2023-08-03T23:16:34Z

Awesome stuff, thanks @pontushojer! 🙌🏻

Diff is looking good now. Basically just this

< |           multiqc |  - 21.31s: Searching files
> |           multiqc |  - 3.51s: Searching files

Just needs changelog and some docs, then it's good to merge I think.. 🏃🏻‍♂️

ewels · 2023-08-03T23:22:50Z

ok, I need to go and sleep now so I will merge as this is functionally complete. I'll make a new issue for the required docs + changelog and tackle that tomorrow prior to release if no-one beats me to it.

Thanks both!

ewels · 2023-08-04T18:50:22Z

Now that this is in place, I think that the log_filesize_limit config option is a lot less important - it was there mostly to solve a similar problem - spending ages parsing huge files looking for logs.

As a result I think I can safely bump that, so I'll change the default from 10MB to 50MB. This is a fairly common issue people run into so should be a nice improvement, and now possible without sacrificing speed.

vladsavelyev added 7 commits April 12, 2023 14:02

Move check file mimetype out of the spatterns loop

d848725

Optimisation: cache files contents

a529a1d

Fix cache files contents

f2c1d49

Remove profile decorators

4927d6c

Configure global search limit

44dc35e

Fix

57babce

Blackify

70b62b2

ewels added the priority: high label Apr 21, 2023

ewels added this to the MultiQC v1.15 milestone Apr 21, 2023

Merge branch 'master' into optimisation

102029f

ewels reviewed Jul 29, 2023

View reviewed changes

ewels added 2 commits July 29, 2023 23:38

Merge branch 'master' into optimisation

4c7395a

Some more file search tweaks

ba717a9

- Return False immediately if filename patterns don't match - Fix behaviour with num_lines - Bump num_lines for afterqc

Merge branch 'master' into optimisation

469e5a0

vladsavelyev mentioned this pull request Aug 1, 2023

Add htseq false positive (gene counts list without stats in the end) MultiQC/test-data#259

Merged

Modify htseq sesarch pattern to look at the file head only (credit to @…

aa175ad

…pontushojer!)

pontushojer reviewed Aug 1, 2023

View reviewed changes

multiqc/utils/search_patterns.yaml Outdated Show resolved Hide resolved

Update multiqc/utils/search_patterns.yaml

04737cc

Co-authored-by: Pontus Höjer <pontushojer@gmail.com>

ewels merged commit dd43295 into MultiQC:master Aug 3, 2023
16 checks passed

ewels mentioned this pull request Aug 3, 2023

Changelog + docs for file search optimisations #1966

Closed

4 tasks

vladsavelyev deleted the optimisation branch August 5, 2023 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File search optimisations #1904

File search optimisations #1904

vladsavelyev commented Apr 14, 2023 •

edited

Loading

ewels commented Apr 21, 2023

ewels commented Jul 29, 2023

ewels commented Jul 29, 2023

ewels commented Jul 29, 2023 •

edited

Loading

ewels Jul 29, 2023

ewels Jul 29, 2023

ewels Jul 29, 2023

ewels Jul 29, 2023 •

edited

Loading

ewels Jul 29, 2023

ewels commented Jul 29, 2023 •

edited

Loading

ewels commented Jul 29, 2023

ewels commented Jul 29, 2023

ewels commented Jul 31, 2023

vladsavelyev commented Jul 31, 2023

pontushojer commented Aug 1, 2023

vladsavelyev commented Aug 1, 2023

pontushojer commented Aug 1, 2023

vladsavelyev commented Aug 1, 2023

pontushojer commented Aug 1, 2023

ewels commented Aug 3, 2023

ewels commented Aug 3, 2023 •

edited

Loading

ewels commented Aug 4, 2023

	if not f.get("contents_lines") or config.filesearch_lines_limit < pattern.get("num_lines", 0):
	if not f.get("contents_lines"):

	if i >= config.filesearch_lines_limit:
	if i >= config.filesearch_lines_limit and i >= pattern.get("num_lines", 0):

File search optimisations #1904

File search optimisations #1904

Conversation

vladsavelyev commented Apr 14, 2023 • edited Loading

ewels commented Apr 21, 2023

ewels commented Jul 29, 2023

ewels commented Jul 29, 2023

ewels commented Jul 29, 2023 • edited Loading

ewels Jul 29, 2023

Choose a reason for hiding this comment

ewels Jul 29, 2023

Choose a reason for hiding this comment

ewels Jul 29, 2023

Choose a reason for hiding this comment

ewels Jul 29, 2023 • edited Loading

Choose a reason for hiding this comment

ewels Jul 29, 2023

Choose a reason for hiding this comment

ewels commented Jul 29, 2023 • edited Loading

ewels commented Jul 29, 2023

ewels commented Jul 29, 2023

ewels commented Jul 31, 2023

vladsavelyev commented Jul 31, 2023

pontushojer commented Aug 1, 2023

vladsavelyev commented Aug 1, 2023

pontushojer commented Aug 1, 2023

vladsavelyev commented Aug 1, 2023

pontushojer commented Aug 1, 2023

ewels commented Aug 3, 2023

ewels commented Aug 3, 2023 • edited Loading

ewels commented Aug 4, 2023

vladsavelyev commented Apr 14, 2023 •

edited

Loading

ewels commented Jul 29, 2023 •

edited

Loading

ewels Jul 29, 2023 •

edited

Loading

ewels commented Jul 29, 2023 •

edited

Loading

ewels commented Aug 3, 2023 •

edited

Loading