Optimize parsing of 2D data dictionaries in `multiqc.utils.utils_functions.write_data_file()` #1891

sstrong99 · 2023-03-24T22:32:48Z

This comment contains a description of changes (with reason)
CHANGELOG.md has been updated

Ran into an issue with multiqc hanging with large preseq files (10k lines). It seems to be this double loop over the 300 samples x 10000 lines that is causing the hang. I did some benchmarking with the below code. The results are

the original method test_orig() took 5.7s
test_opt took 0.4s, 14x faster, but only works if you can trust that all the data are the same type
test_opt_check_al() took 3s, 1.9x faster

This PR currently implements test_opt, which requires that you trust that all the data are the same type. I'm not sure if that's an OK assumption.

def test_orig(data):
    return list(
        dict.fromkeys(
            [
                str(data_header)
                for sample_data in data.values()
                for data_header in sample_data.keys()
                if type(sample_data[data_header]) is not dict
            ]
        )
    )

def test_opt(data):
    h = set()
    for values in data.values():
        first_sub_value = next(iter(values))
        if isinstance(first_sub_value, dict):
            continue
        h |= values.keys()
    h = [str(item) for item in h]
    return h

def test_opt_check_all(data):
    h = set()
    for values in data.values():
        passing = {k for k, v in values.items() if not isinstance(v, dict)}
        h |= passing
    h = [str(item) for item in h]
    return h

# COMMAND ----------

import timeit
import string

test_dict={k: {k1: None for k1 in range(10000)} for k in list(string.ascii_lowercase)*12}

# COMMAND ----------

timeit.timeit(lambda: test_orig(test_dict), number=50)

# COMMAND ----------

timeit.timeit(lambda: test_opt(test_dict), number=50)

# COMMAND ----------

timeit.timeit(lambda: test_opt_check_all(test_dict), number=50)

# COMMAND ----------

opt_result = test_opt(test_dict)
orig_result = test_orig(test_dict)
opt_check_all_result = test_opt_check_all(test_dict)
assert opt_result == orig_result
assert opt_check_all_result == orig_result

vladsavelyev

Thank you for optimisation and benchmarking, that's really great!

Steven Strong and others added 3 commits March 3, 2023 17:27

optimize write_data_file

0ae010e

add changelog

b5a940a

Merge branch 'master' into optimize_write_data_file

26179dd

vladsavelyev changed the title ~~optimize write_data_file~~ optimize(core): Optimize parsing of 2D data dictionaries in multiqc.utils.utils_functions.write_data_file() Sep 1, 2023

vladsavelyev requested review from ewels and vladsavelyev September 1, 2023 17:06

vladsavelyev approved these changes Sep 1, 2023

View reviewed changes

vladsavelyev added the awaits-review Awaiting final review and merge. label Sep 1, 2023

vladsavelyev changed the title ~~optimize(core): Optimize parsing of 2D data dictionaries in multiqc.utils.utils_functions.write_data_file()~~ Optimize parsing of 2D data dictionaries in multiqc.utils.utils_functions.write_data_file() Sep 4, 2023

Update CHANGELOG.md

f5bacba

vladsavelyev approved these changes Sep 4, 2023

View reviewed changes

vladsavelyev merged commit 383127f into MultiQC:master Sep 4, 2023
3 checks passed

sstrong99 deleted the optimize_write_data_file branch September 5, 2023 14:03

vladsavelyev mentioned this pull request Oct 19, 2023

Fix column order in data exported from matplotlib linegraph with --data-format tsv --export #2143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize parsing of 2D data dictionaries in `multiqc.utils.utils_functions.write_data_file()` #1891

Optimize parsing of 2D data dictionaries in `multiqc.utils.utils_functions.write_data_file()` #1891

sstrong99 commented Mar 24, 2023 •

edited by ewels

vladsavelyev left a comment

Optimize parsing of 2D data dictionaries in multiqc.utils.utils_functions.write_data_file() #1891

Optimize parsing of 2D data dictionaries in multiqc.utils.utils_functions.write_data_file() #1891

Conversation

sstrong99 commented Mar 24, 2023 • edited by ewels

vladsavelyev left a comment

Choose a reason for hiding this comment

Optimize parsing of 2D data dictionaries in `multiqc.utils.utils_functions.write_data_file()` #1891

Optimize parsing of 2D data dictionaries in `multiqc.utils.utils_functions.write_data_file()` #1891

sstrong99 commented Mar 24, 2023 •

edited by ewels