Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize parsing of 2D data dictionaries in multiqc.utils.utils_functions.write_data_file() #1891

Merged
merged 4 commits into from
Sep 4, 2023

Conversation

sstrong99
Copy link
Contributor

@sstrong99 sstrong99 commented Mar 24, 2023

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md has been updated

Ran into an issue with multiqc hanging with large preseq files (10k lines). It seems to be this double loop over the 300 samples x 10000 lines that is causing the hang. I did some benchmarking with the below code. The results are

  • the original method test_orig() took 5.7s
  • test_opt took 0.4s, 14x faster, but only works if you can trust that all the data are the same type
  • test_opt_check_al() took 3s, 1.9x faster

This PR currently implements test_opt, which requires that you trust that all the data are the same type. I'm not sure if that's an OK assumption.

def test_orig(data):
    return list(
        dict.fromkeys(
            [
                str(data_header)
                for sample_data in data.values()
                for data_header in sample_data.keys()
                if type(sample_data[data_header]) is not dict
            ]
        )
    )
​
def test_opt(data):
    h = set()
    for values in data.values():
        first_sub_value = next(iter(values))
        if isinstance(first_sub_value, dict):
            continue
        h |= values.keys()
    h = [str(item) for item in h]
    return hdef test_opt_check_all(data):
    h = set()
    for values in data.values():
        passing = {k for k, v in values.items() if not isinstance(v, dict)}
        h |= passing
    h = [str(item) for item in h]
    return h# COMMAND ----------import timeit
import stringtest_dict={k: {k1: None for k1 in range(10000)} for k in list(string.ascii_lowercase)*12}
​
# COMMAND ----------timeit.timeit(lambda: test_orig(test_dict), number=50)
​
# COMMAND ----------timeit.timeit(lambda: test_opt(test_dict), number=50)
​
# COMMAND ----------timeit.timeit(lambda: test_opt_check_all(test_dict), number=50)
​
# COMMAND ----------opt_result = test_opt(test_dict)
orig_result = test_orig(test_dict)
opt_check_all_result = test_opt_check_all(test_dict)
assert opt_result == orig_result
assert opt_check_all_result == orig_result

@vladsavelyev vladsavelyev changed the title optimize write_data_file optimize(core): Optimize parsing of 2D data dictionaries in multiqc.utils.utils_functions.write_data_file() Sep 1, 2023
Copy link
Member

@vladsavelyev vladsavelyev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for optimisation and benchmarking, that's really great!

@vladsavelyev vladsavelyev added the awaits-review Awaiting final review and merge. label Sep 1, 2023
@vladsavelyev vladsavelyev changed the title optimize(core): Optimize parsing of 2D data dictionaries in multiqc.utils.utils_functions.write_data_file() Optimize parsing of 2D data dictionaries in multiqc.utils.utils_functions.write_data_file() Sep 4, 2023
@vladsavelyev vladsavelyev merged commit 383127f into MultiQC:master Sep 4, 2023
3 checks passed
@sstrong99 sstrong99 deleted the optimize_write_data_file branch September 5, 2023 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaits-review Awaiting final review and merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants