Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add E3SM-IO logs from Theta with heatmap/dxt data #23

Merged
merged 1 commit into from
Feb 18, 2022

Conversation

shanedsnyder
Copy link
Contributor

This PR includes 2 darshan logs of a larger-scale run of the E3SM-IO benchmark (I case) on the Theta supercomputer at ALCF:

  • e3sm_io_heatmap_and_dxt.darshan contains both heatmap and DXT module data
  • e3sm_io_heatmap_only.darshan contains just heatmap data for a separate run of the same benchmark

The README provides more details on job size, benchmark configuration, etc.

These logs serve a similar purpose as the ones in #22, just from a larger-scale job with lots of read/write activity.

@tylerjereddy
Copy link
Collaborator

hey, now you can help fix the CI ;)

Looks like DXT tracing adds two orders of magnitude to the binary log file size?

@shanedsnyder
Copy link
Contributor Author

Yeah, I'll try to sort out the CI issues, then get these both merged in. Though from @nawtrey's #22 (comment), sounds like CI is going to be failing until we get heatmap support properly in pydarshan via darshan-hpc/darshan#615

@tylerjereddy
Copy link
Collaborator

That's right, we can't escape chicken-and-egg problems once the CI gets to the proper test failures.

@shanedsnyder
Copy link
Contributor Author

DXT tracing is especially expensive in these examples because of the shear number of operations (there's around a million of them).

@tylerjereddy
Copy link
Collaborator

Well, we could merge in conservative guards in pydarshan-devel to avoid the errors before merging Jakob's branch, but not sure if worth it.

@tylerjereddy
Copy link
Collaborator

When I was trying to produce some sample cross-comparison plots on branch treddy_demo_diag_heatmap I got the traceback below the fold. May be worth checking that, I'll push the branch up on my fork just in case. (happens for both logs, but not the log in gh-22)

Traceback (most recent call last):
  File "/Users/treddy/github_projects/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.0/python-3.9.5-u6d3ai52ic4te2cyyfljbk2cfmshv2jv/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/treddy/github_projects/spack/opt/spack/darwin-bigsur-skylake/apple-clang-12.0.0/python-3.9.5-u6d3ai52ic4te2cyyfljbk2cfmshv2jv/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/__main__.py", line 3, in <module>
    main()
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/cli/__init__.py", line 164, in main
    mod.main(args)
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/cli/summary.py", line 518, in main
    report_data = ReportData(log_path=log_path)
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/cli/summary.py", line 125, in __init__
    self.report = darshan.DarshanReport(log_path, read_all=True)
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/report.py", line 374, in __init__
    self.open(filename, read_all=read_all)    
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/report.py", line 431, in open
    self.read_all()
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/report.py", line 547, in read_all
    self.read_all_heatmap_records()
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/report.py", line 627, in read_all_heatmap_records
    heatmaps[mod].add_record(rec)
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/datatypes/heatmap.py", line 78, in add_record
    raise ValueError("Record nbins is not consistent with current heatmap.")
ValueError: Record nbins is not consistent with current heatmap.

@tylerjereddy
Copy link
Collaborator

Could be something in that branch, since it was just designed as a quick hack to compare the heatmaps on top of what Jakob did.

@shanedsnyder
Copy link
Contributor Author

I hit that same error you mention ("Record nbins is not consistent with current heatmap."), just using darshan-hpc/darshan#665 directly, so I don't think it's anything related to changes in your fork.

I also don't hit the error with the diagonal log you submitted, so don't think it's something that affects all logs with heatmap data. The logs in this PR do have APMPI data, which your logs do not, maybe it's somehow related to that. I don't know, but I'll keep looking into it before merging.

@tylerjereddy
Copy link
Collaborator

I suppose unintentionally problematic logs are especially welcome in the logs repo

@shanedsnyder
Copy link
Contributor Author

I suppose unintentionally problematic logs are especially welcome in the logs repo

This is true. Also, I think there could be a bug in the heatmap module bindings PR that's just being triggered by this larger scale log.

I'm going to merge both of these then provide details back on darshan-hpc/darshan#665

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants