Skip to content

Conversation

@judysng
Copy link
Contributor

@judysng judysng commented Jan 12, 2024

Description of changes

  • Added integration test for checking that there isn't a file handler leak in computemgtd
  • Added resource leak checking as callable functions. Called the functions in starccm test because the starccm runs for ~20 mins, which is long enough to notice the leak

Tests

  • Ran on personal Jenkins
    • Current computemgtd's number of file descriptors doesn't change over time
    • Ran with a modified computemgtd that introduces the above file handler leak -- the test fails and number of open files increases as expected

References

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@judysng judysng requested review from a team as code owners January 12, 2024 19:45
@judysng judysng added the skip-changelog-update Disables the check that enforces changelog updates in PRs label Jan 12, 2024
@judysng judysng force-pushed the resource-leak branch 2 times, most recently from 87b60f3 to 1c7e6db Compare January 12, 2024 19:49
@codecov
Copy link

codecov bot commented Jan 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (3c5d0fc) 90.21% compared to head (599feb1) 90.21%.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #6009   +/-   ##
========================================
  Coverage    90.21%   90.21%           
========================================
  Files          180      180           
  Lines        15822    15822           
========================================
  Hits         14274    14274           
  Misses        1548     1548           
Flag Coverage Δ
unittests 90.21% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@judysng judysng force-pushed the resource-leak branch 3 times, most recently from f5025f1 to fa13db4 Compare January 12, 2024 19:59


@pytest.mark.usefixtures("instance", "os", "scheduler")
def test_resource_leaks(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid having a specific integration test for it? and instead add this logic to check file descriptors and resource leak by default in all the tests. I mean, we don't need to wait 30 minutes in every test, but we can do the check.

Copy link
Contributor Author

@judysng judysng Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR to add the logic in functions in utils.py and called it in the starccm test. Unfortunately I don't think we can add it by default to all the tests unless we add the assert in cluster creation/deletion, which I think we'd want to avoid

Copy link
Contributor

@dreambeyondorange dreambeyondorange Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does the starccm test run typically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~20 mins, which is long enough for the resource leak test to fail. I tested running this starccm test with the resource leak issue here and it failed as expected.

return result.stdout.strip()


def check_file_handler_leak(remote_command_executor, slurm_commands, cluster, region):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this function name best describes what this is doing. It seems like this just gets the number of open files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I renamed it to be get_compute_ip_to_no_files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the term no is less explicit than num to abbreviate the number of files. no has more binary implications or boolean implications

@judysng judysng force-pushed the resource-leak branch 4 times, most recently from da28e00 to 5f04eda Compare January 23, 2024 15:04
Signed-off-by: Judy Ng <njud@amazon.com>
Signed-off-by: Judy Ng <njud@amazon.com>
@judysng judysng merged commit 0a03492 into aws:develop Jan 23, 2024
himani2411 pushed a commit to himani2411/aws-parallelcluster that referenced this pull request Feb 6, 2024
* Add resource leak check into util, call checks in starccm test

Signed-off-by: Judy Ng <njud@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog-update Disables the check that enforces changelog updates in PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants