Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Rework: More Aggressive Culling Of FPGA Resources, Slack/PR Notifications #1316

Merged
merged 9 commits into from Dec 10, 2022

Conversation

abejgonzalez
Copy link
Contributor

@abejgonzalez abejgonzalez commented Dec 2, 2022

This PR attempts to fix/alleviate some CI issues that have caused resource leakages (of both manager and FPGA instances).

Existing issues found:

  • The workflow monitor did not start properly (a combination of needing all env. variables, and other small oversights). However, this wasn't apparent since there was no logging/notification of something going wrong.
  • FPGA instances were left alive for too long (workflow monitor was failing is one reason for this).

It does the following:

  • workflow_monitor.py and setup_workflow_monitor.py
    • The monitor needs access to all environment variables used in ci_variables.py. Thus, instead of setting an env.sh file that holds needed env. variables, a new ci_env dict is created (see the ci_variables.py changes) and set for the corresponding run() calls.
    • Now there are 2 verification steps to ensure that the workflow monitor at least started properly (the workflow monitor has it's own check to ensure it is properly running)
    • Since the workflow monitor has access to all env. variables needed, it doesn't have the args anymore.
    • The workflow monitor now sends PR comments when starting + failing
    • The workflow monitor now has a check_and_terminate_select_instances that has a 45m timeout on FPGA instances (see platform_lib.py for more info).
    • The workflow monitor now has a try/catch to check if failures happen (at least it should hopefully send the PR comment if something fails).
  • ci_variables.py - A new ci_env dict is populated with all env. variables used throughout CI. Instead of using python variable names for environment variables (which don't have the same naming convention) use the env. variable name for the key in the dict. All code locations are changed to account for this. This dict is type checked so that all keys are considered present. Future CI type checking will happen in a followup PR.
  • firesim-cleanup.yml - This workflow now sends a Slack message if something goes wrong. This is tied to the cull-instances script now returning an error if it culls something it expected not to cull.
  • cull-old-ci-insts.py - This now explicitly checks for FPGA instances that are live for over 1hr. Also now exits non-zero if something was terminated unexpectedly (triggers a slack message).
  • {github_}common.py - Moved GitHub specific items to a new file github_common.py to avoid a circular dependency with the platform_lib file. A new issues URL and issue function is added for PR comment posts.
  • platform_lib.py:
    • New find_select_ci_instances func. for each platform to find "select" CI instances (i.e. FPGA instances) and return them.
    • New check_and_terminate_select_instances func. to find "select" instances and terminate them. If instances are terminated then it sends a PR comment.

There are now multiple checks for FPGA instances (for extra extra extra security):

  • Linux test timeout - 30m timeout
  • Workflow Monitor - 45m timeout
  • Cull CI Instances - 1h timeout

Related PRs / Issues

N/A

UI / API Impact

N/A

Verilog / AGFI Compatibility

N/A

Contributor Checklist

  • Is this PR's title suitable for inclusion in the changelog and have you added a changelog:<topic> label?
  • Did you add Scaladoc/docstring/doxygen to every public function/method?
  • Did you add at least one test demonstrating the PR?
  • Did you delete any extraneous prints/debugging code?
  • Did you state the UI / API impact?
  • Did you specify the Verilog / AGFI compatibility impact?
  • If applicable, did you regenerate and publicly share default AGFIs?
  • If applicable, did you apply the ci:fpga-deploy label?
  • If applicable, did you apply the Please Backport label?

Reviewer Checklist (only modified by reviewer)

Note: to run CI on PRs from forks, comment @Mergifyio copy main and manage the change from the new PR.

  • Is the title suitable for inclusion in the changelog and does the PR have a changelog:<topic> label?
  • Did you mark the proper release milestone?
  • Did you check whether all relevant Contributor checkboxes have been checked?

@abejgonzalez abejgonzalez self-assigned this Dec 2, 2022
@abejgonzalez
Copy link
Contributor Author

Started up the workflow monitor for CI run 3625480853

@abejgonzalez
Copy link
Contributor Author

Something went wrong in the workflow monitor for CI run 3625480853. Verify CI instances are terminated properly. Must be checked before submitting the PR.

@abejgonzalez abejgonzalez force-pushed the fix-ci-leakages branch 2 times, most recently from 7ac2d51 to afb44e5 Compare December 6, 2022 06:27
@abejgonzalez abejgonzalez force-pushed the fix-ci-leakages branch 3 times, most recently from 570e3a6 to a337f1e Compare December 6, 2022 07:24
@abejgonzalez abejgonzalez changed the title CI TESTING CI Rework: More Aggressive Culling Of FPGA Resources, Slack/PR Notifications, Vitis Re-Enable Dec 6, 2022
@abejgonzalez
Copy link
Contributor Author

abejgonzalez commented Dec 6, 2022

While not working 100%. This PR is now ready for the 1st round of reviews.

@t14916
Copy link
Contributor

t14916 commented Dec 6, 2022

This may be adjacent to this PR in particular, but if the aim is to reduce CI costs, you can also try a more aggressive filtering approach right? For example, I think manager/python library changes will run the full gamut of scala tests and chipyard tests, but is that really necessary?

@abejgonzalez
Copy link
Contributor Author

This may be adjacent to this PR in particular, but if the aim is to reduce CI costs, you can also try a more aggressive filtering approach right? For example, I think manager/python library changes will run the full gamut of scala tests and chipyard tests, but is that really necessary?

I agree. However, I'm treating that as future work

@abejgonzalez abejgonzalez changed the base branch from combo-fixes to main December 6, 2022 21:56
@abejgonzalez abejgonzalez force-pushed the fix-ci-leakages branch 2 times, most recently from 2bbdac2 to 273f6b2 Compare December 6, 2022 22:05
@abejgonzalez
Copy link
Contributor Author

Workflow monitor started for CI run: 3633720674

@abejgonzalez
Copy link
Contributor Author

Workflow monitor started for CI run: 3633857500

@abejgonzalez
Copy link
Contributor Author

Workflow monitor started for CI run: 3633857500

@abejgonzalez
Copy link
Contributor Author

Something went wrong in the workflow monitor for CI run 3655073569. Verify CI instances are terminated properly. Must be checked before submitting the PR.

Exception Message:

can't subtract offset-naive and offset-aware datetimes

Traceback Message:

Traceback (most recent call last):
  File "/home/centos/firesim/.github/scripts/./workflow-monitor.py", line 62, in main
    platform_lib.check_and_terminate_run_farm_instances(45, ci_env['GITHUB_RUN_ID'])
  File "/home/centos/firesim/.github/scripts/platform_lib.py", line 278, in check_and_terminate_run_farm_instances
    if (datetime.datetime.now() - inst['LaunchTime']) >= datetime.timedelta(minutes=timeout):
        ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
TypeError: can't subtract offset-naive and offset-aware datetimes

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

17 similar comments
@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Uncaught 2 FPGA instance shutdown(s) detected for CI run: 3659621259. Verify CI state before submitting PR.

@abejgonzalez
Copy link
Contributor Author

Something went wrong in the workflow monitor for CI run 3661037859. Verify CI instances are terminated properly. Must be checked before submitting the PR.

Exception Message:

[Errno 2] No such file or directory: '/home/centos/actions-runner-0/_work/_temp/_github_workflow/event.json'

Traceback Message:

Traceback (most recent call last):
  File "/home/centos/firesim/.github/scripts/./workflow-monitor.py", line 66, in main
    platform_lib.check_and_terminate_run_farm_instances(0, ci_env['GITHUB_RUN_ID'])
  File "/home/centos/firesim/.github/scripts/platform_lib.py", line 302, in check_and_terminate_run_farm_instances
    issue_post(ci_env['PERSONAL_ACCESS_TOKEN'], get_issue_number(),
                                                ^^^^^^^^^^^^^^^^^^
  File "/home/centos/firesim/.github/scripts/github_common.py", line 59, in get_issue_number
    with open(ci_env['GITHUB_EVENT_PATH']) as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/centos/actions-runner-0/_work/_temp/_github_workflow/event.json'

@abejgonzalez
Copy link
Contributor Author

@t14916 This should fully work now. Sorry for the delay.

@abejgonzalez abejgonzalez changed the title CI Rework: More Aggressive Culling Of FPGA Resources, Slack/PR Notifications, Vitis Re-Enable CI Rework: More Aggressive Culling Of FPGA Resources, Slack/PR Notifications Dec 10, 2022
Copy link
Contributor

@t14916 t14916 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abejgonzalez abejgonzalez merged commit 35f4f5a into main Dec 10, 2022
@abejgonzalez abejgonzalez deleted the fix-ci-leakages branch December 10, 2022 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants