Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make lookout ingesters process PodUnschedulable errors #2478

Merged
merged 4 commits into from
May 18, 2023

Conversation

JamesMurkin
Copy link
Contributor

@JamesMurkin JamesMurkin commented May 17, 2023

The motivation for this change is that we have many runs that never get the node name set - because this information is on the PodUnschedulable error events.

This makes tracking if certain nodes have most of the issues very difficult, as most runs with errors don't get the node set

This PR:

  • No longer skips non-terminal errors, this should be fine as:
    • The only events that are non-terminal are PodUnschedulable and PodTerminated and the ingester handles both of these explicitly
    • Each event will only have 1 error
  • Sets the nodeName on PodUnschedulable error events

Longer term this flow needs more thought:

  • Possibly add a PodNodeAssigned event
  • Possibly add other events to display start up status
  • Add NodeName to other events

However for now this is a minimal change to help with investigation into problem nodes

┆Issue is synchronized with this Jira Task by Unito

The motivation for this change is that we have many runs that never get the node name set - because this information is on the PodUnschedulable error events.

This makes tracking if certain nodes have most of the issues very difficult, as most runs with errors don't get the node set

This PR:
 - No longer skips non-terminal errors, this should be fine as:
  - The only events that are non-terminal are PodUnschedulable and PodTerminated and the ingester handles both of these explicitly
  - Each event will only have 1 error
 - Set the nodeName on PodUnschedulable error events

Longer term this flow needs more thought:
 - Possibly add a PodNodeAssigned event
 - Possibly add other events to display start up status
 - Add NodeName to other events

However for now this is a minimal change to help with investigation into problem nodes
@codecov
Copy link

codecov bot commented May 17, 2023

Codecov Report

Patch coverage: 80.58% and project coverage change: +0.04 🎉

Comparison is base (b37d1dd) 58.45% compared to head (7fd9b33) 58.50%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2478      +/-   ##
==========================================
+ Coverage   58.45%   58.50%   +0.04%     
==========================================
  Files         235      235              
  Lines       29537    29526      -11     
==========================================
+ Hits        17267    17273       +6     
+ Misses      10949    10932      -17     
  Partials     1321     1321              
Flag Coverage Δ
armada-server 58.50% <80.58%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ernal/lookoutingester/instructions/instructions.go 70.05% <76.78%> (+1.84%) ⬆️
...nal/lookoutingesterv2/instructions/instructions.go 76.74% <85.10%> (+1.43%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@JamesMurkin JamesMurkin marked this pull request as ready for review May 18, 2023 11:49
carlocamurri
carlocamurri previously approved these changes May 18, 2023
@JamesMurkin JamesMurkin merged commit 517d946 into master May 18, 2023
23 checks passed
@JamesMurkin JamesMurkin deleted the lookout_ingester branch May 18, 2023 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants