Skip to content

fix(BRE2-940): out of Nebius capacity error mapping and ufw regression#118

Merged
patelspratik merged 4 commits into
mainfrom
nebiusresourceserr
May 12, 2026
Merged

fix(BRE2-940): out of Nebius capacity error mapping and ufw regression#118
patelspratik merged 4 commits into
mainfrom
nebiusresourceserr

Conversation

@patelspratik
Copy link
Copy Markdown
Contributor

@patelspratik patelspratik commented May 8, 2026

  • Maps Nebius NotEnoughResources service errors to ErrInsufficientResources instead of treating them as quota failures.

For capacity errors, Nebius can wrap NotEnoughResources inside a serviceerror.Error / operation error. We now inspect the service-error details before falling back to generic ResourceExhausted handling, so true capacity exhaustion is classified correctly while quota failures still map to ErrOutOfQuota. Example

  • Fixes Docker published-port firewalling on Nebius images where Docker starts after cloud-init.

For firewalling, some Nebius CPU images run cloud-init before Docker creates the DOCKER-USER chain. The old setup tried to write DOCKER-USER rules too early, Docker later started with an empty chain, and docker run -p ... ports became publicly reachable. This change installs an idempotent Docker firewall script and runs it both during cloud-init and via a docker.service ExecStartPost hook.

The validation test has been failing for weeks because we now use a CPU image in our test pipeline, not a GPU one.

  • Manual repro confirmed published Docker port was reachable before applying the rule set and blocked after applying it.

@patelspratik patelspratik requested a review from a team as a code owner May 8, 2026 20:34
@patelspratik patelspratik force-pushed the nebiusresourceserr branch from f49162d to ec6ab83 Compare May 8, 2026 20:46
@patelspratik patelspratik changed the title fix(BRE2-940): additional Nebius capacity error mapping fix(BRE2-940): out of Nebius capacity error mapping May 8, 2026
@patelspratik patelspratik force-pushed the nebiusresourceserr branch from 287dc95 to d684d75 Compare May 9, 2026 00:54
@patelspratik patelspratik force-pushed the nebiusresourceserr branch from 1c472db to 590764d Compare May 9, 2026 01:29
drewmalin
drewmalin previously approved these changes May 9, 2026
Comment on lines +41 to +51
var serviceErr *serviceerror.Error
if errors.As(e, &serviceErr) {
for _, detail := range serviceErr.Details {
switch detail.(type) {
case *serviceerror.NotEnoughResources:
return v1.ErrInsufficientResources
case *serviceerror.QuotaFailure:
return v1.ErrOutOfQuota
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I wonder if the below ever actually fire? Interesting that we are looking directly at grpc codes there, whereas here we are testing the err type (which seems more appropriate and expected).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya I was poking through the code and was similarly confused by the casting. It seems there are a couple paths and the errors get rewrapped as serviceerror or operations. I figured no harm if this block is the capture now. Though I am wondering if the below block was just always wrong.

Comment thread v1/providers/nebius/instance.go Outdated
Comment on lines +1685 to +1686
// UFW persists its own rules in /etc/ufw; only DOCKER-USER needed a Docker
// startup hook after removing netfilter-persistent.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably going to be confusing to future readers as it is mainly documenting what changed from in this PR ("after removing netfilter-persistent").

Comment thread v1/providers/nebius/instance.go Outdated
Comment on lines +1673 to +1678
// DOCKER-USER is Docker's documented filter hook for this traffic. The
// ordering is important: some Nebius images run cloud-init before Docker has
// created DOCKER-USER, and Docker may create/reset the chain during daemon
// startup. We therefore install both:
// - an immediate cloud-init run for images where Docker is already active
// - a docker.service ExecStartPost hook for images where Docker starts later
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From cloudinit logs (which we should start looking at and triggering a build failure for if they do not succeed):

...
iptables: No chain/target/match by that name.
iptables: No chain/target/match by that name. 
iptables: No chain/target/match by that name.
iptables: No chain/target/match by that name. 
iptables: No chain/target/match by that name.
iptables: No chain/target/match by that name. 
iptables: No chain/target/match by that name.
iptables: No chain/target/match by that name. 
iptables: No chain/target/match by that name.
iptables: No chain/target/match by that name. 
iptables: No chain/target/match by that name.
iptables: No chain/target/match by that name. 
iptables: No chain/target/match by that name.
...

it seems like the core fix should be to add an initial line in the iptables setup to ensure the chain is created (docker itself doesn't need to be the one to create it):

iptables -N DOCKER-USER || true

it looks like the new firewall script does exactly this -- so the comment here isn't wrong to mention ordering, but it could be simplified to state "we need to ensure the chain exists so we will attempt to create it, but it's possible that this is running after docker itself created it in which case we will ignore the error and move on".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually on that note:

https://github.com/brevdev/cloud/blob/main/v1/providers/shadeform/firewall.go#L19

Could you add a "iptables -N DOCKER-USER || true" on the shadeform side? Better safe than sorry.

Comment on lines +1660 to +1661
dockerServiceDropInDir = "/etc/systemd/system/docker.service.d"
dockerFirewallDropInPath = dockerServiceDropInDir + "/10-brev-firewall.conf"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one actually might be worth a comment as we alternatively could make a oneshot unit.

The benefit of a standalone unit is that we can inspect it individually, so we could do something like systemctl status brev-docker-firewall. With this we'll need to just go straight to the docker service (e.g. journalctl docker. Not necessarily a bad thing, but worth considering if/when we need to answer the question "what happened to this VM's iptables?"

@patelspratik patelspratik changed the title fix(BRE2-940): out of Nebius capacity error mapping fix(BRE2-940): out of Nebius capacity error mapping and ufw regression May 11, 2026
@patelspratik patelspratik merged commit ffcb421 into main May 12, 2026
7 of 10 checks passed
@patelspratik patelspratik deleted the nebiusresourceserr branch May 12, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants