Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fail with useful error on unmatched interface (bsc#990745) #575

Closed
wants to merge 7 commits into from

Conversation

aspiers
Copy link
Member

@aspiers aspiers commented Jul 27, 2016

If we fail to find an admin interface matching a pattern such as +10g1 for a given node, don't proceed blindly, but instead bail with a helpful error message. This was experienced in a scenario where the Crowbar nodes were a mix of physical machines and VMs with slow virtio interfaces, but the network.json was incorrectly configured and the admin network conduit failed to match the slow virtio interfaces.

Also add a warning whenever a conduit cannot be matched against available interfaces, to help more quickly identify incorrectly configured network.json files, and de-obfuscate the code with comments and better variable names.

https://bugzilla.suse.com/show_bug.cgi?id=990745

map_if_ref(if_remap, if_ref)
r = map_if_ref(if_remap, if_ref)
if r.nil?
raise "Unable to find interface for #{node.fqdn} matching #{if_ref} out of #{if_remap}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics/LineLength: Line is too long. 105/100

@aspiers
Copy link
Member Author

aspiers commented Jul 27, 2016

In particular, I need someone to confirm that there aren't any scenarios where it's OK to have an unmatched interface pattern. Because if there are, obviously this will break them.

@vuntz
Copy link
Member

vuntz commented Jul 28, 2016

I might be wrong, but I think it's actually okay to have this be nil in some cases, as not all conduits are used on all nodes. So I think your patch is too aggressive here. You can try setting something crazy on the storage network (which is not used unless you deploy swift or ceph), and see if the exception is triggered.

# intf_to_if_map precalculated by #build_node_map if you are
# calling this in a loop and want to avoid it being calculated
# on each call.
#

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/TrailingWhitespace: Trailing whitespace detected. (https://github.com/bbatsov/ruby-style-guide#no-trailing-whitespace)

@aspiers
Copy link
Member Author

aspiers commented Jul 28, 2016

@vuntz I think you're probably right. I've drastically improved the PR.

@aspiers
Copy link
Member Author

aspiers commented Aug 1, 2016

Not convinced that gating failure was related to the changes - forced a rebuild.

@Itxaka
Copy link
Member

Itxaka commented Aug 1, 2016

lgtm +1

@aspiers
Copy link
Member Author

aspiers commented Aug 11, 2016

@rhafer Since @vuntz isn't around, your review on this would be appreciated!

@rhafer
Copy link
Contributor

rhafer commented Aug 12, 2016

@aspiers Yes, I have this on my radar. Just didn't find the time yet to give it the attention it deserves :(

desired.downto(0,&filter)
when "?"
(desired..speeds.length).each(&filter)
desired.downto(0,&filter) unless found

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style/SpaceAfterComma: Space missing after comma. (https://github.com/bbatsov/ruby-style-guide#spaces-operators)

@aspiers
Copy link
Member Author

aspiers commented Aug 12, 2016

@rhafer OK thanks :) Rebased, slightly improved one of the comments, and fixed some broken indentation.

r = map_if_ref(if_remap, if_ref)
if r.nil?
Chef::Log.warn "Unable to find interface for #{node.fqdn} " \
"matching #{if_ref} out of #{if_remap}"
Copy link
Contributor

@rhafer rhafer Aug 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @vuntz already mentioned it's perfectly fine to have this nil here for some of the conduits. As long as that conduit is not used on the node (i.e. not network using that conduit is allocated/enabled on that node). I such a case this will cause a lot of unnecessary warnings (build_node_map is called quite frequently) in the logs. I fear that it will just confuse people.

Wouldn't be better to throw a more useful error message for cases where the conduit is actually used on a node (e.g. here: https://github.com/crowbar/crowbar-core/blob/master/chef/cookbooks/network/recipes/default.rb#L161)

Or even when trying to allocate/enable a network already (i.e. in the rails app)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I such a case this will cause a lot of unnecessary warnings (build_node_map is called quite frequently) in the logs. I fear that it will just confuse people.

Yeah OK, fair point.

Wouldn't be better to throw a more useful error message for cases where the conduit is actually used on a node

Sure, that's why this PR also adds this error.

Or even when trying to allocate/enable a network already (i.e. in the rails app)

You mean, somewhere like in NetworkService#allocate_ip_by_type? That seems to get all its network info from the databag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhafer Would it be good enough to change this Chef::Log.warn into a .debug? Also, I'm still not sure exactly where else you are suggesting that errors should be added if the interface is nil? (See question above.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be good enough to change this Chef::Log.warn into a .debug?

Yeah, I guess that's fine.

You mean, somewhere like in NetworkService#allocate_ip_by_type?

Yes. And NetworkService#enable_interface. The advantage of doing proper checks there would be that we could actually fail with an error much earlier than during the chef-client run. (Note: I am not say that you should be the one implementing that, it just an idea that I had when reviewing your code)

That seems to get all its network info from the databag.

Yeah, it would need to be enhanced to call into NodeObject#build_node_map. Which reminds me that might want to add the debug message there as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's now .debug.

Based on our follow-on IRC discussion about this, I've created Add network interface validation to NetworkService#build_net_info (Trello).

@aspiers
Copy link
Member Author

aspiers commented Sep 7, 2016

@rhafer
Copy link
Contributor

rhafer commented Sep 8, 2016

+1

@aspiers I guess you want to remove the "wip" and "do not merge labels"?

@aspiers
Copy link
Member Author

aspiers commented Sep 9, 2016

Rebased to fix conflict.

Adam Spiers added 7 commits September 21, 2016 22:45
It's amazing that this critical part of Crowbar's core code has remained
cryptic for so long.  Let's make a start at documenting it so that
future generations don't suffer the same pain.
This isn't technically a bug, because in Ruby:

    a, b, c = [nil, nil]

will result in c being assigned nil.  However it's more correct to
be consistent in the number of elements the method returns.
If we fail to find an interface matching a pattern such as "+10g1"
for a given node, give some useful debugging.  We don't make it a
warning, because there are cases where we expect this to happen,
e.g. with rarely used networks such as 'os_sdn'.

This helps more quickly identify incorrectly configured network.json
files, and was experienced in a scenario where the Crowbar nodes were a
mix of physical machines and VMs with slow virtio interfaces.

https://bugzilla.suse.com/show_bug.cgi?id=990745
Catch the case where we failed to find an admin interface for a node,
and bail with a more helpful error message.

This was experienced in a scenario where the Crowbar nodes were a mix of
physical machines and VMs with slow virtio interfaces, but the
network.json was incorrectly configured and the admin network conduit
failed to match the slow virtio interfaces.

https://bugzilla.suse.com/show_bug.cgi?id=990745
"rescue nil" is not only a really bad habit, but it's completely
unnecessary when looking up a value in a Hash.
@aspiers
Copy link
Member Author

aspiers commented Sep 21, 2016

Failed gating build has been deleted, so rebased to trigger another one.

@aspiers
Copy link
Member Author

aspiers commented Nov 6, 2016

Bah, @vuntz completely rewrote this so rebasing is going to be ugly... One nice thing about gerrit is that it warns you when reviews conflict with each other. #githubsucks

@aspiers
Copy link
Member Author

aspiers commented Nov 6, 2016

@vuntz I had a quick attempt to rebase and gave up as everything looks different. Let's fix this together this week.

if interface.nil?
raise "Failed to find admin node interface for node #{node.fqdn}! " \
"Check your network barclamp proposal."
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about this: no objection about this part, as it's no regression, but I think it's not enough. We do want to avoid a crash in chef-client here, as having a rogue node could result, thanks to this, in a DoS on part of the deployment infrastructure. I'll submit a patch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #803

@aspiers
Copy link
Member Author

aspiers commented Dec 22, 2016

@vuntz @rhafer Please can you either rebase this or help me rebase it? I put a lot of effort into adding useful comments to make cryptic code readable, but since you rewrote things completely by moving a lot of stuff to Crowbar::ConduitResolver I would have to almost start from scratch, and I don't know what other interface changes you introduced during this move.

@vuntz
Copy link
Member

vuntz commented Jan 6, 2017

Obsoleted by #803

@vuntz vuntz closed this Jan 6, 2017
@aspiers aspiers deleted the bail-on-unmatched-intf branch February 8, 2017 19:04
aspiers pushed a commit to aspiers/crowbar-core that referenced this pull request Feb 8, 2017
This is a forward-port of the first commit of crowbar#575 (893bfbb) in an
attempt to make this code even clearer.  crowbar#575 never got merged because
it took too long to review it, and in the meantime all of this code
got significantly rewritten during extraction of the conduit resolver
code into a library to be shared between Chef and Crowbar.

The other (fix) part of crowbar#575 was superseded by crowbar#803.
@aspiers
Copy link
Member Author

aspiers commented Feb 8, 2017

Looks like I'll have to redo this myself. Resubmitted as #1082.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
6 participants