Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNMP-less servers are no longer connected in the NAV topology #1753

Closed
a31amit opened this issue Aug 7, 2018 · 21 comments
Closed

SNMP-less servers are no longer connected in the NAV topology #1753

a31amit opened this issue Aug 7, 2018 · 21 comments
Assignees
Labels
Milestone

Comments

@a31amit
Copy link

a31amit commented Aug 7, 2018

I have some servers without snmp and switches with snmp. However, netmap weather map doesn't show some of them connected.

Those servers are showing uplink to switches and ethernet port information is visible however it's not showing them connected in netmap weather map and show them as isolated nodes.

How can I troubleshoot it? please help

@jmbredal
Copy link
Collaborator

jmbredal commented Aug 8, 2018

This is related to topology. There are some discussions regarding this in the archive on the nav-users mailing list. Do a search for "topology". The answer for Navtopology Difficulties and VLAN Bandwidth load I would guess is relevant.

@a31amit
Copy link
Author

a31amit commented Aug 22, 2018

no luck so far to identify the issue.

@lunkwill42
Copy link
Member

@a31amit Please clarify: Am I correct in understanding that the ipdevinfo page shows uplink information for a given server, yet it is isolated in the layer 2 netmap view?

@a31amit
Copy link
Author

a31amit commented Sep 4, 2018

That's correct. this doesn't show connected in layer 2 netmap view section.
nav_debug

@a31amit
Copy link
Author

a31amit commented Sep 4, 2018

@lunkwill42 let me know if you needed more details. just to add a little more note for my configuration. that I am using latest 4.8.5 Appliance and I have modified cron frequency which is running with navcron user.

I try to delete, re-create devices but that didn't help me.

@lunkwill42
Copy link
Member

Right, @a31amit , and what are your actual view settings in Netmap when this happens?

@koukou73gr
Copy link

koukou73gr commented Oct 26, 2018

Hi, is there any update for this? I have the same problem, running 4.8.6. I am quite confident my netmap was right while on 4.7.x but it has been quite some time since I took a look.

Funny thing is, some servers DO so connected, but most are not. But even these are missing their second interface connection to another switch.

Even my gateway shows isolated!
All my switches show properly connected to each other.

For example, direct neighborship candidates report show this for ceph-10-206-13-178 and host-10-206-123-174:

dell5324-pclab.physics.auth.gr | g7 | lldp | ceph-10-206-123-178.physics.auth.gr | enp8s0f0 |  
dell5324-pclab.physics.auth.gr | g7 | lldp | ceph-10-206-123-178.physics.auth.gr | enp8s0f1 |  
dell5324-pclab.physics.auth.gr | g7 | lldp | ceph-10-206-123-178.physics.auth.gr | team0 |  
dell5324-pclab.physics.auth.gr | g8 | lldp | ceph-10-206-123-178.physics.auth.gr | enp8s0f0 |  
dell5324-pclab.physics.auth.gr | g8 | lldp | ceph-10-206-123-178.physics.auth.gr | enp8s0f1 |  
dell5324-pclab.physics.auth.gr | g8 | lldp | ceph-10-206-123-178.physics.auth.gr | team0 |  
dell5324-pclab.physics.auth.gr | ch1 | cam | ceph-10-206-123-178.physics.auth.gr |   |  
dell5324-pclab.physics.auth.gr | ch6 | cam | ceph-10-206-123-178.physics.auth.gr |   |  
dell6224-lab.physics.auth.gr | 1/0/23 | cam | ceph-10-206-123-178.physics.auth.gr |   |  
n012.physics.auth.gr | ch2 | cam | ceph-10-206-123-178.physics.auth.gr |   |  
n014.physics.auth.gr | ch1 | cam | ceph-10-206-123-178.physics.auth.gr |   |  
n016.physics.auth.gr | g5 | cam | ceph-10-206-123-178.physics.auth.gr |   |  
n016.physics.auth.gr | g5 | lldp | ceph-10-206-123-178.physics.auth.gr | eno1

dell5324-pclab.physics.auth.gr | g3 | cam | host-10-206-123-174.physics.auth.gr |   |  
dell5324-pclab.physics.auth.gr | g3 | lldp | host-10-206-123-174.physics.auth.gr | eno2 |  
dell6224-lab.physics.auth.gr | 1/0/23 | cam | host-10-206-123-174.physics.auth.gr |   |  
n012.physics.auth.gr | ch1 | cam | host-10-206-123-174.physics.auth.gr |   |  
n014.physics.auth.gr | ch1 | cam | host-10-206-123-174.physics.auth.gr |   |  
n016.physics.auth.gr | g6 | lldp | host-10-206-123-174.physics.auth.gr | eno1 |  
n016.physics.auth.gr | g6 | lldp | host-10-206-123-174.physics.auth.gr | eno2 |  
n016.physics.auth.gr | g6 | lldp | host-10-206-123-174.physics.auth.gr | team0 |  
n016.physics.auth.gr | g6 | lldp | host-10-206-123-174.physics.auth.gr | virtpubbr0 |  
n016.physics.auth.gr | ch2 | cam | host-10-206-123-174.physics.auth.gr

And this is my netmap:

screenshot-2018-10-26 http 10 206 123 33

As you can see, of the above, only ceph-10-206-123-178 show connected and only to one switch, not the other.

I've changed the dell5324 and n016 switches to GSW from SW to allow ip2mac to run (it used to run for SW too but stopped after an upgrade on 17-4-2018, maybe that's when netmap broke too...) in hopes that the situation is changed somehow but it didnt.
 

@koukou73gr
Copy link

As an addendum, here's what dell5324 and n016 think about their neighbors:
image
(the unrecognised neighbor is a "known" transient connection to a hosted server box)

image

Thanks!

@lunkwill42
Copy link
Member

@koukou73gr , if I understand you correctly, ceph-10-206-123-178.physics.auth.gr should appear as connected to n016.physics.auth.gr, but it doesn't.

You provided this piece of collected candidate data:

n016.physics.auth.gr | g5 | lldp | ceph-10-206-123-178.physics.auth.gr | eno1

But: Do you have candidate data from ceph-10-206-123-178.physics.auth.gr that corroborates this?

As part of the changed topology algorithm in NAV 4.8, LLDP data that cannot be corroborated by both peers will be thrown out by NAV.

Also, thanks for providing ample data - it makes the bug report a lot more useful.

@koukou73gr
Copy link

@koukou73gr , if I understand you correctly, ceph-10-206-123-178.physics.auth.gr should appear as connected to n016.physics.auth.gr, but it doesn't.

You provided this piece of collected candidate data:

n016.physics.auth.gr | g5 | lldp | ceph-10-206-123-178.physics.auth.gr | eno1

But: Do you have candidate data from ceph-10-206-123-178.physics.auth.gr that corroborates this?

Nope! Nothing. No reverse entry. Just the switch to server connection is there, not vice versa. There is no reverse even for the connection of ceph-10-206-123-178 to dell5324 switch that IS drawn on the netmap!

As part of the changed topology algorithm in NAV 4.8, LLDP data that cannot be corroborated by both peers will be thrown out by NAV.

Also, thanks for providing ample data - it makes the bug report a lot more useful.

Whatever else I can collect and present you, please ask. If there is some short of debug switch to increase output, please say so.

@lunkwill42
Copy link
Member

@koukou73gr Well, that would explain why ceph-10-206-123-178 doesn't appear connected to n016. Does ceph-10-206-123-178.physics.auth.gr provide LLDP data over SNMP at all?

@koukou73gr
Copy link

No. And it never did. None of my servers ever did as I am now finding out it needs more configuration in each server. I never had this, but still, once (upon a time) my netmap was correct.

@koukou73gr
Copy link

Right, reconfiguring lldpd on the server in question to connect as an snmp subagent did the trick.

Is this now a requirement? I don't recall reading anything in 4.8.x release notes.

@a31amit
Copy link
Author

a31amit commented Oct 29, 2018

that doesn't seem to be true in my case, I can see a few servers are showing connected but direct neighbors report their source as "CAM" rather LLDP [ that means cam was used to generate netmap ] and some are not. I can see them connected in Device Info and Ports view.

Sw - Arista over SNMP
Servers - no SNMP / no discover method used

@lunkwill42
Copy link
Member

@koukou73gr The topology detector now requires LLDP neighbor reports to be bilateral, since NAV 4.8 (we had a lot of issue with HP devices that reflected partial CDP information into LLDP records, which caused topolical chaos). I can see this was not explicitly stated in the release notes, and in the case of servers, it might not have been fully thought through. It was never a requirement that servers needed SNMP support at all, so we may need to rethink this decision.

@koukou73gr
Copy link

@lunkwill42 , thanks for confirming this.
Please consider mentioning in future release notes.

And sorry for hijacking the issue, it appears @a31amit has a slightly different problem.

@lunkwill42
Copy link
Member

lunkwill42 commented Nov 5, 2018

@a31amit I agree with koukou73gr that you two are probably not having the same issue, so I ask again: What are your actual view settings in Netmap when the problem occurs?

@a31amit
Copy link
Author

a31amit commented Dec 5, 2018

github_nav

This is setting I am using.

@lunkwill42
Copy link
Member

Hm, @a31amit, your example from above shows a device with an uplink to Ethernet12 at ■■■sw10■ca. If you look at the details of that Ethernet 12 port in ipdevinfo, does it have a registered reciprocal connection that includes a port, or does it only have information about which device is connected?

I see a similar issue on an installation I have access to, and although the switch that shows as isolated in Netmap has a known uplink, the topology information on the other end of that link does not include the port information - which is probably why Netmap seems to discard it.

@a31amit
Copy link
Author

a31amit commented Jan 18, 2019

So we have moved that particular server, But I can confirm from another server which has case information that I connection doesn't show any port. only device name/ip infornation along with VLAN detail is mentioned.
I doubt that be because the server doesn't respond to any SNMP as only switches are configured for SNMP for our environment.
Direct neighborship Candidates show source as CAM and candidate neighbor port and control have no data

@lunkwill42
Copy link
Member

@a31amit , it is definitely because NAV does not have port information from the server. It could only get that if the server either has an SNMP agent, or has an LLDP agent (in which case, the port would be identified when pulling the LLDP records from the switch).

Nevertheless, your NAV has the data that is needed to draw a proper link in Netmap, so I would say the bug is likely in Netmap, and is likely caused by Netmap ignoring the link because of the missing neighboring port information.

@lunkwill42 lunkwill42 self-assigned this Jan 22, 2019
@lunkwill42 lunkwill42 added the bug label Jan 22, 2019
@lunkwill42 lunkwill42 changed the title netmap weather map doesnt show some devices connected SNMP-less servers are no longer connected in the NAV topology Aug 22, 2019
@lunkwill42 lunkwill42 added this to the 4.9.8 milestone Aug 22, 2019
lunkwill42 added a commit that referenced this issue Aug 22, 2019
When the source has multiple distinct neighbor candidates, we cannot
trust a random one, so only trust when there is a single distinct
candidate.

Closes #1753
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants