Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aldrin2 Firmware 3.0.1: NO-CARRIER even when everything else indicates link #152

Open
aep opened this issue Nov 18, 2021 · 31 comments
Open

Comments

@aep
Copy link

aep commented Nov 18, 2021

i'm puzzled if this is a bug in firmware 3.0.1 or maybe an installation error.
after moving a AS5114-48X, all links are down. SFP modules are detected:

[   66.044885] sfp sfp-1: module OEM              AXS85-192-M3     rev A    sn CSG101KA5702     dc 201024

ethtool says its up, but ip link still shows NO-CARRIER. consequently all the routes are dead.

ethtool -S says it's receiving packets just fine, but not sending any (probably since its marked no-carrier)

root@localhost:~# ethtool -S swp1
NIC statistics:
     good_octets_received: 3864300
     good_octets_sent: 0

unfortunately i never tested a cold boot before deploying. fw3.1 seemed to work fine in my tests where i plugged in the SFPs after boot.

how does carrier detection work, and is there a possibility i need to something other than "ip link set up" for it to work?

@paulmenzel
Copy link
Contributor

Is it firmware version 3.1 or 3.0.1?

@aep
Copy link
Author

aep commented Nov 18, 2021

sorry, yes 3.0.1.

btw /sys/class/net/swp1/operstate is still down after ip link set up. not sure if that indicates something

@paulmenzel
Copy link
Contributor

For correctness, please edit/update the title and original report.

PS: Fingers crossed, that Marvell and plvision.eu folks are going to help you quickly.

@aep aep changed the title Aldrin2 Firmware 3.1: NO-CARRIER even when everything else indicates link Aldrin2 Firmware 3.0.1: NO-CARRIER even when everything else indicates link Nov 18, 2021
@sonoble
Copy link
Contributor

sonoble commented Nov 18, 2021 via email

@aep
Copy link
Author

aep commented Nov 18, 2021

There is a patch for 3.1.0 rc1 on the marvell-switching GitHub

dont think its the firmware after all. i dowgraded to 2.8.0 and still have the same issue.

Do you know what the OS/SDK version on the a385

how do i find out? i can access uboot.
or do you mean dentos? currently trying revision 3480ace, before the 3.0 update.

additional discovery: i can even see traffic being trapped to kernel via tcpdump,
but there is no outgoing packets , possibly because the route is marked linkdown

10.100.10.0/24 dev swp17 proto kernel scope link src 10.100.10.17 offload linkdown 

a layer2 bridge, which should be entirely within the asic, also doesnt forward or learn anything

unfortunately the driver is confusing to read since its a large patch file.
the only time it changes carrier state might be in mvsw_pr_port_handle_event, which originates in the binary blob.

but i'm currently trying to find the origin of this message
"[ 420.757989] Aldrin2 0000:01:00.0 swp20: configuring for inband/10gbase-r link mode"

@taraschornyiplv
Copy link
Contributor

can you pls provide the output of onlpdump.
also can please you test the DAC cable?

@aep
Copy link
Author

aep commented Nov 18, 2021

onlpdump

see attached

onldump.txt

also can please you test the DAC cable?

unfortunately i cant. this device is already deployed.
however, i had a technician loop one of the fibers from port 47 to port 48 and the modules happily report receiving a signal:

root@localhost:~# ethtool -m swp47 | grep -i rece
	Receiver signal average optical power     : 0.5889 mW / -2.30 dBm
root@localhost:~# ethtool -m swp48 | grep -i rece
	Receiver signal average optical power     : 0.7221 mW / -1.41 dBm

full module output as text here

root@localhost:~# ethtool -m swp47
	Identifier                                : 0x03 (SFP)
	Extended identifier                       : 0x04 (GBIC/SFP defined by 2-wire interface ID)
	Connector                                 : 0x07 (LC)
	Transceiver codes                         : 0x10 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
	Transceiver type                          : 10G Ethernet: 10G Base-SR
	Encoding                                  : 0x06 (64B/66B)
	BR, Nominal                               : 10300MBd
	Rate identifier                           : 0x00 (unspecified)
	Length (SMF,km)                           : 0km
	Length (SMF)                              : 0m
	Length (50um)                             : 80m
	Length (62.5um)                           : 20m
	Length (Copper)                           : 0m
	Length (OM3)                              : 300m
	Laser wavelength                          : 850nm
	Vendor name                               : OEM
	Vendor OUI                                : 00:90:65
	Vendor PN                                 : AXS85-192-M3
	Vendor rev                                : A
	Option values                             : 0x00 0x1a
	Option                                    : RX_LOS implemented
	Option                                    : TX_FAULT implemented
	Option                                    : TX_DISABLE implemented
	BR margin, max                            : 0%
	BR margin, min                            : 0%
	Vendor SN                                 : CSG101KA5706
	Date code                                 : 201024
	Optical diagnostics support               : Yes
	Laser bias current                        : 6.888 mA
	Laser output power                        : 0.5330 mW / -2.73 dBm
	Receiver signal average optical power     : 0.5964 mW / -2.24 dBm
	Module temperature                        : 27.16 degrees C / 80.88 degrees F
	Module voltage                            : 3.4092 V
	Alarm/warning flags implemented           : Yes
	Laser bias current high alarm             : Off
	Laser bias current low alarm              : Off
	Laser bias current high warning           : Off
	Laser bias current low warning            : Off
	Laser output power high alarm             : Off
	Laser output power low alarm              : Off
	Laser output power high warning           : Off
	Laser output power low warning            : Off
	Module temperature high alarm             : Off
	Module temperature low alarm              : Off
	Module temperature high warning           : Off
	Module temperature low warning            : Off
	Module voltage high alarm                 : Off
	Module voltage low alarm                  : Off
	Module voltage high warning               : Off
	Module voltage low warning                : Off
	Laser rx power high alarm                 : Off
	Laser rx power low alarm                  : Off
	Laser rx power high warning               : Off
	Laser rx power low warning                : Off
	Laser bias current high alarm threshold   : 15.000 mA
	Laser bias current low alarm threshold    : 1.000 mA
	Laser bias current high warning threshold : 13.000 mA
	Laser bias current low warning threshold  : 2.000 mA
	Laser output power high alarm threshold   : 2.5118 mW / 4.00 dBm
	Laser output power low alarm threshold    : 0.1258 mW / -9.00 dBm
	Laser output power high warning threshold : 1.9952 mW / 3.00 dBm
	Laser output power low warning threshold  : 0.1584 mW / -8.00 dBm
	Module temperature high alarm threshold   : 90.00 degrees C / 194.00 degrees F
	Module temperature low alarm threshold    : -10.00 degrees C / 14.00 degrees F
	Module temperature high warning threshold : 85.00 degrees C / 185.00 degrees F
	Module temperature low warning threshold  : -5.00 degrees C / 23.00 degrees F
	Module voltage high alarm threshold       : 3.6000 V
	Module voltage low alarm threshold        : 2.9000 V
	Module voltage high warning threshold     : 3.5000 V
	Module voltage low warning threshold      : 3.0000 V
	Laser rx power high alarm threshold       : 3.1622 mW / 5.00 dBm
	Laser rx power low alarm threshold        : 0.0199 mW / -17.01 dBm
	Laser rx power high warning threshold     : 1.9952 mW / 3.00 dBm
	Laser rx power low warning threshold      : 0.0316 mW / -15.00 dBm

@aep
Copy link
Author

aep commented Nov 18, 2021

err....

running onlpdump makes the links come up.

[  253.028020] Aldrin2 0000:01:00.0 swp47: Link is Up - 10Gbps/Full - flow control off
[  253.030557] Aldrin2 0000:01:00.0 swp48: Link is Up - 10Gbps/Full - flow control off
[  253.035743] IPv6: ADDRCONF(NETDEV_CHANGE): swp47: link becomes ready
[  253.050293] IPv6: ADDRCONF(NETDEV_CHANGE): swp48: link becomes ready

i tested this 3 times now to make extra sure i'm not imagining it.

  1. cold boot
  2. ip link set swp47 up
  3. observe that it is linkdown
  4. onlpdump
  5. ip link now shows LOWER_UP and the link works fine

@sonoble
Copy link
Contributor

sonoble commented Nov 18, 2021

We have seen that behavior and filed a bug with Marvell. Will follow up and see if it is fixed in 3.1.0 rc1.

@paulmenzel
Copy link
Contributor

We have seen that behavior and filed a bug with Marvell. Will follow up and see if it is fixed in 3.1.0 rc1.

Thank you for escalating it.

Sorry OT: It’d be great, if you used the public bug trackers, so that the whole community can benefit and participate.

@jmpolom
Copy link

jmpolom commented Nov 23, 2021

We have seen that behavior and filed a bug with Marvell. Will follow up and see if it is fixed in 3.1.0 rc1.

@sonoble where was this bug filed?

This really needs to be a more transparent process since this is purportedly an open source project. I do not see this issue on the public switchdev-prestera tracker.

This is going to become a major usability issue that prevents further adoption of DENT offerings. A better solution is required here. Users need a way to directly engage with those who develop critical pieces of software and the Prestera driver is certainly one of them. How do we work with Marvell to open things up here?

cc: @storrgie @trishan @lperkov

@Mickey201
Copy link
Contributor

Hi @jmpolom, I think you run too fast to a wrong conclusions.

Hi @aep, the scenario you specified sounds familiar. Adding Accton team member: @richardlee66
We suspect you might have an old CPLD version.
Please read the platform CPLD version according to the below procedure: (From U-Boot command line)

Marvell>> i2c dev 2

Major CPLD number:

Marvell>> i2c md 0x40 01 1
0001: 01

Minor CPLD number:

Marvell>> i2c md 0x40 ff 1
00ff: 03

Per the example above - The CPLD version is 1.03
The most updated CPLD version used on Marvell LAB is 1.09
If you have an older version - please contact @richardlee66 from Accton team.

@paulmenzel
Copy link
Contributor

Hi @jmpolom, I think you run too fast to a wrong conclusions.

Why? If it’s a known bug, why isn’t the problem and solution documented? What is the problem actually with older CPLD versions?

Hi @aep, the scenario you specified sounds familiar. Adding Accton team member: @richardlee66 We suspect you might have an old CPLD version. Please read the platform CPLD version according to the below procedure: (From U-Boot command line)

If there are known problems, why does the Linux kernel driver not check the CPLD version, and warn about it in the log files?

[…]

@Mickey201
Copy link
Contributor

Mickey201 commented Nov 23, 2021

@paulmenzel, assuming my concern about the CPLD is right - this is not a bug. It means @aep probably has a platform with ENG CPLD image.
We should ask the Accton team how they manage their platform's support - but defining it as a Marvell Switchdev Driver bug is wrong.

If there are known problems, why does the Linux kernel driver not check the CPLD version, and warn about it in the log files?

The CPLD Driver is a platform driver handled by the Accton team - please consult with them. Marvell Switchdev driver has no direct interface with the CPLD.

@jmpolom
Copy link

jmpolom commented Nov 23, 2021

Hi @jmpolom, I think you run too fast to a wrong conclusions.

Why? If it’s a known bug, why isn’t the problem and solution documented? What is the problem actually with older CPLD versions?

My comment was based on earlier comments seeming to suggest a driver issue and also the comment from @sonoble suggesting a bug was filed somewhere that isn’t public. Maybe that is not accurate but I’d like to see some explanation either way.

Generally there hasn’t been a specifically identified support entry point for the Marvell-based DENT platforms. IE: if you have a hardware issue, where does a user ask for help? It has seemed to default to this issue tracker but that really is quite messy and should be better thought out. This issue tracker should not be used to provide end user device support and also to coordinate the development of an OS. It lumps a ton of disparate things into one bin and will become increasingly more difficult/painful/undesirable to interact with.

@aep
Copy link
Author

aep commented Nov 23, 2021

Please read the platform CPLD version according to the below procedure: (From U-Boot command line)

Marvell>> i2c dev 2
Setting bus to 2
Marvell>> i2c md 0x40 01 1
0001: 01    .
Marvell>>  i2c md 0x40 ff 1
00ff: 05    .

It means @aep probably has a platform with ENG CPLD image.

this is a regular production device from Accton. If dentos only wants to support specific revisions of hardware, it would be nice to have that documented, so we can purchase the correct revision in the future.

This issue tracker should not be used to provide end user device support and also to coordinate the development of an OS

they're the same thing. Unless you're specifically suggesting that dentos doesn't accept outside contributions, which would explain trivial bugfix PRs being ignored.

Again, i would really appreciate if the purpose of dent is better documented. The overall tone appears to be that this is actually internal to some corporate agreement rather than for general use. Otherwise we'll have to find workarounds for the silicon that happens to be out there, as we traditionally do in linux.

@sonoble
Copy link
Contributor

sonoble commented Nov 23, 2021 via email

@sonoble
Copy link
Contributor

sonoble commented Nov 23, 2021 via email

@jmpolom
Copy link

jmpolom commented Nov 23, 2021

This issue really highlights some major deficiencies with the DENT Project that need to be resolved sooner rather than later if we want users to stick around. I see the following as questions that need to be answered:

  • How do users engage hardware ODMs for support (ex: for things supported via ODM developed ONLP drivers)?
    • Hopefully we get things together and realize getting board support into the kernel is the best path forward, but in the mean time ONLP seems to be a thing to deal with. The ODMs are singularly responsible for this it seems so support points must be noted.
  • How do users engage ASIC OEMs (Marvell and nvidia) for support?
    • Nvidia has a documented email address to obtain support for Spectrum ASIC devices
    • Marvell has nothing documented here and they seem to completely ignore issues on their tracker. Additionally, when developers from Marvell do engage community members the tone is incredibly condescending.

Again, i would really appreciate if the purpose of dent is better documented. The overall tone appears to be that this is actually internal to some corporate agreement rather than for general use. Otherwise we'll have to find workarounds for the silicon that happens to be out there, as we traditionally do in linux.

I’d generally agree that the feeling the rest of us users are left with is that we are simply bystanders to someone else’s corporate objectives. We do not have clearly identified avenues for support with many of the major players here and it leads to a pretty lousy experience. This must be improved or we risk alienating existing users and denying future ones.

@aep
Copy link
Author

aep commented Nov 23, 2021

Do you have a contact at Acton/edge-core that you can work with or a reseller?

I have requested support from the reseller, but accton is not a brand that cares about longevity of their products.
We only buy from them because it's the only dentos device available right now.

The suggestion is to update to the latest cpld.

Can we collect the clpd images in a repo, similar to how we do it for firmware?
The e-waste problem here is rather tragic if getting dent to work requires vendor support.

@taraschornyiplv
Copy link
Contributor

@aep until you will get updated cpld image try to build an image w/o this commit.

@paulmenzel
Copy link
Contributor

Commit fee5b08 (Modify RX loss active high for CPLD RX loss definition correction) has no commit message body describing the problem and the fix. And also does not mention the side effect with older CPLD versions.

@Mickey201
Copy link
Contributor

You're right @paulmenzel, I think Taras only wanted to assist @aep to enable the system until Edge-Core will engage.
@richardlee66, @brandonchuang, please join this discussion and provide your insight about Edge-core support plans.

@aep
Copy link
Author

aep commented Nov 28, 2021

Reseller responded that Accton does not release updated CLPD images for the dentos line, so these machines are dead on arrival unless the dentos community can somehow agree on a workaround.

This line in the offending commit unfortunately confirms the sentiment of the rest of this thread.

+ It is currently not required for Amazon 'ethtool -m' support but it is intended for future use.

We could maintain a community fork of dentos that works outside of Amazon, but i'm not sure if there's even interest.

@demliu
Copy link

demliu commented Nov 29, 2021

@aep
Our Edgecore engineers looked at this issue. They are not sure this is CPLD issue.
Latest CPLD version is here, https://accpartner.accton.com/sites/csp/ONIE/AS5114-48X/CPLD/as5114-bptfr_v01.01.09h_as4224-bptfr_v0c.02.0ah.updater. This link needs customer log in.
It’s an ONIE updater, you can use ONIE upgrade method in ONIE system.

@aep
Copy link
Author

aep commented Dec 1, 2021

Our Edgecore engineers looked at this issue

thank you @demliu , i'm sure you realize this is not a particularly useful response, since there's nothing anyone outside of Accton can do about it. The CPLD isn't documented, and the sources aren't available. The commit that broke it is Accton specific. I'll happily help debugging the issue here, in public.

This link needs customer log in.

Please make this publicly available, or fix dentos main to work with all devices you shipped.

This open source project won't work if basic functionality requires an NDA.
But if dentos doesnt work, there's no benefit to building datacenters with Accton devices in the first place.
This is literately the only reason i'm pushing for giving Accton a chance again.

@aep
Copy link
Author

aep commented Dec 12, 2021

@demliu

I now received the CPLD from the reseller after pressure from Marvell. This might fix my issue but i'm not willing to purchase more devices until edge-core commits to making them publicly available. Dentos cannot be successful if this level of escalation is required for everyone participating. If dentos doesnt work we will stick to cisco, who have great support.

Please make them publicly available or commit to a stable ABI.

@jmpolom
Copy link

jmpolom commented Dec 12, 2021

I now received the CPLD from the reseller after pressure from Marvell. This might fix my issue but i'm not willing to purchase more devices until edge-core commits to making them publicly available.

Is this an update to the CPLD firmware itself that you received?

@demliu
Copy link

demliu commented Dec 13, 2021

@aep
Thank you for the feedback! Edgecore is not yet publicly open to offer tech support for DENT. As a customer, you still can get available services for the tech support team. Appreciate your feedback about DENT and will let our team know.

@paulmenzel
Copy link
Contributor

@aep, thank you again very much for debugging the issue, we also seem to have run in with one device with firmware Aldrin2 firmware 3.0.1 and CPLD firmware 1.05 – same as you. One of our three devices started to show this issue once – no restart yet and the link just dropped. Before it worked fine. For you it was a little different, right? All the ports (besides management) never worked, didn’t they?

Before we go through updating our devices, did you apply the update, and did it fix the issue?

@aep
Copy link
Author

aep commented Jun 22, 2023

yes, the switch doesnt come up without the binary blobs. you need to have an exact match between dentos and the binaries, which aren't public, and we dont know which ones dentos devs test.

DENTOS never made it out of the lab unfortunately. We're too small for being able to make a single vendor (Accton) to give us the required binary blobs. Only through pressure from Marvell they gave us anything. once. The second potential vendor (Delta) doesnt even want to sell us anything.

As to your question on the ML: I think the CPLD is just for board specific things like pinouts, power, idk. They could probably just open source it if they wanted to.

If there was a large enough community, i'm sure we could convince marvel to just sell us the chip. the rest of the board is trivial. but i'm not seeing any traction that would make that a compelling argument. And if there was a relevant community, Accton would probably also be convinced to just publish the blobs. TLDR: unless you're facebook, give up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants