Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RX endpoint failure on 10/100/1000 adapters when high traffic volume #5

Closed
acagliano opened this issue Jun 28, 2024 · 12 comments
Closed
Assignees
Labels
bug Something isn't working external Related to or caused by an issue out of scope of this project wontfix This will not be worked on

Comments

@acagliano
Copy link
Collaborator

STEPS TO REPRODUCE

  1. Run the lwIPDEMO application
  2. Wait for netif found and httpd running.
  3. Wait for the first IP address print for a DHCP-assigned IP
  4. Might get a few more printed IPs.
  5. Calculator will hang at this point, requiring a reset button press.
  • This will leak memory. *
  • This reproduces only with an NCM adapter. *
@acagliano acagliano added the bug Something isn't working label Jun 28, 2024
@acagliano acagliano assigned acagliano and tkbstudios and unassigned acagliano Jun 28, 2024
@acagliano
Copy link
Collaborator Author

Assigning to @tkbstudios to test with their netchat application and other assorted utilities that do not use the HTTP client. When done, report back and re-assign to me please.

@acagliano
Copy link
Collaborator Author

Possible memory management issue.

@tkbstudios
Copy link
Collaborator

Works fine with my adapter, I think mine is NCM tho. Are you sure it's not something with your adapter? I'll check if it's NCM later.

@acagliano
Copy link
Collaborator Author

Traced freeze to usb_HandleEvents() after many runs.

@acagliano
Copy link
Collaborator Author

Trace
Occurs after many runs of the main loop in main.c.

  • "handling events" prints.
  • "eth:DBG: eth callback" and "eth:DBG: eth callback done" prints 3x
  • device freezes, "done" does not print.

Bug occuring in usbdrvce code? But where and why?
Possible usbdrvce bug? Quirk with specific NCM adapter?
Perhaps request others test further with NCM devices to see how widespread this is. Perhaps trace to specific adapter type/manufacturer/chipset?

@tkbstudios
Copy link
Collaborator

tkbstudios commented Jul 1, 2024

I tested on my hardware, as you said, it will print out stuff if it's NCM?
for me it spams my screen with this:

handling events
eth:DBG: eth callback
eth:DBG: eth callback done

So I might think it's your hardware or maybe toolchain? could it be possible you need to update your toolchain and clibs to v11.2 ?

@acagliano
Copy link
Collaborator Author

acagliano commented Jul 1, 2024

So I might think it's your hardware or maybe toolchain? could it be possible you need to update your toolchain and clibs to v11.2 ?

Already done. I wish I had another ncm device with me, but i only took one of each kind.
Maybe I'll have others test with the same kind of adapter.

@tkbstudios
Copy link
Collaborator

Yeah most probably

@acagliano acagliano changed the title Device freezes once IP address acquired in lwipDEMO when using NCM adapter device freezes once IP address acquired in lwipDEMO when using NCM adapter Jul 1, 2024
@acagliano
Copy link
Collaborator Author

Freeze occurs on event id 13 (USB_DEVICE_INTERRUPT).

@acagliano
Copy link
Collaborator Author

I suspect that this might be a duplicate of or related to the issue detailed here: CE-Programming/toolchain#482.

NCM devices tend to have higher levels of throughput due to their concatenation of multiple packet payloads into a single usb transfer usually multiple MTU's in size. It is possible that the higher throughput produced by this type of device is triggering the same bug detailed in the issue above, which may be why it does not reproduce with ECM.

For the time being will recommend the use of ECM only with this project until such a time as the above bug is fixed (and it fixes this??) or a culprit in NCM driver code is identified.

@acagliano acagliano added the external Related to or caused by an issue out of scope of this project label Jul 6, 2024
@acagliano
Copy link
Collaborator Author

FIX NOTES

It it recommended that users only use 10/100 mbps adapters with this project. 10/100/1000 and Gigabit adapters will almost certainly exhibit the below described behavior.

Problem Detail

High Speed Ethernet device stalls/endpoint errors when backlogged.
During times of little network activity, everything works fine. However, when a lot of requests or network traffic hits the adapter, the calculator is spammed with incoming packets. You can see this with debug output enabled as the rate of text-dump to the screen vastly increases with RX frames coming in more often and being larger.

Eventually some error occurs (likely the USB buffers [on calc?] becoming backlogged) and the endpoint stops functioning properly. Subsequent attempts to queue up a transfer on the IN bulk endpoint fail immediately with USB_TRANSFER_CANCELED and USB_TRANSFER_BUS_ERROR. The function would then immediately queue another transfer which would again fail, queue another, and so on before usb_HandleEvents could even exit, giving the appearance that the calculator had frozen; it actually had not.

Resolution

I have resolved the freeze/memory leak by causing, upon this, 3 retries failing being a fatal error causing the device to be disabled, which stops the requeues of transfers and killing the network. This also prevents the memory leak that resulted.

Issue is open for a possible resolution: CE-Programming/toolchain#490.
This may not be fixable in any kind of reliable manner. Ethernet devices advertise their speeds to the network and their speed is not easily configurable by the host unless some hardware-specific driver or firmware hack allows it. The calculator may simply be too slow of a device to reliably communicate with most 10/100/1000 adapters.

@acagliano acagliano added the wontfix This will not be worked on label Jul 7, 2024
@acagliano acagliano changed the title device freezes once IP address acquired in lwipDEMO when using NCM adapter RX endpoint failure on 10/100/1000 adapters when high traffic volume Jul 7, 2024
@acagliano acagliano reopened this Jul 7, 2024
@acagliano
Copy link
Collaborator Author

Resolution

NCM needs to request a transfer of at least the NTB max size.
Thanks @commandblockguy

acagliano added a commit that referenced this issue Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working external Related to or caused by an issue out of scope of this project wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants