New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug with IDF SPI driver ? Ethernet at same time as SPI LCD causes display corruption (IDFGH-5658) #7380
Comments
Hi there, It seems I encountered the same bug. We use IDF v.4.2 branch. We have a custom board with ST25R3911 connected by SPI to ESP32. For internet connectivity we are using WiFi or Ethernet. When we use WiFi all works fine. When we use Ethernet it most time works fine but it's possible to "break" ST25R3911 by sending many ICMP requests to the ESP32 ( Meantime, in another project we use the same HW and IDF v.3.3. In that case there is no such an issue, so I think it's an issue in the v.4.x branch. By "break" ST25R3911 I mean that it falls to "odd" state in that it does not process the commands. |
Can confirm SPI transfer issues when ethernet is receiving packets. For us, it seems that the SPI data transmit pointer is reversed by 32 bytes, then the read pointer jumps back to its original value and the rest of the data is transferred correctly. Sometimes this reversal happens after just a few transmitted bytes (again seems 2^n length, e.g. 64 bytes); sometimes this offset exists from the beginning, for our whole SPI transfer (almost 4000 bytes). Of course I don't know whether it is the SPI data transmit pointer; just the data appears "shifted", i.e. sections repeat (for a while). Higher ethernet packet rates and larger ethernet packets make the bug appear more frequently. With the right |
We have been chasing this bug for 2 weeks and I think I finally identified the relationship between Ethernet and display glitches. Our display glitches could be explained by missed set x/y commands. The ILI9488 samples the D/C pin on the first falling clock after CS goes low. This requires SPI mode 3 to have an idle high CLK. We use DMA for all display writes and set the DC pin in the pre-callback. When Ethernet is not connected, I think most display writes finish immediately and the dma queue stays empty. As a result, the CS pin is toggled by the DMA SPI handler for each transfer. We fixed it by adding a write high and write low to the CS pin immediately after setting the DC pin, in the pre-callback. What led you to conclude that the pointer is wrong? |
i don't know if there is a pointer that is wrong. i can see data repeating (skipping back) during a transfer, but sometimes only part of the data in the middle of the transaction.
On November 8, 2021 1:47:52 AM PST, Elco Jacobs ***@***.***> wrote:
We have been chasing this bug for 2 weeks and I think I finally identified the relationship between Ethernet and display glitches. Our display glitches could be explained by missed set x/y commands.
The ILI9488 samples the D/C pin on the first falling clock after CS goes low. This requires SPI mode 3 to have an idle high CLK.
But that's not all.
We use DMA for all display writes and set the DC pin in the pre-callback. When Ethernet is not connected, I think most display writes finish immediately and the dma queue stays empty. As a result, the CS pin is toggled by the DMA SPI handler for each transfer.
When Ethernet is connected, the ethernet task loads the CPU or dma handler with other tasks, which allows the display dma queue to fill with more than one transfer. When these dma transfers are handled back to back, the CS pin stays low. The second transfer in the queue does not have a CS pin high to low transition and the DC pin is not resampled.
We fixed it by adding a write high and write low to the CS pin immediately after setting the DC pin, in the pre-callback.
What led you to conclude that the pointer is wrong?
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#7380 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
Repeated parts of the screen could perhaps come from a missed set position command. I can't know whether your problem is the same as ours, but perhaps try toggling CS high and low after setting DC and before starting the SPI transfer. I thought it was a memory bug in the driver too, but this has fixed it for us. It still could be a timing issue with a bug in the driver, so keep us posted on what you find. |
Seems 90c4827 introduced this bug (bisect result). |
See #7874 for a bugfix. I don't know if this will fix it for everybody, and whether the change from 32 to 16 bytes RX burst length is sufficient in all cases, but it seems to work for us. |
Turns out that 16 fixed it for my test firmware, but I have to go to 8 bytes for our main application firmware. |
Nice find. Do you know why the burst length has an effect? |
no clue. If I had to guess, maybe the SPI DMA doesn't get time on the
bus, but doesn't have a way to deal with it and instead outputs old data
in its buffer and at some point somehow snaps back to the correct read
address.
…On 13/11/2021 15:38, Elco Jacobs wrote:
Nice find. Do you know why the burst length has an effect?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7380 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABLOO2CB55TW2NZGNTD2BDUL3ZJDANCNFSM5BWADTIQ>.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
" but I have to go to 8 bytes for our main application firmware." I played with these values months back, but dismissed the approach as the length required to make SPI reliable is so short that I suspect it defeats the entire point of using DMA for the transaction? |
it's the dma burst setting for Ethernet. idf 3.3 used 4 bytes.
On November 14, 2021 7:50:08 AM PST, jonshouse1 ***@***.***> wrote:
" but I have to go to 8 bytes for our main application firmware."
I played with these values months back, but dismissed the approach as the length required to make SPI reliable is so short that I suspect it defeats the entire point of using DMA for the transaction?
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#7380 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
"it's the dma burst setting for Ethernet. idf 3.3 used 4 bytes." Ahh ok, interesting, that would make me simply wrong, sorry. I tried hacking the settings in the "esp-iot-solution/components/bus/spi_bus.c" code as this seems to be the SPI driver that the SPI LCD on my VNC client project is using. I also tried tweaking the idf SPI code but failed to make any real improvements. A vaguely remember that disabling DMA for the SPI driver fixed the issue but I was not shocked when it was so slow as to be worthless. |
Just to confirm I changed emac_hal.c as per #7380 and rebuilt my project. I've been running the VNC client for a couple of hours and see no display corruption so far. Thanks. |
Did you try the cs toggle after each DC pin change? It does sound like a timing issue to me. Maybe shortening the burst length causes the ethernet DMA to not hold the DMA long enough to let SPI DMA queue up. Another thing worth trying is to explicitly give the ethernet DMA and SPI DMA different channels. |
Not sure who this is addressed to ? Just for clarity, I tried changing many settings in the "esp-iot-solution/components/bus/spi_bus.c" driver, the SPI display driver my project seems to be using. No change fixed the issue. Also tried changing these values below, my comments are tagged JA None of this had any positive effect. Your theory does not seem to match the observation of "corecode" or my experimentation. I since changed back to the default spi_bus.c and applied #7380 and I can confirm that seems to fix the display corruption issue for me, on the version of tools I am using.
|
You changed the dma queue to length 1, which means you cannot queue up more than 1 transfer and will get a CS toggle before each transfer. Another thing to note is that setting the receive length to 0 for a transaction is NOT disabling rx, that will make it default to the transmit length. You really need to set the rx data pointer to nullptr. I'm just trying to help find the real cause, I have solved my own problem already and have a working display at 16mhz SPI with a dma queue of length 10. Toggling the DC pin is done bij de pre-callback in the dma handler. Reducing the tx burst length might work, but it is just a workaround. |
"If the rx pointer is uninitialized, SPI reception can overwrite random memory." Yes I can see that would be the case. You talk as if this is my driver and I somehow have agency over it?, I expect others above my paygrade to ship drivers that work! ... if you feel you can do better then please have a go at diagnosing and fixing the issue. My changes where mostly by blind feel, currently my test gear is packed away in storage and my skills are marginal so clearly I am not the best person to fix the issue. I tried lots or permutations, the code I pasted was simply the state it was in when I gave up my poking at it. As I said nothing before #7380 made any difference, if you feel you can nail the issue down to a clear fault and solution then please do so. |
I was led here by the issue that @corecode created. Here you give a stack-allocated local buffer (pixels) to the display driver. If the function jag_draw_bitmap is blocking until the transfer is completed, then giving it a stack-allocated buffer is fine. I don't have the time to dive deeper to figure this out. |
? Please clarify, this is probably something I have missed. I opened this bug report and I see no issues against ESPVNCC yet.
I did wonder that, if you look back through the commits you will see several attempts at different semaphores. Frankly I lack the skill to unpick issues if the drivers simply do not work for the one workload I am doing! I can't really do much better in my code until I get a working IDF and drivers, then I will have time (and clarity) to add the missing locks. Put simply I can not debug hardware, my code AND the IDF drivers at the same time. Please keep the conversation here to the core issue ("Ethernet at same time as SPI LCD causes display corruption (IDFGH-5658)" |
PS I am also seeing a periodic infrequent crash with my code, this actually may be the issue with my code "elcojacobs" just described :-)
|
I was referring to #7874
If you are giving the SPI DMA driver a pointer to memory that is deallocated, that is a bug in your code. I am not certain there is no bug in the ESP-IDF driver. My point is just that the fact that using ethernet and DMA at the same time causes corruption does not mean necessarily that the ESP-IDF drivers are buggy.
I am not using your code in any way, so I won't open an issue against it or fix it. I have a board with an SPI display (ILI9488) and hardware ethernet (LAN8724) and open-source firmware: https://github.com/BrewBlox/brewblox-firmware Writing DMA code is hard, you'll have to take into account many things, like
|
Elco, you are derailing this issue.
The problem is that if you use the normal SPI driver (which uses DMA internally) and you receive Ethernet frames at the same time, then occasionally the SPI transfers incorrect data.
Changing the (internal) Ethernet DMA burst size back to what it was in IDF 3.3 makes these incorrect transfers disappear. ESP bug. Maybe hardware.
On November 15, 2021 10:16:46 AM PST, Elco Jacobs ***@***.***> wrote:
> ? Please clarify, this is probably something I have missed. I opened this bug report and I see no issues against ESPVNCC yet.
I was referring to #7380
My reply should have been to that issue, but I accidentally placed it here.
> If you think the issue with display corruption is just my code then you are wrong (not saying my code might not have all kind of issues, just that the CORE issue with it is not my code!) The display corruption seems to be an interaction between the drivers for physical Ethernet and DMA driven SPI.
If you are giving the SPI DMA driver a pointer to memory that is deallocated, that is a bug in your code.
If the ethernet driver is the only piece of code overwriting that same area of memory, it is perfectly allowed to do so, because you released the memory. You could make the pixel buffer static to not release it and re-use it every time you call that function.
I am not certain there is no bug in the ESP-IDF driver. My point is just that the fact that using ethernet and DMA at the same time causes corruption does not mean necessarily that the ESP-IDF drivers are buggy.
> If you wish to take a stab at fixing the issue here then please feel free, but only hardware using Physical Ethernet will reproduce the issue.
I am not using your code in any way, so I won't open an issue against it or fix it.
If there is an actual bug, I want to know. That's why I am replying with the issues that I found to help others fix bugs in their code or pinpoint the actual issue in the drivers.
I have a board with an SPI display (ILI9488) and hardware ethernet (LAN8724) and open-source firmware: https://github.com/BrewBlox/brewblox-firmware
I was convinced there was a bug in ESP-IDF that caused interaction between SPI and ethernet until I finally found the issue in our code that caused the interaction.
Writing DMA code is hard, you'll have to take into account many things, like
- Only allocating DMA-capable memory and only freeing it when the DMA transfer is done
- Thread-safe access to the DMA buffer
- Toggling CS/DC pin at the right time for the display
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#7380 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
There are 2 functions: |
I use blocking spi transmit and have problems.
…On 15/11/2021 11:27, Elco Jacobs wrote:
There are 2 functions:
|spi_device_queue_trans| and |spi_device_transmit|.
|spi_device_transmit| just calls |spi_device_queue_trans| and waits
for the (DMA) transfer to complete.
If SPI transfers have errors when using the blocking
spi_device_transmit, then I agree that it is an internal framework issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7380 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABLOO4AAMVXJCZVZKT526DUMFNK5ANCNFSM5BWADTIQ>.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Jesus, I just added info that I think could help people at espressif solve the bug or pinpoint the problem. I'm not trying to prove anything. I spent 2 weeks chasing a display bug and I am sharing what I found in attempt to help improve the framework that we all use. |
In that case you remind me of talking to my wife, lots of words but maybe forgetting to include the basic context. Is the display bug related to to an IDF display driver or do you have your own display driver? If it relates to a driver in IDF then it should not be posted against that? If it is your own display driver then what did you learn about the interaction between Ethernet and DMA SPI. |
I had display corruption, using lvgl on top of esp-idf SPI drivers, which occurred only when Ethernet was plugged in. With lan8724 phy and internal mac. No display corruption without Ethernet plugged in. But you are convinced that this could not have the same cause as the issue reported here, so please forget anything I said. 🙄 |
Fantastic, why not lead with that!
Sounds the same as my issue. Maybe you could summarise in way avoiding display driver specifics what you learnt. Is #7380 is not a viable fix for the issue?, if not then why? If it is a viable fix, then why all the extra words and fluff about my code and display drivers from you, can you simplify and summarise the additional point you are making? |
I did try to explain why a timing issue could cause the CS to stay low, which could cause the display miss the command/data selection. I was watching this issue because I thought there was a bug in the low level drivers. I found out there maybe wasn't and figured it would be nice to share my findings.... |
my issue often presents as data in the middle of the spi transaction gets corrupted. I don't think that can be explained with CS signals.
On November 15, 2021 5:40:52 PM PST, Elco Jacobs ***@***.***> wrote:
I did try to explain why a timing issue could cause the CS to stay low, which could cause the display miss the command/data selection.
The issue is subtle, it requires understanding of how the display samples the pin and how seemingly unrelated dma transfers can cause timing differences. So I needed a lot of words, and you misguided your anger at espressif at me because of it.
Yes I did point out other potential bugs that I found in an effort to help in case we all share a bug with similar symptoms but other causes, and you think I am an asshole because of it. A timing issue could mask both memory bugs or the DC pin sampling issue.
I was watching this issue because I thought there was a bug in the low level drivers. I found out there maybe wasn't and figured it would be nice to share my findings....
I was met with "no this is a bug in the driver, please shut up", so I tried to explain it in more detail. Only to be compared to your wife or elitist Linux devs.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#7380 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
We have at least 3 different permutations of drivers and code that all fail to work in almost exactly the same way in this bug report. Are you claiming that 3 sets people have 3 different subtly wrong programs that only show breakage when Ethernet is active because that would seem very very unlikely? One of my tests was to write every line of the LCD over and over with a single colour from a global static buffer, even that corrupts when Ethernet is active, now you are going to claim that this code is subtly wrong? that my pointer to 240 x16 bits of a solid colour is somehow not quite correct, does not meet some magic pixel painting constraint ? |
Interesting topic, I did a quick test by putting the LAN8720 Ethernet initialization code into this example, and both of them just work fine, even without the PR fix FYI, I didn't use the iot-solution driver but the st7789 driver located in esp-idf. For that PR, we will still take it into consideration, maybe will make the burst size configurable. |
Hi @suda-morris, thanks for looking at this. You will have to send a high ethernet packet load to make it become more likely. Depending on the packet load, I see the transmission error once per minute or several times a second. Longer frames make the transmission error more likely. You don't even have to send IP packets; it is sufficient to send any ethernet frame (that get discarded immediately by lwip). I used this command:
|
I have a similar issue on my project. I recently added ethernet so there are now lots of concurrent accesses on SPI bus (3 devices: W5500, display and gpio expander). W5500 on its own works, spi driver as well (has been working for years now) but when they co-exist, there is a crash after a few seconds of high load. All transactions are polling. The crash happens during a display transaction because, although the display transaction has started, get_acquiring_dev() claims there is nobody that has acquired the device (NULL returned). Changing |
This issue is only related to the embedded emac, not external wiznet via
SPI.
…On 31/12/2021 00:25, philippe44 wrote:
I have a similar issue on my project. I recently added ethernet so
there are not lots of concurrent write on SPI bus (3 devices: W5500,
display and gpio expander). W5500 on its own works, spi driver as well
(has been working for years now) but when they co-exist, there is a
crash after a few seconds of high load. All transactions are polling.
The crash happens during a display transaction because, although the
display transaction has started, get_acquiring_dev() claims there is
nobody that has acquired the device (NULL returned). Changing
|dmabmr.rx_dma_pbl = EMAC_DMA_BURST_LENGTH_8BEAT; |does not do anything
—
Reply to this email directly, view it on GitHub
<#7380 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABLOO2B2AYU3VXGARGDRG3UTVSGPANCNFSM5BWADTIQ>.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yep I realized. I thought the connection with get_acquiring_dev() would suffice to give me some pointer but no. I've opened another issue #8179 |
Merges espressif/esp-idf#7874 Closes espressif/esp-idf#7380 * Original commit: espressif/esp-idf@2553fb5
Merges espressif/esp-idf#7874 Closes espressif/esp-idf#7380 * Original commit: espressif/esp-idf@2553fb5
Merges espressif/esp-idf#7874 Closes espressif/esp-idf#7380 * Original commit: espressif/esp-idf@2553fb5
Merges espressif/esp-idf#7874 Closes espressif/esp-idf#7380 * Original commit: espressif/esp-idf@2553fb5
Merges espressif/esp-idf#7874 Closes espressif/esp-idf#7380 * Original commit: espressif/esp-idf@2553fb5
Merges espressif/esp-idf#7874 Closes espressif/esp-idf#7380 * Original commit: espressif/esp-idf@2553fb5
This looks like a bug with the esp-idf SPI driver.
espressif/esp-iot-solution#110
The text was updated successfully, but these errors were encountered: