Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with IDF SPI driver ? Ethernet at same time as SPI LCD causes display corruption (IDFGH-5658) #7380

Closed
jonshouse1 opened this issue Aug 6, 2021 · 46 comments
Labels
Resolution: Done Issue is done internally Status: Resolved Issue is done internally

Comments

@jonshouse1
Copy link

This looks like a bug with the esp-idf SPI driver.

espressif/esp-iot-solution#110

@espressif-bot espressif-bot added the Status: Opened Issue is new label Aug 6, 2021
@github-actions github-actions bot changed the title Bug with IDF SPI driver ? Ethernet at same time as SPI LCD causes display corruption Bug with IDF SPI driver ? Ethernet at same time as SPI LCD causes display corruption (IDFGH-5658) Aug 6, 2021
@lexxvir
Copy link

lexxvir commented Aug 16, 2021

Hi there,

It seems I encountered the same bug. We use IDF v.4.2 branch. We have a custom board with ST25R3911 connected by SPI to ESP32. For internet connectivity we are using WiFi or Ethernet. When we use WiFi all works fine. When we use Ethernet it most time works fine but it's possible to "break" ST25R3911 by sending many ICMP requests to the ESP32 (ping -f for couple of seconds is enough). Same ping -f command does not "break" ST25R3911 when we use WiFi.

Meantime, in another project we use the same HW and IDF v.3.3. In that case there is no such an issue, so I think it's an issue in the v.4.x branch.

By "break" ST25R3911 I mean that it falls to "odd" state in that it does not process the commands.

@corecode
Copy link
Contributor

corecode commented Nov 8, 2021

Can confirm SPI transfer issues when ethernet is receiving packets. For us, it seems that the SPI data transmit pointer is reversed by 32 bytes, then the read pointer jumps back to its original value and the rest of the data is transferred correctly. Sometimes this reversal happens after just a few transmitted bytes (again seems 2^n length, e.g. 64 bytes); sometimes this offset exists from the beginning, for our whole SPI transfer (almost 4000 bytes). Of course I don't know whether it is the SPI data transmit pointer; just the data appears "shifted", i.e. sections repeat (for a while).

Higher ethernet packet rates and larger ethernet packets make the bug appear more frequently. With the right nping settings I can trigger this bug several times per second.

@elcojacobs
Copy link

We have been chasing this bug for 2 weeks and I think I finally identified the relationship between Ethernet and display glitches. Our display glitches could be explained by missed set x/y commands.

The ILI9488 samples the D/C pin on the first falling clock after CS goes low. This requires SPI mode 3 to have an idle high CLK.
But that's not all.

We use DMA for all display writes and set the DC pin in the pre-callback. When Ethernet is not connected, I think most display writes finish immediately and the dma queue stays empty. As a result, the CS pin is toggled by the DMA SPI handler for each transfer.
When Ethernet is connected, the ethernet task loads the CPU or dma handler with other tasks, which allows the display dma queue to fill with more than one transfer. When these dma transfers are handled back to back, the CS pin stays low. The second transfer in the queue does not have a CS pin high to low transition and the DC pin is not resampled.

We fixed it by adding a write high and write low to the CS pin immediately after setting the DC pin, in the pre-callback.

What led you to conclude that the pointer is wrong?

@corecode
Copy link
Contributor

corecode commented Nov 8, 2021 via email

@elcojacobs
Copy link

Repeated parts of the screen could perhaps come from a missed set position command.
Instead of overwriting the same part, you would be writing to a different position.

I can't know whether your problem is the same as ours, but perhaps try toggling CS high and low after setting DC and before starting the SPI transfer.

I thought it was a memory bug in the driver too, but this has fixed it for us. It still could be a timing issue with a bug in the driver, so keep us posted on what you find.

@corecode
Copy link
Contributor

corecode commented Nov 9, 2021

Seems 90c4827 introduced this bug (bisect result).

@corecode
Copy link
Contributor

See #7874 for a bugfix. I don't know if this will fix it for everybody, and whether the change from 32 to 16 bytes RX burst length is sufficient in all cases, but it seems to work for us.

@corecode
Copy link
Contributor

Turns out that 16 fixed it for my test firmware, but I have to go to 8 bytes for our main application firmware.

corecode added a commit to corecode/esp-idf that referenced this issue Nov 11, 2021
@elcojacobs
Copy link

Nice find. Do you know why the burst length has an effect?

@corecode
Copy link
Contributor

corecode commented Nov 14, 2021 via email

@jonshouse1
Copy link
Author

" but I have to go to 8 bytes for our main application firmware."

I played with these values months back, but dismissed the approach as the length required to make SPI reliable is so short that I suspect it defeats the entire point of using DMA for the transaction?

@corecode
Copy link
Contributor

corecode commented Nov 14, 2021 via email

@jonshouse1
Copy link
Author

"it's the dma burst setting for Ethernet. idf 3.3 used 4 bytes."

Ahh ok, interesting, that would make me simply wrong, sorry.
I neglected to compare the SPI drivers to previous versions. I mostly focused on the driver handling the display in my project rather than the driver handling Ethernet.

I tried hacking the settings in the "esp-iot-solution/components/bus/spi_bus.c" code as this seems to be the SPI driver that the SPI LCD on my VNC client project is using. I also tried tweaking the idf SPI code but failed to make any real improvements. A vaguely remember that disabling DMA for the SPI driver fixed the issue but I was not shocked when it was so slow as to be worthless.

@jonshouse1
Copy link
Author

Just to confirm I changed emac_hal.c as per #7380 and rebuilt my project.

I've been running the VNC client for a couple of hours and see no display corruption so far.

Thanks.

@elcojacobs
Copy link

Did you try the cs toggle after each DC pin change? It does sound like a timing issue to me.
Whether the CS pin toggle works because of how the display samples the pin or because it caused a small delay, I'm not sure, but I have not seen any glitches since.

Maybe shortening the burst length causes the ethernet DMA to not hold the DMA long enough to let SPI DMA queue up.

Another thing worth trying is to explicitly give the ethernet DMA and SPI DMA different channels.

@jonshouse1
Copy link
Author

jonshouse1 commented Nov 15, 2021

Did you try the cs toggle after each DC pin change?

Not sure who this is addressed to ?

Just for clarity, I tried changing many settings in the "esp-iot-solution/components/bus/spi_bus.c" driver, the SPI display driver my project seems to be using. No change fixed the issue.
I tried changing dma_chan from the original "host_id"

Also tried changing these values below, my comments are tagged JA

None of this had any positive effect. Your theory does not seem to match the observation of "corecode" or my experimentation.

I since changed back to the default spi_bus.c and applied #7380 and I can confirm that seems to fix the display corruption issue for me, on the version of tools I am using.

$ idf.py --version
ESP-IDF v4.4-dev-1594-g1d7068e4b-dirty

spi_bus_device_handle_t spi_bus_device_create(spi_bus_device_handle_t bus_handle, const spi_device_config_t *device_conf)
{
    SPI_BUS_CHECK(NULL != bus_handle, "Pointer error", NULL);
    _spi_bus_t *spi_bus = (_spi_bus_t *)bus_handle;

    _spi_device_t *spi_dev = malloc(sizeof(_spi_device_t));
    spi_device_interface_config_t devcfg = {
        .command_bits = 0,
        .address_bits = 0,
        .dummy_bits = 0,
        .clock_speed_hz = device_conf->clock_speed_hz,
        .duty_cycle_pos = 128,      //50% duty cycle
        .mode = device_conf->mode,
        .spics_io_num = device_conf->cs_io_num,
        // Keep the CS low 3 cycles after transaction, to stop slave from missing the last bit when CS has less propagation delay than CLK
        .cs_ena_posttrans = 4,          // JA, was 3 now 5
        .cs_ena_pretrans = 4,           // JA added, pause after chip select before spi write
        //.flags = SPI_DEVICE_HALFDUPLEX,       // JA tried, kills touch driver
        .queue_size = 1                 // JA changed from 3 to 1
`

@elcojacobs
Copy link

elcojacobs commented Nov 15, 2021

You changed the dma queue to length 1, which means you cannot queue up more than 1 transfer and will get a CS toggle before each transfer.
If that didn't help, my theory is incorrect.

Another thing to note is that setting the receive length to 0 for a transaction is NOT disabling rx, that will make it default to the transmit length. You really need to set the rx data pointer to nullptr.
If the rx pointer is uninitialized, SPI reception can overwrite random memory.

I'm just trying to help find the real cause, I have solved my own problem already and have a working display at 16mhz SPI with a dma queue of length 10. Toggling the DC pin is done bij de pre-callback in the dma handler.

Reducing the tx burst length might work, but it is just a workaround.

@jonshouse1
Copy link
Author

jonshouse1 commented Nov 15, 2021

"If the rx pointer is uninitialized, SPI reception can overwrite random memory."

Yes I can see that would be the case. You talk as if this is my driver and I somehow have agency over it?, I expect others above my paygrade to ship drivers that work! ... if you feel you can do better then please have a go at diagnosing and fixing the issue.

My changes where mostly by blind feel, currently my test gear is packed away in storage and my skills are marginal so clearly I am not the best person to fix the issue. I tried lots or permutations, the code I pasted was simply the state it was in when I gave up my poking at it. As I said nothing before #7380 made any difference, if you feel you can nail the issue down to a clear fault and solution then please do so.

@elcojacobs
Copy link

I was led here by the issue that @corecode created.
I had a quick look at your use of the driver and I do see a potential problem:

https://github.com/jonshouse1/ESPVNCC/blob/49cb7454c9cb6a9386fb05162651ef44c0a3c8cc/main/lcd_vncc.c#L363

Here you give a stack-allocated local buffer (pixels) to the display driver.
If this driver sends the data asynchronously using DMA, then this function will return and the buffer will be deallocated. This means the memory can be overwritten before it is transferred by DMA.

If the function jag_draw_bitmap is blocking until the transfer is completed, then giving it a stack-allocated buffer is fine. I don't have the time to dive deeper to figure this out.
If the jag_draw_bitmap function queues a transfer and doesn't wait for completion, then anything the processor does between the return of this function and the actual transfer can overwrite your pixels.

@jonshouse1
Copy link
Author

jonshouse1 commented Nov 15, 2021

I was led here by the issue that @corecode created.

? Please clarify, this is probably something I have missed. I opened this bug report and I see no issues against ESPVNCC yet.

Here you give a stack-allocated local buffer (pixels) to the display driver.

I did wonder that, if you look back through the commits you will see several attempts at different semaphores. Frankly I lack the skill to unpick issues if the drivers simply do not work for the one workload I am doing!
If you think the issue with display corruption is just my code then you are wrong (not saying my code might not have all kind of issues, just that the CORE issue with it is not my code!) The display corruption seems to be an interaction between the drivers for physical Ethernet and DMA driven SPI .
I've written several bits of test code that prove this to be the case. IE Ethernet alone works fine, SPI Display alone works fine, only the two together cause issues, even if the code does not process the network data the interaction still exists.
Rebuilding ESPVNCC to use Wifi 100% fixes it. Adding a delay between Ethernet RX and SPI write improves it, other people see the exact same issue and so on ...... I am 100% convinced that this is not just my issue, but an actual issue.

I can't really do much better in my code until I get a working IDF and drivers, then I will have time (and clarity) to add the missing locks. Put simply I can not debug hardware, my code AND the IDF drivers at the same time.

Please keep the conversation here to the core issue ("Ethernet at same time as SPI LCD causes display corruption (IDFGH-5658)"
If you want to open an issue against ESPVNCC or email me directly then I welcome your comments, fixes or improvements. This was my first project of its type and a new implementation of a VNC client from scratch so I do not expect it not to be bug free, nor do I regard myself as the last word in code quality :-)
If you wish to take a stab at fixing the issue here then please feel free, but only hardware using Physical Ethernet will reproduce the issue.
Thanks.

@jonshouse1
Copy link
Author

PS I am also seeing a periodic infrequent crash with my code, this actually may be the issue with my code "elcojacobs" just described :-)
Will others will using #7380 please check the long term stability of their projects. As I said probably just my issue at this point?


rect 00004 xpos=00003 ypos=00068 width=00237 height=00013 et=0 took 10ms
rect 00005 xpos=00003 ypos=00094 width=00237 height=00104 et=0 took 80ms
Got VNC_SMT_FRAMEBUFFERUPDATE 1 rectangles
assertion "handle == get_acquiring_dev(host)" failed: file "IDF/components/driver/spi_master.c", line 949, function: spi_device_polling_end

abort() was called at PC 0x40100b1b on core 0


Backtrace:0x400d3b37:0x3ffce5900x4008794d:0x3ffce5b0 0x4008d60a:0x3ffce5d0 0x40100b1b:0x3ffce640 0x40085559:0x3ffce670 0x400855cd:0x3ffce6a0 0x400db033:0x3ffce6c0 0x400db185:0x3ffce720 0x400db2c4:0x3ffce750 0x400dbf44:0x3ffce790 0x400dc047:0x3ffce7d0 0x400d9310:0x3ffce800 0x400d9ca6:0x3ffce820 0x400d9dfe:0x3ffcf060 0x400da2f1:0x3ffcf0a0 0x4008a769:0x3ffcf1d0 


@elcojacobs
Copy link

elcojacobs commented Nov 15, 2021

? Please clarify, this is probably something I have missed. I opened this bug report and I see no issues against ESPVNCC yet.

I was referring to #7874

If you think the issue with display corruption is just my code then you are wrong (not saying my code might not have all kind of issues, just that the CORE issue with it is not my code!) The display corruption seems to be an interaction between the drivers for physical Ethernet and DMA driven SPI.

If you are giving the SPI DMA driver a pointer to memory that is deallocated, that is a bug in your code.
If the ethernet driver is the only piece of code overwriting that same area of memory, it is perfectly allowed to do so, because you released the memory. You could make the pixel buffer static to not release it and re-use it every time you call that function.

I am not certain there is no bug in the ESP-IDF driver. My point is just that the fact that using ethernet and DMA at the same time causes corruption does not mean necessarily that the ESP-IDF drivers are buggy.

If you wish to take a stab at fixing the issue here then please feel free, but only hardware using Physical Ethernet will reproduce the issue.

I am not using your code in any way, so I won't open an issue against it or fix it.
If there is an actual bug, I want to know. That's why I am replying with the issues that I found to help others fix bugs in their code or pinpoint the actual issue in the drivers.

I have a board with an SPI display (ILI9488) and hardware ethernet (LAN8724) and open-source firmware: https://github.com/BrewBlox/brewblox-firmware
I was convinced there was a bug in ESP-IDF that caused interaction between SPI and ethernet until I finally found the issue in our code that caused the interaction.

Writing DMA code is hard, you'll have to take into account many things, like

  • Only allocating DMA-capable memory and only freeing it when the DMA transfer is done
  • Thread-safe access to the DMA buffer
  • Toggling CS/DC pin at the right time for the display

@corecode
Copy link
Contributor

corecode commented Nov 15, 2021 via email

@elcojacobs
Copy link

There are 2 functions:
spi_device_queue_trans and spi_device_transmit.
spi_device_transmit just calls spi_device_queue_trans and waits for the (DMA) transfer to complete.
If SPI transfers have errors when using the blocking spi_device_transmit, then I agree that it is an internal framework issue.

@corecode
Copy link
Contributor

corecode commented Nov 15, 2021 via email

@elcojacobs
Copy link

Jesus, I just added info that I think could help people at espressif solve the bug or pinpoint the problem.

I'm not trying to prove anything. I spent 2 weeks chasing a display bug and I am sharing what I found in attempt to help improve the framework that we all use.

@jonshouse1
Copy link
Author

I'm not trying to prove anything. I spent 2 weeks chasing a display bug and I am sharing what I found in attempt to help improve the framework that we all use.

In that case you remind me of talking to my wife, lots of words but maybe forgetting to include the basic context.

Is the display bug related to to an IDF display driver or do you have your own display driver? If it relates to a driver in IDF then it should not be posted against that? If it is your own display driver then what did you learn about the interaction between Ethernet and DMA SPI.
If you are NOT using Ethernet and SPI then you are wasting your time and ours as this is ONLY about that combination and its interaction, you seem to be writing a tutorial on SPI displays rather than contributing to this bug report best I can tell?

@elcojacobs
Copy link

I had display corruption, using lvgl on top of esp-idf SPI drivers, which occurred only when Ethernet was plugged in. With lan8724 phy and internal mac.

No display corruption without Ethernet plugged in.
If Ethernet was actually used, I would get display glitches, unplug the cable, no glitches.
The app could have 2 network connections and 2 ip addresses, so I can say that I had no glitches with wifi with the same code firmware binary.

But you are convinced that this could not have the same cause as the issue reported here, so please forget anything I said. 🙄

@jonshouse1
Copy link
Author

I had display corruption, using lvgl on top of esp-idf SPI drivers, which occurred only when Ethernet was plugged in. With lan8724 phy and internal mac.

No display corruption without Ethernet plugged in.
If Ethernet was actually used, I would get display glitches, unplug the cable, no glitches.
The app could have 2 network connections and 2 ip addresses, so I can say that I had no glitches with wifi with the same code firmware binary.

Fantastic, why not lead with that!

But you are convinced that this could not have the same cause as the issue reported here, so please forget anything I said.

Sounds the same as my issue.
How does a long tutorial on writing display drivers, Data/CMD and chip selects on SPI displays help with this. You seem to be saying that the DMA transaction stops in the wrong place, but wrong relative to what? The information you provide seems to relate to your experience of a specific display driver, and (had you read anything I had written) you would know not the display driver I am using. As this seems to effect any SPI+Ethernet combination then why does the fault not lie with the Ethernet driver as #7380 suggests?

Maybe you could summarise in way avoiding display driver specifics what you learnt. Is #7380 is not a viable fix for the issue?, if not then why? If it is a viable fix, then why all the extra words and fluff about my code and display drivers from you, can you simplify and summarise the additional point you are making?

@elcojacobs
Copy link

elcojacobs commented Nov 16, 2021

I did try to explain why a timing issue could cause the CS to stay low, which could cause the display miss the command/data selection.
The issue is subtle, it requires understanding of how the display samples the pin and how seemingly unrelated dma transfers can cause timing differences. So I needed a lot of words, and you misguided your anger at espressif at me because of it.
Yes I did point out other potential bugs that I found in an effort to help in case we all share a bug with similar symptoms but other causes, and you think I am an asshole because of it. A timing difference could mask both memory bugs or the DC pin sampling issue. Changing the ethernet burst length changes the time an Ethernet DMA transfer takes.

I was watching this issue because I thought there was a bug in the low level drivers. I found out there maybe wasn't and figured it would be nice to share my findings....
I was met with "no this is a bug in the driver, please shut up", so I tried to explain it in more detail. Only to be compared to your wife or elitist Linux devs.

@corecode
Copy link
Contributor

corecode commented Nov 16, 2021 via email

@jonshouse1
Copy link
Author

I did try to explain why a timing issue could cause the CS to stay low, which could cause the display miss the command/data selection.

We have at least 3 different permutations of drivers and code that all fail to work in almost exactly the same way in this bug report. Are you claiming that 3 sets people have 3 different subtly wrong programs that only show breakage when Ethernet is active because that would seem very very unlikely?
You seem to have grasped the premise that my code is faulty, hence the report is invalid and the others are also invalid, ok possible, but unlikely and as I say flatly contradicted by the tests I did (I guess you are forcing me to re-write or find them ...)

One of my tests was to write every line of the LCD over and over with a single colour from a global static buffer, even that corrupts when Ethernet is active, now you are going to claim that this code is subtly wrong? that my pointer to 240 x16 bits of a solid colour is somehow not quite correct, does not meet some magic pixel painting constraint ?

@suda-morris
Copy link
Collaborator

Interesting topic, I did a quick test by putting the LAN8720 Ethernet initialization code into this example, and both of them just work fine, even without the PR fix FYI, I didn't use the iot-solution driver but the st7789 driver located in esp-idf. For that PR, we will still take it into consideration, maybe will make the burst size configurable.

@corecode
Copy link
Contributor

Hi @suda-morris, thanks for looking at this. You will have to send a high ethernet packet load to make it become more likely. Depending on the packet load, I see the transmission error once per minute or several times a second. Longer frames make the transmission error more likely.

You don't even have to send IP packets; it is sufficient to send any ethernet frame (that get discarded immediately by lwip).

I used this command:

sudo nping --count 0 --quiet --rate 100000 --ether-type 0x0700 --dest-mac cc:50:e3:be:f7:5b --data-length 1400 169.254.92.111

@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Nov 26, 2021
@philippe44
Copy link

philippe44 commented Dec 31, 2021

I have a similar issue on my project. I recently added ethernet so there are now lots of concurrent accesses on SPI bus (3 devices: W5500, display and gpio expander). W5500 on its own works, spi driver as well (has been working for years now) but when they co-exist, there is a crash after a few seconds of high load. All transactions are polling.

The crash happens during a display transaction because, although the display transaction has started, get_acquiring_dev() claims there is nobody that has acquired the device (NULL returned). Changing dmabmr.rx_dma_pbl = EMAC_DMA_BURST_LENGTH_8BEAT; does not do anything

@corecode
Copy link
Contributor

corecode commented Dec 31, 2021 via email

@philippe44
Copy link

philippe44 commented Jan 1, 2022

This issue is only related to the embedded emac, not external wiznet via SPI.

Yep I realized. I thought the connection with get_acquiring_dev() would suffice to give me some pointer but no. I've opened another issue #8179

@espressif-bot espressif-bot added Status: Reviewing Issue is being reviewed and removed Status: In Progress Work is in progress labels Feb 7, 2022
@espressif-bot espressif-bot added Resolution: Done Issue is done internally Status: Resolved Issue is done internally and removed Status: Reviewing Issue is being reviewed labels Feb 15, 2022
gabsuren pushed a commit to gabsuren/esp-protocols-1 that referenced this issue Apr 8, 2022
dokmic pushed a commit to dokmic/eth2wlan that referenced this issue May 8, 2022
gabsuren pushed a commit to gabsuren/esp-protocols-1 that referenced this issue May 17, 2022
gabsuren pushed a commit to gabsuren/esp-protocols-1 that referenced this issue May 27, 2022
gabsuren pushed a commit to gabsuren/esp-protocols-1 that referenced this issue May 27, 2022
0xFEEDC0DE64 pushed a commit to 0xFEEDC0DE64/esp-protocols that referenced this issue Jun 30, 2022
euripedesrocha pushed a commit to euripedesrocha/esp-protocols that referenced this issue Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Resolved Issue is done internally
Projects
None yet
Development

No branches or pull requests

7 participants