Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I2C data corruption with timer and UDP server running (IDFGH-11762) #12860

Open
3 tasks done
dek-RVB opened this issue Dec 22, 2023 · 11 comments
Open
3 tasks done

I2C data corruption with timer and UDP server running (IDFGH-11762) #12860

dek-RVB opened this issue Dec 22, 2023 · 11 comments
Labels
Status: Opened Issue is new Type: Bug bugs in IDF

Comments

@dek-RVB
Copy link

dek-RVB commented Dec 22, 2023

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

IDF version.

v5.1.2 (also tested on master)

Espressif SoC revision.

ESP32 (revision v3.1)

Operating System used.

Windows

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

None

Development Kit.

ESP32-Ethernet-Kit-V1.2 and custom board

Power Supply used.

External 5V

What is the expected behavior?

The temperature is monitored using a PCT2075 over an I2C-bus, while an auto-reload esp timer that triggers a level 3 interrupt is running. An UDP server is also set up. It is expected that the temperature is outputted without corrupted data.

What is the actual behavior?

The temperature that is monitored by the setup described above gives reasonable data most of the time, but randomly logs temperature spikes. These spikes seem to happen at random moments. Sometimes the corrupted data is two of the same bytes after each other and other times it looks random. No pattern is seen yet. The data on the I2C bus has been checked and does not show any of those temperature spikes. The device does not crash.

Steps to reproduce.

  1. Connect the SDA and SCL pin of the Adafruit PCT2075 to IO2 and IO4 of the ESP32-Ethernet-Kit V1.2, respectively.

  2. Connect the address pins of the PCT2075 to ground or 3V3 (make sure to change to the appropriate address in the code (ec_control.c --> PCT2075_I2C_ADDR).

  3. Connect the PCT2075 to GND and 3V3.

  4. Connect the ESP32-Ethernet-Kit V1.2 to a PoE capable device.

  5. Make sure the interrupt level of the 'High resolution timer (esp_timer)' is set to '3' and the 'Support ISR dispatch method' checkbox is active in the sdkconfig.
    image
    image

  6. Build and flash the project found in the attached files.

  7. Open the monitor; an IP address will be assigned to the device and the temperatures below 20°C and above 60°C will be logged. Also the measurements before and after the erroneous data is logged.

  8. The occurrence of errors can be significantly increased by flooding the device with ARP messages. This can be done by:

  • using an external network performance tester
  • by running an 'arp-scan' command line on pc

EC_controller_test.zip

Debug Logs.

E (2656) pct2075_read_temperature: prev raw data = 1a40
E (2656) pct2075_read_temperature: Temperature is not within limits: 64.250000 (raw data = 4040)
E (2656) pct2075_read_temperature: next raw data = 1a40
E (2996) pct2075_read_temperature: prev raw data = 1aa0
E (2996) pct2075_read_temperature: Temperature is not within limits: -95.375000 (raw data = a0a0)
E (2996) pct2075_read_temperature: next raw data = 1aa0
E (173046) pct2075_read_temperature: prev raw data = 1a40
E (173046) pct2075_read_temperature: Temperature is not within limits: 64.250000 (raw data = 4040)
E (173046) pct2075_read_temperature: next raw data = 1a40

More Information.

Initial Setup

Custom PCB

The custom design is a PCB containing:

  • ESP32-PICO-V3-02 SiP
  • SPI bus with a DAC and an ADC. The ESP32 writes to and reads registers from the DAC. The ADC reading is triggered every 1ms by an auto-reload timer to have high speed measurements.
  • 400kHz I2C bus with an I/O expander, current monitor and temperature sensor.
  • PoE module that feeds 5V, which is also converted to 3V3, to the PCB.
  • phy for Ethernet connection. This is used to set up an UDP server.

Errors on the custom PCB

I2C

The custom board sporadically reported current and temperature spikes (both positive and negative) at random moments. Those spikes do not happen an the same time. The I2C bus was monitored with an oscilloscope and did not show any sign of corrupted data sent over the bus. The time between two spikes ranges from a few seconds to a couple of hours.

We discovered later that the rate of erroneous values is increased by flooding the network with ARP messages. Disabling the initialization of the UDP server removed the current and temperature spikes.

It was also discovered that disabling the esp timer callback also removes the current and temperature spikes. However enabling the callback to an empty function still gives erroneous data. Increasing the timer's frequency increases the number of error rate. The frequency can not be too high as it will introduce watchdog timeouts.

The increase in timer frequency and the ARP flooding consistently reduce the time between two spikes to a couple of spikes per 10 minutes.

SPI

The SPI bus reads from the DAC are randomly converted to writes which gives unwanted values at the ouput of the DAC (confirmed by monitoring the SPI bus with an oscilloscope). ARP flooding has no impact on the rate of SPI read/writes. However, increasing the timer's frequency increases the number of read/writes. It is still unclear if SPI and I2C errors are related to each other.

Tests

The system has been tested on stack overflows, task sizes, memory leaking...
The power supply is stable.

Also tested:

  • on a Linux build using VS Code ESP-IDF
  • with a different phy
  • with the VS Code ESP-IDF V4.6
  • gptimer instead of esp timer
  • multiple iterations of the custom board
  • with the master branch

but none of the above helped to resolve the weird behavior of the system.

ESP32-Ethernet-Kit V1.2

First, the code has been reduced to its minimum, while still showing erroneous data on the custom board. Therefore only the UDP server initialization (no active task), the auto-reload timer with an empty callback and a task that reads the temperature sensor using I2C have been preserved. This reduces the errors to only temperature spikes. This code has been ported to be used on the ESP32-Ethernet-Kit V1.2 in combination with a Adafruit PCT2075.

To increase the number of errors the timer auto-reload value has been set to 1 µs and the number of I2C reads have been increased. To be clear, the errors still occur without those changes, but these can take hours to happen.

Does anyone know what is going on with this specific combination of UDP server, auto-reload timer and I2C bus?

Thanks in advance

RVB

@dek-RVB dek-RVB added the Type: Bug bugs in IDF label Dec 22, 2023
@espressif-bot espressif-bot added the Status: Opened Issue is new label Dec 22, 2023
@github-actions github-actions bot changed the title I2C data corruption with timer and UDP server running I2C data corruption with timer and UDP server running (IDFGH-11762) Dec 22, 2023
@mythbuster5
Copy link
Collaborator

Q1: From you description. Do you mean that the data on oscilloscope is correct, but data which esp got is wrong?
Q2: What if disable the UDP, but only initialize the timer and I2C? Still wrong?

@dek-RVB
Copy link
Author

dek-RVB commented Dec 26, 2023

@mythbuster5
Answer to Q1: The data on the oscilloscope is correct, but the data reported by the esp is sometimes wrong.
Answer to Q2: If the UDP server is not initialized in combination with the timer and I2C, the error will not occur.

@xavierhamel
Copy link

Hi,
We also currently have a similar problem. We are running an i2c sensor with a UDP server and the BLE scan running. About once every hour the application crashes because of a memory corruption of the heap. It is not possible for us to compare the values from the oscilloscope and the data that the i2c stack returns like you did (there is too much data).

When disabling the UDP server or the BLE scanning, the problem seems to occur much less often. Our guess is that the problem is coming from the i2c driver (or how we use it). The UDP server and BLE scanning are just using the heap a lot which create an environment where memory corruption is much more likely.

We have tested and reproduced the issue on v5.0 and v5.1.

@redfast00
Copy link
Contributor

Seems to be related to #7781

@dek-RVB
Copy link
Author

dek-RVB commented Mar 13, 2024

Some more tests have been performed. New insights in the errors have been discovered.

Error analysis

Measurement format

In the original code a current, voltage and temperature measurement (using I2C as described in the issue) are sequentially performed in a FreeRTOS task. This task then waits 10 ms (using vTaskDelay) and performs the 3 measurements again and again...

Each measurement consists of 2 bytes of data. A diagram is shown below. I1 is the most significant byte of the current measurement and I0 the less significant byte. The same convention is used for the voltage and temperature.

Ideal_transmission

Error format

A pattern was discovered in the errors. These errors can be separated in multiple cases. Keep in mind that the I2C bus was monitored and that this bus contains no errors. All the errors happen internally in the ESP32.

Temperature errors

The temperature errors are always the same. The first byte of the temperature T1 is replaced by the last byte of voltage measurement V0 as shown below.

Temp_error

Current errors

There are two types of current errors. The first one being similar to the temperature error. The first byte of the current I1 is replaced by the last byte of the previous measurement being T0. This is shown below.

Current_error1

The second error is more complicated as I have no idea where the erroneous byte comes from. The last byte of the current I0 is replaced by a value that is constant within one run of the code. Restarting the ESP32 (without rebuilding/reflashing) can change that value, but not always. The monitored values are 0x04 and 0x8D, but I have no idea what causes these bytes to appear there.

Current_error2

FreeRTOS timing

While I was monitoring the bus with an oscilloscope and logging the measured data with its timestamp using a logging script, a weird timing behavior was discovered. The FreeRTOS tickrate is set to 100 ticks per second, which gives a minimal interval of 10ms between two timestamps of measurements. An example of the normal behavior is shown below.

OS_tick0

Whenever a temperature error occurs the OS misses its 10ms second mark and instead logs a timestamp at a 15 ms mark. It then again waits 15ms and then proceeds with the expected 10 ms between two logged timestamps. I would expect that whenever the OS cannot reach its set tickrate (due to for example CPU overload), it would skip a tick causing it to have a delta of 20 ms between two timestamps. An example of the error and expected behavior is shown below.

OS_tick1

The current error again is more complex. Whenever the erroneous current is negative (MSB of I1 being 1) the OS misses its 10 ms mark and instead logs a timestamp at the 12 ms mark. It then waits 18 ms before proceeding with the expected 10 ms delta between two timestamps. Whenever the erroneous current is positive (MSB of I1 being 0) the 10 ms mark is reached and no timing problems are visible in the timestamp logging.

Example code ESP32-Ethernet-Kit V1.2

I provided example code in the original issue to easily reproduce the errors. The new insights in the replaced bytes give information about the error pattern of the example code. As only temperature measurements are performed, the first byte of the newly measured temperature is replaced by the last byte of the previously measured temperature. The temperature is relatively stable which results in two of the same bytes in the reported erroneous temperature measurement.

Notes and other tests

  • Note that the measured voltage is never erroneous. It is still unclear to me why these errors occur and why with the pattern described above.
  • I have tried clearing the I2C RX buffers in between two measurements, but this does not resolve anything.
  • I also have tried bit-banging the I2C bus by setting and clearing the SDA and SCL pins. This resulted in 0 I2C errors. This again proves that there is no problem with the custom board or with the used peripherals. I strongly believe there is an internal problem in the ESP32 causing these errors.

@mythbuster5 Would you have any insights on this weird behavior?

Thanks in advance.

RVB

@peturdainn
Copy link

I can totally reproduce this issue on the Espressif devkit

@redfast00
Copy link
Contributor

We have been able to make the problematic code a lot smaller: it's only 170 lines of C now, in a single file, based on the 'i2c simple example' https://github.com/espressif/esp-idf/tree/v5.1.2/examples/peripherals/i2c/i2c_simple. We have determined that the network stack is not involved with this bug, so we've eliminated that from the code. I've attached the zip as attachment.

As for other suggestions:

When call the “i2c_driver_install()” API to register I2C, please set the last parameter to "ESP_INTR_FLAG_IRAM" for testing.

This does not fix the problem, corruption still happens (but less often)

Increase the Timer period for testing.

Increased the timer from 1 to 10 us, corruption still happens

You can try to set the esp_timer task core affinity to CPU1 for testing.

However, with the affinity set to CPU1, the corruption does not seem to happen anymore.

We're not satisfied with this solution yet, for the following reasons:

  • We still don't know why the corruption happened in the first place, so we don't know if we've solved it or just made it a lot more infrequent
  • The 'note that they may break other features' warning is concerning, and there doesn't seem to be any documentation for what those features may be.

We have also reduced the hardware needed to reproduce this bug. We are able to reproduce this on just an official ESP32-Ethernet-Kit_A_V1.2 with a PCT2075 sensor module from Adafruit (https://www.adafruit.com/product/4369). This sensor is likely also available from other vendors, in case Adafruit does not ship to your region.

image
based_on_i2c_simple.zip

@redfast00
Copy link
Contributor

redfast00 commented Oct 15, 2024

We received support from Espressif. There was indeed an issue with the I2C FIFO, the following patch given by one of their employees fixes it:

From fb0c921cc6c93a755f3f39f472fc88b59d130dad Mon Sep 17 00:00:00 2001
From: Jacques_Zhao <redacted@espressif.com>
Date: Fri, 30 Aug 2024 19:23:45 +0800
Subject: [PATCH] i2c: fix i2c read error

---
 components/hal/esp32/include/hal/i2c_ll.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/components/hal/esp32/include/hal/i2c_ll.h b/components/hal/esp32/include/hal/i2c_ll.h
index f2903de44a..f1aa6aacf9 100644
--- a/components/hal/esp32/include/hal/i2c_ll.h
+++ b/components/hal/esp32/include/hal/i2c_ll.h
@@ -518,6 +518,7 @@ static inline void i2c_ll_get_scl_timing(i2c_dev_t *hw, int *high_period, int *l
 __attribute__((always_inline))
 static inline void i2c_ll_write_txfifo(i2c_dev_t *hw, const uint8_t *ptr, uint8_t len)
 {
+    hw->fifo_conf.nonfifo_en = 0;
     uint32_t fifo_addr = (hw == &I2C0) ? 0x6001301c : 0x6002701c;
     for(int i = 0; i < len; i++) {
         WRITE_PERI_REG(fifo_addr, ptr[i]);
@@ -536,9 +537,14 @@ static inline void i2c_ll_write_txfifo(i2c_dev_t *hw, const uint8_t *ptr, uint8_
 __attribute__((always_inline))
 static inline void i2c_ll_read_rxfifo(i2c_dev_t *hw, uint8_t *ptr, uint8_t len)
 {
+    hw->fifo_conf.nonfifo_en = 1;
     for(int i = 0; i < len; i++) {
-        ptr[i] = HAL_FORCE_READ_U32_REG_FIELD(hw->fifo_data, data);
+        ptr[i] = hw->ram_data[i];
     }
+    hw->fifo_conf.nonfifo_en = 0;
+
+    hw->fifo_conf.rx_fifo_rst = 1;
+    hw->fifo_conf.rx_fifo_rst = 0;
 }
 
 /**
-- 
2.34.1

@AxelLin
Copy link
Contributor

AxelLin commented Oct 15, 2024

@mythbuster5 Could you review #12860 (comment) ?

@redfast00
Copy link
Contributor

We have tested this on 10 boards for two weeks and haven't had a single error anymore (we used to have multiple per 10 minutes). The patch above was sent by an Espressif employee, thanks for the support!

@AxelLin
Copy link
Contributor

AxelLin commented Nov 13, 2024

@mythbuster5 Could you review #12860 (comment) ?

Is there anything wrong with above fix? (I'm wondering why it's still not yet fixed in github).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Opened Issue is new Type: Bug bugs in IDF
Projects
None yet
Development

No branches or pull requests

7 participants