Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TW#24580] UART ISR race condition in uart_rx_intr_handler_default causing stuck data in the UART FIFO #2204

Closed
ukfasthands opened this issue Jul 19, 2018 · 31 comments

Comments

@ukfasthands
Copy link

Environment

  • Development Kit: none : Custom PCB with ESP-WROOM-32
  • Kit version : n/a
  • Core (if using chip or module): ESP-WROOM-32
  • IDF version : fdee0b6
  • Development Env: Make
  • Operating System: Windows 10
  • Power Supply: external 5v

Problem Description

There is a UART ISR race condition in uart_rx_intr_handler_default when using high baud rate (500000).

Current ISR:
1 rx_fifo_len = uart_reg->status.rxfifo_cnt;
2 [... deleted code for clarity]
3 while (buf_idx < rx_fifo_len) {
4 p_uart->rx_data_buf[buf_idx++] = uart_reg->fifo.rw_byte;
5 }
6 [.. delete ed code for clarity]
7 uart_clear_intr_status(uart_num, UART_RXFIFO_TOUT_INT_CLR_M | UART_RXFIFO_FULL_INT_CLR_M);

It is possible to get characters stuck in the FIFO as the ISR first reads the FIFO size (line 1 above) and pulls the data from the FIFO before clearing the ISR. If data is received while reading from the FIFO (time between lines 2 to 6 above)) it it will not be read. The ISR will be cleared so the data will be stuck in the FIFO until more data is received. This will delay communications.

I suggest that the ISR should be cleared before reading the uart_reg->status.rxfifo_cnt; This way if more data is received while read the FIFO another interrupt will be generated once completed.

Expected Behavior

Actual Behavior

Steps to repropduce

Very hard to reproduce:
esp32 --UART--> esp32
Baud rate set to 500000
UART config:
#define UART_EMPTY_THRESH_DEFAULT (10)
#define UART_FULL_THRESH_DEFAULT (127)
#define UART_TOUT_THRESH_DEFAULT (1)

Packet burst from esp32 1 to esp32 2

@costaud
Copy link
Collaborator

costaud commented Jul 25, 2018

"The ISR will be cleared so the data will be stuck in the FIFO until more data is received. This will delay communications."

Actually the interrupt status can not be cleared before reading data out of the FIFO. We will update the documentation about this point.

Even if there is one byte left in the FIFO, it will trigger timeout interrupt if no more data is received.

@lollizze
Copy link

I'm experience the same issue, but for me the status reg rxfifo_cnt seems to become wrong. After a while a few bytes is stuck in the fifo, but rxfifo_cnt is 0, until I send in more bytes but the rxfifo_cnt will still be those few bytes short

@FayeY FayeY changed the title UART ISR race condition in uart_rx_intr_handler_default causing stuck data in the UART FIFO [TW#24580] UART ISR race condition in uart_rx_intr_handler_default causing stuck data in the UART FIFO Jul 26, 2018
@koobest
Copy link
Contributor

koobest commented Jul 30, 2018

Hello
@ukfasthands, your fifo_full threshold is set too high, is there still a problem if use the default value?
@lollizze, when the rxfifo_cnt reaches 128, the count value will not increase any more, continue to receive data will cause fifo to overflow( but rxfifo_cnt keeps 128), in this case, we need to reset fifo.

@lollizze
Copy link

Hi @koobest we have set our rx_flow_ctrl_thresh to 110 and the rest is default from esp

#define UART_EMPTY_THRESH_DEFAULT  (10)
#define UART_FULL_THRESH_DEFAULT  (120)
#define UART_TOUT_THRESH_DEFAULT   (10)

I don't think fifo overflow is our problem, with my settings and flow control i expect that an overflow should never happen

@negativekelvin
Copy link
Contributor

If tout interrupt occurs after rxfifo_cnt was read (during fifo readout) it will be cleared anyway together with fifo full interrupt and bytes will be stuck.

@koobest
Copy link
Contributor

koobest commented Aug 22, 2018

Hi,
I tried to accept 67M of data at the 500000 baud rate and no data loss.

  1. if rxfifo_cnt == 0 and a few bytes are stuck in fifo, It must be that fifo has overflowed. we need to reset the fifo.

  2. In uart.c, setting the threshold by changing #define UART_FULL_THRESH_DEFAULT (120) is useless, we should set the specific field in uart_config_t,otherwise, it will use a random value.

@koobest
Copy link
Contributor

koobest commented Aug 22, 2018

@negativekelvin,
Assume that the baud rate is 1M. if a time out interrupt is generated during the reading of data from fifo,this means that reading fifo requires 10us. in this case, no data will stuck in fifo. If there is a byte stuck in fifo, it means that reading fifo requires 20us.

@lollizze
Copy link

@koobest I don't think fifo overflow is our problem, with my settings and flow control i expect that an overflow should never happen. we have set our rx_flow_ctrl_thresh to 100 and the rest is default from esp. our rx buffer is 1024 and we aren't handling all UART_DATA events when it arrives meaning we get UART_BUFFER_FULL from time to time

@negativekelvin
Copy link
Contributor

@koobest worst case with interrupt latency the new byte arrives just after rxfifo_cnt is read and timeout occurs just before flag is cleared so 10us, how long does it take to empty fifo at 80mhz?

@koobest
Copy link
Contributor

koobest commented Aug 23, 2018

Hi,@lollizze, can you provide more details to help us debugging?

  1. Does the flow control works when the data is not read out in time?
  2. How do you handle the uart events in your application?
  3. Is there any other task to read and write flash?
  4. Can you kindly provide the uart initialization code and the IDF commit ID? I can use the same
    configuration for debugging.

thanks!!

@koobest
Copy link
Contributor

koobest commented Aug 23, 2018

Hi,@negativekelvin
Each fifo reads takes about 10 apb cycles, so empty fifo will takes
t = UART_FULL_THRESH_DEFAULT * 10 apb cycle. I will have a test at a higher baud rate.

@negativekelvin
Copy link
Contributor

@koobest ok so it could be possible to mistakenly squelch timeout interrupt. All that needs to be done is to AND the interrupt clear mask with uart_intr_status. I don't see any downside.

@lollizze
Copy link

lollizze commented Aug 23, 2018

@koobest

  1. I don't think we have any flow control problem, everything coming in seem to be read out
  2. uart event is received by a rx task that notifys our main thread, and if upper layer can receive data, e.g TCP then we read from uart
  3. read and write flash? or do you mean uart. In the uart we have two task one for rx and one for tx, we have a main task, then we have some minor tasks but main task is handling most stuff
  4. d91c425
     uart_config_t uart_config;
     esp_err_t espStatus;
     uart_config->data_bits = UART_DATA_8_BITS;
     uart_config->parity = UART_PARITY_DISABLE;
     uart_config->stop_bits = UART_STOP_BITS_1;
     uart_config->flow_ctrl = UART_HW_FLOWCTRL_CTS_RTS;
     uart_config->rx_flow_ctrl_thresh = 100;
     uart_config->use_ref_tick = false;
     uart_config->baud_rate = 3000000;
     espStatus = uart_param_config(0, &uart_config);
     espStatus = gpio_set_direction(GPIO_NUM_1, GPIO_MODE_OUTPUT);
     espStatus = gpio_set_level(GPIO_NUM_1, 1);
    
     espStatus = uart_set_pin(0, GPIO_NUM_1, GPIO_NUM_3, GPIO_NUM_22, GPIO_NUM_19);
     rxMutex = xSemaphoreCreateBinary();
     xSemaphoreGive(p->rxMutex);
    
     espStatus = uart_driver_install(0,
                                     1024,   // rx buf size
                                     0,   // tx buf size, 0 means tx task is blocked until all data in buffer is transmitted
                                     50,
                                     &uartIsrEventQueue,
                                     0);
    
     if (rxTask == NULL)
     {
         xTaskCreate(&uartRxTask, "uart_rx_task", 2048, p, 5, &rxTask);
     }
    
     if (txQueue == NULL)
     {
         txQueue = xQueueCreate(2, sizeof(QMessage));
         cb_ASSERT(txQueue != NULL);
     }
    
     if (txTask == NULL)
     {
         xTaskCreate(&uartTxTask, "uart_tx_task", 2048, p, 6, &txTask);
         cb_ASSERT(txTask != NULL);
     }
    

our testsetup
3Mbit uart on 2 modules, TCP connection between them, duplex. send 3 MB data in 1024 bytes chunks

@koobest
Copy link
Contributor

koobest commented Aug 27, 2018

Hi,@ lollizze
Can you increase your rx_buf_size, the problem should be caused by the buffer being too small. In the ringbuffer, there will be some bytes used to hold the ringbuffer's private data. so, 1024 byte of buffer is not enough for receiving 1024 byte of data. by the way, if increasing rx_buf_size can solve the problem, you can set flow control threshold to 120.

@lollizze
Copy link

@koobest How can the size of the ring buffer affect the rx fifo, the stuck byte is in fifo before it is copied into the buffer. We can't have a buffer that is big enough to never get full.
regarding threshold we lowered to 100 because we had problem in earlier sdk that we got fifo overflow when we had 120 that might be solved now we have just kept it as it was

@koobest
Copy link
Contributor

koobest commented Aug 27, 2018

@ lollizze Can you help check the following register of uart when the problem occurs?

UART[uart_num]->status.rxfifo_cnt 
UART[uart_num]->mem_rx_status.wr_addr
UART[uart_num]->mem_rx_status.rd_addr

If ( (UART[uart_num]->status.rxfifo_cnt == 0) && ( UART[uart_num]->mem_rx_status.wr_addr !=
UART[uart_num]->mem_rx_status.rd_addr) ),

it means there some data was stuck in the fifo. this can helps us locate the problem.

thanks !!

@lollizze
Copy link

@koobest here is the status of the registers
uart_reg->status.rxfifo_cnt = 0
UART[uart_num]->mem_rx_status.wr_addr = 415
UART[uart_num]->mem_rx_status.rd_addr = 414

@lollizze
Copy link

While having a breakpoint in uart isr context to check register rx rd_addr drifted to 426, can that happen if breaking at that point? when starting again the uart is completely out of sync

@koobest
Copy link
Contributor

koobest commented Aug 28, 2018

@lollizze
From the register value, we can confirm that the state machine of the UART is in the wrong state。
How many bytes do you read each time, Are you using the following reading method? if not, you can have a try, it can avoid the ringbuffer full situation.

uart_read_bytes(0, buf, 1024, portMAX_DELAY);

@lollizze
Copy link

@koobest yes we are using that(uart_read_bytes(0, buffer, nBytes, 0);), but the amount of bytes is calculated based on how much we can handle in the application

@koobest
Copy link
Contributor

koobest commented Aug 29, 2018

Hi, @lollizze
How do you handle UART_BUFFER_FULL event? due to the use of flow control, the complete transfer of data can be guaranteed, but when the UART_BUFFER_FULL event is comming, we still need to call uart_read_bytes to read the data out.
in uart_events example, since we don't use flow control, we use the following way to handle UART_BUFFER_FULL events.

uart_flush_input(EX_UART_NUM);
xQueueReset(uart0_queue);

@lollizze
Copy link

We are handling UART_BUFFER_FULL same as UART_DATA

@qt1
Copy link

qt1 commented Sep 7, 2018

I am not sure that this is caused by a race condition between the CPU and the ISR as described above.

As far as I understand the CPU is not involved with the internal management of the fifo (rxfifo_cnt , wr_addr, rd_addr) .
This seems more like a hardware problem, maybe in the internals of the fifo or microcode.
Maybe collision between fifo channels?

@qt1
Copy link

qt1 commented Sep 7, 2018

I added a pull request: espressif/arduino-esp32#1849 for #2388

This may also help here.

Please check
Thanks

@lollizze
Copy link

@qt1 your problem is probably the same as I have seen, rx_cnt is 0 but the wr rd addr is not the same

@negativekelvin
Copy link
Contributor

@qt1
Copy link

qt1 commented Nov 26, 2018

@lollizze As far as I understand the management of the FIFO is (should be) done by the hardware or some microcode.
Therefore the state of the FIFO should be consistent in all cases.

The hardware API seems to be consistent: you only read from the FIFO and everything else happens automatically, so the CPU is not involved in the transaction and therefore the FIFO management should be affected by the CPU or interrupts.

I suspect that this behaviour is a bug in the hardware and should be fixed by the Espresiff
The patch of comparing the pointers gives OK results but it is not perfect. I hope they address this in the next hardware releases..

(I didn't read the whole patch as it seems to be very long)

@igrr igrr closed this as completed in a9c1124 Nov 28, 2018
@negativekelvin
Copy link
Contributor

When using dual cores, rx_buffer_full_flg may read and write on different cores at same time, which may lose synchronization.

This is what leads to uart_reg->status.rxfifo_cnt == 0 && uart_reg->mem_rx_status.wr_addr != uart_reg->mem_rx_status.rd_addr ?

@koobest
Copy link
Contributor

koobest commented Dec 3, 2018

Hi, @negativekelvin
This fix may be unrelated to uart_reg->status.rxfifo_cnt == 0 && uart_reg->mem_rx_status.wr_addr != uart_reg->mem_rx_status.rd_addr,but we hope you can have a try , it fixed a UART read bug.

Since we can not reproduce your issue, we can't fix it. However, we can continue to work on this issue, but need some help from your:

  • This issue can be reproduced using the UART driver in IDF. since I am not sure if you are using IDF, arduino or a driver that you implement yourself, there are too many comments on this issue :-).
  • Provide a example that can reproduce it.
  • Please do not put your attachment in https://gist....(if have), we can not open it :).

thanks !!

@qt1
Copy link

qt1 commented Dec 3, 2018

Dear @koobest ,

The uart_reg->status.rxfifo_cnt == 0 && uart_reg->mem_rx_status.wr_addr != uart_reg->mem_rx_status.rd_addr, is a patch.
It did not fix the bug' it just reduces the problem but the problem is not solved.

I have a system where the problem occurs every 20-30 minutes. From time to time the patch does not help and there is some other mess in the buffer.
My program is a simple bridge that copy bytes between UART2 UART0 and a TCP stream.

The behaviour is statistical. My guess is that it is probably a race condition in the hardware that mix UART channels.

Good luck fixing it!

@koobest
Copy link
Contributor

koobest commented Dec 4, 2018

@qt1
Is your system based on arduino or IDF? I saw that you have an arduino pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants