EVB FATCATLAB 9DOF sporadically crashes kernel #662

kortschak · 2016-06-03T10:22:45Z

System information:

I am using: EVB

I installed ev3dev using this image file: evb-ev3dev-jessie-2016-01-25.img

My kernel version is: 4.4.9-ti-rt-r25-11-ev3dev-bb.org

My host computer is running: linux

I am connecting using: Wi-Fi

About my issue:

Plug in the 9DOF device and remove then plug in again.
Attempt to navigate in Brickman - not possible.

Additionally the EVB will not respond to an invocation of /sbin/poweroff (presumably it is waiting for Brickman to terminate).

kortschak · 2016-06-03T10:25:18Z

Oooh. It's not Brickman. Relevant lines of dmesg output:

[   51.947745] lego-port port0: Added new device 'evb-ports:in1:ev3-uart-host'
[   52.205858] Registered EV3 UART sensor line discipline. (29)
[   54.192904] lego-sensor sensor0: Registered 'ev3-uart-76' on 'evb-ports:in1'.
[   54.233011] tty ttyS4: Reconnected due to: Bad message size.
[   56.508776] tty ttyS4: Reconnected due to: Bad message size.
[   66.913634] lego-port port0: Removed device 'evb-ports:in1:ev3-uart-host'
[   66.953411] lego-sensor sensor0: Unregistered 'ev3-uart-76' on 'evb-ports:in1'.
[   66.953590] pin1_mv: 4997
[   66.953596] pin6_mv: 53
[   66.953601] pin5_gpio: 1
[   66.953605] pin6_gpio: 0
[   68.376911] lego-port port0: Added new device 'evb-ports:in1:ev3-uart-host'
[   69.305480] lego-sensor sensor1: Registered 'ev3-uart-76' on 'evb-ports:in1'.
[   69.345030] tty ttyS4: Reconnected due to: Bad checksum.
[   69.365600] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: bfb02a35

[   69.365619] CPU: 0 PID: 226 Comm: kworker/0:3 Tainted: G        WC      4.4.9-ti-rt-r25-11-ev3dev-bb.org #1
[   69.365623] Hardware name: Generic AM33XX (Flattened Device Tree)
[   69.365667] Workqueue: events ev3_uart_handle_rx_data [ev3_uart_sensor_ld]
[   69.365709] [<c0015e91>] (unwind_backtrace) from [<c001259d>] (show_stack+0x11/0x14)
[   69.365732] [<c001259d>] (show_stack) from [<c03dafa1>] (dump_stack+0x5d/0x6c)
[   69.365751] [<c03dafa1>] (dump_stack) from [<c00fadbf>] (panic+0xbb/0x1d8)
[   69.365779] [<c00fadbf>] (panic) from [<c00314db>] (__stack_chk_fail+0x17/0x18)
[   69.365801] [<c00314db>] (__stack_chk_fail) from [<bfb02a35>] (ev3_uart_open+0x0/0x22c [ev3_uart_sensor_ld])
[   69.365851] [<bfb02a35>] (ev3_uart_open [ev3_uart_sensor_ld]) from [<00000000>] (  (null))
[   69.446314] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: bfb02a35

dlech · 2016-06-03T16:05:44Z

Hmm... nothing obvious jumping out here. Makes me wonder if this is somehow related to #623.

It would be useful to put a logic analyzer on this to see if we really have a bad message size or if our driver needs to be fixed.

Also, an inspection of ev3_uart_sensor_ld.c by a fresh pair of eyes to see if we are passing a pointer to some stack allocated memory somewhere would be useful. It could be that we are using the pointer after the stack is popped.

bmegli · 2016-06-05T17:28:02Z

I am not sure if this is the problem but lines 400-401 of ev3_uart_sensor_ld.c look suspicious.

port->in_port = NULL;
put_device(&port->in_port->dev);

So we are dereferencing through NULL pointer?
The lines should be probably in the reverse order.

Update -> Pull Request issued

dlech · 2016-06-05T18:25:08Z

I don't think we ever reach this code. We would certainly get a crash if we did. Thanks for the PR to fix it.

bmegli · 2016-06-06T08:44:39Z

I think I may have found the problem:

static void ev3_uart_handle_rx_data(struct work_struct *work)
{
        //...
        u8 message[EV3_UART_MAX_MESSAGE_SIZE + 2];
        int count = CIRC_CNT(cb->head, cb->tail, EV3_UART_BUFFER_SIZE);
        //message is 36 bytes long (btw, shouldn't it be MAX_DATA_SIZE+2 or just MAX_MESSAGE_SIZE)
        //count is up to 255
        // ...
        //after syncing
        msg_size = ev3_uart_msg_size((u8)cb->buf[cb->tail]);
        //msg_size may be up 128 + 2 + 1!
        //EV3_UART_MAX_MESSAGE_SIZE macro returns up to 128, is this intentional?
        //note - we are before CRC check here still
        if (msg_size > count)
            break;

    size_to_end = CIRC_CNT_TO_END(cb->head, cb->tail, EV3_UART_BUFFER_SIZE);
    if (msg_size > size_to_end) {
        memcpy(message, cb->buf + cb->tail, size_to_end);
        memcpy(message + size_to_end, cb->buf, msg_size - size_to_end);
        cb->tail = msg_size - size_to_end;
    } else {
        memcpy(message, cb->buf + cb->tail, msg_size);
        cb->tail += msg_size;
        if (cb->tail >= EV3_UART_BUFFER_SIZE)
           cb->tail = 0;
        }

       //now we may have overwritten 36 byte buffer with up to 131 msg_size bytes

       //only after that we check that size is OK but we already may have had overflow

      if (msg_size > EV3_UART_MAX_MESSAGE_SIZE) {
         port->last_err = "Bad message size.";
         goto err_invalid_state;
      }
     //now follows the CRC check, but it is also too late

The obvious fix here is to move:

  if (msg_size > EV3_UART_MAX_MESSAGE_SIZE)
  //...

before memcpy of the memory, but I am not sure if another problem also doesn't lie in:

#define EV3_UART_CMD_SIZE(byte) (1 << (((byte) >> 3) & 0x7))

which doesn't limit size to around 30 bytes but to 128 instead

I am not enough into the code to answer all of that but hopefully it is enough for @dlech to fix it.

dlech · 2016-06-06T16:30:38Z

Awesome! This explains the stack corruption after the Bad Message Size error that @kortschak was seeing.

dlech · 2016-06-06T16:44:33Z

For reference, the official LEGO firmware does not list 64 or 128 as possible data sizes

https://github.com/mindboards/ev3sources/blob/master/lms2012/d_uart/Linuxmod_AM1808/d_uart_mod.c#L230

dlech · 2016-06-06T16:57:39Z

Ah, but LEGO has buffer size of 64.

https://github.com/mindboards/ev3sources/blob/78ebaf5b6f8fe31cc17aa5dce0f8e4916a4fc072/lms2012/lms2012/source/lms2012.h#L1027

dlech · 2016-06-06T16:58:13Z

But still max data length of 32

https://github.com/mindboards/ev3sources/blob/78ebaf5b6f8fe31cc17aa5dce0f8e4916a4fc072/lms2012/lms2012/source/lms2012.h#L248

bmegli · 2016-06-06T17:09:56Z

This explains the stack corruption after the Bad Message Size error that @kortschak was seeing.

Yes, correcting this will make the kernel code safe but there still remains the question:

Why was there incorrect message in the first place?

Maybe it is artifact of disconnecting the sensor while working or changing baudrate or other factor.

For reference, the official LEGO firmware does not list 64 or 128 as possible data sizes

Yes but we probably should not expect device to play nicely at this point. It can be anything connected to uart that just went through sync procedure, either by coincidence or even maliciously.

We can also simply get corrupted data.

You know - let's not trust them - sensors, nobody knows who they work for ;-)

Anyway - great code! Three thumbs up!

@bmegli

We were overflowing a stack allocated array when a message had a bad size before acutally checking the size of the message. Suggested-by: @bmegli Issue: ev3dev/ev3dev#662

dlech · 2016-06-06T17:12:59Z

Why was there incorrect message in the first place?

@kortschak will have to add some printks and let us know.

I've pushed the fix suggested by @bmegli, so that should at least fix the stack corruption issue. @kortschak, can you confirm?

dlech · 2016-06-06T17:44:02Z

Interesting... With these changes, I'm getting a bunch of "Reconnected due to: Bad message size." with the LEGO EV3 Ultrasonic sensors (which was the problem sensor reported in #664).

bmegli · 2016-06-06T18:19:42Z

Hmm, in original ev3 code you linked they do have a length check:

   Uart1RecMesLng  =  GET_MESSAGE_LENGTH(Byte) + 2; 

           if (Uart1RecMesLng <= UART_BUFFER_SIZE) 
           { // Valid length 
             Uart1RecMes[Uart1RecMesIn]  =  Byte; 
             Uart1RecMesIn++; 
           }

bmegli · 2016-06-06T18:24:55Z

and

UART_BUFFER_SIZE=64

so they pass messages up to 32 length

I mean, if GET_MESSAGE_LENGTH gives 64, they add 2 to it and it will fail the check
and message is not valid

edit: they allow messages smaller-equal 62
edit of edit: the steps due to coding are 32+2, 64+2 so they allow smaller equal 34

bmegli · 2016-06-06T18:25:55Z

So the buffer seems just "nearest greater than 32 power of 2" ;-)

dlech · 2016-06-06T18:44:56Z

I haven't been able to reproduce the "Reconnected due to: Bad message size." since the first reboot. I am running into #572 quite a bit though. It makes me wonder if we have some kind of cross-talk issue with the hardware that is causing data corruption on the UARTs (Bluetooth and input ports 1 and 2 all use SoC UARTs).

We should save that discussion for another issue though. This issue is about the kernel crash, and I think we have fixed it.

bmegli · 2016-06-06T19:18:16Z

Ok, one thing before I forget:

static inline int ev3_uart_msg_size(u8 header) 
 { 
    int size; 

    if (!(header & EV3_UART_MSG_TYPE_MASK)) /* SYNC, NACK, ACK */ 
        return 1;

    size = EV3_UART_CMD_SIZE(header); 
    size += 2; /* header and checksum */ 
    if ((header & EV3_UART_MSG_TYPE_MASK) == EV3_UART_MSG_TYPE_INFO) 
        size++; /* extra command byte */  

    return size; 
}

So maybe max length could be 32+2+1.
I mean maybe my comment to limit

u8 message[EV3_UART_MAX_MESSAGE_SIZE];

length to 32+2 was wrong

To consider but yes, probably not here and now ;-)

Edit: in original lms code they allow length up to 62 as valid, ugh, but probably mean 34 or 35
Edit of Edit: silly me, the protocol allows 32+2, the next step is 64+2 so they allow at most 34

dlech · 2016-06-06T20:22:07Z

So maybe max length could be 32+2+1.

Yes, I suppose it could be if there is corrupt data.

dlech · 2016-06-06T20:27:53Z

although we should catch the error now instead of writing past the end of the buffer, so I think it is OK the way it is. EV3_UART_MSG_TYPE_INFO should never be the full 32 bytes (longest is 11 bytes). Only EV3_UART_MSG_TYPE_DATA are that big.

bmegli · 2016-06-06T21:23:02Z

Oh, ok, unlike you I don't know the protocol

Just wanted to make sure you didn't change the buffer size too hasty. I was only guessing this was a typo (message[UART_MAX_MESSAGE_SIZE+2] -> message[UART_MAX_MESSAGE_SIZE] that probably sneaked from the form message[UART_MAX_DATA_SIZE+2]. Uggghh.

Enough. ;-)

dlech · 2016-06-15T03:02:30Z

This fix has been released in the 12-ev3dev kernel.

Please confirm and close the issue if it is fixed.

kortschak · 2016-06-15T11:35:25Z

This appears to fix the failure. Though there is an interesting line in dmesg on reconnection. After the first connect, you see the following:

[  244.687878] lego-sensor sensor2: Unregistered 'ev3-uart-76' on 'evb-ports:in1'.
[  244.696682] pin1_mv: 4987
[  244.696687] pin6_mv: 68
[  244.696693] pin5_gpio: 1
[  244.696697] pin6_gpio: 0
[  253.009123] lego-port port0: Added new device 'evb-ports:in1:ev3-uart-host'
[  253.930700] lego-sensor sensor3: Registered 'ev3-uart-76' on 'evb-ports:in1'.
[  256.222986] tty ttyS4: Reconnected due to: Bad message size

Also interesting, but unrelated is that here gravity is only half strength! I get ±0.5g depending on orientation, never 1g.

bmegli · 2016-06-15T12:04:49Z

Interesting.

Wasn't there tty ttyS4: buffer overrun somewhere around? If not it means that there may be also some other, unknown yet, cause for Bad message size with 9DOF, but non critical now.

kortschak · 2016-06-15T12:32:40Z

I'll be opening another issue for the 9DOF wrt the factor of 2 error, so I may as well make a note of this as well. I'll do that tomorrow.

kortschak · 2016-06-15T12:34:19Z

From #662 (comment)

[   69.305480] lego-sensor sensor1: Registered 'ev3-uart-76' on 'evb-ports:in1'.
[   69.345030] tty ttyS4: Reconnected due to: Bad checksum.
[   69.365600] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: bfb02a35

bmegli · 2016-06-15T12:45:11Z

Yes, yes, I know about that one. The kernel panic on incorrect message size has already been fixed by @dlech. But there is still unknown reason for bad message size itself (which is non-fatal now).

@dlech has added buffer overrun message in place of possible cause but if it's not showing then it's some another cause.

For short:

there is bad message from 9DOF for some reason
this was causing kernel panic (result, fixed already)

kortschak · 2016-06-15T12:51:32Z

Yes, that was my meaning. It seems there is still an unknown reason, it was present in the original dmesg lines (the middle line in my last comment), but we didn't notice because worse things were also happening.

kortschak · 2016-06-15T12:58:31Z

Here's a sequence of connecting/removing of the 9DOF on an EV3 with a recent kernel upgrade (4.4.13-12-ev3dev-ev3)

[   68.492177] lego-port port0: Added new device 'in1:ev3-uart-host'
[   69.777069] Registered EV3 UART sensor line discipline. (29)
[   70.739787] lego-sensor sensor0: Registered 'ev3-uart-76' on 'in1'.
[   70.792347] tty ttyS1: Reconnected due to: Bad message size
[   73.023261] tty ttyS1: Reconnected due to: Bad message size
[   73.444670] ttyS1: buffer overrun
[   73.445291] ttyS1: buffer overrun
[   73.447617] ttyS1: buffer overrun
[   77.135573] ttyS1: buffer overrun
[   77.136330] ttyS1: buffer overrun
[   77.138547] ttyS1: buffer overrun
[   77.139235] ttyS1: buffer overrun
[   77.141624] ttyS1: buffer overrun
[   77.141975] ttyS1: buffer overrun
[   77.142622] ttyS1: buffer overrun
[   79.404329] tty ttyS1: Reconnected due to: Bad message size
[   87.576531] ev3_uart_receive_buf: 7 callbacks suppressed
[   87.576584] ttyS1: buffer overrun
[   87.577162] ttyS1: buffer overrun
[   89.786367] tty ttyS1: Reconnected due to: Bad message size
[   90.956587] ttyS1: buffer overrun
[   90.957222] ttyS1: buffer overrun
[   90.959295] ttyS1: buffer overrun
[   90.960065] ttyS1: buffer overrun
[   90.962575] ttyS1: buffer overrun
[   90.962862] ttyS1: buffer overrun
[   90.963937] ttyS1: buffer overrun
[   90.965152] ttyS1: buffer overrun
[   93.167001] tty ttyS1: Reconnected due to: Bad message size
[   94.177694] ev3_uart_receive_buf: 10 callbacks suppressed
[   94.177833] ttyS1: buffer overrun
[   94.178349] ttyS1: buffer overrun
[   96.447631] tty ttyS1: Reconnected due to: Bad message size
[  123.715056] ttyS1: buffer overrun
[  123.715812] ttyS1: buffer overrun
[  133.015177] lego-port port0: Removed device 'in1:ev3-uart-host'
[  133.138656] lego-sensor sensor0: Unregistered 'ev3-uart-76' on 'in1'.
[  135.596631] lego-port port0: Added new device 'in1:ev3-uart-host'
[  137.836522] lego-sensor sensor1: Registered 'ev3-uart-76' on 'in1'.
[  155.711342] ttyS1: buffer overrun
[  155.712021] ttyS1: buffer overrun
[  155.714145] ttyS1: buffer overrun
[  155.714862] ttyS1: buffer overrun
[  157.920986] tty ttyS1: Reconnected due to: Bad message size
[  188.358156] lego-port port0: Removed device 'in1:ev3-uart-host'
[  188.444315] lego-sensor sensor1: Unregistered 'ev3-uart-76' on 'in1'.
[  190.940146] lego-port port0: Added new device 'in1:ev3-uart-host'
[  193.189607] lego-sensor sensor2: Registered 'ev3-uart-76' on 'in1'.
[  195.470074] tty ttyS1: Reconnected due to: Bad message size

bmegli · 2016-06-15T17:28:14Z

@kortschak This is actually great news ;-)

The problem is pinpointed now exactly where we have guessed.

Forgive me the flood of questions - you are right about the issue (new).

kortschak · 2016-06-15T23:56:30Z

Yeah. I thought you would like that.

I'll have time to file the issue tonight.

dlech changed the title ~~EVB FATCATLAB 9DOF sporadically crashes Brickman~~ EVB FATCATLAB 9DOF sporadically crashes kernel Jun 3, 2016

dlech added the sensors label Jun 3, 2016

dlech mentioned this issue Jun 9, 2016

Plugging in Sensors prevents boot. #623

Closed

bmegli mentioned this issue Jun 10, 2016

Possible Cause for "Reconnected due to: Bad message size" #671

Open

kortschak closed this as completed Jun 15, 2016

kortschak mentioned this issue Jun 16, 2016

FATCATLAB 9DOF causes buffer overrun on Ev3 #679

Closed

bmegli mentioned this issue Jun 27, 2016

Possible race condition when resyncing UART sensor #690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EVB FATCATLAB 9DOF sporadically crashes kernel #662

EVB FATCATLAB 9DOF sporadically crashes kernel #662

kortschak commented Jun 3, 2016

kortschak commented Jun 3, 2016

dlech commented Jun 3, 2016

bmegli commented Jun 5, 2016 •

edited

dlech commented Jun 5, 2016

bmegli commented Jun 6, 2016

dlech commented Jun 6, 2016

dlech commented Jun 6, 2016

dlech commented Jun 6, 2016

dlech commented Jun 6, 2016

bmegli commented Jun 6, 2016

dlech commented Jun 6, 2016 •

edited

dlech commented Jun 6, 2016 •

edited

bmegli commented Jun 6, 2016

bmegli commented Jun 6, 2016 •

edited

bmegli commented Jun 6, 2016

dlech commented Jun 6, 2016

bmegli commented Jun 6, 2016 •

edited

dlech commented Jun 6, 2016

dlech commented Jun 6, 2016

bmegli commented Jun 6, 2016

dlech commented Jun 15, 2016

kortschak commented Jun 15, 2016

bmegli commented Jun 15, 2016

kortschak commented Jun 15, 2016

kortschak commented Jun 15, 2016

bmegli commented Jun 15, 2016

kortschak commented Jun 15, 2016

kortschak commented Jun 15, 2016

bmegli commented Jun 15, 2016

kortschak commented Jun 15, 2016

EVB FATCATLAB 9DOF sporadically crashes kernel #662

EVB FATCATLAB 9DOF sporadically crashes kernel #662

Comments

kortschak commented Jun 3, 2016

System information:

About my issue:

kortschak commented Jun 3, 2016

dlech commented Jun 3, 2016

bmegli commented Jun 5, 2016 • edited

dlech commented Jun 5, 2016

bmegli commented Jun 6, 2016

dlech commented Jun 6, 2016

dlech commented Jun 6, 2016

dlech commented Jun 6, 2016

dlech commented Jun 6, 2016

bmegli commented Jun 6, 2016

dlech commented Jun 6, 2016 • edited

dlech commented Jun 6, 2016 • edited

bmegli commented Jun 6, 2016

bmegli commented Jun 6, 2016 • edited

bmegli commented Jun 6, 2016

dlech commented Jun 6, 2016

bmegli commented Jun 6, 2016 • edited

dlech commented Jun 6, 2016

dlech commented Jun 6, 2016

bmegli commented Jun 6, 2016

dlech commented Jun 15, 2016

kortschak commented Jun 15, 2016

bmegli commented Jun 15, 2016

kortschak commented Jun 15, 2016

kortschak commented Jun 15, 2016

bmegli commented Jun 15, 2016

kortschak commented Jun 15, 2016

kortschak commented Jun 15, 2016

bmegli commented Jun 15, 2016

kortschak commented Jun 15, 2016

bmegli commented Jun 5, 2016 •

edited

dlech commented Jun 6, 2016 •

edited

dlech commented Jun 6, 2016 •

edited

bmegli commented Jun 6, 2016 •

edited

bmegli commented Jun 6, 2016 •

edited