Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EVB FATCATLAB 9DOF sporadically crashes kernel #662

Closed
kortschak opened this issue Jun 3, 2016 · 30 comments
Closed

EVB FATCATLAB 9DOF sporadically crashes kernel #662

kortschak opened this issue Jun 3, 2016 · 30 comments
Labels

Comments

@kortschak
Copy link
Member

System information:

  • I am using: EVB
  • I installed ev3dev using this image file: evb-ev3dev-jessie-2016-01-25.img
  • My kernel version is: 4.4.9-ti-rt-r25-11-ev3dev-bb.org
  • My host computer is running: linux
  • I am connecting using: Wi-Fi

About my issue:

Plug in the 9DOF device and remove then plug in again.
Attempt to navigate in Brickman - not possible.

Additionally the EVB will not respond to an invocation of /sbin/poweroff (presumably it is waiting for Brickman to terminate).

@kortschak
Copy link
Member Author

Oooh. It's not Brickman. Relevant lines of dmesg output:

[   51.947745] lego-port port0: Added new device 'evb-ports:in1:ev3-uart-host'
[   52.205858] Registered EV3 UART sensor line discipline. (29)
[   54.192904] lego-sensor sensor0: Registered 'ev3-uart-76' on 'evb-ports:in1'.
[   54.233011] tty ttyS4: Reconnected due to: Bad message size.
[   56.508776] tty ttyS4: Reconnected due to: Bad message size.
[   66.913634] lego-port port0: Removed device 'evb-ports:in1:ev3-uart-host'
[   66.953411] lego-sensor sensor0: Unregistered 'ev3-uart-76' on 'evb-ports:in1'.
[   66.953590] pin1_mv: 4997
[   66.953596] pin6_mv: 53
[   66.953601] pin5_gpio: 1
[   66.953605] pin6_gpio: 0
[   68.376911] lego-port port0: Added new device 'evb-ports:in1:ev3-uart-host'
[   69.305480] lego-sensor sensor1: Registered 'ev3-uart-76' on 'evb-ports:in1'.
[   69.345030] tty ttyS4: Reconnected due to: Bad checksum.
[   69.365600] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: bfb02a35

[   69.365619] CPU: 0 PID: 226 Comm: kworker/0:3 Tainted: G        WC      4.4.9-ti-rt-r25-11-ev3dev-bb.org #1
[   69.365623] Hardware name: Generic AM33XX (Flattened Device Tree)
[   69.365667] Workqueue: events ev3_uart_handle_rx_data [ev3_uart_sensor_ld]
[   69.365709] [<c0015e91>] (unwind_backtrace) from [<c001259d>] (show_stack+0x11/0x14)
[   69.365732] [<c001259d>] (show_stack) from [<c03dafa1>] (dump_stack+0x5d/0x6c)
[   69.365751] [<c03dafa1>] (dump_stack) from [<c00fadbf>] (panic+0xbb/0x1d8)
[   69.365779] [<c00fadbf>] (panic) from [<c00314db>] (__stack_chk_fail+0x17/0x18)
[   69.365801] [<c00314db>] (__stack_chk_fail) from [<bfb02a35>] (ev3_uart_open+0x0/0x22c [ev3_uart_sensor_ld])
[   69.365851] [<bfb02a35>] (ev3_uart_open [ev3_uart_sensor_ld]) from [<00000000>] (  (null))
[   69.446314] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: bfb02a35

@dlech dlech changed the title EVB FATCATLAB 9DOF sporadically crashes Brickman EVB FATCATLAB 9DOF sporadically crashes kernel Jun 3, 2016
@dlech dlech added the sensors label Jun 3, 2016
@dlech
Copy link
Member

dlech commented Jun 3, 2016

Hmm... nothing obvious jumping out here. Makes me wonder if this is somehow related to #623.

It would be useful to put a logic analyzer on this to see if we really have a bad message size or if our driver needs to be fixed.

Also, an inspection of ev3_uart_sensor_ld.c by a fresh pair of eyes to see if we are passing a pointer to some stack allocated memory somewhere would be useful. It could be that we are using the pointer after the stack is popped.

@bmegli
Copy link
Member

bmegli commented Jun 5, 2016

I am not sure if this is the problem but lines 400-401 of ev3_uart_sensor_ld.c look suspicious.

port->in_port = NULL;
put_device(&port->in_port->dev);

So we are dereferencing through NULL pointer?
The lines should be probably in the reverse order.

Update -> Pull Request issued

@dlech
Copy link
Member

dlech commented Jun 5, 2016

I don't think we ever reach this code. We would certainly get a crash if we did. Thanks for the PR to fix it.

@bmegli
Copy link
Member

bmegli commented Jun 6, 2016

I think I may have found the problem:

static void ev3_uart_handle_rx_data(struct work_struct *work)
{
        //...
        u8 message[EV3_UART_MAX_MESSAGE_SIZE + 2];
        int count = CIRC_CNT(cb->head, cb->tail, EV3_UART_BUFFER_SIZE);
        //message is 36 bytes long (btw, shouldn't it be MAX_DATA_SIZE+2 or just MAX_MESSAGE_SIZE)
        //count is up to 255
        // ...
        //after syncing
        msg_size = ev3_uart_msg_size((u8)cb->buf[cb->tail]);
        //msg_size may be up 128 + 2 + 1!
        //EV3_UART_MAX_MESSAGE_SIZE macro returns up to 128, is this intentional?
        //note - we are before CRC check here still
        if (msg_size > count)
            break;

    size_to_end = CIRC_CNT_TO_END(cb->head, cb->tail, EV3_UART_BUFFER_SIZE);
    if (msg_size > size_to_end) {
        memcpy(message, cb->buf + cb->tail, size_to_end);
        memcpy(message + size_to_end, cb->buf, msg_size - size_to_end);
        cb->tail = msg_size - size_to_end;
    } else {
        memcpy(message, cb->buf + cb->tail, msg_size);
        cb->tail += msg_size;
        if (cb->tail >= EV3_UART_BUFFER_SIZE)
           cb->tail = 0;
        }

       //now we may have overwritten 36 byte buffer with up to 131 msg_size bytes

       //only after that we check that size is OK but we already may have had overflow

      if (msg_size > EV3_UART_MAX_MESSAGE_SIZE) {
         port->last_err = "Bad message size.";
         goto err_invalid_state;
      }
     //now follows the CRC check, but it is also too late

The obvious fix here is to move:

  if (msg_size > EV3_UART_MAX_MESSAGE_SIZE)
  //...

before memcpy of the memory, but I am not sure if another problem also doesn't lie in:

#define EV3_UART_CMD_SIZE(byte) (1 << (((byte) >> 3) & 0x7))

which doesn't limit size to around 30 bytes but to 128 instead

I am not enough into the code to answer all of that but hopefully it is enough for @dlech to fix it.

@dlech
Copy link
Member

dlech commented Jun 6, 2016

Awesome! This explains the stack corruption after the Bad Message Size error that @kortschak was seeing.

@dlech
Copy link
Member

dlech commented Jun 6, 2016

For reference, the official LEGO firmware does not list 64 or 128 as possible data sizes

https://github.com/mindboards/ev3sources/blob/master/lms2012/d_uart/Linuxmod_AM1808/d_uart_mod.c#L230

@dlech
Copy link
Member

dlech commented Jun 6, 2016

@dlech
Copy link
Member

dlech commented Jun 6, 2016

@bmegli
Copy link
Member

bmegli commented Jun 6, 2016

This explains the stack corruption after the Bad Message Size error that @kortschak was seeing.

Yes, correcting this will make the kernel code safe but there still remains the question:

Why was there incorrect message in the first place?

Maybe it is artifact of disconnecting the sensor while working or changing baudrate or other factor.

For reference, the official LEGO firmware does not list 64 or 128 as possible data sizes

Yes but we probably should not expect device to play nicely at this point. It can be anything connected to uart that just went through sync procedure, either by coincidence or even maliciously.

We can also simply get corrupted data.

You know - let's not trust them - sensors, nobody knows who they work for ;-)

Anyway - great code! Three thumbs up!

dlech added a commit to ev3dev/lego-linux-drivers that referenced this issue Jun 6, 2016
We were overflowing a stack allocated array when a message had a bad
size before acutally checking the size of the message.

Suggested-by: @bmegli
Issue: ev3dev/ev3dev#662
@dlech
Copy link
Member

dlech commented Jun 6, 2016

Why was there incorrect message in the first place?

@kortschak will have to add some printks and let us know.

I've pushed the fix suggested by @bmegli, so that should at least fix the stack corruption issue. @kortschak, can you confirm?

@dlech
Copy link
Member

dlech commented Jun 6, 2016

Interesting... With these changes, I'm getting a bunch of "Reconnected due to: Bad message size." with the LEGO EV3 Ultrasonic sensors (which was the problem sensor reported in #664).

@bmegli
Copy link
Member

bmegli commented Jun 6, 2016

Hmm, in original ev3 code you linked they do have a length check:

   Uart1RecMesLng  =  GET_MESSAGE_LENGTH(Byte) + 2; 

           if (Uart1RecMesLng <= UART_BUFFER_SIZE) 
           { // Valid length 
             Uart1RecMes[Uart1RecMesIn]  =  Byte; 
             Uart1RecMesIn++; 
           } 

@bmegli
Copy link
Member

bmegli commented Jun 6, 2016

and

UART_BUFFER_SIZE=64

so they pass messages up to 32 length

I mean, if GET_MESSAGE_LENGTH gives 64, they add 2 to it and it will fail the check
and message is not valid

edit: they allow messages smaller-equal 62
edit of edit: the steps due to coding are 32+2, 64+2 so they allow smaller equal 34

@bmegli
Copy link
Member

bmegli commented Jun 6, 2016

So the buffer seems just "nearest greater than 32 power of 2" ;-)

@dlech
Copy link
Member

dlech commented Jun 6, 2016

I haven't been able to reproduce the "Reconnected due to: Bad message size." since the first reboot. I am running into #572 quite a bit though. It makes me wonder if we have some kind of cross-talk issue with the hardware that is causing data corruption on the UARTs (Bluetooth and input ports 1 and 2 all use SoC UARTs).

We should save that discussion for another issue though. This issue is about the kernel crash, and I think we have fixed it.

@bmegli
Copy link
Member

bmegli commented Jun 6, 2016

Ok, one thing before I forget:

static inline int ev3_uart_msg_size(u8 header) 
 { 
    int size; 

    if (!(header & EV3_UART_MSG_TYPE_MASK)) /* SYNC, NACK, ACK */ 
        return 1;

    size = EV3_UART_CMD_SIZE(header); 
    size += 2; /* header and checksum */ 
    if ((header & EV3_UART_MSG_TYPE_MASK) == EV3_UART_MSG_TYPE_INFO) 
        size++; /* extra command byte */  

    return size; 
} 

So maybe max length could be 32+2+1.
I mean maybe my comment to limit

u8 message[EV3_UART_MAX_MESSAGE_SIZE];

length to 32+2 was wrong

To consider but yes, probably not here and now ;-)

Edit: in original lms code they allow length up to 62 as valid, ugh, but probably mean 34 or 35
Edit of Edit: silly me, the protocol allows 32+2, the next step is 64+2 so they allow at most 34

@dlech
Copy link
Member

dlech commented Jun 6, 2016

So maybe max length could be 32+2+1.

Yes, I suppose it could be if there is corrupt data.

@dlech
Copy link
Member

dlech commented Jun 6, 2016

although we should catch the error now instead of writing past the end of the buffer, so I think it is OK the way it is. EV3_UART_MSG_TYPE_INFO should never be the full 32 bytes (longest is 11 bytes). Only EV3_UART_MSG_TYPE_DATA are that big.

@bmegli
Copy link
Member

bmegli commented Jun 6, 2016

Oh, ok, unlike you I don't know the protocol

Just wanted to make sure you didn't change the buffer size too hasty. I was only guessing this was a typo (message[UART_MAX_MESSAGE_SIZE+2] -> message[UART_MAX_MESSAGE_SIZE] that probably sneaked from the form message[UART_MAX_DATA_SIZE+2]. Uggghh.

Enough. ;-)

@dlech
Copy link
Member

dlech commented Jun 15, 2016

This fix has been released in the 12-ev3dev kernel.

Please confirm and close the issue if it is fixed.

@kortschak
Copy link
Member Author

This appears to fix the failure. Though there is an interesting line in dmesg on reconnection. After the first connect, you see the following:

[  244.687878] lego-sensor sensor2: Unregistered 'ev3-uart-76' on 'evb-ports:in1'.
[  244.696682] pin1_mv: 4987
[  244.696687] pin6_mv: 68
[  244.696693] pin5_gpio: 1
[  244.696697] pin6_gpio: 0
[  253.009123] lego-port port0: Added new device 'evb-ports:in1:ev3-uart-host'
[  253.930700] lego-sensor sensor3: Registered 'ev3-uart-76' on 'evb-ports:in1'.
[  256.222986] tty ttyS4: Reconnected due to: Bad message size

Also interesting, but unrelated is that here gravity is only half strength! I get ±0.5g depending on orientation, never 1g.

@bmegli
Copy link
Member

bmegli commented Jun 15, 2016

Interesting.

Wasn't there tty ttyS4: buffer overrun somewhere around? If not it means that there may be also some other, unknown yet, cause for Bad message size with 9DOF, but non critical now.

@kortschak
Copy link
Member Author

I'll be opening another issue for the 9DOF wrt the factor of 2 error, so I may as well make a note of this as well. I'll do that tomorrow.

@kortschak
Copy link
Member Author

From #662 (comment)

[   69.305480] lego-sensor sensor1: Registered 'ev3-uart-76' on 'evb-ports:in1'.
[   69.345030] tty ttyS4: Reconnected due to: Bad checksum.
[   69.365600] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: bfb02a35

@bmegli
Copy link
Member

bmegli commented Jun 15, 2016

Yes, yes, I know about that one. The kernel panic on incorrect message size has already been fixed by @dlech. But there is still unknown reason for bad message size itself (which is non-fatal now).

@dlech has added buffer overrun message in place of possible cause but if it's not showing then it's some another cause.

For short:

  • there is bad message from 9DOF for some reason
  • this was causing kernel panic (result, fixed already)

@kortschak
Copy link
Member Author

Yes, that was my meaning. It seems there is still an unknown reason, it was present in the original dmesg lines (the middle line in my last comment), but we didn't notice because worse things were also happening.

@kortschak
Copy link
Member Author

Here's a sequence of connecting/removing of the 9DOF on an EV3 with a recent kernel upgrade (4.4.13-12-ev3dev-ev3)

[   68.492177] lego-port port0: Added new device 'in1:ev3-uart-host'
[   69.777069] Registered EV3 UART sensor line discipline. (29)
[   70.739787] lego-sensor sensor0: Registered 'ev3-uart-76' on 'in1'.
[   70.792347] tty ttyS1: Reconnected due to: Bad message size
[   73.023261] tty ttyS1: Reconnected due to: Bad message size
[   73.444670] ttyS1: buffer overrun
[   73.445291] ttyS1: buffer overrun
[   73.447617] ttyS1: buffer overrun
[   77.135573] ttyS1: buffer overrun
[   77.136330] ttyS1: buffer overrun
[   77.138547] ttyS1: buffer overrun
[   77.139235] ttyS1: buffer overrun
[   77.141624] ttyS1: buffer overrun
[   77.141975] ttyS1: buffer overrun
[   77.142622] ttyS1: buffer overrun
[   79.404329] tty ttyS1: Reconnected due to: Bad message size
[   87.576531] ev3_uart_receive_buf: 7 callbacks suppressed
[   87.576584] ttyS1: buffer overrun
[   87.577162] ttyS1: buffer overrun
[   89.786367] tty ttyS1: Reconnected due to: Bad message size
[   90.956587] ttyS1: buffer overrun
[   90.957222] ttyS1: buffer overrun
[   90.959295] ttyS1: buffer overrun
[   90.960065] ttyS1: buffer overrun
[   90.962575] ttyS1: buffer overrun
[   90.962862] ttyS1: buffer overrun
[   90.963937] ttyS1: buffer overrun
[   90.965152] ttyS1: buffer overrun
[   93.167001] tty ttyS1: Reconnected due to: Bad message size
[   94.177694] ev3_uart_receive_buf: 10 callbacks suppressed
[   94.177833] ttyS1: buffer overrun
[   94.178349] ttyS1: buffer overrun
[   96.447631] tty ttyS1: Reconnected due to: Bad message size
[  123.715056] ttyS1: buffer overrun
[  123.715812] ttyS1: buffer overrun
[  133.015177] lego-port port0: Removed device 'in1:ev3-uart-host'
[  133.138656] lego-sensor sensor0: Unregistered 'ev3-uart-76' on 'in1'.
[  135.596631] lego-port port0: Added new device 'in1:ev3-uart-host'
[  137.836522] lego-sensor sensor1: Registered 'ev3-uart-76' on 'in1'.
[  155.711342] ttyS1: buffer overrun
[  155.712021] ttyS1: buffer overrun
[  155.714145] ttyS1: buffer overrun
[  155.714862] ttyS1: buffer overrun
[  157.920986] tty ttyS1: Reconnected due to: Bad message size
[  188.358156] lego-port port0: Removed device 'in1:ev3-uart-host'
[  188.444315] lego-sensor sensor1: Unregistered 'ev3-uart-76' on 'in1'.
[  190.940146] lego-port port0: Added new device 'in1:ev3-uart-host'
[  193.189607] lego-sensor sensor2: Registered 'ev3-uart-76' on 'in1'.
[  195.470074] tty ttyS1: Reconnected due to: Bad message size

@bmegli
Copy link
Member

bmegli commented Jun 15, 2016

@kortschak This is actually great news ;-)

The problem is pinpointed now exactly where we have guessed.

Forgive me the flood of questions - you are right about the issue (new).

@kortschak
Copy link
Member Author

Yeah. I thought you would like that.

I'll have time to file the issue tonight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants