New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EVB FATCATLAB 9DOF sporadically crashes kernel #662
Comments
Oooh. It's not Brickman. Relevant lines of
|
Hmm... nothing obvious jumping out here. Makes me wonder if this is somehow related to #623. It would be useful to put a logic analyzer on this to see if we really have a bad message size or if our driver needs to be fixed. Also, an inspection of |
I am not sure if this is the problem but lines 400-401 of port->in_port = NULL;
put_device(&port->in_port->dev); So we are dereferencing through NULL pointer? Update -> Pull Request issued |
I don't think we ever reach this code. We would certainly get a crash if we did. Thanks for the PR to fix it. |
I think I may have found the problem: static void ev3_uart_handle_rx_data(struct work_struct *work)
{
//...
u8 message[EV3_UART_MAX_MESSAGE_SIZE + 2];
int count = CIRC_CNT(cb->head, cb->tail, EV3_UART_BUFFER_SIZE);
//message is 36 bytes long (btw, shouldn't it be MAX_DATA_SIZE+2 or just MAX_MESSAGE_SIZE)
//count is up to 255
// ...
//after syncing
msg_size = ev3_uart_msg_size((u8)cb->buf[cb->tail]);
//msg_size may be up 128 + 2 + 1!
//EV3_UART_MAX_MESSAGE_SIZE macro returns up to 128, is this intentional?
//note - we are before CRC check here still
if (msg_size > count)
break;
size_to_end = CIRC_CNT_TO_END(cb->head, cb->tail, EV3_UART_BUFFER_SIZE);
if (msg_size > size_to_end) {
memcpy(message, cb->buf + cb->tail, size_to_end);
memcpy(message + size_to_end, cb->buf, msg_size - size_to_end);
cb->tail = msg_size - size_to_end;
} else {
memcpy(message, cb->buf + cb->tail, msg_size);
cb->tail += msg_size;
if (cb->tail >= EV3_UART_BUFFER_SIZE)
cb->tail = 0;
}
//now we may have overwritten 36 byte buffer with up to 131 msg_size bytes
//only after that we check that size is OK but we already may have had overflow
if (msg_size > EV3_UART_MAX_MESSAGE_SIZE) {
port->last_err = "Bad message size.";
goto err_invalid_state;
}
//now follows the CRC check, but it is also too late The obvious fix here is to move: if (msg_size > EV3_UART_MAX_MESSAGE_SIZE)
//... before #define EV3_UART_CMD_SIZE(byte) (1 << (((byte) >> 3) & 0x7)) which doesn't limit size to around 30 bytes but to 128 instead I am not enough into the code to answer all of that but hopefully it is enough for @dlech to fix it. |
Awesome! This explains the stack corruption after the Bad Message Size error that @kortschak was seeing. |
For reference, the official LEGO firmware does not list 64 or 128 as possible data sizes |
Ah, but LEGO has buffer size of 64. |
But still max data length of 32 |
Yes, correcting this will make the kernel code safe but there still remains the question: Why was there incorrect message in the first place? Maybe it is artifact of disconnecting the sensor while working or changing baudrate or other factor.
Yes but we probably should not expect device to play nicely at this point. It can be anything connected to uart that just went through sync procedure, either by coincidence or even maliciously. We can also simply get corrupted data. You know - let's not trust them - sensors, nobody knows who they work for ;-) Anyway - great code! Three thumbs up! |
We were overflowing a stack allocated array when a message had a bad size before acutally checking the size of the message. Suggested-by: @bmegli Issue: ev3dev/ev3dev#662
@kortschak will have to add some I've pushed the fix suggested by @bmegli, so that should at least fix the stack corruption issue. @kortschak, can you confirm? |
Interesting... With these changes, I'm getting a bunch of "Reconnected due to: Bad message size." with the LEGO EV3 Ultrasonic sensors (which was the problem sensor reported in #664). |
Hmm, in original ev3 code you linked they do have a length check: Uart1RecMesLng = GET_MESSAGE_LENGTH(Byte) + 2;
if (Uart1RecMesLng <= UART_BUFFER_SIZE)
{ // Valid length
Uart1RecMes[Uart1RecMesIn] = Byte;
Uart1RecMesIn++;
} |
and
so they pass messages up to 32 length I mean, if GET_MESSAGE_LENGTH gives 64, they add 2 to it and it will fail the check edit: they allow messages smaller-equal 62 |
So the buffer seems just "nearest greater than 32 power of 2" ;-) |
I haven't been able to reproduce the "Reconnected due to: Bad message size." since the first reboot. I am running into #572 quite a bit though. It makes me wonder if we have some kind of cross-talk issue with the hardware that is causing data corruption on the UARTs (Bluetooth and input ports 1 and 2 all use SoC UARTs). We should save that discussion for another issue though. This issue is about the kernel crash, and I think we have fixed it. |
Ok, one thing before I forget: static inline int ev3_uart_msg_size(u8 header)
{
int size;
if (!(header & EV3_UART_MSG_TYPE_MASK)) /* SYNC, NACK, ACK */
return 1;
size = EV3_UART_CMD_SIZE(header);
size += 2; /* header and checksum */
if ((header & EV3_UART_MSG_TYPE_MASK) == EV3_UART_MSG_TYPE_INFO)
size++; /* extra command byte */
return size;
} So maybe max length could be 32+2+1. u8 message[EV3_UART_MAX_MESSAGE_SIZE]; length to 32+2 was wrong To consider but yes, probably not here and now ;-) Edit: in original lms code they allow length up to 62 as valid, ugh, but probably mean 34 or 35 |
Yes, I suppose it could be if there is corrupt data. |
although we should catch the error now instead of writing past the end of the buffer, so I think it is OK the way it is. |
Oh, ok, unlike you I don't know the protocol Just wanted to make sure you didn't change the buffer size too hasty. I was only guessing this was a typo ( Enough. ;-) |
This fix has been released in the 12-ev3dev kernel. Please confirm and close the issue if it is fixed. |
This appears to fix the failure. Though there is an interesting line in dmesg on reconnection. After the first connect, you see the following:
Also interesting, but unrelated is that here gravity is only half strength! I get ±0.5g depending on orientation, never 1g. |
Interesting. Wasn't there |
I'll be opening another issue for the 9DOF wrt the factor of 2 error, so I may as well make a note of this as well. I'll do that tomorrow. |
From #662 (comment)
|
Yes, yes, I know about that one. The kernel panic on incorrect message size has already been fixed by @dlech. But there is still unknown reason for @dlech has added For short:
|
Yes, that was my meaning. It seems there is still an unknown reason, it was present in the original dmesg lines (the middle line in my last comment), but we didn't notice because worse things were also happening. |
Here's a sequence of connecting/removing of the 9DOF on an EV3 with a recent kernel upgrade (4.4.13-12-ev3dev-ev3)
|
@kortschak This is actually great news ;-) The problem is pinpointed now exactly where we have guessed. Forgive me the flood of questions - you are right about the issue (new). |
Yeah. I thought you would like that. I'll have time to file the issue tonight. |
System information:
About my issue:
Plug in the 9DOF device and remove then plug in again.
Attempt to navigate in Brickman - not possible.
Additionally the EVB will not respond to an invocation of /sbin/poweroff (presumably it is waiting for Brickman to terminate).
The text was updated successfully, but these errors were encountered: