Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App crash with Mq Error Code 15 #332

Closed
1 task
marthtz opened this issue Nov 2, 2020 · 5 comments
Closed
1 task

App crash with Mq Error Code 15 #332

marthtz opened this issue Nov 2, 2020 · 5 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@marthtz
Copy link
Contributor

marthtz commented Nov 2, 2020

Required information

Operating system:
Ubuntu 18.04 LTS

Compiler version:
GCC 7.4.0

Observed result or behaviour:
Add a session ID to the message queue communication to filter outdated messages when an application restarts.

Expected result or behaviour:
Application registers at RouDi and runs like expected.

Conditions where it occurred / Performed steps:
Quite hard, since it is a race. But is is possible with a debugger. Start RouDi and an application. Just stop the activation thread in RouDi right before the WAKEUP_TRIGGER is sent. Stop and restart the application.
Quite rare, but when this bug is triggered, then the application registers at RouDi and receives an outdated message when an application/sender/receiver port is requested. The application terminates itself.

TODO

  • Re-write iceoryx_posh/test/moduletests/test_mq_interface.cpp to test only POSH specific functionality of the IpcInterface e.g. getShmTopic(), IpcChannel is already covered by iceoryx_utils/test/moduletests/test_ipc_channel.cpp (follow-up from Integration of Unix Domain Sockets #381 )
@marthtz marthtz added the bug Something isn't working label Nov 2, 2020
@marthtz
Copy link
Contributor Author

marthtz commented Nov 2, 2020

To actually fix this bug, we need to send a session ID with each mqueue message. With the session ID, the recipient can check if the received message is valid or from an outdated session.

The message queue communication needs a more sophisticated protocol with an header and payload. Since the communication can break in several ways, I would propose the following changes. To fix this specific bug, only the first step needs to be implemented.

Step 1

  • MqMessage is split into header and payload, where the header is just an fixed, reserved amount of space at the beginning of the internal string and filled in sendMessageToProcess with the session ID which is passed alongside the MqMessage
  • to be able to create a more sophisticated header in the next steps, the first byte is the magic number 0xFF
  • followed by 8 byte session id
  • followed by two bytes payload size and the actual payload, which consists of the current protocol
  • to keep our tests running, we need an "always valid" session id. this could be either 0 or -1u. this is not great for security reasons, but doesn't make it worse than the current situation
  • framing
    • magic number (0xFF): 1 byte -> maybe have this as protocol version byte and also use it later
    • session id (uint64_t): 8 byte
    • payload length (uint16_t, N): 2 bytes
    • payload: N bytes

Step 2

  • split MqMessageType into several enums, since it contains to much unrelated stuff
  • MqMessageType (uint8_t)
    • REQUEST-> for e.g. IMPL_SENDER
    • RESPONSE -> for e.g. IMPL_SENDER
    • NOTIFICATION -> better name???; for e.g. KEEP_ALIVE, WAKEUP_TRIGGER
  • MqMessageCommand (uint8_t)
    • REG
    • IMPL_SENDER
    • FIND_SERVICE --> this could actually allocate a chunk from the mempool and just send a ChunkManagement pointer to the application; would fix AOS-2931
    • ...
    • INVALID -> this is the last one, used for sanity checks and unifies the current BEGIN, NO_TYPE and END enum values
  • MqMessageCommandResult (uint8_t)
    • ACK
    • ERROR
  • MqMessageErrorType (uint8_t)
    • UNKNOWN_MESSAGE
    • MESSAGE_NOT_SUPPORTED
    • INCOMPATIBLE_PROTOCOL -> REG should send an expected protocol version, since we don't promise any binary compatibility
    • NO_UNIQUE_CREATED
    • SENDER_LIST_FULL
    • INVALID -> used for sanity checks
  • MqMessageNotifications (uint8_t)
    • KEEP_ALIVE -> this could be refactored to an atomic_flag; atomic_flag::clear() in app; atomic_flag::test_and_set() in RouDi; we need some memory for each application in the shared memory, though
    • WAKEUP_TRIGGER -> this could be refactored to be triggered by a semaphore in the shared memory
    • APP_WAIT -> this should probably remain a mqueue message; alternatively a multi pusher single popper queue in shm
    • INVALID -> used for sanity checks
  • framing for REQUEST/NOTIFICATION
    • message type (uint8_t): 1 byte
    • message command/notification (uint8_t): 1 byte
    • session id (uint64_t): 8 byte
    • payload size (uint16_t, N): 2 byte
    • payload (like now, except MqMessageType): N bytes
  • framing for RESPONSE
    • message type (uint8_t): 1 byte
    • message command (uint8_t): 1 byte
    • message command result (uint8_t): 1 byte
    • session id (uint64_t): 8 byte
    • payload size (uint16_t, N): 2 byte
    • payload (like now, except MqMessageType): N bytes

Step 3

  • don't convert to ASCII, but introduce a binary protocol for the payload
  • each payload entry consists of 2 bytes data length + actual data
  • framing
    • message type (uint8_t): 1 byte
    • message command/notification (uint8_t): 1 byte
    • session id (uint64_t): 8 byte
    • payload size (uint16_t, N): 2 byte
    • data entry size (uint16_t, X): 2 byte
    • data entry: X byte
    • data entry size (uint16_t, Y): 2 byte
    • data entry: Y byte

Step 1 is necessary to fix this bug.

Step 2 will prevent us from bugs where the app receives interleaving messages. We already had this with the introduction of SERVICE_REGISTRY_CHANGE_COUNTER. With the NOTIFICATION approach, app requests could be cached in the app and the app could wait for the actual response.

Step 3 is not much work when Step 2 is done and will increase the performance since the serialization is now much simpler.

For Step 2 and 3 probably an own story should be created. Maybe it's also unnecessary if we do the transition to a shm based communication, but even there we need some kind of protocol.

@budrus
Copy link
Contributor

budrus commented Mar 21, 2021

@mossmaurice @elBoberido. Do we still have this issue with the latest changes in #404 ?

@elBoberido
Copy link
Member

@budrus this issue only persists if there is an additional thread accessing the IPC channel. We had that once in a RouDi extension.

This issue contains a proposal how to overhaul the IPC communication. This could be done as part of #611, though. @mossmaurice what do you think?

@budrus
Copy link
Contributor

budrus commented Jan 20, 2022

@elBoberido please create an issue for the refactoring and then we'll close this one

@elBoberido
Copy link
Member

With the merge of the Client and Server API, RouDi addons do not need to reuse the socket/message queue therefore this issue is closed. The proposed protocol for the UDS communication will be implemented in #1133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants