Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong handling of applications that are started multiple times in parallel #404

Closed
4 tasks done
budrus opened this issue Dec 2, 2020 · 6 comments · Fixed by #523, #598 or #606
Closed
4 tasks done

Wrong handling of applications that are started multiple times in parallel #404

budrus opened this issue Dec 2, 2020 · 6 comments · Fixed by #523, #598 or #606
Assignees
Labels
bug Something isn't working

Comments

@budrus
Copy link
Contributor

budrus commented Dec 2, 2020

Required information

Operating system:
Gentoo LInux

Compiler version:
GCC 9.3.0

Observed result or behaviour:
When a second instance of application is started with the same Runtime name we have a wrong handling
With Monitoring: The second application tries to register again an again as there is no response until it gets a timeout that says there is no RouDi. RouDi ignores the registration as this app is already registered but does not send an error response
Without Monitoring: The second application is accepted. For the first one the shared memory resources are removed. But the first application is still running an on a nice trip to undefined behavior

Expected result or behaviour:
RouDi should prevent that multiple instances of applications with the same runtime name are running in parallel and also instantly report a clear error

Conditions where it occurred / Performed steps:
Start RouDi and the first instance of the application. Then start another instance of the application

Bugfix steps

  • Fix verbose output when iox-roudi --help
  • Extend timeout info in runtime if RouDi is not answering b/c keepalive limit was not yet reached
  • Send de-register message from runtime d'tor
  • Add APP_ALREADY_REGISTERED message and inform user via runtime
@budrus budrus self-assigned this Dec 2, 2020
@budrus budrus added the bug Something isn't working label Dec 2, 2020
@budrus
Copy link
Contributor Author

budrus commented Dec 2, 2020

For the re-registration process of application we have to distinguish the following cases.

  1. a gracefully terminated application re-registers with RouDi w/ active monitoring
  2. a hard terminated application re-registers with RouDi w/ active monitoring
  3. a second application with same name tries to registers with RouDi w/ active monitoring
  4. a gracefully terminated application re-registers with RouDi w/o monitoring
  5. a hard terminated application re-registers with RouDi w/o monitoring
  6. a second application with same name tries to registers with RouDi w/o monitoring

Gracefully terminated means that the main thread goes out of scope (e.g. by catching Ctrl-C with a signal handler that terminates an endless loop). Hard termination means that Ctrl-C is not caught or we have an assertion etc. The difference is that our runtime d'tor is only executed in case of a graceful termination.

Current behavior is:

1. a gracefully terminated application re-registers with RouDi w/ active monitoring
Re-registration is fine when the monitoring has already cleaned up. If the monitoring has not yet cleaned up there is a warning from RouDi and no response to the REG message. As the application tries again the re-registration succeeds after the monitoring cleanup
2. a hard terminated application re-registers with RouDi w/ active monitoring
Re-registration is fine when the monitoring has already cleaned up. If the monitoring has not yet cleaned up there is a warning from RouDi and no response to the REG message. As the application tries again the re-registration succeeds after the monitoring cleanup
3. a second application with same name tries to registers with RouDi w/ active monitoring
The second application starts to register over and over again as there is no response from RouDi. After the "wait for RouDi" timeout the application is terminated with the error MQ_INTERFACE__REG_ROUDI_NOT_AVAILABLE. RouDi prints a warning for every registration attempt but does not react
4. a gracefully terminated application re-registers with RouDi w/o monitoring
If the monitoring is deactivated, RouDi cannot distinguish between still running and re-registering applications. It assumes that we have a re-registration and removes the old application (and their shared memory resources) and registers the new application. This is fine in this case
5. a hard terminated application re-registers with RouDi w/o monitoring
If the monitoring is deactivated, RouDi cannot distinguish between still running and re-registering applications. It assumes that we have a re-registration and removes the old application (and their shared memory resources) and registers the new application. This is fine in this case
6. a second application with same name tries to registers with RouDi w/o monitoring
If the monitoring is deactivated, RouDi cannot distinguish between still running and re-registering applications. It assumes that we have a re-registration and removes the old application (and their shared memory resources) and registers the new application. This is fine is NOT fine in this case as the first application is still running. it becomes a zombie and works on shared memory objects that were freed.

Proposed future behavior is:

The following changes are proposed

  • Doing an deregister with RouDi in the d'tor of the runtime. I.e. triggering the clean up of shared memory resources. This does an earlier cleanup when monitoring is activated and allows to clean up if there is no monitoring (related to Clean up all shared memory resources when corresponding user objects go out of scope #369)
  • Sending an Error response over the message queue if an application is already registered. This then leads to an immediate termination of the second application with an error that says that this application is already running.
  • When RouDi detects that an application is already registered and the monitoring is deactivated, Checking will kill(0) (i.e. no signal) if the first application is still alive. If not doing the registration for the new one, if it is still running error response to the second application. This has the drawback that we have a problem when the first application was terminated and the PID was reused before it is started again. The consequence would be that RouDi must be restarted. But this only happens when a) we have the monitoring deactivated which is enabled by default b) the application was not gracefully terminated c) we have the situation that the PID is reused between termination and restart. This is not a perfect solution but I currently don't see a better one

1. a gracefully terminated application re-registers with RouDi w/ active monitoring
The re-registration is fine as the termination of the last execution made a deregister
2. a hard terminated application re-registers with RouDi w/ active monitoring
Re-registration is fine when the monitoring has already cleaned up. If the monitoring has not yet cleaned up there is an error response and the application gets terminated. It must be tried again after the monitoring made the clean up
3. a second application with same name tries to registers with RouDi w/ active monitoring
Error response from RouDi and termination of the application with an error that says that it is already running
4. a gracefully terminated application re-registers with RouDi w/o monitoring
The re-registration is fine as the termination of the last execution made a deregister
5. a hard terminated application re-registers with RouDi w/o monitoring
If the PID was not already reused, RouDi does the clean up as they detect that the first application is no more running and registers the new one. If the PI is already reused we have some kind of deadlock :-(
6. a second application with same name tries to registers with RouDi w/o monitoring
As the first application is still running we detect this and send an error response. The second application gets terminated with an error that says that it is already running

@sculpordwarf @dkroenke @mossmaurice. What do you think?

@budrus budrus added this to To do in v1.0 via automation Dec 15, 2020
@budrus budrus added this to To do in Sprints Jan 15, 2021
@mossmaurice
Copy link
Contributor

@budrus Proposal makes sense to me. I'll take up this issue and work on it in the coming weeks.

I plan the following pull requests (in no particular order):

  1. Fix verbose output when iox-roudi --help
  2. Extend timeout info in runtime if RouDi is not answering b/c keepalive limit was not yet reached
  3. Send de-register message from runtime d'tor
  4. Add APP_ALREADY_REGISTERED message and inform user via runtime

@mossmaurice mossmaurice assigned mossmaurice and unassigned budrus Jan 21, 2021
@sculpordwarf
Copy link
Contributor

@budrus
I don´t get 'Checking will kill(0) (i.e. no signal) if the first application is still alive'
Kill(0)= kill own process, kill is also a signal, will = with?

From my point of view:

  1. a, 5. a and 6. a.) RouDi knows the PID of the process. An "are you still there" function could be realized via signals. Maybe it would be a good idea to do a shutdown of a single process, like it is down for all in the shutdown of ProcessManager::killAllProcesses: Send SIG_TERM (will fail if it can´t be shutdown or if already shutdown (errorcode different)) -> cyclic resend SIG_TERM to check if process has shutdown -> after time X and app didn´t shutdown -> SIG_KILL -> SIG_TERM cyclic to check if gone. Otherwise, you could use some signals and implement a PING/PONG over it -> RouDi may needs to expose its PID e.g. over REG_ACK.
    You need to check, if the PID is already used by an already registered process in the REG case in RouDi. A shutdown of the previous process is needed if PID is already in the application list, since PID is unique, else a kill could kill the wrong app if reregister comes. By sending random REG messages to the RouDi, it can be miss used to shutdown other applications. If the PID is reused by another application (e.g. top) RouDi could kill the wrong app.

  2. a) The application should not get MQ_INTERFACE__REG_ROUDI_NOT_AVAILABLE. It should get a meaningful error message. RouDi should response with a message like "registration failed" with a reason:

enum class MqMessageType : int32_t
{
    ...,
    REG_FAILED	
}
enum class MqMessageFailedReasonType : int32_t
{
    APPLICATION_ALREADY_UP
}

@budrus
Copy link
Contributor Author

budrus commented Jan 24, 2021

@sculpordwarf

I don´t get 'Checking will kill(0) (i.e. no signal) if the first application is still alive'

I think you can send the kill to the process but without providing the signal parameter. See here
I thought this could be the content of the "are you still there" function.

If there is no monitoring and the PID was already reused we cannot differentiate 5 and 6.
If we respond with APP_ALREADY_REGISTERED this is fine for an application that was started twice but if it crashed it can never register again.
If we kill the process with the PID we would kill whatever application to get our crashed application registered again.This is teh worse option for me.

I'm wondering if you describe a case where an application registers with a PID that already belongs to a registered application. This is an additional scenario, here we could remove the application that is in the list as there are no two processes with same PID. But this case is not the normal case of 5 and 6. Here we would have same process names but different PIDs

The attack you describe is a problem with the current message queue design and must be fixed when switching to UDS and message passing. Here we can get the information from the OS which user / application registers. A user provided PID is a bad idea for sure

mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Jan 26, 2021
…neParser

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Jan 26, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Jan 26, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Jan 26, 2021
… already registered at RouDi

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Jan 27, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Jan 27, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Jan 28, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
@mossmaurice
Copy link
Contributor

mossmaurice commented Jan 29, 2021

After a chat with @sculpordwarf and @elBoberido we decided to switch to UNIX domain sockets first (#381), before addressing this bug report. The behaviour is implementation specific and therefore there is no need to fix the message queue behaviour as they will be deprecated before v1.0.

Current behaviour example (3 & 6):

  1. Start 1st Subscriber
  2. Start 2nd Subscriber (same exec again): 2nd subscriber "steals" the MQ due to mq_unlink
  3. Start RouDi
  4. Start Publisher
  5. Communication happens between Publisher and 2nd Subscriber, 1st Subscriber loops endlessly

Ideas to allow more than one app with the same name:

  • Create 2nd IPC channel (Foo_2) if there is already one with the same name
    • RouDi could then use this IPC channel to inform the 2nd app that the name is already taken by another one
    • Let RouDi do the cleanup of the IPC channels?
    • Use AppName_PID as name for IPC channel?
  • Is it possible to use one IPC channel with UNIX domain sockets and two receiving applications?

mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Jan 29, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
@elBoberido
Copy link
Member

  * Let RouDi do the cleanup of the IPC channels?

This is the tricky part. Ideally the cleanup is not done by deleting and recreating the IPC channel since this opens the window for a race of a third start of the application which notices that there is no IPC channel and tries to open one.

I would suggest to have this sequece:

  • create IPC channel with app name only if there is none -> happy path
  • if there is one, create a unique IPC channel and tell roudi to use that one for the initial response
    • could be a flag in the registration message like outdatedIpcChannelCheck
    • if roudi thinks the application is still running, tell the new instance that it is not allowed to register
      • the new instance deletes its IPC channel
    • if roudi has already cleaned up the resources, tell the new instance that it is registered and to use the IPC channel with the app name only
      • roudi opens the IPC channel with the app name only for further communication
      • the new instace opens the IPC channel with the app name only and deletes all remaining messages
      • the new instance deletes the previously created IPC channel and uses only the one with the app name only

mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Feb 1, 2021
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Feb 3, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Feb 3, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Feb 3, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Feb 3, 2021
…st cases

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Feb 3, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Feb 3, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
mossmaurice added a commit to ApexAI/iceoryx that referenced this issue Feb 4, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
@mossmaurice mossmaurice moved this from To do to Current in Sprints Feb 4, 2021
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
…o_moving_other_way_round' into iox-eclipse-iceoryx#404-extend-timeout-registration-info
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
…ess.hpp and process_manager.hpp

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
…n destruction

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
…tion posix::FileLock

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
…rocess c'tor const and check if callable contains value

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
…r names

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
…rocesses

* Add forwarding c'tor to smart_lock
* Remove sendMessageToRouDi and use sendRequestToRoudi
* Wrap ProcessManager in smart_lock and remove internal mutex

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
…IpcInterfaceCreator

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
… segfault and using EXPECT_DEATH

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
* Initalising member variables
* Reword doxygen comments
* Reword method names of `ProcessManager`
* Use std::unique_ptr in ProcessManager test

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
…ce without monitoring, const-correctness and make conversion of pid explicit

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Dietrich Krönke <dietrich.kroenke@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
…_test

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
* Use LogFatal instead of LogError
* Fix return values in registerProcess()
* Add possibility to omit ACK IpcMessage when removing process

Signed-off-by: Simon Hoinkis <simon.hoinkis@apex.ai>
marthtz pushed a commit to boschglobal/iceoryx that referenced this issue May 12, 2021
…meout-registration-info

Iox eclipse-iceoryx#404 Extend timeout registration info and send termination request on destruction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
v1.0
  
Done
4 participants