Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRIU: getLastRestoreTime part2 #18598

Closed
tajila opened this issue Dec 12, 2023 · 9 comments
Closed

CRIU: getLastRestoreTime part2 #18598

tajila opened this issue Dec 12, 2023 · 9 comments
Labels
comp:vm criu Used to track CRIU snapshot related work

Comments

@tajila
Copy link
Contributor

tajila commented Dec 12, 2023

Currently, InternalCRIUSupport::getLastRestoreTimeImpl returns last restore time based on the time we collect when we resume from checkpoint. This time is inaccurate as, CRIU itself runs for roughly 30ms before it transitions to the JVM.

To provide a more accurate time, we need an API that returns the CRIU restore process start time.

@tajila tajila added criu Used to track CRIU snapshot related work comp:vm labels Dec 12, 2023
@singh264
Copy link
Contributor

@tajila I would like to address this GitHub issue.

@singh264
Copy link
Contributor

@tajila could you clarify if the goal is for InternalCRIUSupport::getLastRestoreTimeImpl to get the start time of the process running the criu restore command?

This process seems to start a new process that runs the java command to complete the code execution from the checkpoint. In the CRIU demo, I observe that the criu restore process continues until the restored java process completes, and the delta between the start time of the criu restore process and the start time of the restored java process is approximately 4 ms. It seems like the delta is influenced by factors like system load, checkpoint size, or other system configurations since it is not close to 30 ms.

@tajila
Copy link
Contributor Author

tajila commented Dec 12, 2023

It seems like the delta is influenced by factors like system load, checkpoint size, or other system configurations since it is not close to 30 ms.

with real world applications its much larger. But, yes with a small app there wont be too much of a difference.

could you clarify if the goal is for InternalCRIUSupport::getLastRestoreTimeImpl to get the start time of the process running the criu restore command?

Thats correct, we want it to include the time taken by criu restore so it is more accurate.

@singh264
Copy link
Contributor

@tajila requesting your feedback on two approaches for obtaining the criu restore process pid to determine its start time:

  1. Get the parent process id in the restored java process as it matches the criu restore process id.
  2. Get the criu restore process pid from a pre-created file by criu before the restored java process initiates.

The first appraoch is simpler but it is dependent on stable process relationships.

The second approach is robust in dynamic process relationship scenarios but less efficient, requiring criu to support creating a file with its pid during restore.

@tajila
Copy link
Contributor Author

tajila commented Dec 14, 2023

What about getpid/getppid ? I guess before you get to far, you should check if the data in the /proc/[pid]/stat file is accurate and does in fact include the total time.

@singh264
Copy link
Contributor

What about getpid/getppid ?

getppid would be called in the jvm to get the pid of the criu restore process in the first approach. getppid would be called in criu restore to save its pid to a file for the jvm in the second approach. The first approach seems to be better as it is simpler and requires no changes to criu.

check if the data in the /proc/[pid]/stat file is accurate and does in fact include the total time

/proc/[pid]/stat includes the total time in ns precision epoch time, however it is the same for both the criu restore process and the restored java process, which suggests that the start time in /proc/[pid]/stat may not be accurate enough to differentiate between the two processes. This can be tested with stat -c %y /proc/[pid]/stat.

/proc/[pid] also includes the total time in ns precision epoch time and it indicates a non-zero delta between the start time of the criu restore process and the restored java process. This can be tested with stat -c %y /proc/[pid].

It seems like obtaining the process start time from /proc/[pid] would be a suitable as it meets the criteria indicated in the issue description, and it also corresponds to the existing vm->checkpointState.lastRestoreTimeMillis:

ps -ef | grep criu; ps -ef | grep java
root     1038126  335988  0 11:03 pts/3    00:00:00 /root/criu_demo/criu/criu/criu restore -D /root/criu_demo/InstantOnStartupGuide/workspace/checkpointData --shell-job -v4 --log-file=restore.log
root     1038077 1038126  1 11:03 pts/3    00:00:00 java -XX:+EnableCRIUSupport --add-exports java.base/openj9.internal.criu=ALL-UNNAMED HelloInstantOn

// criu restore process
root@c24919v1:~/openj9_issues_18598# stat -c %y /proc/1038126; stat -c %y /proc/1038126/stat 
2023-12-18 11:03:14.098547507 -0800
2023-12-18 11:03:27.454622061 -0800

// restored java process
root@c24919v1:~/openj9_issues_18598# stat -c %y /proc/1038077; stat -c %y /proc/1038077/stat 
2023-12-18 11:03:14.150547798 -0800
2023-12-18 11:03:27.454622061 -0800

// Monday, December 18, 2023 11:03:14.222 -0800
vm->checkpointState.lastRestoreTimeMillis: 1702926194222 

@tajila
Copy link
Contributor Author

tajila commented Jan 3, 2024

It seems like obtaining the process start time from /proc/[pid] would be a suitable as it meets the criteria indicated in the issue description, and it also corresponds to the existing vm->checkpointState.lastRestoreTimeMillis:

lastRestoreTimeMillis is inaccurate. What is most accurutate in this scenario is when criu restore process started. This is what we really want, because this represents the true latency the user will incur.

@tajila
Copy link
Contributor Author

tajila commented Jan 3, 2024

Perhaps the we need an api that get process start time of an arbritary process given a pid. This way we can query the parent of the restored process (criu restore ...).

singh264 added a commit to singh264/openj9 that referenced this issue Jan 24, 2024
Get the start time of the criu restore process, which is
the parent process of the restored java process.

Issue: eclipse-openj9#18598
Depends-on: eclipse/omr#7214
Signed-off-by: Amarpreet Singh <Amarpreet.A.Singh@ibm.com>
@singh264
Copy link
Contributor

This issue can be closed and it was addressed by:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:vm criu Used to track CRIU snapshot related work
Projects
Status: Done
Development

No branches or pull requests

2 participants