Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in restoring a checkpoint using SimpleProcessor and SwitchableProcessor in ARM & KVM setup #932

Closed
mbabaie opened this issue Mar 12, 2024 · 6 comments · Fixed by #986
Labels

Comments

@mbabaie
Copy link
Contributor

mbabaie commented Mar 12, 2024

Describe the bug

I am trying to take a checkpoint and restore it in an ARM full-system setup while using KVM CPU to fast-forward the kernel boot. After the boot, it switches to Timing CPU and then takes a checkpoint. Taking the checkpoint works. However, restoring the checkpoint fails.

Affects version
gem5/develop branch
Commit revision ID I am working: 6f90fec

gem5 Modifications
I am using this script: configs/example/gem5_library/arm-ubuntu-run-with-kvm.py
I updated the script with these changes:

  1. Added a few lines of code so it takes a checkpoint and restores the checkpoint.

  2. It can use either simpleProcessor() or switchableProcessor() in taking a checkpoint and restoring it. For restore, if using SwitchableProcessor(), starts with TIMING , never switches.

  3. extended the command list a bit so taking a checkpoint and restoring it is meaningful.

You can find the script and all of its changes here.

To Reproduce

  1. You can clone the repo and branch I shared previously.

  2. Compile gem5 with command :

scons build/ARM/gem5.opt -j40  --without-tcmalloc
  1. Generate a checkpoint using the command below. The checkpoint will be in the m5out_chkpt/checkpoint. Note: here we assume a switchableProcessor() is used in taking a checkpoint
build/ARM/gem5.opt --outdir=m5out_chkpt configs/example/gem5_library/arm-ubuntu-run-with-kvm.py --take-chkpt=True --chkpt-cpu-switchable=True

Terminal Output

  1. To restore the checkpoint using a Simple(Timing) CPU, use the command below:
build/ARM/gem5.opt --outdir=m5out_rstr_simple configs/example/gem5_library/arm-ubuntu-run-with-kvm.py --take-chkpt=False --chkpt-dir=m5out_chkpt/checkpoint --rstr-cpu-switchable=False

Here's the output once running the command above:

src/dev/arm/energy_ctrl.cc:252: warn: Existing EnergyCtrl, but no enabled DVFSHandler found.
gem5.opt: src/dev/arm/gic_v2.hh:365: uint8_t gem5::GicV2::getCpuTarget(gem5::ContextID, uint32_t) const: Assertion `ctx < sys->threads.numRunning()' failed.
Program aborted at tick 1203416333724
--- BEGIN LIBC BACKTRACE ---
build/ARM/gem5.opt(_ZN4gem515print_backtraceEv+0x3c)[0xaaaacc1e208c]
build/ARM/gem5.opt(_ZN4gem512abortHandlerEi+0x5c)[0xaaaacc204f60]
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffa493b7dc]
/lib/aarch64-linux-gnu/libc.so.6(+0x7f200)[0xffffa3c0f200]
/lib/aarch64-linux-gnu/libc.so.6(raise+0x1c)[0xffffa3bca67c]
/lib/aarch64-linux-gnu/libc.so.6(abort+0xe4)[0xffffa3bb7130]
/lib/aarch64-linux-gnu/libc.so.6(+0x33fd0)[0xffffa3bc3fd0]
/lib/aarch64-linux-gnu/libc.so.6(__assert_perror_fail+0x0)[0xffffa3bc4040]
build/ARM/gem5.opt(_ZNK4gem55GicV212getCpuTargetEij+0x164)[0xaaaaccad5374]
build/ARM/gem5.opt(_ZN4gem55GicV27sendIntEj+0x3c)[0xaaaaccace6dc]
build/ARM/gem5.opt(_ZN4gem512MuxingKvmGicINS_10GicV2TypesEE7sendIntEj+0xd0)[0xaaaacc123f90]
build/ARM/gem5.opt(_ZN4gem512DrainManager6resumeEv+0xdc)[0xaaaacc1f210c]
build/ARM/gem5.opt(+0xfc10ec)[0xaaaacac110ec]
build/ARM/gem5.opt(+0xe5f2c4)[0xaaaacaaaf2c4]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(+0x120334)[0xffffa4440334]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(_PyObject_MakeTpCall+0x8c)[0xffffa43f9470]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(+0xdc528)[0xffffa43fc528]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x91e0)[0xffffa4399a00]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(+0x1b4104)[0xffffa44d4104]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x91e0)[0xffffa4399a00]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(+0x1b4104)[0xffffa44d4104]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x6480)[0xffffa4396ca0]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(+0x1b4104)[0xffffa44d4104]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(PyEval_EvalCode+0xa4)[0xffffa44cef28]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(+0x1af4d4)[0xffffa44cf4d4]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(+0x120bb8)[0xffffa4440bb8]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(_PyEval_EvalFrameDefault+0x5b60)[0xffffa4396380]
/lib/aarch64-linux-gnu/libpython3.10.so.1.0(+0x1b4104)[0xffffa44d4104]
build/ARM/gem5.opt(+0xfba748)[0xaaaacac0a748]
build/ARM/gem5.opt(main+0x198)[0xaaaacaa4ef68]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc)[0xffffa3bb73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xffffa3bb74cc]
--- END LIBC BACKTRACE ---
For more info on how to address this issue, please visit https://www.gem5.org/documentation/general_docs/common-errors/ 

Aborted (core dumped)
  1. To restore the checkpoint using a SimpleSwitchable(Timing) CPU, use the command below:
build/ARM/gem5.opt --outdir=m5out_rstr_switchable configs/example/gem5_library/arm-ubuntu-run-with-kvm.py --take-chkpt=False --chkpt-dir=m5out_chkpt/checkpoint --rstr-cpu-switchable=True

Here's the output once running the command above:

info: Using default config
Reading the checkpoint
warn: Setting the checkpoint path via the Simulator constructor is deprecated and will be removed in future releases of gem5. Please set this through via the appropriate workload function (i.e., `set_se_binary_workload` or `set_kernel_disk_workload`). If both are set the workload function set takes precedence.
Done with restoring the checkpoint. Now running the simulation.
Global frequency set at 1000000000000 ticks per second
warn: No dot file generated. Please install pydot to generate the dot file and pdf.
src/mem/dram_interface.cc:690: warn: DRAM device capacity (16384 Mbytes) does not match the address range assigned (1024 Mbytes)
src/mem/dram_interface.cc:690: warn: DRAM device capacity (16384 Mbytes) does not match the address range assigned (1024 Mbytes)
src/sim/kernel_workload.cc:46: info: kernel located at: /home/babaie/.cache/gem5/arm64-linux-kernel-5.4.49
src/base/loader/symtab.cc:95: warn: Cannot insert a new symbol table due to name collisions. Adding prefix to each symbol's name can resolve this issue.
src/base/statistics.hh:279: warn: One of the stats is a legacy stat. Legacy stat is a stat that does not belong to any statistics::Group. Legacy stat is deprecated.
board.vncserver: Listening for connections on port 5900
board.terminal: Listening for connections on port 3456
board.realview.uart1.device: Listening for connections on port 3457
board.realview.uart2.device: Listening for connections on port 3458
board.realview.uart3.device: Listening for connections on port 3459
board.remote_gdb: Listening for connections on port 7000
src/sim/serialize.hh:379: fatal: fatal condition !paramInImpl(cp, name, param) occurred: Can't unserialize 'board.processor.start0.core:_pid'

Host ISA
ARM

@mbabaie mbabaie added the bug label Mar 12, 2024
@powerjg
Copy link
Contributor

powerjg commented Mar 12, 2024

It's doesn't surprise me too much that the SimpleSwitchableProcessor doesn't work well. This was never meant to be used for anything except the absolute simplest use cases.

What happens if you use just the KVM processor to take the checkpoint and then restore with atomic or the simple timing?

@mbabaie
Copy link
Contributor Author

mbabaie commented Mar 13, 2024

Hi Jason,
Thank you very much for your reply.

I have tested the same experiments with simpleProcessor(), using the same repo and run script.
Here are the results for testing:

  1. To generate the checkpoint, one should use this command:
    Note: we use a simpleProcessor(KVM) to take the checkpoint.
build/ARM/gem5.opt --outdir=m5out_chkpt_kvm_simpleProcessor configs/example/gem5_library/arm-ubuntu-run-with-kvm.py --take-chkpt=True --chkpt-cpu-switchable=False
  1. To restore the checkpoint with a simpleProcessor(Timing), use the command below:
build/ARM/gem5.opt --outdir=m5out_rstr_simpleTiming --debug-flags=ExecAll configs/example/gem5_library/arm-ubuntu-run-with-kvm.py --take-chkpt=False --chkpt-dir=m5out_chkpt_kvm_simpleProcessor/checkpoint --rstr-cpu-switchable=False

Here's the terminals output:

info: Using default config
Reading the checkpoint.
warn: Setting the checkpoint path via the Simulator constructor is deprecated and will be removed in future releases of gem5. Please set this through via the appropriate workload function (i.e., `set_se_binary_workload` or `set_kernel_disk_workload`). If both are set the workload function set takes precedence.
Global frequency set at 1000000000000 ticks per second
warn: No dot file generated. Please install pydot to generate the dot file and pdf.
src/mem/dram_interface.cc:690: warn: DRAM device capacity (16384 Mbytes) does not match the address range assigned (1024 Mbytes)
src/mem/dram_interface.cc:690: warn: DRAM device capacity (16384 Mbytes) does not match the address range assigned (1024 Mbytes)
src/sim/kernel_workload.cc:46: info: kernel located at: /home/babaie/.cache/gem5/arm64-linux-kernel-5.4.49
src/base/loader/symtab.cc:95: warn: Cannot insert a new symbol table due to name collisions. Adding prefix to each symbol's name can resolve this issue.
src/base/statistics.hh:279: warn: One of the stats is a legacy stat. Legacy stat is a stat that does not belong to any statistics::Group. Legacy stat is deprecated.
board.vncserver: Listening for connections on port 5913
board.terminal: Listening for connections on port 3508
board.realview.uart1.device: Listening for connections on port 3509
board.realview.uart2.device: Listening for connections on port 3510
board.realview.uart3.device: Listening for connections on port 3511
board.remote_gdb: Listening for connections on port 7013
src/arch/arm/isa.cc:1559: warn: Checkpoint value for register id_pfr1 does not match current configuration (checkpointed: 0x1, current: 0x1011)
src/arch/arm/isa.cc:1559: warn: Checkpoint value for register id_aa64pfr0_el1 does not match current configuration (checkpointed: 0x1100000022, current: 0x1100002222)
src/arch/arm/isa.cc:1559: warn: Checkpoint value for register id_pfr1 does not match current configuration (checkpointed: 0x1, current: 0x1011)
src/arch/arm/isa.cc:1559: warn: Checkpoint value for register id_aa64pfr0_el1 does not match current configuration (checkpointed: 0x1100000022, current: 0x1100002222)
src/dev/arm/energy_ctrl.cc:252: warn: Existing EnergyCtrl, but no enabled DVFSHandler found.
src/sim/simulate.cc:199: info: Entering event queue @ 1317305722654.  Starting simulation...

While using the debug-flags=ExecAll that should print all committed instructions by the cores, nothing is printed. The simulation also does not end. Using debug-flags=Event shows some events keep being rescheduled and executed:

...
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 rescheduled @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 executed @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 rescheduled @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 executed @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 rescheduled @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 executed @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 rescheduled @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 executed @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 rescheduled @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 executed @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 rescheduled @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 executed @ 1317305924784
1317305924784: board.processor.cores1.core.wrapped_function_event: EventFunctionWrapped 288 rescheduled @ 1317305924784
...

@powerjg
Copy link
Contributor

powerjg commented Mar 14, 2024

@giactra do you have any hints as to how to debug taking a checkpoint with Arm KVM and restoring?

@kaustav-goswami
Copy link
Contributor

kaustav-goswami commented Mar 15, 2024

Hi, for the SimpleProcessor case, ARM KVM uses ArmDefaultRelease.for_kvm() as the release for the ArmBoard. This removes SECURITY and VIRTUALIZATION extensions from ArmDefaultRelease(). When restoring the system using anything other than KVM CPUs, the ArmBoard uses release=ArmDefaultRelease(). Therefore the warn for id_pfr1 and id_aa64pfr0_el1 were shown during restore. The simulation starts but it gets stuck at gem5::ArmISA::ISA while reading the wrong register value. The ArmBoard has an init param release= which can be used to get around this problem. I'll document this case, and, add checks within the board and open a new PR.

@giactra
Copy link
Contributor

giactra commented Mar 15, 2024

I don't know if changing the release will fix the issue, but IMHO we shouldn't just document things better, we should provide a less obscure platform.

I think we should remove complexity from the ArmBoard:

https://github.com/gem5/gem5/blob/develop/src/python/gem5/components/boards/arm_board.py#L119

Which is changing the release secretly to make it work with KVM depending on the CPUs in use.
We already have a KVM specific config, which is arm-ubuntu-run-with-kvm.py. It means that if you want to use KVM, you should use that platform. We should move the ArmRelease.for_kvm from the ArmBoard to that config.

You want to boot linux without KVM?

Use arm-ubuntu-run.py

You want to use KVM?

Use arm-ubuntu-run-with-kvm.py

@kaustav-goswami
Copy link
Contributor

Okay, I'll open a new PR fixing the release.

@ivanaamit ivanaamit linked a pull request Apr 2, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants