Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't dump pages which only contain zero bytes #2331

Open
wants to merge 1 commit into
base: criu-dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions Documentation/criu.txt
Original file line number Diff line number Diff line change
Expand Up @@ -369,6 +369,14 @@ mount -t cgroup -o devices,freezer none devices,freezer
Deduplicate "old" data in pages images of previous *dump*. This option
implies incremental *dump* mode (see the *pre-dump* command).

*--skip-zero-pages*::
Don't dump pages containing only zero bytes. This is a
potentially expensive operation because it checks for
every single process page if it contains only zeros, but
it can significantly decrease the image size and improve the
startup-time if many such pages exist. It effectively
replaces such pages which the kernel's zero-page on restore.

*-l*, *--file-locks*::
Dump file locks. It is necessary to make sure that all file lock users
are taken into dump, so it is only safe to use this for enclosed containers
Expand Down
1 change: 1 addition & 0 deletions criu/config.c
Original file line number Diff line number Diff line change
Expand Up @@ -650,6 +650,7 @@ int parse_options(int argc, char **argv, bool *usage_error, bool *has_exec_cmd,
{ "ms", no_argument, 0, 1054 },
BOOL_OPT("track-mem", &opts.track_mem),
BOOL_OPT("auto-dedup", &opts.auto_dedup),
BOOL_OPT("skip-zero-pages", &opts.skip_zero_pages),
{ "libdir", required_argument, 0, 'L' },
{ "cpu-cap", optional_argument, 0, 1057 },
BOOL_OPT("force-irmap", &opts.force_irmap),
Expand Down
3 changes: 3 additions & 0 deletions criu/cr-service.c
Original file line number Diff line number Diff line change
Expand Up @@ -541,6 +541,9 @@ static int setup_opts_from_req(int sk, CriuOpts *req)
if (req->has_auto_dedup)
opts.auto_dedup = req->auto_dedup;

if (req->has_skip_zero_pages)
opts.skip_zero_pages = req->skip_zero_pages;

if (req->has_force_irmap)
opts.force_irmap = req->force_irmap;

Expand Down
1 change: 1 addition & 0 deletions criu/crtools.c
Original file line number Diff line number Diff line change
Expand Up @@ -541,6 +541,7 @@ int main(int argc, char *argv[], char *envp[])
" pages images of previous dump\n"
" when used on restore, as soon as page is restored, it\n"
" will be punched from the image\n"
" --skip-zero-pages don't dump pages containing only zero bytes.\n"
" --pre-dump-mode splice - parasite based pre-dumping (default)\n"
" read - process_vm_readv syscall based pre-dumping\n"
"\n"
Expand Down
1 change: 1 addition & 0 deletions criu/include/cr_options.h
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ struct cr_options {
int track_mem;
char *img_parent;
int auto_dedup;
int skip_zero_pages;
unsigned int cpu_cap;
int force_irmap;
char **exec_cmd;
Expand Down
2 changes: 2 additions & 0 deletions criu/include/stats.h
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ enum {
CNT_SHPAGES_SKIPPED_PARENT,
CNT_SHPAGES_WRITTEN,

CNT_SKIPPED_ZERO_PAGES,

DUMP_CNT_NR_STATS,
};

Expand Down
53 changes: 48 additions & 5 deletions criu/mem.c
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@
#include <sys/mman.h>
#include <errno.h>
#include <fcntl.h>
#include <string.h>
#include <sys/syscall.h>
#include <sys/prctl.h>
#include <sys/uio.h>

#include "types.h"
#include "cr_options.h"
Expand All @@ -31,6 +33,7 @@
#include "prctl.h"
#include "compel/infect-util.h"
#include "pidfd-store.h"
#include "xmalloc.h"

#include "protobuf.h"
#include "images/pagemap.pb-c.h"
Expand Down Expand Up @@ -191,11 +194,33 @@ static int generate_iovs(struct pstree_item *item, struct vma_area *vma, struct
bool has_parent)
{
unsigned long nr_scanned;
unsigned long pages[3] = {};
/* Counters for PAGES_SKIPPED_PARENT, PAGES_LAZY, PAGES_WRITTEN and SKIPPED_ZERO_PAGES */
unsigned long pages[4] = {};
unsigned long vaddr;
bool dump_all_pages;
int ret = 0;

static char *zero_page = NULL;
static char *remote_page = NULL;
int zero = 0;
struct iovec local[2];
struct iovec remote[1];
int nread = 0;
if (opts.skip_zero_pages && zero_page == NULL) {
zero_page = xmalloc(PAGE_SIZE);
remote_page = xmalloc(PAGE_SIZE);
if (zero_page == NULL || remote_page == NULL) {
pr_warn("Can't allocate memory - disabling --skip-zero-pages\n");
opts.skip_zero_pages = 0;
} else {
memzero(zero_page, PAGE_SIZE);
local[0].iov_base = remote_page;
local[0].iov_len = PAGE_SIZE;
remote[0].iov_base = (void *)0x0;
remote[0].iov_len = PAGE_SIZE;
}
}

dump_all_pages = should_dump_entire_vma(vma->e);

nr_scanned = 0;
Expand All @@ -207,9 +232,25 @@ static int generate_iovs(struct pstree_item *item, struct vma_area *vma, struct

/* If dump_all_pages is true, should_dump_page is called to get pme. */
next = should_dump_page(pmc, vma->e, vaddr, &softdirty);
if (!dump_all_pages && next != vaddr) {
vaddr = next - PAGE_SIZE;
continue;
if (!dump_all_pages) {
if (next != vaddr) {
vaddr = next - PAGE_SIZE;
continue;
} else if (opts.skip_zero_pages) {
remote[0].iov_base = (void *)vaddr;
nread = process_vm_readv(item->pid->real, local, 1, remote, 1, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea to read process memory twice. We have to avoid this.

btw: #2292 solves the same problem in a more optimal way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be better if we can avoid reading process memory twice.

#2292 solves the same problem in a more optimal way.

I think this is different. In #2292, we exclude pages with zero PFN (PAGE_IS_PFNZERO), while this option skips zero-filled memory (e.g., memory that has been filled with zeros using memset() would be skipped with this option).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a performance or a correctness issue? If it's just about performance, I think the benefit might justify the additional overhead and after all the feature is on by default.

In the case this is a correctness problem, do you have a suggestion how we can avoid this?

PS: and yes, @rst0git is right - this change is about skipping regular pages which are filled with only zero bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be useful for runtimes like Java, which often allocate large memory regions without fully using them (e.g. for the heap). For a simple Helloworld Java program, this new feature shrinks the image size be about 20% from 13mb to 11mb.

we exclude pages with zero PFN (PAGE_IS_PFNZERO), while this option skips zero-filled memory

I believe we need a zdtm test, which can reproduce such a zero page without PAGE_IS_PFNZERO but with zero data. Maybe, we even want to fix kernel to report such a page as "PAGE_IS_PFNZERO" instead.

I agree with Andrei that manually checking page content with memcmp is an anti-pattern. Nack from me, at list in current state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree. Reading a page second time and using memcmp() sounds not optimal.

I still like the idea of not including empty pages in the checkpoint, but it sounds difficult.

If the kernel could track it, that would be nice. Not sure the kernel has a better alternative than memcmp() to find a zeroed memory page.

At this point I think it would be nice to see some numbers. How much faster is restoring if something like this PR is applied. Although I don't like the memcmp() it would only be used if the corresponding command-line option is explicitly selected by the user. Maybe that makes it acceptable. Can the second reading of the page be avoided?

Maybe some post-processing of the checkpoint image would be an alternative. Remove the empty pages after checkpointing and have support during restore to handle pages like this.

Copy link
Member

@avagin avagin Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a performance or a correctness issue? If it's just about performance, I think the benefit might justify the additional overhead and after all the feature is on by default.

I don't understand why we need to read process pages to do this check? Why can't we do that before dumping these pages into the image (page_xfer_dump_pages)?

Copy link
Member

@Snorch Snorch Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe memset(0) is used in specific scenarios where applications/libraries are dealing with sensitive data and/or use custom memory management.

It's not just the "malloc+memset(0)" use case. Java does mmap and pretouch memory so we can have a lot of pages with only zero content (but not the kernel zero page) in some scenarios.

First, I thought it does not work, upd: that was stupid of me not to enable --skip-zero-pages, with option it works fine, sorry.

Second, If Java put so much effort to have those zeroed pages in RSS isn't it a bad idea to restore those pages like they are "PAGE_IS_PFNZERO" ones? =)

[root@turmoil tmp]# ./malloc-test 
Enter any char to stop

------ In another terminal ------
[root@turmoil snorch]# grep 2097164 -A3 -B1 /proc/$(pidof malloc-test)/smaps
7fb87b744000-7fb8fb747000 rw-p 00000000 00:00 0 
Size:            2097164 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:             2097160 kB
[root@turmoil tmp]# /home/snorch/devel/ms/criu/criu/criu dump --skip-zero-pages -v4 -o dump.log -t $(pidof malloc-test) -j -D /images-dir/
[root@turmoil tmp]# /home/snorch/devel/ms/criu/criu/criu restore -j -D /images-dir/

------ In another terminal ------
[root@turmoil snorch]# grep 2097164 -A3 -B1 /proc/$(pidof malloc-test)/smaps
7fb87b744000-7fb8fb747000 rw-p 00000000 00:00 0 
Size:            2097164 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:             1048584 kB

upd2: If we want to preserve them in RSS, we can remember those special zero-filled pages in images on dump without saving their data and then on restore put them to RSS by writing zeroes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second, If Java put so much effort to have those zeroed pages in RSS isn't it a bad idea to restore those pages like they are "PAGE_IS_PFNZERO" ones? =)

@Snorch, Java (or OpenJDK based JVMs to be more specific) was not designed and optimized for cloud/container use cases but rather for large, monolithic application servers. In the old days, pretouching/zeroing memory was a way to pre-allocate physical memory and not potentially get it from swap later. For current cloud/container use cases the huge memory footprint can be problem. With check-pointing, a small image is more important and COWing a PAGE_IS_PFNZERO page is much faster then loading and populating it from disk.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need to read process pages to do this check? Why can't we do that before dumping these pages into the image (page_xfer_dump_pages)?

Thanks a lot for your suggestion @avagin. I'm short of time for the next week because of FOSDEM/Jfokus but I'll try to come up with a new version which moves the zero check to page_xfer_dump_pages() afterwards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonis FOSDEM was my favorite conference when I lived close by. btw @mihalicyn are there too, he is one of criu maintainers. He will be happy to help with any questions.

if (nread == PAGE_SIZE) {
zero = memcmp(zero_page, remote_page, PAGE_SIZE);
/*
* If the page contains just zeros we can treat it like the zero page and skip it.
* At restore it will be replaced by a reference to the zero page and COWed if accessed.
*/
if (zero == 0) {
pages[3]++;
continue;
}
}
}
}

if (vma_entry_can_be_lazy(vma->e) && !is_stack(item, vaddr))
Expand Down Expand Up @@ -247,8 +288,10 @@ static int generate_iovs(struct pstree_item *item, struct vma_area *vma, struct
cnt_add(CNT_PAGES_SKIPPED_PARENT, pages[0]);
cnt_add(CNT_PAGES_LAZY, pages[1]);
cnt_add(CNT_PAGES_WRITTEN, pages[2]);
cnt_add(CNT_SKIPPED_ZERO_PAGES, pages[3]);

pr_info("Pagemap generated: %lu pages (%lu lazy) %lu holes\n", pages[2] + pages[1], pages[1], pages[0]);
pr_info("Pagemap generated: %lu pages (%lu lazy) %lu holes %lu skipped zero\n",
pages[2] + pages[1], pages[1], pages[0], pages[3]);
return ret;
}

Expand Down
7 changes: 7 additions & 0 deletions criu/stats.c
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,9 @@ static void display_stats(int what, StatsEntry *stats)
stats->dump->pages_skipped_parent, stats->dump->pages_skipped_parent);
pr_msg("Memory pages written: %" PRIu64 " (0x%" PRIx64 ")\n", stats->dump->pages_written,
stats->dump->pages_written);
if (stats->dump->has_skipped_zero_pages)
pr_msg("Memory pages skipped because zero: %" PRIu64 " (0x%" PRIx64 ")\n",
stats->dump->skipped_zero_pages, stats->dump->skipped_zero_pages);
pr_msg("Lazy memory pages: %" PRIu64 " (0x%" PRIx64 ")\n", stats->dump->pages_lazy,
stats->dump->pages_lazy);
} else if (what == RESTORE_STATS) {
Expand Down Expand Up @@ -178,6 +181,10 @@ void write_stats(int what)
ds_entry.has_page_pipes = true;
ds_entry.page_pipe_bufs = dstats->counts[CNT_PAGE_PIPE_BUFS];
ds_entry.has_page_pipe_bufs = true;
if (opts.skip_zero_pages) {
ds_entry.has_skipped_zero_pages = true;
ds_entry.skipped_zero_pages = dstats->counts[CNT_SKIPPED_ZERO_PAGES];
}

ds_entry.shpages_scanned = dstats->counts[CNT_SHPAGES_SCANNED];
ds_entry.has_shpages_scanned = true;
Expand Down
1 change: 1 addition & 0 deletions images/rpc.proto
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,7 @@ message criu_opts {
optional bool leave_stopped = 69;
optional bool display_stats = 70;
optional bool log_to_stderr = 71;
optional bool skip_zero_pages = 72;
/* optional bool check_mounts = 128; */
}

Expand Down
2 changes: 2 additions & 0 deletions images/stats.proto
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ message dump_stats_entry {
optional uint64 shpages_scanned = 12;
optional uint64 shpages_skipped_parent = 13;
optional uint64 shpages_written = 14;

optional uint64 skipped_zero_pages = 15;
}

message restore_stats_entry {
Expand Down
11 changes: 11 additions & 0 deletions lib/c/criu.c
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,17 @@ void criu_set_auto_dedup(bool auto_dedup)
criu_local_set_auto_dedup(global_opts, auto_dedup);
}

void criu_local_set_skip_zero_pages(criu_opts *opts, bool skip_zero_pages)
{
opts->rpc->has_skip_zero_pages = true;
opts->rpc->skip_zero_pages = skip_zero_pages;
}

void criu_set_skip_zero_pages(bool skip_zero_pages)
{
criu_local_set_skip_zero_pages(global_opts, skip_zero_pages);
}

void criu_local_set_force_irmap(criu_opts *opts, bool force_irmap)
{
opts->rpc->has_force_irmap = true;
Expand Down
1 change: 1 addition & 0 deletions test/javaTests/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
<!-- Suite testng xml file to consider for test execution -->
<suiteXmlFiles>
<suiteXmlFile>test.xml</suiteXmlFile>
<suiteXmlFile>test-zero.xml</suiteXmlFile>
</suiteXmlFiles>
</configuration>
</plugin>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ public void runtest(String testName, String checkpointOpt, String restoreOpt) th
String pid;
int exitCode;

System.out.println("======= Testing " + testName + " ========");
System.out.println("======= Testing " + testName + " " + checkpointOpt + " ========");

testSetup(testName);

Expand Down
89 changes: 89 additions & 0 deletions test/javaTests/test-zero.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
<?xml version = "1.0" encoding = "UTF-8"?>
<!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd" >

<suite name = "Suite2">
<parameter name="checkpointOpt" value="--skip-zero-pages"/>
<parameter name="restoreOpt" value=""/>

<test name = "test1-FileRead">
<parameter name="testname" value="FileRead"/>
<classes>
<class name = "org.criu.java.tests.CheckpointRestore"/>
</classes>
</test>

<test name = "test2-ReadWrite">
<parameter name="testname" value="ReadWrite"/>
<classes>
<class name = "org.criu.java.tests.CheckpointRestore"/>
</classes>
</test>

<test name = "test3-MemoryMappings">
<parameter name="testname" value="MemoryMappings"/>
<classes>
<class name = "org.criu.java.tests.CheckpointRestore"/>
</classes>
</test>

<test name = "test4-MultipleFileRead">
<parameter name="testname" value="MultipleFileRead"/>
<classes>
<class name = "org.criu.java.tests.CheckpointRestore"/>
</classes>
</test>

<test name = "test5-MultipleFileWrite">
<parameter name="testname" value="MultipleFileWrite"/>
<classes>
<class name = "org.criu.java.tests.CheckpointRestore"/>
</classes>
</test>

<test name = "test6-Sockets">
<parameter name="testname" value="Sockets"/>
<parameter name="checkpointOpt" value="--tcp-established --skip-zero-pages"/>
<parameter name="restoreOpt" value="--tcp-established"/>
<classes>
<class name = "org.criu.java.tests.CheckpointRestore"/>
</classes>
</test>

<test name = "test7-SocketsListen">
<parameter name="testname" value="SocketsListen"/>
<parameter name="checkpointOpt" value="--tcp-established --skip-zero-pages"/>
<parameter name="restoreOpt" value="--tcp-established"/>
<classes>
<class name = "org.criu.java.tests.CheckpointRestore"/>
</classes>
</test>

<test name = "test8-SocketsConnect">
<parameter name="testname" value="SocketsConnect"/>
<parameter name="checkpointOpt" value="--tcp-established --skip-zero-pages"/>
<parameter name="restoreOpt" value="--tcp-established"/>
<classes>
<class name = "org.criu.java.tests.CheckpointRestore"/>
</classes>
</test>

<test name = "test9-SocketsMultiple">
<parameter name="testname" value="SocketsMultiple"/>
<parameter name="checkpointOpt" value="--tcp-established --skip-zero-pages"/>
<parameter name="restoreOpt" value="--tcp-established"/>
<classes>
<class name = "org.criu.java.tests.CheckpointRestore"/>
</classes>
</test>

<test name = "test10-SocketsData">
<parameter name="testname" value="SocketsData"/>
<parameter name="checkpointOpt" value="--tcp-established --skip-zero-pages"/>
<parameter name="restoreOpt" value="--tcp-established"/>
<classes>
<class name = "org.criu.java.tests.CheckpointRestore"/>
</classes>

</test>

</suite>
9 changes: 8 additions & 1 deletion test/zdtm.py
Original file line number Diff line number Diff line change
Expand Up @@ -1052,6 +1052,7 @@ def __init__(self, opts):
self.__sat = bool(opts['sat'])
self.__dedup = bool(opts['dedup'])
self.__mdedup = bool(opts['noauto_dedup'])
self.__skip_zero_pages = bool(opts['skip_zero_pages'])
self.__user = bool(opts['user'])
self.__rootless = bool(opts['rootless'])
self.__leave_stopped = bool(opts['stop'])
Expand Down Expand Up @@ -1381,6 +1382,9 @@ def dump(self, action, opts=[]):
if self.__dedup:
a_opts += ["--auto-dedup"]

if self.__skip_zero_pages:
a_opts += ["--skip-zero-pages"]

a_opts += ["--timeout", "10"]

criu_dir = os.path.dirname(os.getcwd())
Expand Down Expand Up @@ -2083,7 +2087,7 @@ def run_test(self, name, desc, flavor):
'dedup', 'sbs', 'freezecg', 'user', 'dry_run', 'noauto_dedup',
'remote_lazy_pages', 'show_stats', 'lazy_migrate', 'stream',
'tls', 'criu_bin', 'crit_bin', 'pre_dump_mode', 'mntns_compat_mode',
'rootless')
'rootless', 'skip_zero_pages')
arg = repr((name, desc, flavor, {d: self.__opts[d] for d in nd}))

if self.__use_log:
Expand Down Expand Up @@ -2697,6 +2701,9 @@ def get_cli_args():
rp.add_argument("--noauto-dedup",
help="Manual deduplicate images on iterations",
action='store_true')
rp.add_argument("--skip-zero-pages",
Copy link
Member

@rst0git rst0git Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonis Would you be able to also enable testing with existing ZDTM tests using --skip-zero-pages in run-ci-tests.sh?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonis It was great meeting you at FOSDEM and your talk was very good!

We can use something like the following patch to run the existing ZDTM tests with --skip-zero-pages: rst0git@45a8ca1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avagin Thank you for the review! Would it be sufficient to run the following tests?

diff --git a/scripts/ci/run-ci-tests.sh b/scripts/ci/run-ci-tests.sh
index ef7e869e0..8f5e25d03 100755
--- a/scripts/ci/run-ci-tests.sh
+++ b/scripts/ci/run-ci-tests.sh
@@ -268,6 +268,9 @@ make -C test/others/rpc/ run
 ./test/zdtm.py run -t zdtm/transition/maps007 --pre 2 --page-server --dedup
 ./test/zdtm.py run -t zdtm/transition/maps007 --pre 2 --pre-dump-mode read
 
+# Run tests with --skip-zero-pages
+./test/zdtm.py run --skip-zero-pages -T '.*maps0.*'
+
 ./test/zdtm.py run -t zdtm/transition/pid_reuse --pre 2 # start time based pid reuse detection
 ./test/zdtm.py run -t zdtm/transition/pidfd_store_sk --rpc --pre 2 # pidfd based pid reuse detection

help="Don't dump pages containing only zero bytes",
action='store_true')
rp.add_argument("--nocr",
help="Do not CR anything, just check test works",
action='store_true')
Expand Down
1 change: 1 addition & 0 deletions test/zdtm/static/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -271,6 +271,7 @@ TST_NOFILE := \
sigtrap01 \
change_mnt_context \
fd_offset \
zero_pages \
# jobctl00 \

PKG_CONFIG ?= pkg-config
Expand Down
Loading
Loading