Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster startup -- Share code objects from memory-mapped file #86

Closed
yuleil opened this issue Sep 19, 2021 · 9 comments
Closed

Faster startup -- Share code objects from memory-mapped file #86

yuleil opened this issue Sep 19, 2021 · 9 comments

Comments

@yuleil
Copy link

yuleil commented Sep 19, 2021

This is a Cpython startup improvement approach proposed by Alibaba Compiler Team.

We are working on ways to speed up python application startup time. The main idea here is sharing code objects from mmaped file, which produces similar startup benefits with a simpler implementation, compared to Experiment E.

Our design is inspired by the Application Class-Data Sharing (AppCDS) feature, introduced in OpenJDK. AppCDS allows a set of application classes to be pre-processed into a shared archive file, which can then be memory-mapped at runtime to reduce startup time and memory footprint.

Based on the above principle, we proposed Code-Data Sharing (CDS) approach, which allows a set of code objects to be deep copied into a memory-mapped heap image file. During runtime:

  • use MAP_FIXED to map to the predetermined heap image to ensure that the pointers are correct
  • One concern is ASLR will randomly arrange the address of data section, causing ob_type may point to wrong address in memory. The solution is to patch the correct address for ob_type by traversing each object in heap image.
  • rehash the frozen_set s
  • get the code object directly from heap image while importing packages

Experiments

Env: Linux & Intel skylake

Running empty application

$time for i in `seq 100`; do PYCDSMODE=0 python3 -c ''; done
real  0m1.486s
user  0m1.186s
sys   0m0.307s

$PYCDSMODE=1 python3 -c '' # dump
$time for i in `seq 100`; do PYCDSMODE=2 python3 -c ''; done
real  0m1.201s
user  0m0.934s
sys   0m0.273s

Startup time benefits: 19.18% reduction

WebServer (flask + requests + pymongo)

$time PYCDSMODE=0 python3 -c 'import flask, requests, pymongo'
real  0m0.303s
user  0m0.278s
sys   0m0.025s

$PYCDSMODE=1 python3 -c 'import flask, requests, pymongo' dump
$time PYCDSMODE=2 python3 -c 'import flask, requests, pymongo'
real  0m0.257s
user  0m0.232s
sys   0m0.024s

Startup time benefits: 15.18% reduction

Summary

Compared to the existing approaches, the main contribution of Our CDS approach includes:

  • CDS use the heap object directly, while the memory-mapped implementation in PyICE needs some deserialization

  • CDS doesn't need to generate C source code, thus avoiding using C toolchain for compiling. This is essential for a production environment on the cloud

Considering AppCDS has proved to be successful in OpenJDK 10, we believe our proposal can be a practical feature to enhance CPython startup performance, even while our overall design is still evolving.

@yuleil yuleil changed the title Faster startup -- Sharing code objects from memory-mapped file Faster startup -- Share code objects from memory-mapped file Sep 19, 2021
@gvanrossum
Copy link
Collaborator

Thanks! We'll try to study this in the coming weeks.

@gvanrossum
Copy link
Collaborator

To make comparison easier, here's a GitHub diff from v3.9.5, which is where this code was forked:

https://github.com/python/cpython/compare/v3.9.5...yuleil:cds?expand=1

@gvanrossum
Copy link
Collaborator

I haven't looked deeply in the implementation, but the idea looks decent enough: there's a "dump" mode that creates an mmap'ed segment with a snapshot of the heap in a file (or some part of it -- perhaps only code objects?), and a "use" mode that maps that file into memory at the original address. The beauty is that this supports arbitrary 3rd party modules.

The complexity is caused by the need to fix up the segment after it's been mmap'ed in, because

  1. The segment contains references to objects that aren't in the segment (e.g. Py_None, PyTuple_Type), and those objects's addresses may vary; and
  2. Frozen sets (which seem to complicate everything in this area, e.g. reproducible marshal output).

The solution is nicely general, but requires two new tp_* fields in all type objects. This seems less than ideal, since IIUC these are only populated for a handful of builtin objects. Also, it means that the mmap'ed segments cannot be shared between processes (so Instagram's "pre-forked servers" solution still wins for memory use).

Comparing with Experiment E (#84), the tooling is easier to use with 3rd party modules, although the dump/use mechanism is a bit clunky. I wonder if you could borrow an idea from #84 and generate a table of fix-ups for the mmap'ed segment that do things like patching references to standard types and singleton values, instead of adding new tp_* fields to all types.

Ideal would be if you could package this as a 3rd party extension module that can be distributed via PyPI.

@gvanrossum
Copy link
Collaborator

Some more questions:

  • Have you compared this to an application packaging system like PyOxidizer?
  • Please briefly describe how you determine what is dumped, in dump mode? (I.e., how you capture all code objects and their dependencies but nothing else.)
  • How portable is this approach? Would it work on Mac? Windows?
  • Would it be possible to dump (a significant portion of) the stdlib and embed that in the CPython binary? (How would a user then go about dumping a collection of 3rd party libraries in addition?)

@yuleil
Copy link
Author

yuleil commented Sep 23, 2021

We are thankful for your timely feedback. Below are some explanations regarding to your questions.

The solution is nicely general, but requires two new tp_* fields in all type objects. This seems less than ideal, since IIUC these are only populated for a handful of builtin objects.

I wonder if you could borrow an idea from #84 and generate a table of fix-ups for the mmap'ed segment that do things like patching references to standard types and singleton values, instead of adding new tp_* fields to all types.

The reason for the two new .tp_*s is that I was planning to implement a generic sharedheap that could hold objects of any type. And then share code objects on top of generic sharedheap. But this will take quite a long time for implementation. From a practical perspective, we focus on types related to code objects and then refer to prior art to avoid adding new .tp_* fields.

Also, it means that the mmap'ed segments cannot be shared between processes (so Instagram's "pre-forked servers" solution still wins for memory use).

Our access to the data in mmap'ed segments is not read-only, considering stuff like the reference counting and patches to ob_type and Py_None. We might get page faults for the first write, then copy-on-write will be triggered, which will not save memory. We could adopt a similar idea from AppCDS feature in OpenJDK, which separates the existing archive (mmap'ed segments) into ro and rw sections. We will also check out Instagram's solution as you suggested.

Ideal would be if you could package this as a 3rd party extension module that can be distributed via PyPI.

This matches perfectly with our planning. We also hope to distribute this via PyPI and will continue working in that direction.

more Q & A

Have you compared this to an application packaging system like PyOxidizer?

Not yet. We will take a close look at that.

Please briefly describe how you determine what is dumped, in dump mode? (I.e., how you capture all code objects and their dependencies but nothing else.)

def patch_import_paths():
    if sys.flags.cds_mode == 1:
        def patch_get_code(orig_get_code):
            def wrap_get_code(self, name):
                code = orig_get_code(self, name)
                SharedCodeWrap.set_module_code(name, code)
                return code
            return wrap_get_code
        SourceFileLoader.get_code = patch_get_code(SourceLoader.get_code)

We patch the SourceFileLoader.get_code method to record the loaded code objects. The recorded code objects will be deep-copied to mmap'ed segments when the process exits.

One challenge here is that all objects referenced by code objects need to be deeply copied into the mmap'ed segment. This operation is highly dependent on the object type's internal implementation. Here we add a .tp_move_in field to record deep copy operations of each type.

How portable is this approach? Would it work on Mac? Windows?

This function was developed on MacOS systems running the M1 chip. It will be perfectly fine working on Mac and Linux. It doesn’t work on Windows for now, due to the current usage of the mmap system call. We will work on extending this portability to multiple operating systems.

Would it be possible to dump (a significant portion of) the stdlib and embed that in the CPython binary? (How would a user then go about dumping a collection of 3rd party libraries in addition?)

A practical reference is OpenJDK's JEP 341: Default CDS Archives, which generates a CDS archive of JDK internal classes at build time. When a user needs to dump 3rd party libraries, a new archive file is generated, with the stdlib and 3rd party libraries used by the program. So there’s no need to use the pre-determined archive.

We tested an empty Python program python3 -c pass. Dumping the stdlib will produce an archive file of about 700KB. In contrast, the OpenJDK17's built-in archive file lib/server/classes.jsa is 14MB in size. So it is generally feasible to dump a default archive in CPython binary.


We sincerely appreciate the attention you are giving to this. We will continue working on ways to make our deliveries more efficient. I will keep you posted on our progress.

@gvanrossum
Copy link
Collaborator

Thanks for your answers. I hope you bring the project to maturity. I have one follow-up question:

We tested an empty Python program python3 -c pass. Dumping the stdlib will produce an archive file of about 700KB.

That number looks suspiciously low. Which modules are included in that? The PYC files for the stdlib total to at least 70 MB.

@yuleil
Copy link
Author

yuleil commented Oct 8, 2021

CDS uses a trace-based model. Since the test program is python3 -c pass, only the modules used by python3 -c pass (and loaded after _bootstrap_external._install(), where patch_import_paths() is called) are moved to the mmap'ed segment.

More specifically, these modules are:

  • codecs
  • encodings.latin_1
  • posixpath
  • os
  • site
  • _bootlocale
  • abc
  • io
  • stat
  • _sitebuiltins
  • genericpath
  • encodings.utf_8
  • encodings.aliases
  • encodings
  • _collections_abc

@oraluben
Copy link

The previous branch is obsolete and we've rewritten the implementation (e.g. remove extra fields in type object and less hard-coded GC logics), which can be found at python/cpython@54a4e1b...oraluben:cds/main.

This will be the new basement of our future development.

@gvanrossum
Copy link
Collaborator

Thanks. I don't think anyone on our team will have time to review the new version before our meeting, so hopefully you can explain some of the differences when we talk tomorrow.

@faster-cpython faster-cpython locked and limited conversation to collaborators Dec 2, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants