Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a cache for JITed object code #433

Merged
merged 6 commits into from Apr 20, 2015
Merged

Add a cache for JITed object code #433

merged 6 commits into from Apr 20, 2015

Conversation

undingen
Copy link
Contributor

@undingen undingen commented Apr 3, 2015

This is a simple cache - we still need to generate the IR but if we can find a cache file created for the exact same IR we will load it and skip instruction selection etc...

Our benchmark test suite does not benefit much from this patch because the JITing time is for most tests too short.

  • I see a +2% slowdown on the first run against unmodified pyston (we have to do additional hashing, disk writes, pointer relocating)
  • on the second run it turns into a -2% speedup.
  • django_setup.py improves from 2.3sec to about 1.6sec.

The cache works be creating symbols for all embedded pointers which change from one pyston start to the next using the same executable (Fixed executable specific addresses will be directly emitted in order to reduce the amount of symbols which will need relocations).
The IR for the module in text representation will get crc32 hashed (I'm very open to other ideas :-)) and if we find a file with the same hash in the pyston_object_cache directory we will use it. While this approach is very safe, it fails if the IR output has not the same variable names, line numbers,...
That's why I changed the variable name assignment to us a incremental index in the name instead of the pointer value as string.
Even after this patch there are still a few instances of nondeterministic IR output (but by far the most cases should get handled), I plan to improve this in a follow up patch.
On our devel machines we will generate a huge amount of cache files with this patch because the cache file only works for the exact same executable. I plan to generate a hash of the pyston executable and save the cache file in a directory with the hash name. And remove on startup all directories which do not contain this hash. Better ideas?
Another issue is that the IR is now more difficult to read because the patchpoints func destinations will all call to a dummy -1 address but there is a option to disable the cache if one has to debug something.

@undingen
Copy link
Contributor Author

undingen commented Apr 3, 2015

Oh :-( should have run make check before deciding that some piece of code is unnecessary...

@kmod
Copy link
Collaborator

kmod commented Apr 3, 2015

sre_parse_parse goes from 2s to 0.9s :) Though with -I it's still 0.2s... but at least now the next things to optimize are our unoptimized analyses where I think we have a lot of room to do much better (as opposed to the already-optimized LLVM compilation).

That's too bad about printing the patchpoints, but maybe we could use the existing dumpPrettyIR() functionality -- where we copy + modify the entire IR to make it nicer for printing -- and insert the function names there? Though I don't think we have to do that for this pr.

I think we'll probably want to iterate on the cache-management strategy, so I'm not too worried if it's not great in the initial PR. We might want to bound the size of the cache overall (LRU with max size?) since we might run a ton of code through a single executable.

std::error_code error_code;
llvm::raw_fd_ostream IRObjectFile(cache_file.c_str(), error_code, llvm::sys::fs::F_RW);
RELEASE_ASSERT(!error_code, "");
IRObjectFile << Obj.getBuffer();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need to be more careful here -- if the user ctrl-C's while we're writing out the object file, on the next run we'll happily load it and then segfault. I ran into this with the .pyc files as well -- I was hoping that we wouldn't have to worry about it, but it actually happened enough even just during development that we had to add checksums. I'm not sure how easy that is to do here, but I think that'd be ideal; another option would be to write out the file to a temporary name and then rename it to the correct name, which guards against the ctrl-C behavior but not other kinds of corruption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.
Another issue you probably already noticed (because of the test failures)
is that I currently have a design issue in the change I integrated to
generate deterministic node names (aka replacement for #%p). I was under
the impression that the strings must only be unique inside the CFG. But
thats not true 😂. Do you have a idea how to generate per module
deterministic node names which are unique? I'm currently afk but will look
into it tomorrow.
Am 04.04.2015 00:18 schrieb "Kevin Modzelewski" notifications@github.com:

In src/codegen/entry.cpp
#433 (comment):

#endif

  • {
  •    RELEASE_ASSERT(module_identifier == M->getModuleIdentifier(), "");
    
  •    RELEASE_ASSERT(!hash_before_codegen.empty(), "");
    
  •    llvm::SmallString<128> cache_file = cache_dir;
    
  •    llvm::sys::path::append(cache_file, hash_before_codegen);
    
  •    if (!llvm::sys::fs::exists(cache_dir.str()) && llvm::sys::fs::create_directory(cache_dir.str())) {
    
  •        fprintf(stderr, "Unable to create cache directory\n");
    
  •        return;
    
  •    }
    
  •    std::error_code error_code;
    
  •    llvm::raw_fd_ostream IRObjectFile(cache_file.c_str(), error_code, llvm::sys::fs::F_RW);
    
  •    RELEASE_ASSERT(!error_code, "");
    
  •    IRObjectFile << Obj.getBuffer();
    

I think we might need to be more careful here -- if the user ctrl-C's
while we're writing out the object file, on the next run we'll happily load
it and then segfault. I ran into this with the .pyc files as well -- I was
hoping that we wouldn't have to worry about it, but it actually happened
enough even just during development that we had to add checksums. I'm not
sure how easy that is to do here, but I think that'd be ideal; another
option would be to write out the file to a temporary name and then rename
it to the correct name, which guards against the ctrl-C behavior but not
other kinds of corruption.


Reply to this email directly or view it on GitHub
https://github.com/dropbox/pyston/pull/433/files#r27759807.

@kmod
Copy link
Collaborator

kmod commented Apr 3, 2015

Ok I think I found the source of the errors -- I think the issue is that when we remap generator expressions / set+dict comprehensions we generate an inner function object, and currently we call nodeName() during the CFG process of the outer scope, and insert those names into the inner scope. Previously this was fine but with your changes we would get clashes when the ids overlapped. Can you double-check, but I think it's ok to just give these specific values hard-coded names, since I don't think there can ever be more than one of them per scope. I sent you a PR that I think fixes it.

@undingen
Copy link
Contributor Author

undingen commented Apr 5, 2015

Update:

  • @kmod I added your pull request, thanks this fixed the issue with the duplicate node names.
  • I changed the hash to SHA256.
  • I patched llvm to support const GlobalVariables as arguments to a patchpoint without having to copy the address in a register. This fixed the "ran out of registers during register allocation" bug I have encountered with the object cache enabled. It works by generating symbol entries inside the stackpoint constant table. The llvm patch needs definitely a test + more comments and I'm doubtful that the patch does not introduce a problem.

The cache file corruption check is not yet implemented and everything needs more testing

@kmod
Copy link
Collaborator

kmod commented Apr 6, 2015

The llvm patch seems pretty reasonable to me; I would maybe email llvm-dev + Andrew Trick + Jeurgen Ributzka for comments?

As a general comment about testing: Michael has some tester improvements that should make it easier to write integration tests, so we could/should use that for this. I think it's ok though if we push this without having 100% confidence in it; let's do some reasonable testing, add a flag to turn it off on demand, and push it :)

@@ -959,6 +968,9 @@ CompiledFunction* doCompile(SourceInfo* source, ParamNames* param_names, const O
source->cfg->print();

assert(g.cur_module == NULL);

clearRelocatableSymsMap();

std::string name = getUniqueFunctionName(nameprefix, effort, entry_descriptor);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this end up causing cache misses if the generated function name is different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this can generate cache misses because of the static num_functions variable. :-(

@undingen
Copy link
Contributor Author

I sent the (slightly changed) llvm patch to the llvm commits mailinglist http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20150413/271030.html
I will update this pull request when I receive feedback on the patch.

@undingen undingen force-pushed the objectcache branch 2 times, most recently from eb488c1 to ad3d0a1 Compare April 15, 2015 20:52
@undingen
Copy link
Contributor Author

I updated the pull request put did not squash it yet in order to make it easier to see what has changed:
The changes start from: Update llvm patch e472a8d

  • I updated the LLVM patch to the one I sent to the llvm-commits mailing-list. Changes are mostly formatting and adding a test.
  • Added code to cleanup the cache directory, the heuristic is very simple: there is a configurable amount of cache files (currently 500) I check on start up if there are more files than that and if true remove the oldest ones. All the cache files are now located in: ~/.cache/pyston/object_cache/
  • save the cache files compressed using the very fast LZ4 library. Reduces the cache files to about one-tenth of the uncompressed size and adds a checksum for detecting e.g. truncated cache files, but does not decrease performance.
  • Sadly I can't use the current lz4 ubuntu package (too old) so I had to import the git module and change the make and cmake support. (Hope I did this right). I also tried zlib but it was noticeable slowing down the cache.
  • In order to generate deterministic cfg node name I made nodeName not dependent on the passed name but instead generate a new unique name. (While this changes the behavior, it looks like nothing depends on getting the same name for two calls to nodeName with the same node argument)

Not done:

  • There are still sometimes cache misses because the generated IR is not always deterministic but I think this can be fixed later.
  • Also there is a llvm patch [1] on the mailing-list which adds support for symbolic patchpoint function addresses. This patch would remove part of the change in this pull request to the patchpoints. But the patch did not yet land in the llvm repo and I also did not make use in this pull request yet. (But we could - just one more patch to llvm)

[1] http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20150330/269160.html

@kmod
Copy link
Collaborator

kmod commented Apr 17, 2015

looks good to me :)

@undingen undingen force-pushed the objectcache branch 2 times, most recently from fb41bae to 9a3c499 Compare April 17, 2015 13:49
@undingen
Copy link
Contributor Author

I squashed it a little bit, and rebased to HEAD.
I also sneaked in another small change :-(, it's the change to src/analysis/scoping_analysis.cpp:

+        // Sort the entries by name to make the order deterministic.
+        std::vector<InternedString> referenced_from_nested_sorted(usage->referenced_from_nested.begin(),
+                                                                  usage->referenced_from_nested.end());
+        std::sort(referenced_from_nested_sorted.begin(), referenced_from_nested_sorted.end());
         int i = 0;
-        for (auto& p : usage->referenced_from_nested) {
+        for (auto& p : referenced_from_nested_sorted) {
             closure_offsets[p] = i;
             i++;

This removes several cache misses.
I think it's ready to get merged.

@undingen
Copy link
Contributor Author

Fixed the merge conflicts.

@undingen undingen force-pushed the objectcache branch 2 times, most recently from 17c0f60 to 589cf7b Compare April 19, 2015 17:12
undingen and others added 6 commits April 20, 2015 11:25
We still need to generate the IR but if we can find,
a cache file created for the exact same IR we will load it and
skip instruction selection etc...
When remapping generator expressions / set+dict comprehensions,
we create an explicit new function, which will later get run
through CFG again.  With the new changes, we can't use a nodeName()
that was generated in the parent scope, since that will end up clashing
when the generated scope generates its own names.

I think it's not too bad to fix since in these cases the arguments are only
ever used inside of the inner scope, so we just have to use names that aren't
used elsewhere.
Adds the LZ4 compression library and use it for compressing cached objects.
This saves alot of space (on my test it reduces the required space to about one-tenth),
and adds a checksum to the file in order to detect truncated cache files,
without reducing the speed.
@kmod
Copy link
Collaborator

kmod commented Apr 20, 2015

closing + reopening to force a new Travis-CI build (I had broken master when you pushed your most recent commit...).

@kmod kmod closed this Apr 20, 2015
@kmod kmod reopened this Apr 20, 2015
kmod added a commit that referenced this pull request Apr 20, 2015
Add a cache for JITed object code
@kmod kmod merged commit 691b8d6 into pyston:master Apr 20, 2015
@kmod
Copy link
Collaborator

kmod commented Apr 20, 2015

This is awesome :) django_test.py takes 22s for me on the first run but only 9.4s on the second run!

@undingen
Copy link
Contributor Author

Cool, thanks for tracking down the remaining cache misses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants