Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start implementing caching for unchanging code blocks #1

Open
fgregg opened this issue Apr 2, 2018 · 15 comments
Open

Start implementing caching for unchanging code blocks #1

fgregg opened this issue Apr 2, 2018 · 15 comments

Comments

@fgregg
Copy link
Member

fgregg commented Apr 2, 2018

No description provided.

@fgregg
Copy link
Member Author

fgregg commented Apr 13, 2018

@fgregg
Copy link
Member Author

fgregg commented Apr 13, 2018

Here's Brandon Willard talking about the approaches he was thinking about:

mpastell#19 (comment)

@brandonwillard
Copy link

@fgregg, looks like you're going down the same rabbit hole that I revisit every year!

A lot of the possible (and worthwhile) functionality should exist in its own project. For instance, the type of caching that invalidates entries based on a variable/call dependency graph, efficient incremental bytecode storage and updating (good for interactive sessions), automatic caching based on instrumentation/runtimes, etc.

As we discussed, there are necessarily low-level language implementation details, but a considerable amount of the logic surrounding those functions can be abstracted at a level sufficient for orchestration in Python (perhaps within the Jupyter framework).

If you're interested, we could probably knock out a full fledged example for Python sessions/kernels and start the generalization from there.

@fgregg
Copy link
Member Author

fgregg commented Apr 14, 2018

That sound great, @brandonwillard. Do you have a suggestion about how we should proceed?

@piccolbo
Copy link

piccolbo commented Apr 14, 2018

I was wondering if you are aware of the dill package and that it offers the ability to save a session. From https://pypi.python.org/pypi/dill:

dill provides the ability to save the state of an interpreter session in a single command

I have not used this feature, but I have used other serialization features and I think it's a high quality package and the dev is very responsive. If storing the entire session is agreeable, then it's a pretty simple algorithm: restore session before chunk that has changed; evaluate chunk; save session after chunk. If any change in the saved session, re-evaluate next chunk. Premature optimization etc etc

@fgregg
Copy link
Member Author

fgregg commented Apr 14, 2018

Thanks for sharing that, @piccolbo.

It's helped me see that serializing and restoring sessions is not something that I want.

Here's a common pattern I have.

<<setup, cache=False>>
import psycopg2
conn = psygopg2
conn = psycopg2.connect('postgres:///my_db')
c = conn.cursor()
@

<<expensive_query, cache=True>>
c.execute('''VERY EXPENSIVE QUERY''')
results = c.fetchall()
@

I don't really want to serialize the connection or the cursor (neither of which can really be serialized anyway). Even if I could sanely serialize and hydrate the connections, I wouldn't want the rehydration of expensive_query to clobber the value of conn or c which restoring the session would do.

@brandonwillard
Copy link

brandonwillard commented Apr 14, 2018

@piccolbo, yes, I was originally using dill's session pickling in my Pweave caching branch (it's commented-out in that commit, though), but — as @fgregg said — it was overkill. That's also what motivated my thinking about incremental caching.

@fgregg, regarding first steps, here's a small example of AST-based "assign" statement caching. I left out any considerations for out-of-scope variable assigns within functions and classes, as well as variable-to-block dependencies (the kind that would invalidate caches for dependent blocks), etc.

This is the kind of thing that might work for org-mode and Pweave, but I think it should exist closer to the Jupyter level (e.g. in a client or kernel). Anyway, from here, we should figure out exactly where this sort of logic should exist and start considering which other languages to support and whether or not we can easily tease-out assign statements. Either that or start on the aforementioned missing features.

One idea I kept having involved Pygments. It has a wide array of lexers and they might be useful for obtaining assigns in a similar, and highly generic, way. That approach is extremely limited, but somewhat promising for a broad, slightly-less-than-naive caching similar to the example given here. Tools like this and, say, Antlr are a nice way to keep the work going in one language (e.g. Python) while covering more languages. Plus, much of what we've discussed doesn't directly rely on bytecode compilation and/or execution; for instance, the caller's frame and exec-like expressions in my example code can — and probably should — be replaced with remote code execution calls via Jupyter.

@fgregg
Copy link
Member Author

fgregg commented Apr 16, 2018

Thanks for sharing that @brandonwillard.

It looks like you are hoping to implement a dependency graph to invalidate code the cached if a dependency changes.

That's a very neat idea, but that would seem to assume that none of the dependencies were connections to external resources.

In the example code, I posted above, conn and c are not serializable so anything that depended upon them would also need to be re-run? In a certain way that seems sane, because this code can't know if I have made changes to database I'm connecting to. However, while that's probably the most conservative behavior, it's not the one I want?

<<setup, cache=False>>
import psycopg2
conn = psygopg2
conn = psycopg2.connect('postgres:///my_db')
c = conn.cursor()
@

<<expensive_query, cache=True>>
c.execute('''VERY EXPENSIVE QUERY''')
results = c.fetchall()
@

@brandonwillard
Copy link

brandonwillard commented Apr 16, 2018

What I think is reasonable to implement in caching logic currently stops short at the bytecode level.

Nonetheless, there are worthwhile, albeit less-than-automatic, ways around specific problems — like the one you mention — and I imagine most would involve some intervention by the user (e.g. specifying a form to evaluate that would determine whether or not remote content was changed).

If you wanted to get really fancy, though, you could add caching logic with awareness specific to libraries like psycopg2. In this specific instance, it's quite possible (but maybe not worthwhile) to determine the tables involved in the relevant data-generating queries, build dependency graphs for those, and use the same caching logic. Even better, if this sort of logic was tailored for data-frame abstraction libraries like Ibis and Blaze, it might be easier to implement and be more applicable!

@fgregg
Copy link
Member Author

fgregg commented Apr 18, 2018

Okee doke. I think I'm going to start by implementing what knitr does by default.

In pseudocode:

old_globals = globals().copy()
for chunk in chunks:
    if chunk in cache:
        chunk.results = cache[chunk].results
        globals().update(cache[chunk].objects)       
    else:
        result = eval(chunk)
        cache[chunk].results = result
        cache[chunk].objects = globals() - old_globals # what's been changed in globals by this code chunk   
    old_globals = globals().copy()

I think there's better things we can do in the future, but this seems pretty simple and from working with knitr, it seems acceptable.

@brandonwillard
Copy link

With source string hashing chunk validation, though, right?

@fgregg
Copy link
Member Author

fgregg commented Apr 18, 2018

Yes.

@brandonwillard
Copy link

By the way, a great next step could involve AST-based hashing for cache and/or block validation, instead of string hashing. That way, inconsequential string changes in blocks wouldn't affect the cache (e.g. white spaces, comments, reorderings, var name changes, etc.)

@fgregg
Copy link
Member Author

fgregg commented Apr 18, 2018

I agree! Although, from working with knitr, I've often found is useful to bust the cache by adding in meaningless white space. Haha.

@brandonwillard
Copy link

Ha, yeah, I've had to do that, as well. However, I think this functionality could be provided by a [temporary] chunk option.

You know, this whole idea of AST-based caching really makes me wish I was working with JVM languages again. Seems like one could cover a whole lot of ground with a unified bytecode like Java's.

fgregg pushed a commit that referenced this issue Dec 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants