Start implementing caching for unchanging code blocks #1

fgregg · 2018-04-02T20:42:33Z

No description provided.

fgregg · 2018-04-13T03:50:25Z

In order for this to be useful, we need to restore the state of variables after each code chunk. (or equivalently restore the a snapshot of the kernel as it was at the end of each code chunk)

Relevant work:

https://stackoverflow.com/questions/633127/viewing-all-defined-variables?noredirect=1&lq=1
https://stackoverflow.com/questions/34342155/how-to-pickle-or-store-jupyter-ipython-notebook-session-for-later
https://github.com/yihui/knitr/blob/master/R/cache.R
https://beta.observablehq.com/@mbostock/how-observable-runs
https://github.com/dataflownb/dfkernel
https://multithreaded.stitchfix.com/blog/2017/07/26/nodebook/

fgregg · 2018-04-13T14:50:48Z

Here's Brandon Willard talking about the approaches he was thinking about:

mpastell#19 (comment)

brandonwillard · 2018-04-14T00:21:20Z

@fgregg, looks like you're going down the same rabbit hole that I revisit every year!

A lot of the possible (and worthwhile) functionality should exist in its own project. For instance, the type of caching that invalidates entries based on a variable/call dependency graph, efficient incremental bytecode storage and updating (good for interactive sessions), automatic caching based on instrumentation/runtimes, etc.

As we discussed, there are necessarily low-level language implementation details, but a considerable amount of the logic surrounding those functions can be abstracted at a level sufficient for orchestration in Python (perhaps within the Jupyter framework).

If you're interested, we could probably knock out a full fledged example for Python sessions/kernels and start the generalization from there.

fgregg · 2018-04-14T00:35:17Z

That sound great, @brandonwillard. Do you have a suggestion about how we should proceed?

piccolbo · 2018-04-14T16:10:07Z

I was wondering if you are aware of the dill package and that it offers the ability to save a session. From https://pypi.python.org/pypi/dill:

dill provides the ability to save the state of an interpreter session in a single command

I have not used this feature, but I have used other serialization features and I think it's a high quality package and the dev is very responsive. If storing the entire session is agreeable, then it's a pretty simple algorithm: restore session before chunk that has changed; evaluate chunk; save session after chunk. If any change in the saved session, re-evaluate next chunk. Premature optimization etc etc

fgregg · 2018-04-14T16:44:00Z

Thanks for sharing that, @piccolbo.

It's helped me see that serializing and restoring sessions is not something that I want.

Here's a common pattern I have.

<<setup, cache=False>>
import psycopg2
conn = psygopg2
conn = psycopg2.connect('postgres:///my_db')
c = conn.cursor()
@

<<expensive_query, cache=True>>
c.execute('''VERY EXPENSIVE QUERY''')
results = c.fetchall()
@

I don't really want to serialize the connection or the cursor (neither of which can really be serialized anyway). Even if I could sanely serialize and hydrate the connections, I wouldn't want the rehydration of expensive_query to clobber the value of conn or c which restoring the session would do.

brandonwillard · 2018-04-14T18:04:49Z

@piccolbo, yes, I was originally using dill's session pickling in my Pweave caching branch (it's commented-out in that commit, though), but — as @fgregg said — it was overkill. That's also what motivated my thinking about incremental caching.

@fgregg, regarding first steps, here's a small example of AST-based "assign" statement caching. I left out any considerations for out-of-scope variable assigns within functions and classes, as well as variable-to-block dependencies (the kind that would invalidate caches for dependent blocks), etc.

This is the kind of thing that might work for org-mode and Pweave, but I think it should exist closer to the Jupyter level (e.g. in a client or kernel). Anyway, from here, we should figure out exactly where this sort of logic should exist and start considering which other languages to support and whether or not we can easily tease-out assign statements. Either that or start on the aforementioned missing features.

One idea I kept having involved Pygments. It has a wide array of lexers and they might be useful for obtaining assigns in a similar, and highly generic, way. That approach is extremely limited, but somewhat promising for a broad, slightly-less-than-naive caching similar to the example given here. Tools like this and, say, Antlr are a nice way to keep the work going in one language (e.g. Python) while covering more languages. Plus, much of what we've discussed doesn't directly rely on bytecode compilation and/or execution; for instance, the caller's frame and exec-like expressions in my example code can — and probably should — be replaced with remote code execution calls via Jupyter.

fgregg · 2018-04-16T01:30:20Z

Thanks for sharing that @brandonwillard.

It looks like you are hoping to implement a dependency graph to invalidate code the cached if a dependency changes.

That's a very neat idea, but that would seem to assume that none of the dependencies were connections to external resources.

In the example code, I posted above, conn and c are not serializable so anything that depended upon them would also need to be re-run? In a certain way that seems sane, because this code can't know if I have made changes to database I'm connecting to. However, while that's probably the most conservative behavior, it's not the one I want?

<<setup, cache=False>>
import psycopg2
conn = psygopg2
conn = psycopg2.connect('postgres:///my_db')
c = conn.cursor()
@

<<expensive_query, cache=True>>
c.execute('''VERY EXPENSIVE QUERY''')
results = c.fetchall()
@

brandonwillard · 2018-04-16T01:57:18Z

What I think is reasonable to implement in caching logic currently stops short at the bytecode level.

Nonetheless, there are worthwhile, albeit less-than-automatic, ways around specific problems — like the one you mention — and I imagine most would involve some intervention by the user (e.g. specifying a form to evaluate that would determine whether or not remote content was changed).

If you wanted to get really fancy, though, you could add caching logic with awareness specific to libraries like psycopg2. In this specific instance, it's quite possible (but maybe not worthwhile) to determine the tables involved in the relevant data-generating queries, build dependency graphs for those, and use the same caching logic. Even better, if this sort of logic was tailored for data-frame abstraction libraries like Ibis and Blaze, it might be easier to implement and be more applicable!

fgregg · 2018-04-18T20:09:24Z

Okee doke. I think I'm going to start by implementing what knitr does by default.

In pseudocode:

old_globals = globals().copy()
for chunk in chunks:
    if chunk in cache:
        chunk.results = cache[chunk].results
        globals().update(cache[chunk].objects)       
    else:
        result = eval(chunk)
        cache[chunk].results = result
        cache[chunk].objects = globals() - old_globals # what's been changed in globals by this code chunk   
    old_globals = globals().copy()

I think there's better things we can do in the future, but this seems pretty simple and from working with knitr, it seems acceptable.

brandonwillard · 2018-04-18T20:23:59Z

With source string hashing chunk validation, though, right?

fgregg · 2018-04-18T20:30:05Z

Yes.

brandonwillard · 2018-04-18T20:31:05Z

By the way, a great next step could involve AST-based hashing for cache and/or block validation, instead of string hashing. That way, inconsequential string changes in blocks wouldn't affect the cache (e.g. white spaces, comments, reorderings, var name changes, etc.)

fgregg · 2018-04-18T20:32:13Z

I agree! Although, from working with knitr, I've often found is useful to bust the cache by adding in meaningless white space. Haha.

brandonwillard · 2018-04-18T20:54:03Z

Ha, yeah, I've had to do that, as well. However, I think this functionality could be provided by a [temporary] chunk option.

You know, this whole idea of AST-based caching really makes me wish I was working with JVM languages again. Seems like one could cover a whole lot of ground with a unified bytecode like Java's.

Verbose logging

fgregg pushed a commit that referenced this issue Dec 15, 2018

Merge pull request #1 from datamade/verbose_logging

e5bf7a7

Verbose logging

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start implementing caching for unchanging code blocks #1

Start implementing caching for unchanging code blocks #1

fgregg commented Apr 2, 2018

fgregg commented Apr 13, 2018 •

edited

Loading

fgregg commented Apr 13, 2018

brandonwillard commented Apr 14, 2018

fgregg commented Apr 14, 2018

piccolbo commented Apr 14, 2018 •

edited by fgregg

Loading

fgregg commented Apr 14, 2018 •

edited

Loading

brandonwillard commented Apr 14, 2018 •

edited

Loading

fgregg commented Apr 16, 2018 •

edited

Loading

brandonwillard commented Apr 16, 2018 •

edited

Loading

fgregg commented Apr 18, 2018 •

edited

Loading

brandonwillard commented Apr 18, 2018

fgregg commented Apr 18, 2018

brandonwillard commented Apr 18, 2018

fgregg commented Apr 18, 2018

brandonwillard commented Apr 18, 2018

Start implementing caching for unchanging code blocks #1

Start implementing caching for unchanging code blocks #1

Comments

fgregg commented Apr 2, 2018

fgregg commented Apr 13, 2018 • edited Loading

fgregg commented Apr 13, 2018

brandonwillard commented Apr 14, 2018

fgregg commented Apr 14, 2018

piccolbo commented Apr 14, 2018 • edited by fgregg Loading

fgregg commented Apr 14, 2018 • edited Loading

brandonwillard commented Apr 14, 2018 • edited Loading

fgregg commented Apr 16, 2018 • edited Loading

brandonwillard commented Apr 16, 2018 • edited Loading

fgregg commented Apr 18, 2018 • edited Loading

brandonwillard commented Apr 18, 2018

fgregg commented Apr 18, 2018

brandonwillard commented Apr 18, 2018

fgregg commented Apr 18, 2018

brandonwillard commented Apr 18, 2018

fgregg commented Apr 13, 2018 •

edited

Loading

piccolbo commented Apr 14, 2018 •

edited by fgregg

Loading

fgregg commented Apr 14, 2018 •

edited

Loading

brandonwillard commented Apr 14, 2018 •

edited

Loading

fgregg commented Apr 16, 2018 •

edited

Loading

brandonwillard commented Apr 16, 2018 •

edited

Loading

fgregg commented Apr 18, 2018 •

edited

Loading