Skip to content

Migrating from Jupyter

Oliver edited this page Apr 21, 2021 · 1 revision

Vizier for the Jupyter Developer

Vizier is similar to Jupyter, but makes several design choices that require you to think a little differently when writing code. This page outlines what you need to watch out for, describes workarounds, and details what we're doing to make the transition smoother in the future. If you find a pain point especially troubling, go and upvote/discuss the related issue!

Cells Execute in Order

Unlike Jupyter, where you manually execute cells, Vizier cells run in order. If you go back and edit an earlier cell, all cells that depend on it will be re-executed.

What this means for you

  1. When you update a cell, all of its dependencies will be re-executed immediately.
  2. You won't be able to access datasets/artifacts generated by later cells.

Why

Reproducibility is a huge problem for Jupyter, in large part because cells can be executed in arbitrary orders. This makes Jupyter hard to approach for newcomers, and encourages lots of bad habits (like making notebooks that have to be executed in "the right" order). The main reason Juypyter does this is because dependency tracking in python is hard. If you edit one cell, Jupyter would have to execute all cells below it, since they might have changed. Vizier, on the other hand, knows which cells depend on which cells. When you edit a cell, only dependent cells will be re-executed (although you can still re-execute cells manually if you like).

Each Python Cell is a Separate Script

Unlike Jupyter, where you treat the entire notebook as one big script run in a single interpreter, Vizier runs each python script cell with a fresh interpreter.

What this means for you

  1. You'll need to repeat import directives in each python cell (we're cleaning this up).
  2. To pass a variable or dataset from one cell to the next, you'll need to use the vizierdb module. For example, to get a dataset created by an earlier cell, use vizierdb.get_dataset(ds_name). To allow later cells to use your dataset, use vizierdb.update_dataset or vizierdb.create_dataset.

Why

First and foremost, it makes dependency tracking feasible. Every time you call one of the vizierdb dataset accessors, Vizier records the dataset you got, created, or updated. This means Vizier knows which cells your python cell depends on, and can figure out which cells depend on your cell, and can avoid re-executing your cell when possible.

Another benefit of going through accessors, is that it makes it easier to translate artifacts between different languages. This is what powers Vizier's polyglot features: A dataset created by SQL can be accessed by Python, and when Python creates a dataset, R cells can access it without problem.

Artifacts are Immutable; Artifact References are not

In contrast to state-based notebooks like Jupyter, artifacts produced by cells (e.g., datasets) in Vizier are immutable: Once created, they exist in perpetuity as-is. Vizier creates the illusion of mutability by allowing names to be updated to point to new versions of the artifact.

What this means for you

When you update a dataset (e.g., in a SQL or python cell), what you're actually doing under the hood is creating a new immutable dataset, and updating the name to point to the new dataset. This means a few things. First, there's no need to manually cache datasets, since datasets (and other artifacts) are automatically cached by default. Vizier does have a "Clone dataset" cell type. This cell always executes almost immediately, because it doesn't need to copy the dataset. It just creates a new name for the same dataset!

Second, it means that you can always go back to an earlier version of the dataset. You can always review a dataset as it was earlier in the notebook and compare it against a later version of the dataset.

Why

We see this as a strictly good thing.