# Tutorial

A dataflow notebook is an extension of the [Jupyter](https://docs.jupyter.org/en/latest/) computational notebook that works with Python. This documentation assumes familiarity with computational notebooks and terminology like cells, outputs, and identifiers. If you are not familiar with what a computational notebook is or the Jupyter project, we highly recommend reviewing those concepts on their [website](https://docs.jupyter.org/en/latest/) and [trying](https://docs.jupyter.org/en/latest/start/index.html#try-the-jupyterlab-interface) the [JupyterLab](https://jupyterlab.readthedocs.io/en/latest/) interface. There is also an extensive [user guide](https://jupyterlab.readthedocs.io/en/latest/user/index.html) for JupyterLab.

There are a few key differences between how dataflow notebooks work compared to classic Jupyter notebooks. Most importantly, cells are linked when one cell references the output of another, forming a dependency graph that allows us to determine which cells depend on others or are depended on by other cells.

## Variables and Outputs

You can use variables in a dataflow notebook in much the same way as in a Jupyter notebook with a key difference: a variable cannot be referenced across cells unless it is denoted as an output. Thus, transient identifiers like a counter `i` used in a loop are valid only in the current cell by default. To ensure that an identifier can referenced, it needs to be listed on the **last line** of the cell. The last-line distinction is something that is used in Jupyter notebooks to output that expression. In dataflow notebooks, any variable or assignment on the last line both outputs the expression and makes it usable in other cells. A single cell with the code `a = 3 * 4` means that `a` is assigned the expression's value, 12, that value is shown as the output, and we can reference `a` in another cell.

In [1369097314]:
a = 3 * 4

12

However, if that cell instead defined `b` on the first line and `c` on the second line, only `c` would be output and accessible. `b` would not.

In [200066137]:
b = 1 * 2
c = 3 * 4

12

To make multiple variables accessible, you can list them as a tuple in the last line.

In [2172169693]:
d = 1 * 2
e = 3 * 4
d, e

2

12

You can also use simultaneous assignment to set the values at the same time.

In [4044153153]:
i, j = 1 * 2, 3 * 4

2

12

You might have noticed that each of the outputs of the cells are **labeled** with the name of the variable. In classic Jupyter notebooks, outputs are labeled with numbers that change with each execution, and a tuple of outputs is presented as a single output rendered as a (textual) tuple. With dataflow notebooks, each output is labeled individually. This means that you can scan through a notebook and see all of the variables that can be referenced in other cells **and** a representation of their values!

## Reusing Identifiers

When coding, people often lodge variable identifiers in their memory so that they can reference them later. For a data scientist, the dataframe they are currently analyzing is often identified as `df`. When manipulating a dataframe (e.g. cleaning or transforming it), it is often useful to see the results of particular changes as they are developed in a step-by-step manner. However, giving a different name to each one of the intermediate outputs can be tedious (`df1`, `df2`, `df3`, ...) and error-prone, so the identifier `df` may be reused to represent the "current" state of the dataframe. If we only link cells based on their named outputs, this introduces an ambiguity which plagues Jupyter notebooks: which `df` are we referencing at any given point? Since cells can be reordered and executed in different orders, the exact dependencies between the assignments to `df` can be impossible to follow.

### Cell Identifiers

A dataflow notebook stores **both** the variable name and the cell identifier for any referenced identifier. When a variable is output only once, the cell identifier is superfluous, but when it is repeatedly used, we can disambiguate the reference by appending its cell identifier. This is done by modifying Python's syntax to allow an identifier to use the `$` symbol. Thus, in the following sequence of cells, we can see which `x` is being referenced in the final cell by looking at the appended identifier:

In [1482671858]:
x = 1 * 2

2

In [1586220593]:
x = 3 * 4

12

In [1256994353]:
x$5e8bce31 ** 2

144

In cases where the reference is not potentially ambiguous (all of the cells until the two definitions of x), we hide the persistent identifier, but we always record it with the notebook so that if it ever becomes ambiguous, we can display the correct references.

Note that you are never required to add the `$`-suffix. Whenever an identifier is referenced (even if the reference is ambiguous), the notebook will default to associating the identifier with the **most recently executed** cell with that output. If you wish to reference a different cell, you can enter the variable name and then use Jupyter's tab completion to choose the cell you wish to reference.

Finally, when an expression is not a variable name or assignment, the output shows the cell identifier. While these can be accessed through the global `Out` dictionary (e.g. `Out['4aec3631']`), we recommend using meaningful names instead.

## Cell Names

While a cell's hexadecimal identifiers are unique and may remind git users of the hashes used to identify content or commits, they are generally difficult to remember. To address this, dataflow notebooks support user-defined cell names. A user can select any cell they wish to recall and then set its name by clicking on the tag icon in the cell toolbar. This action is also available via the Edit menu under "Add/Modify Cell Name..." Then, when programming, a user can type the name of a reference and disambiguate it by adding the cell name as a suffix in the same way that an identifier is used. If a cell is named later, the notebook will update the display of any references to show that reference.

In [3836741696]:
y = 1 * 2

2

In [1831057233]:
y = 3 * 4

12

In [4248419359]:
y$first ** 2

4

Cell names must be unique in a given notebook but they are not permanent. This means that you can change the name of a cell or move a name from one cell to another. The cell's unique hexadecimal identifier will always remain the same, and all code is persisted with that identifier. Thus, we automatically check and update the displayed code in any cell to match the new or changed names.

## Functions, Classes, and Modules

As Python functions, classes, and modules all have associated identifiers, the same rules that apply to a variable apply to these entities. Thus, if you want to use a function or class defined in one cell in another, you need to make sure it is listed on the last line in addition.

In [3130517244]:
def f(x):
    return x ** 2
f

<function __main__.__closure__.<locals>.f(x)>

In [1042737090]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
Point

__main__.__closure__.<locals>.Point

For imported modules, it becomes a bit tedious to list all imported identifiers twice so the notebook detects any imports and **automatically** classifies them as outputs.

In [1539372238]:
import collections.abc
from collections import Counter

3 + 4

collections.Counter

<module 'collections' from '/Users/dakoop/Applications/mambaforge/envs/dfnb-combined-test-install/lib/python3.13/collections/__init__.py'>

7