# Tutorial 1: Random forest ML project
This tutorial will show you how `mandala` works in a small random forest
ML project. You'll see how *queriable & composable* memoization is a simple way
to achieve the main goals of scientific data management:
- easily iterate by just dumping more logic/parameters/experiments on top of the
code you already ran. Memoization automatically takes care of loading past
results, skipping over past computations, and merging results across compatible
versions of your code.
- explore the interdependencies of all saved results incrementally and
declaratively with **computation frames**, generalized dataframes that operate
over memoized computational graphs. Expand the computational history of
artifacts backward (to what produced them) and/or forward (to experiments that
used them), and generate dataframes of results for further analysis.

# Imports & setup

In [4]:
from typing import Tuple
import numpy as np
import pandas as pd
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# recommended way to import mandala functionality
from mandala._next.imports import *

np.random.seed(0)

In [2]:
@op # memoizing decorator
def generate_dataset(random_seed=42):
    print(f"Generating dataset...")
    X, y = load_digits(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=random_seed)
    return X_train, X_test, y_train, y_test

@op
def train_model(X_train, y_train):
    print(f"Training model...")
    model = RandomForestClassifier(n_estimators=1)
    model.fit(X_train, y_train)
    return model, round(model.score(X_train, y_train), 2)

@op
def eval_model(model, X_test, y_test):
    print(f"Evaluating model...")
    return round(model.score(X_test, y_test), 2)

# Running and iterating on the pipeline

## Run the pipeline once with default settings

In [5]:
storage = Storage() # in-memory storage for all results in this notebook

with storage: # we make @ops use a given storage with this `with` block 
    X_train, X_test, y_train, y_test = generate_dataset()
    model, train_acc = train_model(X_train, y_train)
    test_acc = eval_model(model, X_test, y_test)
    print(f"Train accuracy: {train_acc}, Test accuracy: {test_acc}")

Generating dataset...
Training model...
Evaluating model...
Train accuracy: AtomRef(0.91, hid='8f5...', cid='97b...'), Test accuracy: AtomRef(0.76, hid='57f...', cid='9e5...')


Now all three calls are saved to the storage. `@op`s return **value 
references**, which wrap a Python object with some storage metadata needed to
make the memoization **compose**. 

Thanks to that metadata, when we re-run memoized code, the storage recognizes
step-by-step that all work has already been done, and only loads *references* to
the results (not the Python objects themselves):

In [6]:
with storage: # same code, but now it only loads pointers to saved results
    X_train, X_test, y_train, y_test = generate_dataset()
    model, train_acc = train_model(X_train, y_train)
    test_acc = eval_model(model, X_test, y_test)
    print(f"Train accuracy: {train_acc}, Test accuracy: {test_acc}")

Train accuracy: AtomRef(hid='8f5...', cid='97b...', in_memory=False), Test accuracy: AtomRef(hid='57f...', cid='9e5...', in_memory=False)


## Iterate directly on top of memoized code and change memoized functions backward-compatibly
This also makes it easy to iterate on a project by just adding stuff on top of
already memoized code. For example, let's add a new parameter to `train_model`
in a way compatible with our current results:

In [7]:
@op
def train_model(X_train, y_train, n_estimators=NewArgDefault(1)):
    print(f"Training model...")
    model = RandomForestClassifier(n_estimators=n_estimators)
    model.fit(X_train, y_train)
    return model, round(model.score(X_train, y_train), 2)

with storage: # we make @ops use a given storage with this `with` block 
    X_train, X_test, y_train, y_test = generate_dataset()
    for n_estimators in [1, 10, 100]:
        model, train_acc = train_model(X_train, y_train, n_estimators=n_estimators)
        test_acc = eval_model(model, X_test, y_test)
        print(f"With {n_estimators} trees: Train accuracy={train_acc}, Test accuracy={test_acc}")

With 1 trees: Train accuracy=AtomRef(hid='8f5...', cid='97b...', in_memory=False), Test accuracy=AtomRef(hid='57f...', cid='9e5...', in_memory=False)
Training model...
Evaluating model...
With 10 trees: Train accuracy=AtomRef(1.0, hid='760...', cid='b67...'), Test accuracy=AtomRef(0.94, hid='600...', cid='c3b...')
Training model...
Evaluating model...
With 100 trees: Train accuracy=AtomRef(1.0, hid='ab0...', cid='b67...'), Test accuracy=AtomRef(0.98, hid='45f...', cid='d15...')


When we add a new argument with a default value wrapped as `NewArgDefault(obj)`,
`mandala` will ignore this parameter when its value equals `obj`, and fall back
on memoized calls that don't provide this argument. 

This is why `Training model...` got printed only two times, and why the results
for `n_estimators=1` are not in memory (`in_memory=False`).

## Pros and cons of memoization
Composable memoization is a powerful "imperative" query interface: if you have
the memoized code in front of you, you can just retrace it and get references to
any intermediate results you want to look at. This is very flexible, because you
can add control flow logic to restrict the results to look at. Retracing is
cheap, because large objects are not loaded from storage, letting you narrow
down your "query" before high-bandwidth interaction with the storage.

However, the memoized code may not always be in front of you! Especially in
larger projects, where it's easy to lose track of what has already been
computed, there's a need for a complementary storage interface based on
**declarative** principles. We discuss this next.

# Querying the storage with computation frames 
To complement the weaknesses of memoization, `mandala` offers the
`ComputationFrame` class. A computation frame is a generalization of the
familiar `pandas` dataframe in two main ways:
- the "columns" of a computation frame represent a **computational graph**, made of 
variables and operations on them.
- the "rows" of a computation frame are **computations that (partially) follow
the structure of the computational graph**.

It's best to illustrate this via some examples:

In [33]:
cf = storage.cf(train_model)
cf

ComputationFrame with 5 variable(s) (10 unique refs), 1 operation(s) (3 unique calls)
Computational graph:
    output_0, output_1 = train_model(y_train=y_train, n_estimators=n_estimators, X_train=X_train)

In [34]:
cf.rename(vars={'output_0': 'model', 'output_1': 'train_acc'}, inplace=True)

ComputationFrame with 5 variable(s) (10 unique refs), 1 operation(s) (3 unique calls)
Computational graph:
    model, train_acc = train_model(y_train=y_train, n_estimators=n_estimators, X_train=X_train)

We just got a computation frame (CF) corresponding to a very simple computation
graph: it only has one operation, `train_model`, with its associated inputs and
outputs. The printout also describes the overall number of `Ref`s and `Call`s
represented by this CF.

We can directly extract a dataframe from it:

In [10]:
cf.get_df()

Extracting tuples from the computation graph:
output_0, output_1 = train_model(n_estimators=n_estimators, X_train=X_train, y_train=y_train)...


Unnamed: 0,X_train,n_estimators,y_train,train_model,output_0,output_1
0,"[[0.0, 0.0, 3.0, 14.0, 1.0, 0.0, 0.0, 0.0, 0.0...",100.0,"[6, 0, 0, 3, 0, 5, 0, 0, 4, 1, 2, 8, 4, 5, 9, ...","Call(train_model, cid='bb3...', hid='255...')","(DecisionTreeClassifier(max_features='sqrt', r...",1.0
1,"[[0.0, 0.0, 3.0, 14.0, 1.0, 0.0, 0.0, 0.0, 0.0...",,"[6, 0, 0, 3, 0, 5, 0, 0, 4, 1, 2, 8, 4, 5, 9, ...","Call(train_model, cid='ca5...', hid='ac0...')","(DecisionTreeClassifier(max_features='sqrt', r...",0.91
2,"[[0.0, 0.0, 3.0, 14.0, 1.0, 0.0, 0.0, 0.0, 0.0...",10.0,"[6, 0, 0, 3, 0, 5, 0, 0, 4, 1, 2, 8, 4, 5, 9, ...","Call(train_model, cid='c4f...', hid='5f7...')","(DecisionTreeClassifier(max_features='sqrt', r...",1.0


We get back a table, where each row corresponds to a call to `train_model`.
Furthermore, the `Call` objects themselves (which contain metadata about the
call) appear in the column dedicated to the single operation in the graph.

We see that in the `n_estimators` column we have the values `[100.0, NaN, 10.0]`, 
reflecting the fact that we made 1 call to `train_model` before introducing the
`n_estimators` argument, and 2 afterwards.

## Exploring storage by expanding the CF
This CF gets much more interesting and useful when we can look into where the
inputs to `train_model` came from, and what the outputs were used for. We can
add the history of particular variables by calling `expand_back` on the CF, and 
similarly `expand_forward` shows the operations that consume given variables: 

In [13]:
print(cf.expand_back(varnames=["X_train", "y_train"]))
print(cf.expand_forward(varnames=['output_0', 'output_1']))

ComputationFrame with 6 variable(s) (11 unique refs), 2 operation(s) (4 unique calls)
Computational graph:
    X_train, y_train = generate_dataset(random_seed=random_seed)
    output_0, output_1 = train_model(y_train=y_train, n_estimators=n_estimators, X_train=X_train)
