# Automatic versioning and dependency tracking
`mandala` can
- automatically track the functions and global variables that went into a
  memoized call;
- detect when they have changed;
- let you choose whether a change 
  - requires recomputation of dependent calls (e.g., a bug fix)
  - or not (e.g., refactor that doesn't impact essential logic).

The core principle of the versioning system is that the
*content* of your code (modulo semantic equivalences you can choose to add
manually), implicitly determines "which world" you are in, and which calls are
memoized. This results in a low-overhead, low-effort way to reuse memoized
results in a very fine-grained way, while keeping storage well-organized as a
project evolves.  

**Note**: this functionality is still a work in progress! 

## Understanding dependencies
To get started, let's define a memoized function `train_model` that depends on a
helper function and class and some global variables:

In [2]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from mandala.imports import *
from typing import Tuple
from pathlib import Path
from sklearn.model_selection import train_test_split

N_SAMPLES = 1000
X, y = make_classification(n_samples=N_SAMPLES, random_state=42)

C = 0.1
class MyRegression: # a thin wrapper around sklearn's LogisticRegression

    def __init__(self):
        self.lr = LogisticRegression(C=C)
    
    def fit(self, X, y):
        self.lr.fit(X, y)
    
    def score(self, X, y):
        return self.lr.score(X, y)

N_ESTIMATORS = 100
def train_rf(X_train, y_train, X_test, y_test) -> float:
    rf = RandomForestClassifier(n_estimators=N_ESTIMATORS)
    rf.fit(X_train, y_train)
    score = rf.score(X_test, y_test)
    return score

@op
def train_model(model_class: str = 'lr') -> float:
    print(f'training model {model_class}')
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    if model_class == 'lr':
        model = MyRegression()
        model.fit(X_train, y_train)
        acc = model.score(X_test, y_test)
    else:
        acc = train_rf(X_train, y_train, X_test, y_test)
    return acc

We also define a storage with the `deps_path` argument. This tells `mandala`
where to look for the code of the dependencies of a project. For this
demonstration, we set it to just the current `__main__` scope, which will look
for dependencies only in this notebook - so, for example, library functions such
as `train_test_split` will be excluded.

In [3]:
storage = Storage(deps_path='__main__')

Let's now memoize a call to `train_model` and see what dependencies we find:

In [4]:
with storage.run():
    acc = train_model()

storage.versions(train_model) 

training model lr


The dependencies of the call we memoized include
  - the **function's own source code**, in this case the code of `train_model`;
  - the **global variables** that it accesses, in this case `X` and `y`;
  - **recursively**, the **dependencies of any functions/methods that it calls**.
    For example, `train_model` depends on `MyRegression.__init__`, 
    among other functions/methods - and `MyRegression.__init__` itself depends
    on `C`. 

A change to any of these *may* make a memoized call invalid. Note that whether
or not a result is stale depends on the function's dependencies **for that
particular call**, and on whether the changes to the dependencies change the
meaning of the computation. 

Indeed, if we now train a random forest classifier instead, we will see a new
version with different dependencies show up:

In [5]:
with storage.run():
    acc = train_model(model_class='rf')

storage.versions(train_model)

training model rf


## Detecting changes
Now to the fun part: automatically reacting to changes in dependencies. Let's
make three different kinds of changes to the dependencies:
- change the constant `C` used in the logistic regression;
- change the source of the method `__init__` of the `MyRegression` class.
- change the source of `train_model` itself;

As you'll see when you run the below, you'll be presented with a diff of the
dependencies, organized by module and the kind of dependency (function or global
variable), and the memoized functions affected by each change:

In [6]:
N_SAMPLES = 1000 
X, y = make_classification(n_samples=N_SAMPLES, random_state=42)

C = 1.0 # changed from 0.1
class MyRegression: # a thin wrapper around sklearn's LogisticRegression

    def __init__(self):
        self.lr = LogisticRegression(C=C, class_weight='balanced') # changed from 'l2' (default)
    
    def fit(self, X, y):
        self.lr.fit(X, y)
    
    def score(self, X, y):
        return self.lr.score(X, y)

N_ESTIMATORS = 100
def train_rf(X_train, y_train, X_test, y_test) -> float:
    rf = RandomForestClassifier(n_estimators=N_ESTIMATORS)
    rf.fit(X_train, y_train)
    score = rf.score(X_test, y_test)
    return score

@op
def train_model(model_class: str = 'lr') -> float:
    print(f'training model {model_class}')
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    if model_class == 'lr':
        model = MyRegression()
        model.fit(X_train, y_train)
        acc = model.score(X_test, y_test)
    else:
        acc = train_rf(X_train, y_train, X_test, y_test)
    print(f'accuracy: {acc}') # CHANGE: print the accuracy
    return acc

with storage.run():
    rf_acc = train_model(model_class='rf')
    lr_acc = train_model(model_class='lr')

CHANGE DETECTED in train_model from module __main__
Dependent components:
  Version of "train_model" from module "__main__" (content: 3186cce4d44f99c0e346bbc4668f613b, semantic: e27ccd0e76ba6886b390d33c7ade95a3)
  Version of "train_model" from module "__main__" (content: 574da61ae9e58f448972ba39ef30c15e, semantic: 3c430ce7f58c07cac3490e3eafcdc5bf)
===DIFF===:
     else:
         acc = train_rf(X_train, y_train, X_test, y_test)
[32m+    print(f'accuracy: {acc}') # CHANGE: print the accuracy[0m
     return acc
Does this change require recomputation of dependent calls? [y]es/[n]o/[a]bort 
You answered: "n"
CHANGE DETECTED in MyRegression.__init__ from module __main__
Dependent components:
  Version of "train_model" from module "__main__" (content: 3186cce4d44f99c0e346bbc4668f613b, semantic: e27ccd0e76ba6886b390d33c7ade95a3)
===DIFF===:
     def __init__(self):
[31m-        self.lr = LogisticRegression(C=C)[0m
[32m+        self.lr = LogisticRegression(C=C, class_weight='balanced') # c

Let's unpack what happened. In the above interactive dialog, we marked the
changes as follows:
- the added printing of accuracy in `train_model` was marked as not requiring
  recomputation; it will only affect future calls.
- the changes to `C` and `MyRegression.__init__`  were marked as requiring
  recomputation, as they substantially change what gets computed!
As a result, the call that trained a logistic regression model was re-computed -
but the call using a random forest classifier was reused. This is the power of
fine-grained dependency tracking! 

Now that you have multiple versions of stuff, you can also get a look at how a
particular component evolved during a project. For example, `storage.sources()`
gives you a tree-like revision history:

In [7]:
storage.sources(train_model)

Revision history for the source code of function train_model from module __main__ ("===HEAD===" is the current version):


# Conclusion
This was a very brief exposure to the versioning machinery. In particular, we
didn't cover some more of the things possible with this design:
- revisit old versions and make new branches off of them;
- audit all the source code that went into a particular call; 
- incorporate versioning information in declarative queries