# Logistic regression in pytorch
## What's in this tutorial?
This notebook will walk you through the most basic use of `mandala` for storing
and tracking ML experiment results. It uses logistic regression on MNIST as a
"minimally interesting" example. By following this mini-project, you will learn
how to
- break up an experiment into Python functions whose calls can be tracked and
queried;
- use memoization to avoid re-running expensive computations and to
naturally interact with and grow your project (by adjusting the parameters and
adding new code);
- get a powerful query interface to your results "for free" by repurposing the
  pure Python code of your experiments.

The overall goal that `mandala` enables is to let you write only the
plain-Python code you'd write if you were just running experiments in-memory -
yet get the benefits of a database-backed experiment tracking system that knows
not to recompute already computed quantities, and is aware of the relationships
between these quantities and the parameters that went into them.

## Project description
As the example project, you will use logistic regression to classify MNIST
digits. On a high level, you'll do the following:
- load the train and test sets;
- define the model, loss function, and optimizer;
- explore the space of hyperparameters (learning rate, batch size, ...) to 
find a good combination;
- iterate on this pipeline until you're satisfied with the results.

## Import libraries

In [1]:
import torch
import torchvision as tv
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, Subset
import torch.utils as utils

# recommended way to import mandala functionality
from mandala_lite.all import *

## Define supporting functions
For the purposes of this tutorial, you'll break the project into two main
functions:
- `get_dataloaders` returns `DataLoader`s for the train and test sets with a
given batch size;
- `train_model` is passed these `DataLoader`s, trains a model on the train set
with given hyperparameters, and returns the accuracy on the test set.

This kind of decomposition of the problem is a good practice in ML in general,
but also when using `mandala` in particular. It allows you to re-use the same
intermediate results in different experiments. For example, you might want to
try different models on the same data, or different data on the same model. With
this decomposition, you can just call `get_dataloaders` once, and then call 
`train_model` with different models and hyperparameters.

Below are definitions of these functions in fairly standard `pytorch` code. Note
the use of `@op` to mark the functions as tracked by `mandala` - more on this
in a moment:

### Load the data

In [9]:
# MNIST constants
INPUT_SIZE = 28**2
NUM_CLASSES = 10


@op
def get_dataloaders(batch_size: int = 100) -> Tuple[DataLoader, DataLoader]:
    # get train and test loaders for the MNIST dataset
    train_data = Subset(
        MNIST("data", train=True, download=True, transform=tv.transforms.ToTensor()),
        indices=range(2_000),
    )
    test_data = MNIST("data", train=False, transform=tv.transforms.ToTensor())
    train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False)
    return train_loader, test_loader

### Train the model

In [10]:
class LogisticRegression(torch.nn.Module):
    """
    A single `pytorch` linear layer
    """

    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(INPUT_SIZE, NUM_CLASSES)

    def forward(self, feature):
        output = self.linear(feature)
        return output


@op
def train_model(
    train_loader: DataLoader,
    test_loader: DataLoader,
    learning_rate: float = 0.001,
    num_epochs: int = 5,
) -> Tuple[LogisticRegression, float]:
    # train a logistic regression model with the given loaders and hyperparameters
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = LogisticRegression().to(device)
    loss = torch.nn.CrossEntropyLoss().to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    for epoch in range(num_epochs):
        # train
        for batch_index, (images, labels) in enumerate(train_loader):
            images = images.view(-1, INPUT_SIZE).to(device)
            labels = labels.to(device)
            optimizer.zero_grad()
            output = model(images)
            loss_value = loss(output, labels)
            loss_value.backward()
            optimizer.step()
        # test
        accurate, total = 0, 0
        for images, labels in test_loader:
            images = images.view(-1, INPUT_SIZE).to(device)
            labels = labels.to(device)
            output = model(images)
            _, predicted = torch.max(output.data, 1)
            total += labels.size(0)
            accurate += (predicted == labels).sum()
        acc = 100 * accurate / total
        print(
            f"Epoch: {epoch}, Training loss: {round(loss_value.item(), 2)}. Test accuracy: {round(acc.item(), 2)}"
        )
    return model, round(float(acc.item()), 2)

## Run the "default" experiment and log the results
Now that you have defined the functions that make up your experiment, you can
run it with the default parameters and log the results. The `@op` decorator on
the functions above tells `mandala` to track the calls to these functions, and
to store the results in the database - but this only happens when you call these
functions in the context of a given *storage*. So go ahead and create a storage
for this project: 

In [11]:
storage = Storage()

This storage will hold the results of all the experiments you run in this
notebook. With the storage created, you can run the pipeline and log the results
as follows:

In [12]:
with storage.run():
    train_loader, test_loader = get_dataloaders()
    model, acc = train_model(train_loader, test_loader)
    print(f"Final accuracy: {acc}")

Epoch: 0, Training loss: 2.28. Test accuracy: 12.76
Epoch: 1, Training loss: 2.26. Test accuracy: 16.09
Epoch: 2, Training loss: 2.27. Test accuracy: 19.58
Epoch: 3, Training loss: 2.24. Test accuracy: 23.42
Epoch: 4, Training loss: 2.2. Test accuracy: 27.18
Final accuracy: ValueRef(27.18, uid=5480f19d29d345a1174392cca86057a649b6d0e480cf4130eb9509193ab1e6c01b9605377ef2a568f74119c805d59194a62fdd320030bd1bb3618f1ad3a09249)


### What just happened?
There is a lot to unpack in these few lines:
- The `storage.run()` context manager tells `mandala` that all the calls to the
functions decorated with `@op` inside this context should be tracked and stored.
- In this context, each time an `@op`-decorated function is called for the first
time on a set of inputs, `mandala` stores the inputs and outputs of this call in
the storage. 
- Values shared between calls are stored only once. So
  `train_loader` will appear in storage as both the output to the call to
  `get_dataloaders`, and the input to the call to `train_model`.
- The `acc` object (like all objects returned by `@op`-decorated functions is a
*value reference*, which is a value wrapped with storage-relevant metadata.

So what happens when you call `@op`-decorated functions a second time on the
same inputs? Find out by running the cell below:

In [13]:
with storage.run():
    train_loader, test_loader = get_dataloaders()
    model, acc = train_model(train_loader, test_loader)
    print(f"Final accuracy: {acc}")

Final accuracy: ValueRef(27.18, uid=5480f19d29d345a1174392cca86057a649b6d0e480cf4130eb9509193ab1e6c01b9605377ef2a568f74119c805d59194a62fdd320030bd1bb3618f1ad3a09249)


Note that this time the code ran much faster. This is because `mandala`
recognized that the inputs to the functions were the same as before, and so it
didn't need to re-run the calls. This is also evident from the lack of output
from the model training!

## Explore parameters while reusing past results
Where `mandala` really shines is in the ability to minimally change the code of
your experiment to efficiently explore different parameters, add new logic on
top of a workflow, and query existing results. Let's see how this works by
exploring the effect of the learning rate on the accuracy of the model.

In [14]:
with storage.run():
    train_loader, test_loader = get_dataloaders()
    for learning_rate in [0.001, 0.01, 0.1]:
        model, acc = train_model(train_loader, test_loader, learning_rate)
        print(
            f"===End of run=== learning_rate: {learning_rate}, acc: {round(unwrap(acc), 2)}"
        )

===End of run=== learning_rate: 0.001, acc: 27.18
Epoch: 0, Training loss: 2.12. Test accuracy: 45.78
Epoch: 1, Training loss: 1.92. Test accuracy: 61.59
Epoch: 2, Training loss: 1.84. Test accuracy: 68.46
Epoch: 3, Training loss: 1.68. Test accuracy: 71.69
Epoch: 4, Training loss: 1.56. Test accuracy: 73.57
===End of run=== learning_rate: 0.01, acc: 73.57
Epoch: 0, Training loss: 1.23. Test accuracy: 78.13
Epoch: 1, Training loss: 0.86. Test accuracy: 81.3
Epoch: 2, Training loss: 0.75. Test accuracy: 83.19
Epoch: 3, Training loss: 0.62. Test accuracy: 84.03
Epoch: 4, Training loss: 0.6. Test accuracy: 85.13
===End of run=== learning_rate: 0.1, acc: 85.13


We see that the higher the learning rate, the better the final accuracy. Now,
let's try varying the batch size as well:

In [15]:
with storage.run():
    for batch_size in [100, 200, 400]:
        train_loader, test_loader = get_dataloaders(batch_size=batch_size)
        for learning_rate in [0.001, 0.01, 0.1]:
            model, acc = train_model(train_loader, test_loader, learning_rate)
            print(
                f"===end of run=== batch_size: {batch_size}, learning_rate: {learning_rate}, acc: {round(unwrap(acc), 2)}"
            )

===End of run=== batch_size: 100, learning_rate: 0.001, acc: 27.18
===End of run=== batch_size: 100, learning_rate: 0.01, acc: 73.57
===End of run=== batch_size: 100, learning_rate: 0.1, acc: 85.13
Epoch: 0, Training loss: 2.32. Test accuracy: 13.79
Epoch: 1, Training loss: 2.29. Test accuracy: 15.01
Epoch: 2, Training loss: 2.28. Test accuracy: 16.4
Epoch: 3, Training loss: 2.28. Test accuracy: 17.9
Epoch: 4, Training loss: 2.27. Test accuracy: 19.61
===End of run=== batch_size: 200, learning_rate: 0.001, acc: 19.61
Epoch: 0, Training loss: 2.2. Test accuracy: 29.91
Epoch: 1, Training loss: 2.1. Test accuracy: 47.85
Epoch: 2, Training loss: 2.02. Test accuracy: 57.92
Epoch: 3, Training loss: 1.91. Test accuracy: 64.49
Epoch: 4, Training loss: 1.83. Test accuracy: 68.17
===End of run=== batch_size: 200, learning_rate: 0.01, acc: 68.17
Epoch: 0, Training loss: 1.59. Test accuracy: 74.63
Epoch: 1, Training loss: 1.18. Test accuracy: 78.29
Epoch: 2, Training loss: 0.98. Test accuracy: 80.

## Query the results
By now, you have run the pipeline with many different combinations of
parameters, and it's getting difficult to make sense of all the results so far.
One option to get to the results is to just re-run the above workflow, or a "sub-workflow"
of it. For example, how might you get all the results for a given learning rate,
e.g. `learning_rate=0.1`?

One answer: just by re-running the subset of the above code using this value of
the learning rate:

In [17]:
with storage.run():
    for batch_size in [100, 200, 400]:
        train_loader, test_loader = get_dataloaders(batch_size=batch_size)
        for learning_rate in [0.1]:  # only change relative to previous cell
            model, acc = train_model(train_loader, test_loader, learning_rate)
            print(
                f"===end of run=== batch_size: {batch_size}, learning_rate: {learning_rate}, acc: {round(unwrap(acc), 2)}"
            )

===end of run=== batch_size: 100, learning_rate: 0.1, acc: 85.13
===end of run=== batch_size: 200, learning_rate: 0.1, acc: 82.81
===end of run=== batch_size: 400, learning_rate: 0.1, acc: 79.42


This kind of storage access pattern is called **retracing**: you "retrace"
computational code that you have already run before in order to recover the
quantities computed along the way. You can use retracing to *imperatively* query
existing results (like you did above), or to easily add new parameters/logic
that need to compute over existing results.

However, sometimes you don't have a specific piece of code to retrace and just
want to look at all the results in storage. For this, you can use a
*declarative* query interface via the `storage.query()` context manager:

In [19]:
with storage.query() as q:
    batch_size = Q().named("batch_size")
    train_loader, test_loader = get_dataloaders(batch_size=batch_size)
    learning_rate = Q().named("learning_rate")
    model, acc = train_model(train_loader, test_loader, learning_rate)
    df = q.get_table(batch_size, learning_rate, acc.named("accuracy"))

df.sort_values(by=["accuracy"], ascending=False)

Unnamed: 0,batch_size,learning_rate,accuracy
3,100,0.1,85.13
8,200,0.1,82.81
4,400,0.1,79.42
6,100,0.01,73.57
2,200,0.01,68.17
7,400,0.01,47.5
0,100,0.001,27.18
5,200,0.001,19.61
1,400,0.001,14.46
