# 🎉 Introducing TorchEval! 🎉

### Fork this notebook on [Google Colab](https://colab.research.google.com/github/pytorch/torcheval/blob/main/examples/Introducing_TorchEval.ipynb)

First lets install TorchEval from pypi

In [None]:
%%capture
!pip install torcheval

TorchEval is a library built for users who want highly performant implementations of common metrics to evaluate machine learning models. It also provides an easy to use interface for building custom metrics with the same toolkit. Building your metrics with TorchEval makes running distributed training loops with torch.distributed a breeze.

# Using Metrics

Let's setup a simple sequential model here with one layer and RELU activation.

We'll also define a function that runs some random data through the randomly initialized network and return some random labels with the network output.


In [1]:
import torch
BATCH_SIZE = 8
INPUT_SIZE = 10
NUM_CLASSES = 6

model = torch.nn.Sequential(torch.nn.Linear(INPUT_SIZE, NUM_CLASSES), torch.nn.ReLU())

def get_outputs_and_targets():
    input = torch.rand(size=(BATCH_SIZE, INPUT_SIZE))
    target = torch.randint(size=(BATCH_SIZE,), high=NUM_CLASSES)
    outputs = model(input)
    return outputs, target

### Functional Implementations ([Docs](https://pytorch.org/torcheval/main/torcheval.metrics.functional.html))

Now let's see how accurate our random model was using a functional implementation of multiclass accuracy

In [2]:
from torcheval.metrics.functional import multiclass_accuracy

outputs, target = get_outputs_and_targets()

print(
    multiclass_accuracy(outputs, target)
)  # just computes the metric value for this batch of data

tensor(0.1250)


### Class Based Implementations ([Docs](https://pytorch.org/torcheval/main/torcheval.metrics.html))

Now lets assume we have a few batches of data we want to run through and aggregate the total results. This can be done easily with our interfaces inheriting from the `Metric` class 

In [3]:
from torcheval.metrics import MulticlassAccuracy

metric = MulticlassAccuracy()
for i in range(10):
    outputs, target = get_outputs_and_targets()
    metric.update(outputs, target) # updates internal state variables with this batch of data 
                                   # those state variables will be used to compute the metric on all 10 batches of data

print(metric.compute()) # compute metric on all 10 batches. Call metric.reset() to clear internal variables 
                        # and compute metric on a new set of data.

tensor(0.2000)


Deferred computation of the metric is a useful trick for speeding up evaluation loops. In some cases, accumulating state can be done quickly, while metric computation is slow.

Now let's reset the internal state variables of our metric. Any following calls to `.compute()` will only compute the metric over data passed in since the last `.reset()`.

In [7]:
metric.reset()
metric.update(torch.tensor([1,0,1]), torch.tensor([1,0,1])) # should give 100% accuracy
print(metric.compute())

tensor(1.)


### In A Distributed Setting

With distributed training loops, we we also need to synchronize results across processes. TorchEval makes this very simple, let's take a look at a basic training loop.

We have some boilerplate to get distributed training running

In [None]:
import torch.distributed as dist
import torch.distributed.launcher as pet
import uuid

lc = pet.LaunchConfig(
        min_nodes=1,
        max_nodes=1,
        nproc_per_node=4,
        run_id=str(uuid.uuid4()),
        rdzv_backend="c10d",
        rdzv_endpoint="localhost:0",
        max_restarts=0,
        monitor_interval=1,
    )

lc.start_method = "fork"

Now let's setup a distributed loop to do inference

In [None]:
from torcheval.metrics.toolkit import sync_and_compute # import sync_and_compute from our toolkit to sync data between processes
from torcheval.metrics import MulticlassF1Score

def distributed_loop():
    dist.init_process_group(backend="gloo")
    metric = MulticlassF1Score(num_classes=NUM_CLASSES)
    for i in range(10):
        input = torch.rand(size=(BATCH_SIZE, INPUT_SIZE))
        target = torch.randint(size=(BATCH_SIZE,), high=NUM_CLASSES)
        outputs = model(input).detach()
        metric.update(outputs, target)

    return sync_and_compute(metric) # include recipient_rank="all" for each process to return computed metric

In [None]:
batch_values = pet.elastic_launch(lc, entrypoint=distributed_loop)()
print(batch_values) # process with label "rank 0" is using all 40 batches of data

{0: tensor(0.1750), 1: None, 2: None, 3: None}


You can see that on rank 0, we got back a metric value which was computed using all the data across processes. If we had used `sync_and_compute(metric, recipient_rank="all")`, then each process would return `tensor(0.1750)`.

A more full example of a distributed training setup is in our [examples directory](https://github.com/pytorch/torcheval/blob/main/examples/distributed_example.py)

# Adding your own metric

To add your own metric, you simply need to inherit from `Metric` and implement 4 methods.

1. `__init__(self,)`: Defines the state variables 
2. `update(self, *args)`: Determines how to update the state variables with new data
3. `compute(self,)`: Computes the metric from the state variables
4. `merge_state(self, metrics)`: Describes how to merge the internal states when the metric objects from all processes are collected on a single device.

Below, we implement the Kolmogorov-Smirnov 2 Sample test, utilizing the implemenation already in `scipy`.

As a brief refresher, the KS test finds the maximum difference in the CDFs of two randomly sampled datasets. If the two sets of samples come from the same distribution, the KS statistic will trend to 0 with enough data.

In [None]:
from torcheval.metrics.metric import Metric
from scipy.stats import ks_2samp

class KS_2Samp(Metric[torch.Tensor]):
    def __init__(self, device = None) -> None:
        super().__init__(device=device)
        # Keep a list of the samples
        self._add_state("dist_1_samples", torch.tensor([], device=self.device)) 
        self._add_state("dist_2_samples", torch.tensor([], device=self.device))

    @torch.inference_mode() # turn off autograd and apply some automatic optimizations
    def update(self, new_samples_dist_1, new_samples_dist_2):
        # When new data comes in, just add them to the list of samples
        self.dist_1_samples = torch.cat((self.dist_1_samples, new_samples_dist_1))
        self.dist_2_samples = torch.cat((self.dist_2_samples, new_samples_dist_2))
        return self

    @torch.inference_mode()
    def compute(self):
        print("Computing with", self.dist_1_samples.shape[0], "samples", end=": ") #just for show
        # Let scipy do the hard work
        return ks_2samp(
            self.dist_1_samples.cpu().detach().numpy(),
            self.dist_2_samples.cpu().detach().numpy(),
        )

    @torch.inference_mode()
    def merge_state(self, metrics):
        # Merging the states just means concatenating all the samples for each distribution
        dist_1_samples = [self.dist_1_samples, ]
        dist_2_samples = [self.dist_2_samples, ]
        for metric in metrics:
            dist_1_samples.append(metric.dist_1_samples)
            dist_2_samples.append(metric.dist_2_samples)
        self.dist_1_samples = torch.cat(dist_1_samples)
        self.dist_2_samples = torch.cat(dist_2_samples)
        return self

Let's check if the implementation works. We'll sample 10000 elements from a uniform distribution two times and see what the KS statistic is between them.

In [None]:
metric = KS_2Samp()
metric.update(torch.rand(10000), torch.rand(10000))
print(metric.compute()) # statistic should be close to 0 since they have the same underlying distribution

Computing with 10000 samples: KstestResult(statistic=0.0173, pvalue=0.10027104449847714)


Now let's check how the accumulation of state is working by printing the KS statistic with more and more samples

In [None]:
metric = KS_2Samp()
for i in range(10):
    metric.update(torch.rand(500), torch.rand(500))
    print(metric.compute()) # statistic should fall with more data

print()
print("=========")
print("Resetting Metric")
print("=========")
print()

metric.reset()
metric.update(torch.rand(3), torch.rand(5))
print("KS Statistic for a small random batch:", metric.compute())  # statistic be large with small amounts of data

Computing with 500 samples: KstestResult(statistic=0.048, pvalue=0.6126241113875229)
Computing with 1000 samples: KstestResult(statistic=0.038, pvalue=0.4659595288557257)
Computing with 1500 samples: KstestResult(statistic=0.035333333333333335, pvalue=0.3063862891844912)
Computing with 2000 samples: KstestResult(statistic=0.0235, pvalue=0.6388604192561329)
Computing with 2500 samples: KstestResult(statistic=0.0192, pvalue=0.7462473796111823)
Computing with 3000 samples: KstestResult(statistic=0.018666666666666668, pvalue=0.6728446559019895)
Computing with 3500 samples: KstestResult(statistic=0.015142857142857144, pvalue=0.8171893962320109)
Computing with 4000 samples: KstestResult(statistic=0.01325, pvalue=0.8740415683356425)
Computing with 4500 samples: KstestResult(statistic=0.015555555555555555, pvalue=0.6476780146655224)
Computing with 5000 samples: KstestResult(statistic=0.0164, pvalue=0.5120142730148939)

Resetting Metric

Computing with 3 samples: KS Statistic for a small random

See how the metric accumulates state over time. As we get more examples, the KS statistic gets smaller (roughly). 

Doing deferred computation of the metric is a useful trick if accumulating state is fast but metric computation is slow. This is true for the KS test, because computing requires sorting of the samples tensors.

# Running In A Distributed Setting

Since we've built our metric on top of torcheval, running it in a distributed loop is a breeze

In [None]:
def distributed_ksloop():
    dist.init_process_group(backend="gloo")
    metric = KS_2Samp()
    for i in range(10):
        metric.update(torch.rand(500), torch.rand(500))

    return sync_and_compute(metric)

In [None]:
batch_values = pet.elastic_launch(lc, entrypoint=distributed_ksloop)()
print(batch_values[0])

Computing with 20000 samples: 

KstestResult(statistic=0.016000000000000014, pvalue=0.011823127020348602)


Notice that our final result is computed with 20,000 samples, 5,000 from each process!

# Conclusions

Thanks for checking out TorchEval! 

Please check out our [Docs](https://pytorch.org/torcheval/main/), [More Examples](https://pytorch.org/torcheval/main/metric_example.html), and [Github Repo](https://github.com/pytorch/torcheval). 
