# HFTA Tutorial

This notebook demonstrates the way to integrate HFTA to a simple mnist training example.

## Setup

Install the HFTA library from GitHub.

In [None]:
!pip install git+https://github.com/UofT-EcoSystem/hfta

### Demo with a benchmark

Here is a demo run on one of the benchmarks provided in the `hfta` GitHub repo.

Check [here](https://github.com/UofT-EcoSystem/hfta/tree/main/examples/mobilenet) for the code of this example (MobileNet V2).

If you want to see a simpler example on how to utilize HFTA on a normal pytorch model, check the next section.

In [None]:
# We need to sync down the GitHub repo to run the benchmarks
!git clone https://github.com/UofT-EcoSystem/hfta

In [None]:
# Run the MobileNet V2 benchmark
!python hfta/examples/mobilenet/main.py --version v2 --epochs 5 --amp --eval --dataset cifar10 --device cuda --lr 0.01 0.02 0.03 --hfta

## Train a mnist model without HFTA

In this section, we provide a simpler example to show how to modify a normal pytorch model to its HFTA version.

We train a simple neural network with two convolutional layers and two fully connected layers, together with some max_pool and dropout layers. This model is used to train a mnist dataset.

### Define the non-HFTA model

In [None]:
import time
import random
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

try:
  import torch_xla
  import torch_xla.core.xla_model as xm
  import torch_xla.debug.metrics as met
except ImportError:
  pass


class Net(nn.Module):

  def __init__(self):
    super(Net, self).__init__()
    self.conv1 = nn.Conv2d(1, 32, 3, 1)
    self.conv2 = nn.Conv2d(32, 64, 3, 1)
    self.max_pool2d = nn.MaxPool2d(2)
    self.fc1 = nn.Linear(9216, 128)
    self.fc2 = nn.Linear(128, 10)
    self.dropout1 = nn.Dropout2d(0.25)
    self.dropout2 = nn.Dropout2d(0.5)

  def forward(self, x):
    x = self.conv1(x)
    x = F.relu(x)
    x = self.conv2(x)
    x = F.relu(x)
    x = self.max_pool2d(x)
    x = self.dropout1(x)
    x = torch.flatten(x, 1)
    x = self.fc1(x)
    x = F.relu(x)
    x = self.dropout2(x)
    x = self.fc2(x)
    output = F.log_softmax(x, dim=1)
    return output

### Define the training and testing loop

In [None]:
def train(config, model, device, train_loader, optimizer, epoch):
  model.train()
  for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = model(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    if config["device"] == 'xla':
      xm.optimizer_step(optimizer, barrier=True)
    else:
      optimizer.step()
    if batch_idx % config["log_interval"] == 0:
      print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
          epoch,
          batch_idx * len(data),
          len(train_loader.dataset),
          100. * batch_idx / len(train_loader),
          loss.item(),
      ))
      if config["dry_run"]:
        break


def test(model, device, test_loader):
  model.eval()
  test_loss = 0
  correct = 0
  with torch.no_grad():
    for data, target in test_loader:
      data, target = data.to(device), target.to(device)
      N = target.size(0)
      output = model(data)
      test_loss += F.nll_loss(output, target,
                              reduction='none').view(-1, N).sum(dim=1)
      pred = output.argmax(dim=1, keepdim=True)
      correct += pred.eq(target.view_as(pred)).view(-1, N).sum(dim=1)

  length = len(test_loader.dataset)
  test_loss /= length
  loss_str = ["%.4f" % e for e in test_loss]
  correct_str = [
      "%d/%d(%.2lf%%)" % (e, length, 100. * e / length) for e in correct
  ]
  print('Test set: \tAverage loss: {}, \n \t\t\tAccuracy: {}\n'.format(
      loss_str, correct_str))

### Define the main loop

In [None]:
def main(config):
  random.seed(1)
  np.random.seed(1)
  torch.manual_seed(1)

  device = (torch.device(config["device"])
            if config["device"] in {'cpu', 'cuda'} else xm.xla_device())

  kwargs = {'batch_size': config["batch_size"]}
  kwargs.update({'num_workers': 1, 'pin_memory': True, 'shuffle': True},)

  transform = transforms.Compose(
      [transforms.ToTensor(),
       transforms.Normalize((0.1307,), (0.3081,))])

  dataset1 = datasets.MNIST('./data',
                            train=True,
                            download=True,
                            transform=transform)
  dataset2 = datasets.MNIST('./data', train=False, transform=transform)
  train_loader = torch.utils.data.DataLoader(dataset1, **kwargs)
  test_loader = torch.utils.data.DataLoader(dataset2, **kwargs)

  model = Net().to(device)

  optimizer = optim.Adadelta(
      model.parameters(),
      lr=config["lr"][0],
  )

  start = time.perf_counter()
  for epoch in range(1, config["epochs"] + 1):
    now = time.perf_counter()
    train(config, model, device, train_loader, optimizer, epoch)
    print('Epoch {} took {} s!'.format(epoch, time.perf_counter() - now))
  end = time.perf_counter()

  test(model, device, test_loader)

  print('All jobs Finished, Each epoch took {} s on average!'.format(
      (end - start) / config["epochs"]))

### Train the model

In [None]:
config = {
    "device": "cuda",
    "batch_size": 64,
    "lr": [1.0],
    "gamma": 0.7,
    "epochs": 4,
    "seed": 1,
    "log_interval": 500,
    "dry_run": False,
    "save_model": False,
}

print(config)
main(config)

## Improve hardware utilization with HFTA

### How to modify a mnist model to use HFTA?

Up until now, you need to make modifications to your model, training loop, and data shape to utilize the benefits from HFTA. The HFTA package provides convenient converter to facilitate the process and we are working on automating this process now.

Please check the comments in the code to understand what need to be done. In this example, we fuse multiple models with different learning rates together with HFTA to improve the hardware utilizations. Beyond this example, in addition to the learning rate, there are more hyperparameters can be fused.

#### Modify the model

In [None]:
from __future__ import print_function
import sys
import time
import random
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

try:
  import torch_xla
  import torch_xla.core.xla_model as xm
  import torch_xla.debug.metrics as met
except ImportError:
  pass

# Use helper functions from hfta package to convert your operators and optimizors
from hfta.ops import get_hfta_op_for
from hfta.optim import get_hfta_optim_for


class Net(nn.Module):

  # When initializing the model, save the number of fused models (B),
  # and convert the default operators to HFTA version with get_hfta_op_for(<default>, B).
  def __init__(self, B=0):
    super(Net, self).__init__()
    self.B = B
    self.conv1 = get_hfta_op_for(nn.Conv2d, B=B)(1, 32, 3, 1)
    self.conv2 = get_hfta_op_for(nn.Conv2d, B=B)(32, 64, 3, 1)
    self.max_pool2d = get_hfta_op_for(nn.MaxPool2d, B=B)(2)
    self.fc1 = get_hfta_op_for(nn.Linear, B=B)(9216, 128)
    self.fc2 = get_hfta_op_for(nn.Linear, B=B)(128, 10)
    self.dropout1 = get_hfta_op_for(nn.Dropout2d, B=B)(0.25)
    self.dropout2 = get_hfta_op_for(nn.Dropout2d, B=B)(0.5)

  # Minor modifications to the forward pass on special operators.
  # Check the documentation of each operator for details.
  def forward(self, x):
    x = self.conv1(x)
    x = F.relu(x)
    x = self.conv2(x)
    x = F.relu(x)
    x = self.max_pool2d(x)
    x = self.dropout1(x)

    if self.B > 0:
      x = torch.flatten(x, 2)
      x = x.transpose(0, 1)
    else:
      x = torch.flatten(x, 1)

    x = self.fc1(x)
    x = F.relu(x)
    x = self.dropout2(x)
    x = self.fc2(x)

    if self.B > 0:
      output = F.log_softmax(x, dim=2)
    else:
      output = F.log_softmax(x, dim=1)

    return output

#### Modify the training and testing loop

In [None]:
def train(config, model, device, train_loader, optimizer, epoch, B):
  model.train()
  for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)

    # Need to combine multiple batches of input data to feed into the fused model
    if B > 0:
      N = target.size(0)
      data = data.unsqueeze(1).expand(-1, B, -1, -1, -1)
      target = target.repeat(B)

    optimizer.zero_grad()
    output = model(data)

    # Also need to modify the loss function to take consider on fused models
    if B > 0:
      loss = B * F.nll_loss(output.view(B * N, -1), target)
    else:
      loss = F.nll_loss(output, target)

    loss.backward()
    if config["device"] == 'xla':
      xm.optimizer_step(optimizer, barrier=True)
    else:
      optimizer.step()
    if batch_idx % config["log_interval"] == 0:
      print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
          epoch, batch_idx * len(data), len(train_loader.dataset),
          100. * batch_idx / len(train_loader), loss.item()))
      if config["dry_run"]:
        break


def test(model, device, test_loader, B):
  model.eval()
  test_loss = 0
  correct = 0
  with torch.no_grad():
    for data, target in test_loader:
      data, target = data.to(device), target.to(device)
      N = target.size(0)

      # Need to combine multiple batches of input data to feed into the fused model
      if B > 0:
        data = data.unsqueeze(1).expand(-1, B, -1, -1, -1)
        target = target.repeat(B)

      output = model(data)

      # Change the shape of the output for summing up the loss
      if B > 0:
        output = output.view(B * N, -1)

      test_loss += F.nll_loss(output, target,
                              reduction='none').view(-1, N).sum(dim=1)
      pred = output.argmax(dim=1, keepdim=True)
      correct += pred.eq(target.view_as(pred)).view(-1, N).sum(dim=1)

  length = len(test_loader.dataset)
  test_loss /= length
  loss_str = ["%.4f" % e for e in test_loss]
  correct_str = [
      "%d/%d(%.2lf%%)" % (e, length, 100. * e / length) for e in correct
  ]
  print('Test set: \tAverage loss: {}, \n \t\t\tAccuracy: {}\n'.format(
      loss_str, correct_str))

#### Modify the main loop

In [None]:
def main(config):
  random.seed(config["seed"])
  np.random.seed(config["seed"])
  torch.manual_seed(config["seed"])

  device = (torch.device(config["device"])
            if config["device"] in {'cpu', 'cuda'} else xm.xla_device())

  kwargs = {'batch_size': config["batch_size"]}
  kwargs.update({'num_workers': 1, 'pin_memory': True, 'shuffle': True},)

  transform = transforms.Compose(
      [transforms.ToTensor(),
       transforms.Normalize((0.1307,), (0.3081,))])

  # Detect the number of fused models from the number of provided LR's
  B = len(config["lr"]) if config["use_hfta"] else 0

  dataset1 = datasets.MNIST('./data',
                            train=True,
                            download=True,
                            transform=transform)
  dataset2 = datasets.MNIST('./data', train=False, transform=transform)
  train_loader = torch.utils.data.DataLoader(dataset1, **kwargs)
  test_loader = torch.utils.data.DataLoader(dataset2, **kwargs)

  # Create the model and specify the number of fused models (B)
  model = Net(B).to(device)

  print('B={} lr={}'.format(B, config["lr"]), file=sys.stderr)

  # Convert the default optimizor (pytorch Adadelta) to HFTA version with get_hfta_optim_for(<default>, B).
  optimizer = get_hfta_optim_for(optim.Adadelta, B=B)(
      model.parameters(),
      lr=config["lr"] if B > 0 else config["lr"][0],
  )

  start = time.perf_counter()
  for epoch in range(1, config["epochs"] + 1):
    now = time.perf_counter()
    train(config, model, device, train_loader, optimizer, epoch, B)
    print('Epoch {} took {} s!'.format(epoch, time.perf_counter() - now))
  end = time.perf_counter()

  test(model, device, test_loader, B)

  print('All jobs Finished, Each epoch took {} s on average!'.format(
      (end - start) / (max(B, 1) * config["epochs"])))

### Train a single HFTA-enhanced mnist model

Note that this run may be slower than the non-HFTA version because enabling HFTA has a small amount of overhead. Here, we are only training a single model, so the overhead slows down the training. With more models being fused, the overhead is amortized resulting an overall improvement eventually.

In [None]:
# Enable HFTA, but not fusing models
# Only 1 model is trained
config = {
    "use_hfta": True,
    "device": "cuda",
    "batch_size": 64,
    "lr": [0.1],
    "gamma": 0.7,
    "epochs": 4,
    "seed": 1,
    "log_interval": 500,
    "dry_run": False,
    "save_model": False,
}


print(config)
main(config)

### Train fused HFTA-enhanced mnist model

In [None]:
# Enable HFTA and fuse 6 models
config = {
    "use_hfta": True,
    "device": "cuda",
    "batch_size": 64,
    "lr": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6],
    "gamma": 0.7,
    "epochs": 4,
    "seed": 1,
    "log_interval": 500,
    "dry_run": False,
    "save_model": False,
}


print(config)
main(config)

## Conclusion

Based on the time each epoch takes when training the non-HFTA and HFTA version of the same model, we can see that HFTA helps to increase the throughput of the training, especially on a large hardware. Check our [paper](https://arxiv.org/pdf/2102.02344.pdf) for more details.

We are working to make integrating HFTA to a normal pytorch model more convenient.