<a href="https://colab.research.google.com/github/dellis23/test/blob/master/IREE_Torch_Bert_Test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this notebook, we will take a PyTorch Hugging Face BERT model and compile it down to a format executable by IREE.  We will then demonstrate the significantly reduced runtime size.  Additional features of IREE can be found on the [IREE homepage](https://iree-org.github.io/iree/#key-features).

# Package Installation

To install `torch-mlir` (required to compile the model to a format processable by IREE), your Python version must be 3.9 or 3.10.

As of September 2022, Colab only runs on 3.7.  You must use a [local Colab runtime](https://research.google.com/colaboratory/local-runtimes.html) with the correct Python version for this notebook to work correctly.

In [None]:
import platform
assert platform.python_version().startswith('3.9.') or platform.python_version().startswith('3.10.')

In [None]:
%%capture
!pip install -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html torch==1.13.0.dev20220913+cpu
!pip install https://github.com/llvm/torch-mlir/releases/download/snapshot-20220913.595/torch_mlir-20220913.595-cp310-cp310-linux_x86_64.whl
!pip install iree-compiler iree-runtime iree-tools-tf -f https://github.com/iree-org/iree/releases
!pip install git+https://github.com/iree-org/iree-torch.git
!pip install transformers

# Model Setup

## TODO: Give a brief explanation of why this wrapping is necessary.

In [None]:
import torch
import torch_mlir
import iree_torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification


def prepare_sentence_tokens(hf_model: str, sentence: str):
    tokenizer = AutoTokenizer.from_pretrained(hf_model)
    return torch.tensor([tokenizer.encode(sentence)])


class OnlyLogitsHuggingFaceModel(torch.nn.Module):
    """Wrapper that returns only the logits from a HuggingFace model."""

    def __init__(self, model_name: str):
        super().__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_name,  # The pretrained model name.
            # The number of output labels--2 for binary classification.
            num_labels=2,
            # Whether the model returns attentions weights.
            output_attentions=False,
            # Whether the model returns all hidden-states.
            output_hidden_states=False,
            torchscript=True,
        )
        self.model.eval()

    def forward(self, input):
        # Return only the logits.
        return self.model(input)[0]


# Suppress warnings
import warnings
warnings.simplefilter("ignore")
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"

# IREE Compilation

Now, our PyTorch model can be compiled to MLIR and then to a format IREE is able to load and execute.

In [None]:
# The HuggingFace model name to use
model_name = "philschmid/MiniLM-L6-H384-uncased-sst2"

# The sentence to run the model on
sentence = "The quick brown fox jumps over the lazy dog."

print("Parsing sentence tokens.")
example_input = prepare_sentence_tokens(model_name, sentence)

print("Instantiating model.")
model = OnlyLogitsHuggingFaceModel(model_name)

print("Tracing model.")
traced = torch.jit.trace(model, example_input)

print("Compiling with Torch-MLIR")
linalg_on_tensors_mlir = torch_mlir.compile(
    traced,
    example_input,
    output_type=torch_mlir.OutputType.LINALG_ON_TENSORS)

print("Compiling with IREE")
# Backend options:
#
# llvm-cpu - cpu, native code
# vmvx - cpu, interpreted
# vulkan - GPU for general GPU devices
# cuda - GPU for NVIDIA devices
iree_backend = "llvm-cpu"
iree_vmfb = iree_torch.compile_to_vmfb(linalg_on_tensors_mlir, iree_backend)

print("Loading in IREE")
invoker = iree_torch.load_vmfb(iree_vmfb, iree_backend)

print("Running on IREE")
result = invoker.forward(example_input)
print("RESULT:", result)

Parsing sentence tokens.
Instantiating model.
Tracing model.
Compiling with Torch-MLIR
Compiling with IREE
Loading in IREE
Running on IREE
RESULT: tensor([[ 1.8574, -1.8036]])


We are now running our model on IREE.  The compiled version of this model can be saved, deployed, and executed independently of PyTorch.



# Runtime Size Comparison

One benefit of running a model on IREE is lightweight deployment.  The IREE runtime has a significantly smaller footprint than a full PyTorch install.

In [None]:
import os
!du -sh {os.path.dirname(torch.__file__)}
import iree.runtime as iree_runtime
!du -sh {os.path.dirname(iree_runtime.__file__)}

713M	/usr/local/google/home/danielellis/colab-test-venv/lib/python3.10/site-packages/torch
4.0M	/usr/local/google/home/danielellis/colab-test-venv/lib/python3.10/site-packages/iree/runtime
