<a href="https://colab.research.google.com/github/battuzz/torch_aot/blob/main/TorchAOT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AOT Compilation of torch models

## Install latest version of pytorch (CPU)

In [1]:
!pip3 install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Looking in indexes: https://download.pytorch.org/whl/cpu


In [2]:
!pip install cmake



## Import libraries

In [3]:
import torch
print(torch.__version__)

torch.set_default_dtype(torch.float64)   # Change this default data type that is the one used by the model computations. The C++ interface will always be float64

2.7.1+cpu


## Define models

The models we chose are:
- a very basic MLP with 3 layers
- a simple Gaussian Process posterior with squared exponential kernel

We use float64 as a default data type for the interface. However, we could also cast down to float32 to do computations and then cast the results back to float64 to gain some performance.

In [4]:
NUM_INPUTS = 5
NUM_OUTPUTS = 7
NUM_INDUCING_POINTS = 350

class ModelNN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(NUM_INPUTS, 128)
        self.fc2 = torch.nn.Linear(128, 128)
        self.fc3 = torch.nn.Linear(128, NUM_OUTPUTS)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)
        return x

def squared_distance(x1, x2):
    return (
        torch.sum(x1**2, dim=1, keepdim=True)
        + torch.sum(x2**2, dim=1)
        - 2 * torch.mm(x1, x2.t())
    )


def rbf_kernel(x1, x2, lengthscale=1.0):
    dist = squared_distance(x1 / lengthscale, x2 / lengthscale)
    return torch.exp(-0.5 * dist)

class ModelGPPosterior(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.lengthscales = torch.nn.Parameter(torch.randn(NUM_INPUTS))
        self.inducing_points = torch.nn.Parameter(
        torch.randn(NUM_INDUCING_POINTS, NUM_INPUTS)
        )
        self.alpha = torch.nn.Parameter(torch.randn(NUM_INDUCING_POINTS, NUM_OUTPUTS))

    def forward(self, x):
        Kuf = rbf_kernel(x, self.inducing_points, self.lengthscales)
        mean = Kuf @ self.alpha
        return mean


class CastToFloat64Wrapper(torch.nn.Module):
    def __init__(self, inner_model : torch.nn.Module):
        super().__init__()
        self.model_ = inner_model

    def forward(self, x):
        x_default_dtype = x.type(torch.get_default_dtype())
        result = self.model_(x_default_dtype)
        result_f64 = result.type(torch.float64)

        return result_f64

## Train models with random data

In [5]:
def train_with_random_data(model):
    X = torch.randn(1000, NUM_INPUTS).type(torch.float64)
    y = torch.randn(1000, NUM_OUTPUTS).type(torch.float64)

    model.train()

    optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
    for epoch in range(30):
        optimizer.zero_grad()
        output = model(X)
        loss = torch.nn.functional.mse_loss(output, y)
        loss.backward()
        optimizer.step()

        print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

model_nn = CastToFloat64Wrapper(ModelNN())
model_gp = CastToFloat64Wrapper(ModelGPPosterior())
train_with_random_data(model_nn)
train_with_random_data(model_gp)

Epoch 1, Loss: 0.9978238012071641
Epoch 2, Loss: 0.9965290112781162
Epoch 3, Loss: 0.9952749213191149
Epoch 4, Loss: 0.994061126066982
Epoch 5, Loss: 0.9928870313537529
Epoch 6, Loss: 0.9917540233458468
Epoch 7, Loss: 0.9906601361048546
Epoch 8, Loss: 0.9896044033370593
Epoch 9, Loss: 0.9885861983119293
Epoch 10, Loss: 0.987604158883998
Epoch 11, Loss: 0.9866569744713525
Epoch 12, Loss: 0.9857445739751902
Epoch 13, Loss: 0.98486760698659
Epoch 14, Loss: 0.9840230932904375
Epoch 15, Loss: 0.9832090206324187
Epoch 16, Loss: 0.9824244705443933
Epoch 17, Loss: 0.9816697725218735
Epoch 18, Loss: 0.9809429698205399
Epoch 19, Loss: 0.9802439656227822
Epoch 20, Loss: 0.9795702244311616
Epoch 21, Loss: 0.9789185884869569
Epoch 22, Loss: 0.978288564934835
Epoch 23, Loss: 0.9776791901626033
Epoch 24, Loss: 0.9770892672377227
Epoch 25, Loss: 0.9765203592998056
Epoch 26, Loss: 0.9759698374167334
Epoch 27, Loss: 0.975436906603212
Epoch 28, Loss: 0.9749198646298654
Epoch 29, Loss: 0.9744185353810849


## Export models

In [6]:
example_input = torch.randn((1, NUM_INPUTS)).type(torch.float64)

# Export NN
model_nn.eval()

exported = torch.export.export(model_nn, (example_input,))
torch._inductor.aoti_compile_and_package(
    exported,
    package_path="model_nn.pt2",
)
with open("model_nn_inputs_shape.txt", "w") as f:
    f.write(
        f"{len(example_input.shape)} {' '.join(map(str, example_input.shape))}"
    )

model_gp.eval()

exported = torch.export.export(model_gp, (example_input,))
torch._inductor.aoti_compile_and_package(
    exported,
    package_path="model_gp.pt2",
)
with open("model_gp_inputs_shape.txt", "w") as f:
    f.write(
        f"{len(example_input.shape)} {' '.join(map(str, example_input.shape))}"
    )

## Write artefacts used in compilation

In particular we'll need:
- A CMakeLists.txt
- The inference.cpp code that loads and benchmarks the model
- A build script that compiles


In [7]:
cmake_contents = """cmake_minimum_required(VERSION 3.18 FATAL_ERROR)
project(aoti_example)

find_package(Torch REQUIRED)

add_executable(aoti_example inference.cpp)

target_link_libraries(aoti_example "${TORCH_LIBRARIES}")
set_property(TARGET aoti_example PROPERTY CXX_STANDARD 17)
"""
with open('CMakeLists.txt', 'w') as f:
    f.write(cmake_contents)

In [8]:
build_contents = """export CMAKE_PREFIX_PATH=/usr/local/lib/python3.11/dist-packages/torch/share/cmake
export TORCHINDUCTOR_FREEZING=1


rm -rf build
mkdir build
cmake -B build .
cmake --build build --config Release
"""
with open('build.sh', 'w') as f:
    f.write(build_contents)

In [9]:
!chmod +x ./build.sh

In [10]:
cpp_content = """#include <iostream>
#include <vector>
#include <chrono>
#include <fstream>

#include <torch/torch.h>
#include <torch/csrc/inductor/aoti_package/model_package_loader.h>

using namespace std::chrono;

int main(int argc, char* argv[]) {

    if (argc < 3) {
        std::cerr << "Usage: " << argv[0] << " <model.pt2> <inputs.txt>" << std::endl;
        return 1;
    }

    // Load input
    std::ifstream input_file{argv[2]};
    if (!input_file) {
        std::cerr << "Error opening input file: " << argv[2] << std::endl;
        return 1;
    }
    int num_dims {};
    input_file >> num_dims;

    std::vector<int64_t> input_dims{};
    for (int i = 0; i < num_dims; ++i) {
        int64_t dim_size;
        input_file >> dim_size;
        input_dims.push_back(dim_size);
    }

    input_file.close();

    torch::Tensor input = torch::randn(input_dims, torch::dtype(torch::kFloat64));
    std::vector<torch::Tensor> inputs { input };

    c10::InferenceMode mode;
    torch::inductor::AOTIModelPackageLoader loader(argv[1], "model", false);

    // Warmup
    std::vector<torch::Tensor> outputs;
    for (int i = 0; i < 1000; i++) {
        outputs = loader.run(inputs);
    }

    // Benchmark
    auto start_time = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 1000; i++) {
        outputs = loader.run(inputs);
    }
    auto end_time = std::chrono::high_resolution_clock::now();

    auto elapsed = duration_cast<microseconds>(end_time - start_time);
    std::cout << "Average inference time over 1000 runs: "
              << (elapsed.count() / 1000) << " us" << std::endl;

    return 0;
}
"""

with open('inference.cpp', 'w') as f:
    f.write(cpp_content)

In [11]:
!ls .

build		inference.cpp		   model_nn_inputs_shape.txt
build.sh	model_gp_inputs_shape.txt  model_nn.pt2
CMakeLists.txt	model_gp.pt2		   sample_data


## Compile the model

In [12]:
!./build.sh

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
  static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
  /usr/local/lib/python3.11/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:125 (append_torchlib_if_found)
  CMakeLists.txt:4 (find_package)

[0m
-- Found Torch: /usr/local/lib/python3.11/dist-packages/torch/lib/libtorch.so
-- Configuring done (0.5s)
-- Generating done (0.0s)
-- Build files have been written to: /content/build
[ 50%] [32mBuilding CXX object CMakeFiles/aoti_example.dir/inferenc

## Run the benchmark

In [13]:
!./build/aoti_example model_nn.pt2 model_nn_inputs_shape.txt

Average inference time over 1000 runs: 14 us


In [14]:
!./build/aoti_example model_gp.pt2 model_gp_inputs_shape.txt

Average inference time over 1000 runs: 18 us


## Empirical results

On our benchmarks on a normal laptop we measured the following (time is in us):

| method | MLP f64 | MLP f32  | GP f64  | GP f32  |
|---|---|---|---|---|
| AOT Inductor (torch)  | 11  | 12  | 10  | 11  |
| ONNX runtime (torch)  | 11  | 5  | 12  | 8  |
| Tensorflow AOT  | 3  | -  | 7  | -  |


# Try to export with derivatives / jacobian

Exporting derivatives fails because of unsupported operations

In [17]:
class ModelWithJacobian(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model_ = model

    def forward(self, x):
        dy = torch.autograd.functional.jacobian(self.model_, x, create_graph=True)
        return dy


class ModelWithGrads(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model_ = model

    def forward(self, x):
        y = self.model_(x)
        dy = torch.autograd.grad(y, x, retain_graph=True, create_graph=True, )
        return dy


In [18]:
m = ModelWithJacobian(model_nn)

m.eval()

exported = torch.export.export(m, (example_input,))
torch._inductor.aoti_compile_and_package(
    exported,
    package_path="model_grads.pt2",
)

Unsupported: Failed to convert args/kwargs to proxy
  Explanation: Missing `as_proxy()` implementation for some arg/kwarg.


  Developer debug context: call_function args: NNModuleVariable() TensorVariable() ConstantVariable(bool: True)


from user code:
   File "/tmp/ipython-input-17-2830246847.py", line 7, in forward
    dy = torch.autograd.functional.jacobian(self.model_, x, create_graph=True)

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"


In [19]:
m = ModelWithGrads(model_nn)

m.eval()

exported = torch.export.export(m, (example_input,))
torch._inductor.aoti_compile_and_package(
    exported,
    package_path="model_grads.pt2",
)

Unsupported: Attempted to call function marked as skipped
  Explanation: Dynamo developers have intentionally marked that the function `grad` in file `/usr/local/lib/python3.11/dist-packages/torch/autograd/__init__.py` should not be traced.
  Hint: Avoid calling the function `grad`.
  Hint: Remove the function `grad` or the file `/usr/local/lib/python3.11/dist-packages/torch/autograd/__init__.py` from torch/_dynamo/trace_rules.py. More graph breaks may occur as a result of attempting to trace into the function.
  Hint: Please file an issue to PyTorch.

  Developer debug context: module: torch.autograd, qualname: grad, skip reason: <missing reason>


from user code:
   File "/tmp/ipython-input-17-2830246847.py", line 18, in forward
    dy = torch.autograd.grad(y, x, retain_graph=True, create_graph=True, )

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"
