# Shipping SqueezeNet from PyTorch to ONNX to Android App

In this notebook we will show you how to export SqueezeNet which is implemented and trained in fastai library (**TODO**) and PyTorch to run on mobile devices.

Let's get started. First, you should have [PyTorch](https://pytorch.org/) and [ONNX](https://onnx.ai/) installed in your environment and git cloned [AICamera](https://github.com/bwasti/AICamera) repo.

_NOTE: Caffe2 pre-built binaries were installed together when you install PyTorch as the [Caffe2 source code now lives in the PyTorch repository](https://github.com/caffe2/caffe2)._

1. [Install PyTorch 1.0 preview locally](https://pytorch.org/get-started/locally/#start-locally). Run this command:
```sh
conda install pytorch-nightly cuda92 -c pytorch
```

2. Install ONNX. See the instructions in this [notebook](https://nbviewer.jupyter.org/github/cedrickchee/data-science-notebooks/blob/master/notebooks/deep_learning/fastai_mobile/onnx_from_pytorch_to_caffe2.ipynb#Install-ONNX).

**Import some Python packages**

In [1]:
import io
import numpy as np
import torch.onnx

_**NOTE: as the work to bridge ResNet-family models build using [fastai v1 library](https://docs.fast.ai/) to pure PyTorch land continues, for now, the steps below will use an example of mobile-first CNN, SqueezeNet available from torchvision. This model was developed in plain PyTorch (not in fastai v1).**_

## Network - SqueezeNet v1.1

### Efficient Convolutional Neural Networks (CNNs) for Mobile Vision

SqueezeNet is a small CNN which achieves AlexNet level accuracy on ImageNet with 50x fewer parameters. [Paper](http://arxiv.org/abs/1602.07360).

**Use cases**

SqueezeNet models perform image classification—they take images as input and classify the major object in the image into a set of pre-defined classes. They are trained on ImageNet dataset which contains images from 1000 classes. SqueezeNet models are highly efficient in terms of size and speed while providing good accuracies. This makes them ideal for platforms with strict constraints on size.

**SqueezeNet version 1.1**

SqueezeNet 1.1 presented in the [official SqueezeNet repo](https://github.com/DeepScale/SqueezeNet/tree/master/SqueezeNet_v1.1) is an improved version of SqueezeNet 1.0 from the [paper](http://arxiv.org/abs/1602.07360). 

SqueezeNet version 1.1 requires 2.4x less computation than version 1.0, without sacrificing accuracy. [Jun 2016]

[SqueezeNet 1.1 pre-trained model weights](https://github.com/DeepScale/SqueezeNet/tree/master/SqueezeNet_v1.1)

The following [SqueezeNet implementation in PyTorch](https://github.com/pytorch/vision/blob/master/torchvision/models/squeezenet.py) by Marat Dukhan and it is part of `torchvision`:

In [2]:
import math
import torch
import torch.nn as nn
import torch.nn.init as init
import torch.utils.model_zoo as model_zoo


__all__ = ['SqueezeNet', 'squeezenet1_0', 'squeezenet1_1']


model_urls = {
    'squeezenet1_0': 'https://download.pytorch.org/models/squeezenet1_0-a815701f.pth',
    'squeezenet1_1': 'https://download.pytorch.org/models/squeezenet1_1-f364aa15.pth',
}


class Fire(nn.Module):

    def __init__(self, inplanes, squeeze_planes,
                 expand1x1_planes, expand3x3_planes):
        super(Fire, self).__init__()
        self.inplanes = inplanes
        self.squeeze = nn.Conv2d(inplanes, squeeze_planes, kernel_size=1)
        self.squeeze_activation = nn.ReLU(inplace=True)
        self.expand1x1 = nn.Conv2d(squeeze_planes, expand1x1_planes,
                                   kernel_size=1)
        self.expand1x1_activation = nn.ReLU(inplace=True)
        self.expand3x3 = nn.Conv2d(squeeze_planes, expand3x3_planes,
                                   kernel_size=3, padding=1)
        self.expand3x3_activation = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.squeeze_activation(self.squeeze(x))
        return torch.cat([
            self.expand1x1_activation(self.expand1x1(x)),
            self.expand3x3_activation(self.expand3x3(x))
        ], 1)


class SqueezeNet(nn.Module):

    def __init__(self, version=1.0, num_classes=1000):
        super(SqueezeNet, self).__init__()
        if version not in [1.0, 1.1]:
            raise ValueError("Unsupported SqueezeNet version {version}:"
                             "1.0 or 1.1 expected".format(version=version))
        self.num_classes = num_classes
        if version == 1.0:
            self.features = nn.Sequential(
                nn.Conv2d(3, 96, kernel_size=7, stride=2),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=False),
                Fire(96, 16, 64, 64),
                Fire(128, 16, 64, 64),
                Fire(128, 32, 128, 128),
                nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=False),
                Fire(256, 32, 128, 128),
                Fire(256, 48, 192, 192),
                Fire(384, 48, 192, 192),
                Fire(384, 64, 256, 256),
                nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=False),
                Fire(512, 64, 256, 256),
            )
        else:
            self.features = nn.Sequential(
                nn.Conv2d(3, 64, kernel_size=3, stride=2),
                nn.ReLU(inplace=True),
                nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=False),
                Fire(64, 16, 64, 64),
                Fire(128, 16, 64, 64),
                nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=False),
                Fire(128, 32, 128, 128),
                Fire(256, 32, 128, 128),
                nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=False),
                Fire(256, 48, 192, 192),
                Fire(384, 48, 192, 192),
                Fire(384, 64, 256, 256),
                Fire(512, 64, 256, 256),
            )
        # Final convolution is initialized differently form the rest
        final_conv = nn.Conv2d(512, self.num_classes, kernel_size=1)
        self.classifier = nn.Sequential(
            nn.Dropout(p=0.5),
            final_conv,
            nn.ReLU(inplace=True),
            nn.AvgPool2d(13)
        )

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                if m is final_conv:
                    init.normal(m.weight.data, mean=0.0, std=0.01)
                else:
                    init.kaiming_uniform(m.weight.data)
                if m.bias is not None:
                    m.bias.data.zero_()

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x.view(x.size(0), self.num_classes)


def squeezenet1_1(pretrained=False, **kwargs):
    r"""SqueezeNet 1.1 model from the `official SqueezeNet repo
    <https://github.com/DeepScale/SqueezeNet/tree/master/SqueezeNet_v1.1>`_.
    SqueezeNet 1.1 has 2.4x less computation and slightly fewer parameters
    than SqueezeNet 1.0, without sacrificing accuracy.
    Args:
        pretrained (bool): If True, returns a model pre-trained on ImageNet
    """
    model = SqueezeNet(version=1.1, **kwargs)
    if pretrained:
        model.load_state_dict(model_zoo.load_url(model_urls['squeezenet1_1']))
    return model

## Model

We can get the PyTorch model by calling the following function:

In [3]:
# Get pre-trained SqueezeNet model
torch_model = squeezenet1_1(True)

Downloading: "https://download.pytorch.org/models/squeezenet1_1-f364aa15.pth" to /home/ubuntu/.torch/models/squeezenet1_1-f364aa15.pth
100%|██████████| 4966400/4966400 [00:01<00:00, 3173216.83it/s]


## ONNX

### Export the PyTorch model as ONNX model

In [4]:
from torch.autograd import Variable
batch_size = 1 # just a random number

Input to the model:

In [5]:
x = Variable(torch.randn(batch_size, 3, 224, 224), requires_grad=True)

### Export the model

In [6]:
torch_out = torch.onnx._export(torch_model,             # model being run
                               x,                       # model input (or a tuple for multiple inputs)
                               "squeezenet.onnx",       # where to save the model (can be a file or file-like object)
                               export_params=True)      # store the trained parameter weights inside the model file

This step will output a `squeezenet.onnx` file (around 5 MB) in your server/computer storage.

## Caffe2

After that, we can prepare and run the model and **verify** that the result of the model running on PyTorch matches the result running on **ONNX (with Caffe2 backend)**.

In [7]:
import onnx
import caffe2.python.onnx.backend
from onnx import helper

**Load the ONNX GraphProto object**. Graph is a standard Python protobuf object.

In [8]:
model = onnx.load("squeezenet.onnx")

Prepare the Caffe2 backend for executing the model. This **converts the ONNX graph into a Caffe2 NetDef** that can execute it.

In [9]:
prepared_backend = caffe2.python.onnx.backend.prepare(model)

**Run the model in Caffe2.**

Construct a map from input names to Tensor data.

The graph itself contains inputs for all weight parameters, followed by the input image.

Since the weights are already embedded, we just need to pass the input image.

Last parameter is the input to the graph.

In [10]:
W = {model.graph.input[0].name: x.data.numpy()}

Run the Caffe2 net:

In [11]:
c2_out = prepared_backend.run(W)[0]

Verify the numerical correctness upto 3 decimal places.

In [12]:
np.testing.assert_almost_equal(torch_out.data.cpu().numpy(), c2_out, decimal=3)

## Export the model to run on mobile devices

Leverage the cross-platform capability of Caffe2.

In [13]:
# Export to mobile
from caffe2.python.onnx.backend import Caffe2Backend as c2

`Caffe2Backend` is the backend for running ONNX on Caffe2.

Rewrite ONNX graph to Caffe2 NetDef:

In [14]:
init_net, predict_net = c2.onnx_graph_to_caffe2_net(model)

with open("squeeze_init_net.pb", "wb") as f:
    f.write(init_net.SerializeToString())
with open("squeeze_predict_net.pb", "wb") as f:
    f.write(predict_net.SerializeToString())

You'll see 2 files, `squeeze_init_net.pb` and `squeeze_predict_net.pb` in the same directory of this notebook. Let's make sure it can run with `Predictor` since that's what we'll use in the mobile app.

### Loading Pre-Trained Models

Optional read or for reference:
- [Tutorial](https://caffe2.ai/docs/tutorial-loading-pre-trained-models.html)
  - In this tutorial, they will use the SqueezeNet model to identify objects in images.
  - You'll learn how to read the protobuf files (i.e.: init_net.pb, predict_net.pb), use the Predictor function in your Caffe2 workspace to load the blobs from the protobufs, and run the net and get the results.

**Verify it runs with `Predictor`**

Read the protobuf (`*.pb`) files:

In [16]:
# with open("squeeze_init_net.pb") as f:
#     init_net = f.read()
# with open("squeeze_predict_net.pb") as f:
#     predict_net = f.read()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 24: invalid continuation byte

**Fix** the previous `UnicodeDecodeError` error.

Solution: [adding `rb` flag when opening the file](https://github.com/pytorch/pytorch/issues/10070#issuecomment-410979572)

In [15]:
with open("squeeze_init_net.pb", "rb") as f:
    init_net = f.read()
with open("squeeze_predict_net.pb", "rb") as f:
    predict_net = f.read()

[Workspace](https://caffe2.ai/docs/workspace.html) is a key component of Caffe2.


> Workspace is a class that holds all the related objects created during runtime:
>
> 1. all blobs, and
> 2. all instantiated networks. It is the owner of all these objects and deals with the scaffolding logistics.

I think this concept is somewhat similar to TensorFlow Session.

Use the `Predictor` function in your `Workspace` to load the blobs from the protobufs:

In [16]:
from caffe2.python import workspace

p = workspace.Predictor(init_net, predict_net) # create Predictor by using init NetDef and predict NetDef

Finally, **run the net and get the results**!

In [17]:
img = np.random.rand(1, 3, 224, 224).astype(np.float32) # create a random image tensor

result, = p.run([img])
print(result.shape) # our model produces prediction for each of ImageNet 1000 classes

(1, 1000)


---

## Fast.ai Mobile Camera Project

### Integrating Caffe2 on Mobile

Caffe2 is optimized for mobile integrations, both Android and iOS and running models on lower powered devices.

In this notebook, we will go through what you need to know to implement Caffe2 in your mobile project.

### Shipping the models into the Android app

After we are sure that it runs with `Predictor`, we can copy `squeeze_init_net.pb` and `squeeze_predict_net.pb` to 
`AICamera/app/src/main/assets` directory.

Now we can launch Android Studio and import the AICamera project. Next, run the app by pressing the `Shift + F10` shortcut keys.

You can check [Caffe2 AI Camera tutorial](https://caffe2.ai/docs/AI-Camera-demo-android.html) for more details of how Caffe2 can be invoked in the Android mobile app.

### Android App Development using Android Studio

We are building our mobile app using Android Studio version 2.2.3 and above.

#### Some of the problems we encountered:

- [RESOLVED] Gradle Sync problems:
  - Resolution: Install [Android Native Development Kit (NDK)](https://developer.android.com/ndk/) version r15.
- [RESOLVED] Error: "Unable to get the CMake version located at: /home/cedric/m/dev/android/sdk/cmake/bin"
  - Install CMake 3.6.xxxxxxx using SDK Manager:
  ![](../../../images/fastai_mobile/Android_Studio_SDK_Manager_Install_CMake.png)
- [RESOLVED] Error: "Expected NDK STL shared object file at `/home/cedric/m/dev/android/sdk/ndk-bundle/sources/cxx-stl/gnu-libstdc++/4.9/libs/armeabi-v7a/libgnustl_shared.so`"
  - [GitHub Issue](https://github.com/caffe2/AICamera/issues/55)
- [RESOLVED] Error: "This Gradle plugin requires a newer IDE able to request IDE model level 3. For Android Studio this means version 3.0+"
  - [GitHub issue](https://github.com/caffe2/AICamera/issues/55)
  - [StackOverflow question](https://stackoverflow.com/questions/45171647/this-gradle-plugin-requires-android-studio-3-0-minimum)
- [RESOLVED] Error: "(5, 0) Could not find method google() for arguments [] on repository container."
  - [GitHub Issue](https://github.com/react-native-community/react-native-svg/issues/584)
- [RESOLVED] Error: "A problem occurred configuring project ':app'. > buildToolsVersion is not specified."
  - [GitHub Issue](https://github.com/react-native-community/react-native-svg/issues/584)
  - [StackOverflow question](https://stackoverflow.com/questions/32153544/errorcause-buildtoolsversion-is-not-specified)
- [WIP] Error: "android A/libc Fatal signal 6 (SIGABRT), code -6"
  - [GitHub Issue—AICamera demo with Other Networks](https://github.com/caffe2/AICamera/issues/37)

#### Android project, Java source code, and the Android Studio work space

- `ClassifyCamera.java` code, initialize Caffe2 core C++ libraries, `squeeze_predict_net.pb` protobuf file.

![](../../../images/fastai_mobile/Android_Studio_ClassifyCamera_Java_predict_protobuf.png "Android Studio - ClassifyCamera.java code, initialize Caffe2, squeeze_predict_net.pb protobuf file")

- NDK external native C++ build handled by CMake tooling and a CMakeLists file.

![](../../../images/fastai_mobile/Android_NDK_external_build_cmake_c_plus_plus.png)

- CMakeLists source code and JNI libs such as Caffe2 libraries for ARM architecture (i.e.`armeabi-v7a/libCaffe2.a`, etc)

![](../../../images/fastai_mobile/Android_Studio_cmake_libCaffe2.png)

### Demo

Check out a working Caffe2 implementation on mobile:

[Android camera app demo (video)](https://youtu.be/TYkoaVNCMos)

#### Technical specifications

- Network architecture: SqueezeNet 1.1
- Real-time image classification from video stream
- Performance: average 3 fps (frames per second)

#### Steps

- Deploy mobile config and the models to devices.
- Instantiate a Caffe2 instance (Android) or caffe2::Predictor instance (iOS) to expose the model to your Java or iOS code.
- Pass inputs to the model and get outputs back.

#### Objects in graph

- caffe2::NetDef - (binary-serialized) protocol buffer instance that encapsulates the computation graph and the pre-trained weights.
- caffe2::Predictor - stateful class that is instantiated with an "initialization" NetDef and a "predict" NetDef, and executes the "predict" NetDef with the input and returns the output.

#### Mobile app layout in pure C++

- Caffe2 core library, composed of the Workspace, Blob, Net, and Operator classes.
- Caffe2 operator library, a range of Operator implementations (such as convolution, etc)
- Non-optional dependencies:
  - Google Protobuf (the lite version, around 300kb)
  - Eigen, a BLAS (on Android) is required for certain primitives, and a vectorized vector/matrix manipulation library, and Eigen is the fastest benchmarked on ARM.
- NNPACK, which specifically optimizes convolutions on ARM

#### Model

A model consists of two parts—a set of weights that represent the learned parameters (updated during training), and a set of 'operations' that form a computation graph that represent how to combine the input data (that varies with each graph pass) with the learned parameters (constant with each graph pass). The parameters (and intermediate states in the computation graph live in a Caffe2 Workspace (like TensorFlow Session), where a Blob represents an arbitrary typed pointer, typically a TensorCPU, which is an *n-*dimensional array (like PyTorch’s Tensor).

The core class is caffe2::Predictor, which exposes the constructor:

```c++
Predictor(const NetDef& init_net, const NetDef& predict_net)
```
where the two `NetDef` inputs are Google Protocol Buffer objects that represent the 2 computation graphs described above:
- the `init_net` typically runs a set of operations that deserialize weights into the Workspace
- the `predict_net` specifies how to execute the computation graph for each input

The Predictor is a stateful class.

#### Performance considerations

Currently Caffe2 is optimized for ARM CPUs with NEON (basically any ARM CPU since 2012). There are other advantages to offloading compute onto the GPU/DSP, and it's an active work in progress to expose these in Caffe2.

For a convolutional implementation, it is recommended to use NNPACK since that's substantially faster (around 2x-3x) than the standard `im2col/sgemm` implementation used in most frameworks.

For non-convolutional (e.g. ranking) workloads, the key computational primitive are often fully-connected layers (e.g. FullyConnectedOp in Caffe2, InnerProductLayer in Caffe, nn.Linear in Torch). For these use cases, you can fall back to a BLAS library, specifically Accelerate on iOS and Eigen on Android.

#### Memory considerations

The model for memory usage of an instantiated and run Predictor is that it’s the sum of the size of the weights and the total size of the activations. There is no ‘static’ memory allocated, all allocations are tied to the Workspace instance owned by the Predictor, so there should be no memory impact after all Predictor instances are deleted.