This notebook is based on https://www.tensorflow.org/performance/quantization

Small devices like Mobile Phones and Rasberry PI have very little memory and computation power. 

Training neural networks is done by applying many tiny nudges to the weights, and these small increments typically need floating point precision to work (though there are research efforts to use quantized representations here too).

Taking a pre-trained model and running inference is very different. One of the magical qualities of Deep Neural Networks is that they tend to cope very well with high levels of noise in their inputs.

### Why Quantize?

Neural network models can take up a lot of space on disk, with the original AlexNet being over 200 MB in float format for example. Almost all of that size is taken up with the weights for the neural connections, since there are often many millions of these in a single model.

Below image shows a Relu Operation

<img src="images/3.png">

The Nodes and and Weights of a neural network are originally stored as 32-bit floating point numbers. The simplest motivation for quantization is to shrink file sizes by storing the min and max for each layer, and then compressing each float value to an eight-bit integer.The size of the files is reduced by 75%.

<img src="images/4.png">

### Code

In [None]:
# cURL allows us to fetch the file from this location
curl -L "https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz" |
  tar -C tensorflow/examples/label_image/data -xz
#The Graph Transform tool is designed to work on models that are saved as GraphDef files
#(with .pb suffix,For eg,inception_v3_2016_08_28_frozen.pb) usually in a binary protobuf format. 
#This is the low-level definition of a TensorFlow computational graph. 
#TensorFlow Serving is a flexible, high-performance serving system for machine learning models, designed for production
#environments. 
bazel build tensorflow/tools/graph_transforms:transform_graph
bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
    #This tells you from where you have to download the input graph.
  --in_graph=tensorflow/examples/label_image/data/inception_v3_2016_08_28_frozen.pb \
    #This gives the location of output graph i.e. quantized graph
  --out_graph=/tmp/quantized_graph.pb \
  --inputs=input \
  --outputs=InceptionV3/Predictions/Reshape_1 \
    #remove_nodes and strip_unused_nodes remove the nodes and operations that are not useful in the deployed graph.
    #fold_constraints remove older graphs that containes out-of-date information that may cause import errors.
    #Sort_unused_nodes arranges the nodes in GraphDef file in topological order, so that the inputs of any given node are always earlier than the node itself. 
  --transforms='add_default_attributes strip_unused_nodes(type=float, shape="1,299,299,3")
    remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true)
    fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes
    strip_unused_nodes sort_by_execution_order'