# Run Alexnet in CPU and GPU

In this tutorial, I'll run Alexnet and show the speed in Caffe2. Before this, I recommend you to read anothor tutorial [Loading Pre-Trained Model](https://github.com/caffe2/caffe2/blob/master/caffe2/python/tutorials/Loading_Pretrained_Models.ipynb).

## Convert Alexnet from caffe to caffe2

First, we need to download the [deploy.prototxt](https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/deploy.prototxt) and [bvlc_alexnet.caffemodel](http://dl.caffe.berkeleyvision.org/bvlc_alexnet.caffemodel).

```
wget https://raw.githubusercontent.com/BVLC/caffe/master/models/bvlc_alexnet/deploy.prototxt
wget http://dl.caffe.berkeleyvision.org/bvlc_alexnet.caffemodel
```

After that, if you have Caffe2 installed you can find caffe_translator.py in CAFFE_ROOT/caffe2/python. The following command will help you to convert Alexnet into caffe2.

```
python caffe_translator.py tutorials/deploy.prototxt tutorials/bvlc_alexnet.caffemodel --init_net tutorials/init_net.pb --predict_net tutorials/predict_net.pb
```

## Run Alexnet in CPU mode

Now, let's write some code to run Alexnet in cpu mode.

In [6]:
# First let's import a few things needed.
import numpy as np
import os, time

from caffe2.proto import caffe2_pb2
from caffe2.python import workspace

# set the root of caffe2
caffe_root = '~/caffe2'

# set the path of init_net and predict_net
init_net_path = os.path.join(caffe_root, 'caffe2/python/tutorials/init_net.pb')
predict_net_path = os.path.join(caffe_root, 'caffe2/python/tutorials/predict_net.pb')

In [10]:
# We set the device options to CPU
device_opts = caffe2_pb2.DeviceOption()
device_opts.device_type = caffe2_pb2.CPU

In [18]:
# load init_net which contains the blobs of alexnet
init_def = caffe2_pb2.NetDef()
with open(init_net_path, 'r') as f:
    init_def.ParseFromString(f.read())
    init_def.device_option.CopyFrom(device_opts)
    workspace.RunNetOnce(init_def)

print workspace.Blobs()
print workspace.FetchBlob('conv1_w').shape
print workspace.FetchBlob('data').shape

[u'_fc6_mask', u'_fc7_mask', u'_norm1_scale', u'_norm2_scale', u'conv1', u'conv1_b', u'conv1_w', u'conv2', u'conv2_b', u'conv2_w', u'conv3', u'conv3_b', u'conv3_w', u'conv4', u'conv4_b', u'conv4_w', u'conv5', u'conv5_b', u'conv5_w', u'data', u'fc6', u'fc6_b', u'fc6_w', u'fc7', u'fc7_b', u'fc7_w', u'fc8', u'fc8_b', u'fc8_w', u'norm1', u'norm2', u'pool1', u'pool2', u'pool5', u'prob']
(96, 3, 11, 11)
(1,)


In [22]:
# load predict_net
net_def = caffe2_pb2.NetDef()
with open(predict_net_path, 'r') as f:
    net_def.ParseFromString(f.read())
    net_def.device_option.CopyFrom(device_opts)
    workspace.CreateNet(net_def, overwrite=True)

print net_def.name
print workspace.Blobs()
print workspace.FetchBlob('_fc7_mask')

AlexNet
[u'_fc6_mask', u'_fc7_mask', u'_norm1_scale', u'_norm2_scale', u'conv1', u'conv1_b', u'conv1_w', u'conv2', u'conv2_b', u'conv2_w', u'conv3', u'conv3_b', u'conv3_w', u'conv4', u'conv4_b', u'conv4_w', u'conv5', u'conv5_b', u'conv5_w', u'data', u'fc6', u'fc6_b', u'fc6_w', u'fc7', u'fc7_b', u'fc7_w', u'fc8', u'fc8_b', u'fc8_w', u'norm1', u'norm2', u'pool1', u'pool2', u'pool5', u'prob']
_fc7_mask, a C++ native class of type nullptr (uninitialized).


In [21]:
# here we feed the data blob with 1x3x227x227 and type of float32
workspace.FeedBlob('data', np.random.rand(1, 3, 227, 227).astype(np.float32), device_opts)
print workspace.FetchBlob('data').shape

(1, 3, 227, 227)


In [24]:
# run alexnet 50 iters in cpu mode
num_iters = 50
start = time.time()
for i in range(num_iters):
    workspace.RunNet(net_def.name, 1)
end = time.time()
print('Run time per RunNet: {}'.format((end - start) / num_iters))

Run time per RunNet: 0.161446561813


In [29]:
# benchmark alexnet
warmup_runs = 10
main_runs = 50
run_individual = True
cpu_stats = workspace.BenchmarkNet(net_def.name, warmup_runs, main_runs, run_individual)

In [30]:
# TODO: I havn't found a proper way to get operators of net. 
operators = [
    'Milliseconds per iter: {}',
    '(conv1, Conv) {}',
    '(conv1, Relu) {}',
    '(norm1, LRN) {}',
    '(pool1, MaxPool) {}',
    '(conv2, Conv) {}',
    '(conv2, Relu) {}',
    '(norm2, LRN) {}',
    '(pool2, MaxPool) {}',
    '(conv3, Conv) {}',
    '(conv3, Relu) {}',
    '(conv4, Conv) {}',
    '(conv4, Relu) {}',
    '(conv5, Conv) {}',
    '(conv5, Relu) {}',
    '(pool5, MaxPool) {}',
    '(fc6, FC) {}',
    '(fc6, Relu) {}',
    '(fc6, Dropout) {}',
    '(fc7, FC) {}',
    '(fc7, Relu) {}',
    '(fc7, Dropout) {}',
    '(fc8, FC) {}',
    '(prob, Softmax) {}'
]
def show_stats(stats):
    len_stat = len(stats)
    for i in range(len_stat):
        print 'Operator #{} '.format(i) + operators[i].format(stats[i]) + ' ms/iters'

show_stats(cpu_stats)

Operator #0 Milliseconds per iter: 158.544464111 ms/iters
Operator #1 (conv1, Conv) 10.0862464905 ms/iters
Operator #2 (conv1, Relu) 0.152279004455 ms/iters
Operator #3 (norm1, LRN) 15.3392858505 ms/iters
Operator #4 (pool1, MaxPool) 1.75234615803 ms/iters
Operator #5 (conv2, Conv) 17.9576473236 ms/iters
Operator #6 (conv2, Relu) 0.101289302111 ms/iters
Operator #7 (norm2, LRN) 11.8667507172 ms/iters
Operator #8 (pool2, MaxPool) 1.29284083843 ms/iters
Operator #9 (conv3, Conv) 15.8340377808 ms/iters
Operator #10 (conv3, Relu) 0.015605058521 ms/iters
Operator #11 (conv4, Conv) 6.03665542603 ms/iters
Operator #12 (conv4, Relu) 0.0141125591472 ms/iters
Operator #13 (conv5, Conv) 5.42444992065 ms/iters
Operator #14 (conv5, Relu) 0.00877944100648 ms/iters
Operator #15 (pool5, MaxPool) 0.278489500284 ms/iters
Operator #16 (fc6, FC) 42.2674255371 ms/iters
Operator #17 (fc6, Relu) 0.00187726004515 ms/iters
Operator #18 (fc6, Dropout) 0.0027340000961 ms/iters
Operator #19 (fc7, FC) 18.090538024

## Run Alexnet in GPU mode

Runing Alexnet in gpu mode is simple. The only changes is device_opts.

In [40]:
# first, reset workspace
workspace.ResetWorkspace()

True

In [41]:
# set the device options to CUDA
device_opts = caffe2_pb2.DeviceOption()
device_opts.device_type = caffe2_pb2.CUDA
device_opts.cuda_gpu_id = 0

In [42]:
# init net
init_def = caffe2_pb2.NetDef()
with open(init_net_path, 'r') as f:
    init_def.ParseFromString(f.read())
    init_def.device_option.CopyFrom(device_opts)
    workspace.RunNetOnce(init_def)

# create net
net_def = caffe2_pb2.NetDef()
with open(predict_net_path, 'r') as f:
    net_def.ParseFromString(f.read())
    net_def.device_option.CopyFrom(device_opts)
    workspace.CreateNet(net_def, overwrite=True)

# feed data blob
workspace.FeedBlob('data', np.random.rand(1, 3, 227, 227).astype(np.float32), device_opts)

# run alexnet 1000 iters in gpu mode
num_iters = 1000
start = time.time()
for i in range(num_iters):
    workspace.RunNet(net_def.name, 1)
end = time.time()
print('Run time per RunNet: {}'.format((end - start) / num_iters))

# benchmark alexnet
warmup_runs = 50
main_runs = 1000
run_individual = True
gpu_stats = workspace.BenchmarkNet(net_def.name, warmup_runs, main_runs, run_individual)

Run time per RunNet: 0.00201209211349


In [43]:
# These stats are runing in TITAN X
show_stats(gpu_stats)

Operator #0 Milliseconds per iter: 1.70603919029 ms/iters
Operator #1 (conv1, Conv) 0.102756902575 ms/iters
Operator #2 (conv1, Relu) 0.0112983090803 ms/iters
Operator #3 (norm1, LRN) 0.0401390828192 ms/iters
Operator #4 (pool1, MaxPool) 0.020453164354 ms/iters
Operator #5 (conv2, Conv) 0.166687503457 ms/iters
Operator #6 (conv2, Relu) 0.0109590971842 ms/iters
Operator #7 (norm2, LRN) 0.0613889656961 ms/iters
Operator #8 (pool2, MaxPool) 0.0173483639956 ms/iters
Operator #9 (conv3, Conv) 0.111571870744 ms/iters
Operator #10 (conv3, Relu) 0.0101103847846 ms/iters
Operator #11 (conv4, Conv) 0.150309726596 ms/iters
Operator #12 (conv4, Relu) 0.00984948966652 ms/iters
Operator #13 (conv5, Conv) 0.147298529744 ms/iters
Operator #14 (conv5, Relu) 0.00964171625674 ms/iters
Operator #15 (pool5, MaxPool) 0.0151336118579 ms/iters
Operator #16 (fc6, FC) 0.480961769819 ms/iters
Operator #17 (fc6, Relu) 0.00969945359975 ms/iters
Operator #18 (fc6, Dropout) 0.00229897466488 ms/iters
Operator #19 (fc