Skip to content
This repository has been archived by the owner on Feb 7, 2023. It is now read-only.

Caffe Translator Error: Convolution Layer #468

Closed
milewis1 opened this issue May 1, 2017 · 15 comments
Closed

Caffe Translator Error: Convolution Layer #468

milewis1 opened this issue May 1, 2017 · 15 comments

Comments

@milewis1
Copy link

milewis1 commented May 1, 2017

I have a Caffe model that I'm trying to translate into Caffe2. However, I'm running across the following error on the first operator:

RuntimeError: [enforce fail at conv_op_impl.h:25] X.ndim() == filter.ndim(). 4 vs 1 Error from operator:
input: "data" input: "conv1_w" input: "conv1_b" output: "conv1" type: "Conv" arg { name: "stride" i: 2 } arg { name: "pad" i: 3 } arg { name: "kernel" i: 7 }

The original Caffe model start like this:

layer { name: "data" type: "Input" top: "data" input_param { shape { dim: 1 dim: 3 dim: 224 dim: 224 } } } layer { name: "conv1" type: "Convolution" bottom: "data" top: "conv1" param { lr_mult: 1 decay_mult: 1 } param { lr_mult: 2 decay_mult: 0 } convolution_param { num_output: 64 bias_term: true pad: 3 kernel_size: 7 stride: 2 } }

My input is a single color image whose shape going in is: [1, 3, 224, 224]. Has anyone tried to do something similar?

@teaglin
Copy link

teaglin commented May 1, 2017

I've had conversion issues too and haven't been able to resolve them yet. I tried a GoogleNet model, which doesn't seem to work in Caffe2, but works great in Caffe. Input is a single color image [1, 3, 224, 224].

caffe2::EnforceNotMet: [enforce fail at fully_connected_op.h:61] K == W.size() / W.dim32(0). Dimension mismatch: X: 1 1024 2 2, W: 2 1024, b: 2, axis: 1, M: 1, N: 2, K: 4096 Error from operator: input: "pool5/7x7_s1" input: "loss3/classifier_w" input: "loss3/classifier_b" output: "loss3/classifier" type: "FC"

@KleinYuan
Copy link

KleinYuan commented May 2, 2017

Same here
But it means what it says that the dimension does not match.

@littleowl
Copy link

Hi, @milewis1 - thx for your other PR's that I've found that have allowed me to get as far as I have trying to convert this ResNet model. - #472

Afterwards, I get the same error you have here. I was able to get past it by sending a Tensor with only 1 dimension instead of the required 4 - only to still fail a few lines of code later.
libc++abi.dylib: terminating with uncaught exception of type caffe2::EnforceNotMet: [enforce fail at conv_op_impl.h:33] C == filter.dim32(1) * group_. Convolution op: input channels does not match: # of input channels 0 is not equal to kernel channels * group:3*1 Error from operator:

However, in the other issue thread, @KeyKy gave an example as to how he edited the prototxt file. I tried my best to edit the ResNet50 in a similar way. (more complicated then I first thought)
Of course, I am unsure of these changes and if the model would still work afterwards. The interesting thing is I was able to run the predictor without error - only somewhere a layer and everything afterwards returns NaN
Seems odd how it would still work after so many changes to inputs and outputs anyways.

So I guess that there must be also something wrong with the way it's converting the input layer on convolution layers in general and / or maybe it does something wrong when connected to a BatchNorm layer.
:(

@KeyKy
Copy link

KeyKy commented May 4, 2017

@littleowl In my opinion, I look into the SpatialBN source code and find:

    .Input(
        1,
        "scale",
        "The scale as a 1-dimensional tensor of size C to be applied to the "
        "output.")

scale is a tensor of size C. But the PR set it to 1. So in my code, i set it to C. My code:

@TranslatorRegistry.Register("BatchNorm")
def TranslateBatchNorm(layer, pretrained_blobs, is_test):
    caffe_op = BaseTranslate(layer, "SpatialBN")
    output = caffe_op.output[0]
    param = layer.batch_norm_param
    AddArgument(caffe_op, "is_test", is_test)
    AddArgument(caffe_op, "epsilon", param.eps)
    AddArgument(caffe_op, "order", "NCHW")

    caffe_op.input.extend([output + "_scale", output + "_bias", output + "_mean", output + "_var"])
    if not is_test:
        caffe_op.output.extend([output + "_mean", output + "_var", output + "_saved_mean", output + "_saved_var"])

    n_channels = pretrained_blobs[0].shape[0] # get C
    mean = utils.NumpyArrayToCaffe2Tensor(pretrained_blobs[0], output + '_mean')
    var = utils.NumpyArrayToCaffe2Tensor(pretrained_blobs[1], output + '_var')
    pretrained_blobs[2] = np.tile(pretrained_blobs[2], (n_channels, )) # set C
    scale = utils.NumpyArrayToCaffe2Tensor(pretrained_blobs[2], output + '_scale')

    # Create a zero bias array the same size as the scale, we'll let the following
    # Scale (Mul + Add operators in Caffe2) layer handle any bias, just like Caffe
    bias = utils.NumpyArrayToCaffe2Tensor(np.zeros_like(pretrained_blobs[2]), output + '_bias')

    return caffe_op, [scale, bias, mean, var]

It does not give me NaN with my prototxt.

@milewis1
Copy link
Author

milewis1 commented May 4, 2017 via email

@danielhauagge
Copy link

@milewis1 were you able to find a fix for this? I'm running into the exact same problem. I'm trying to get a ResNet50 translated from a caffe model to a caffe2 one. I already applied PR #469 that you proposed, that helped with the BatchNorm, but now I'm running into this issue with the convolutional layer.

@danielhauagge
Copy link

danielhauagge commented Jul 11, 2017

I think I might have found an issue. In the caffe_translator code, if you add a print statement to
ConvertTensorProtosToInitNet

def ConvertTensorProtosToInitNet(net_params, input_name):
    init_net = caffe2_pb2.NetDef()
    for tensor in net_params.protos:
        print tensor.name, list(tensor.dims) # <--- this line
        if len(tensor.float_data) == 0:
            raise RuntimeError("Only float tensors are supported in this util.")
        op = core.CreateOperator(
            "GivenTensorFill", [], [tensor.name],
            arg=[
                utils.MakeArgument("shape", list(tensor.dims)),
                utils.MakeArgument("values", tensor.float_data)])
        init_net.op.extend([op])
    init_net.op.extend([core.CreateOperator("ConstantFill", [], [input_name], shape=[1])])
    return init_net

you will see in the output that two tensors with the same name show up. First, with the right number of dimensions (4), and then with just 1. See below

conv_1_w [64L, 3L, 7L, 7L]       <--- first time shows up with correct number of dimensions
conv_1_b [2L, 2L, 2L, 2L, 2L]
conv_1_scale [64L]
conv_1_bias [64L]
conv_1_mean [64L]
conv_1_var [64L]
conv_1_w [64L]    <--- now shows up with only 1 dimension
conv_1_b [64L]

A hack I did to see if this was the issue was to keep track of the tensors that had already been seen in the loop in ConvertTensorProtosToInitNet, if the tensor had already been seen, then I continued to the next tensor without extending init_op. After this I got a different error:

RuntimeError: [enforce fail at elementwise_op.h:180] A.ndim() > B.ndim(). 4 vs 4. If you are doing broadcasting, input1 should have a smaller number of dimensions. Error from operator: 
input: "conv_1" input: "conv_1_w" output: "conv_1_internal" type: "Mul" arg { name: "axis" i: 1 } arg { name: "broadcast" i: 1 }

@danielhauagge
Copy link

danielhauagge commented Jul 11, 2017

OK, so it seems like the issue I mentioned in the previous comment has to do with the Scale layer being done in-place on the network I'm translating (ResNet50).

In the function TranslateScale the variable named output is being set to mul_op.output[0]. In my case, because the Scale layer is being done in-place, the output and input have the same name, and that is the name of the convolution layer conv_1, not the scale layer. That causes it's parameter names to clash with those of the convolution layer. One thing I did was to change

output = mul_op.output[0]

to

output = layer.name

in TranslateScale. Now output has the value scale_1, which prevents the name clash I mentioned in the previous comment. I'm not sure if this should be done throughout the code though, is there a reason to use .output[0] instead of layer.name?

Right now the network runs without errors but I still need to see if the output I get is correct.

@danielhauagge
Copy link

danielhauagge commented Jul 11, 2017

OK, tested and seems like the network works fine now. I'll submit a PR with the change.

@milewis1
Copy link
Author

milewis1 commented Jul 11, 2017 via email

@ARSwhut
Copy link

ARSwhut commented Oct 21, 2017

I have the same problem.

RuntimeError Traceback (most recent call last)
in ()
9
10 # run the net and return prediction
---> 11 results = p.run([img])
12 #results = np.asarray(results)
13 #print "results shape: ", results.shape

RuntimeError: [enforce fail at tensor.h:671] i < dims_.size(). 0 vs 0. Exceeding ndim limit Error from operator:
input: "data" input: "conv1_w" input: "conv1_b" output: "conv1" type: "Conv" arg { name: "stride" i: 2 } arg { name: "pad" i: 0 } arg { name: "kernel" i: 3 } device_option { } engine: ""
** while accessing input: data

@yangqiongyongyu
Copy link

Have you resolved them? @ARSwhut

@wm10240
Copy link

wm10240 commented Dec 26, 2017

@ARSwhut ARS Have you resolved it?

@milewis1
Copy link
Author

milewis1 commented Dec 26, 2017 via email

@BIGBALLON
Copy link

@danielhauagge thanks for you solution!! it works fine now!!!!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests