Integrate Baidu-warpctc by xlvector · Pull Request #2326 · apache/mxnet

xlvector · 2016-06-03T09:26:52Z

The code is not ready for merge! (Related to issue #2305)

Just want you to help me review code in plugin/warpctc/ to see if I make large mistakes. (There are many debug logs in code, I will remove them at last)

Current code can run in CPU mode (can run means will not crash, I does not check if results is right).

For GPU mode, it will crash.

antinucleon · 2016-06-03T17:28:07Z

plugin/warpctc/warpctc-inl.h

+                                      &cpu_alloc_bytes),
+                   "Error: get_workspace_size in inf_test");
+    std::cout << cpu_alloc_bytes << std::endl;
+    void* ctc_cpu_workspace = malloc(cpu_alloc_bytes);


Use ctx.get_host_space

xlvector · 2016-06-04T06:49:05Z

Thanks all, I will try all your ideas today.

xlvector · 2016-06-05T04:02:20Z

I add following inferType function

virtual bool InferType(std::vector *in_type,
std::vector *out_type,
std::vector *aux_type) const {
CHECK_LE(in_type->size(), this->ListArguments().size());
in_type->clear();
in_type->push_back(mshadow::kFloat32);
in_type->push_back(mshadow::kInt32);
out_type->clear();
out_type->push_back(mshadow::kFloat32);
return true;
}

I want to make label be int type. In python code, I use symbol.Cast:

label = mx.sym.Cast(data = label, dtype = 'int32')

This works OK under CPU mode.

However, when I switch ctx to gpu, I get following error.

/dl/mxnet/dmlc-core/include/dmlc/logging.h:235: [11:51:03] src/ndarray/ndarray_function.cu:19: Check failed: (to->type_flag_) == (from.type_flag_) Source and target must have the same data type when copying across devices.

I think this occurs when copy label data generated by DataIter from cpu to gpu. Here, I find to->type_flag_ = 0 (float32) and from->type_flag_ = 4 (int32)

xlvector · 2016-06-05T13:25:22Z

I find this is because, in Copy<cpu, cpu>, it can do type cast:

void Copy<cpu, cpu>(const TBlob &from, TBlob *to,
15 Context from_ctx, Context to_ctx,
16 RunContext ctx) {
17 MSHADOW_TYPE_SWITCH(to->type_flag_, DType, {
18 if (to->type_flag_ == from.type_flag_) {
19 mshadow::Copy(to->FlatTo2D<cpu, DType>(),
20 from.FlatTo2D<cpu, DType>());
21 } else {
22 MSHADOW_TYPE_SWITCH(from.type_flag_, SrcDType, {
23 to->FlatTo2D<cpu, DType>() =
24 mshadow::expr::tcast(from.FlatTo2D<cpu, SrcDType>());
25 })
26 }
27 })
28 }

But in Copy<cpu, gpu>, it can not do type cast.

So, my problem is while I still can not get int type label after implement InferType.

xlvector · 2016-06-05T14:51:47Z

WarpCTC need label to be a int* in CPU context.

How can I make other symbols in GPU while make label in CPU?

tqchen · 2016-06-05T15:46:19Z

Currently the assumption is that the inputs need to be in the same context, due to the type inference algorithm

The simplest way is to copy the GPU array to CPU during computation on demand. @antinucleon knows how to allocate a cpu temp space.

xlvector · 2016-06-06T08:04:33Z

plugin/warpctc/warpctc-inl.h

+    Stream<xpu> *s = ctx.get_stream<xpu>();
+    Tensor<xpu, 2> grad = in_grad[warpctc_enum::kData].FlatTo2D<xpu, real_t>(s);
+    grad.dptr_ = grads;
+


I fall to test under GPU, so I copy memory to cpu and run computer_ctc_loss, and get grads. I want to set in_grad to grads.

Here, I do not know if in_grad can take values from grads.

tqchen · 2016-06-09T02:56:14Z

any updates on this ?

xlvector · 2016-06-09T09:12:07Z

CPU works now. I am working on GPU these days.

xlvector · 2016-06-11T15:32:12Z

I have make everything works now (in both CPU and GPU). And I am working on write an example to prove rightness of CTC. Will submit code in next week.

piiswrong · 2016-06-11T19:18:12Z

Makefile

 ifeq ($(USE_OPENCV), 1)
 	CFLAGS += -DMXNET_USE_OPENCV=1 `pkg-config --cflags opencv`
-	LDFLAGS += `pkg-config --libs opencv`
+	LDFLAGS += -lopencv_core -lopencv_highgui -lopencv_imgproc -lopencv_imgcodecs


why change this?

I will revert them when I do finial submit. Thanks!

xlvector · 2016-06-12T15:12:30Z

After rebase upstream/master, I find pull request #2358

will cause problem crash when do evaluation. I have commented in #2358.

Yang507 · 2016-07-01T01:36:35Z

@xlvector when running lstm_ocr.py，one error happen:
[ERROR]:TypeError:must be sequence of length 4,not 2
I just change the fonts of ImageCaptcha for I cannot find the file (./data/Xerox.ttf)

xlvector · 2016-07-01T02:18:31Z

@Yang507 you can download Xerox.ttf from http://www.webpagepublicity.com/free-fonts/x/Xerox%20Sans%20Serif%20Narrow.ttf and put it in ./data/

Can you paster all error messages?

livedimg · 2016-07-02T01:18:40Z

I use cnn_ocr.py(https://github.com/xlvector/learning-dl/tree/master/mxnet/ocr),
why Train-Accuracy = 0.000000 after batch 1300
and this batch number is always changing when next training
2016-07-01 12:27:17,341 Epoch[0] Batch [50] Speed: 181.20 samples/sec Train-Accuracy=0.000000
2016-07-01 12:27:26,751 Epoch[0] Batch [100] Speed: 170.05 samples/sec Train-Accuracy=0.000000
2016-07-01 12:27:36,490 Epoch[0] Batch [150] Speed: 164.29 samples/sec Train-Accuracy=0.000000
2016-07-01 12:27:46,196 Epoch[0] Batch [200] Speed: 164.85 samples/sec Train-Accuracy=0.000400
2016-07-01 12:27:56,264 Epoch[0] Batch [250] Speed: 158.92 samples/sec Train-Accuracy=0.001600
2016-07-01 12:28:05,724 Epoch[0] Batch [300] Speed: 169.14 samples/sec Train-Accuracy=0.004000
2016-07-01 12:28:15,906 Epoch[0] Batch [350] Speed: 157.14 samples/sec Train-Accuracy=0.001600
2016-07-01 12:28:25,906 Epoch[0] Batch [400] Speed: 160.00 samples/sec Train-Accuracy=0.000400
2016-07-01 12:28:35,385 Epoch[0] Batch [450] Speed: 168.79 samples/sec Train-Accuracy=0.006800
2016-07-01 12:28:45,341 Epoch[0] Batch [500] Speed: 160.71 samples/sec Train-Accuracy=0.023200
2016-07-01 12:28:54,888 Epoch[0] Batch [550] Speed: 167.60 samples/sec Train-Accuracy=0.076000
2016-07-01 12:29:04,616 Epoch[0] Batch [600] Speed: 164.47 samples/sec Train-Accuracy=0.122800
2016-07-01 12:29:14,225 Epoch[0] Batch [650] Speed: 166.50 samples/sec Train-Accuracy=0.161600
2016-07-01 12:29:23,789 Epoch[0] Batch [700] Speed: 167.30 samples/sec Train-Accuracy=0.172400
2016-07-01 12:29:33,543 Epoch[0] Batch [750] Speed: 164.03 samples/sec Train-Accuracy=0.175200
2016-07-01 12:29:43,938 Epoch[0] Batch [800] Speed: 153.93 samples/sec Train-Accuracy=0.242800
2016-07-01 12:29:53,821 Epoch[0] Batch [850] Speed: 161.88 samples/sec Train-Accuracy=0.264800
2016-07-01 12:30:03,501 Epoch[0] Batch [900] Speed: 165.30 samples/sec Train-Accuracy=0.282400
2016-07-01 12:30:12,893 Epoch[0] Batch [950] Speed: 170.37 samples/sec Train-Accuracy=0.305200
2016-07-01 12:30:22,977 Epoch[0] Batch [1000] Speed: 158.66 samples/sec Train-Accuracy=0.304000
2016-07-01 12:30:32,574 Epoch[0] Batch [1050] Speed: 166.72 samples/sec Train-Accuracy=0.265600
2016-07-01 12:30:42,185 Epoch[0] Batch [1100] Speed: 166.48 samples/sec Train-Accuracy=0.000400
2016-07-01 12:30:51,542 Epoch[0] Batch [1150] Speed: 170.99 samples/sec Train-Accuracy=0.004800
2016-07-01 12:31:01,694 Epoch[0] Batch [1200] Speed: 157.61 samples/sec Train-Accuracy=0.000000
2016-07-01 12:31:11,258 Epoch[0] Batch [1250] Speed: 167.30 samples/sec Train-Accuracy=0.000400
2016-07-01 12:31:20,939 Epoch[0] Batch [1300] Speed: 165.27 samples/sec Train-Accuracy=0.000000
2016-07-01 12:31:31,098 Epoch[0] Batch [1350] Speed: 157.51 samples/sec Train-Accuracy=0.000000
2016-07-01 12:31:40,550 Epoch[0] Batch [1400] Speed: 169.26 samples/sec Train-Accuracy=0.000000
2016-07-01 12:31:50,023 Epoch[0] Batch [1450] Speed: 168.91 samples/sec Train-Accuracy=0.000000
2016-07-01 12:31:59,475 Epoch[0] Batch [1500] Speed: 169.27 samples/sec Train-Accuracy=0.000400
2016-07-01 12:32:08,882 Epoch[0] Batch [1550] Speed: 170.09 samples/sec Train-Accuracy=0.000400
2016-07-01 12:32:18,604 Epoch[0] Batch [1600] Speed: 164.58 samples/sec Train-Accuracy=0.000000

xlvector · 2016-07-02T04:32:37Z

@livedimg we can discuss in that repos.

Yang507 · 2016-07-03T01:51:23Z

@xlvector
[Error]:
begin fit
infer label shape: 128
2016-07-01 09:54:48,300 Start training with [gpu(3)]
infer label shape: 128
infer label shape: 128
infer label shape: 128
iter
Traceback (most recent call last):
File "lstm_ocr.py", line 165, in
batch_end_callback=mx.callback.Speedometer(BATCH_SIZE, 50),)
File "../../python/mxnet/model.py", line 788, in fit
sym_gen=self.sym_gen)
File "../../python/mxnet/model.py", line 221, in _train_multi_device
for data_batch in train_data:
File "lstm_ocr.py", line 66, in iter
img = self.captcha.generate(num)
File "/usr/local/lib/python2.7/dist-packages/captcha/image.py", line 40, in generate
im = self.generate_image(chars)
File "/usr/local/lib/python2.7/dist-packages/captcha/image.py", line 220, in generate_image
self.create_noise_curve(im, color)
File "/usr/local/lib/python2.7/dist-packages/captcha/image.py", line 132, in create_noise_curve
Draw(image).arc(points, start, end, fill=color)
File "/usr/lib/python2.7/dist-packages/PIL/ImageDraw.py", line 164, in arc
self.draw.draw_arc(xy, start, end, ink)
TypeError: must be sequence of length 4, not 2

Thanks!I download the Xerox.ttf from the link and put in my one directory,but the result has the same error.

xlvector · 2016-07-04T10:32:13Z

seems this error is just because of captcha. can you write a test script to make python-captcha work？

Yang507 · 2016-07-05T07:35:47Z

I change the learing rate and make learning rate =0.005,the Train-Accuracy is closed to 1.
The one thing is important for me, I have some captcha images, but I don't know how to put images in code and train it@xlvector, I'm very sorry to interrupt you for the time and thank you for your help.

sinmaystar · 2016-07-14T07:31:30Z

@Yang507 I have meet a same problem with u, when I use captcha.gennerate() method, it occurs a same problem, TypeError: must be sequence of length 4, not 2. How do u do to solve this problem?

Yang507 · 2016-07-14T07:49:16Z

@sinmaystar you could input follow commands in your terminal:sudo pip install -U Pillow,then run lstm_ocr.py again, I almost forget it.

sinmaystar · 2016-07-14T08:01:18Z

@Yang507 just update Pillow package is alright? my pillow version is 2.3.0, what's your pillow package version?

Yang507 · 2016-07-14T08:08:00Z

my Pillow version is 3.3 now , I forget if I have Pillow package before I update Pillow @sinmaystar

sinmaystar · 2016-07-14T08:30:38Z

@Yang507 can u run this code successfully just right now? are u sure its Pillow's problem? I have updated Pillow but its still not working

xlvector · 2016-07-14T08:37:33Z

@sinmaystar you can refer to lepture/captcha#7

Yang507 · 2016-07-18T02:07:24Z

@xlvector I met a same problem with you when I run lstm_ocr.py with my own data(number=18), occur following error:

 Traceback(msot recent call last):
 File "lstm_ocr.py",line 165,in <module>
     batch_end_callback=mx.callback.Speedometer(BATCHSIZE,50))
 File "../../python/mxnet/model.py",line 745,in fit
    self._init_params(dict(data.provide_data+data.provide_label))
 File "../..python/mxnet/model.py",line 486,in _init_params 
    assert(arg_shapes is not None)
 AssertionError

I know my data format is not match with the random captcha format, but I don't know how to write a iterator to load my captcha data and train it with lstm_ocr.py, I want to get your help and i'm grateful.

thewintersun · 2016-07-20T07:05:57Z

能不能在你的wrapctc的例子里多写点介绍。
一点注释也没有。。。

xlvector · 2016-07-20T07:19:22Z

@thewintersun 其实相比于examples/rnn里面的例子，就改了一行。
一些简单的介绍见 https://zhuanlan.zhihu.com/p/21344595?refer=xlvector

anxingle · 2016-07-21T10:25:34Z

@xlvector The log you train lstm shows the network just take less than 40 minutes to converge and finished just in epoch 0 . So , I wonder do you set BATCH_SIZE (in LSTM_OCR) very big(160 or 320 )? or learning rate a little lower (not 0.005)? I run lstm_ocr.py in GTX980ti, the result isn't very good as yours (about 0.09)and take me about 3 hours.

sinmaystar · 2016-07-21T10:34:03Z

@anxingle did u run lstm_ocr.py example successfully ?
how do u integrate baidu warp-ctc?
I have installed baidu warp-ctc and do as readme introduction, but it occurs an error:
AttributeError: 'module' object has no attribute 'WarpCTC'
how can I solve this problem

anxingle · 2016-07-21T10:40:45Z

@sinmaystar Yeah,it's really hard to compile. Do you set $LD_LIBRARY_PATH (let libwarpctc.so in the path)? and edit make/config.mk ,let warpctc_path contain warp-ctc.
Can you tell me where is your warp-ctc and how about your config.mk and more info?

sinmaystar · 2016-07-21T10:57:22Z

        I do not set LD_LIBRARY_PATH，I have comment out two lines as introduction says，my warp-ctc and mxnet all in home directory，thanks for your help，tomorrow I will retry it，thanks for your help，if I it does not work，maybe I will ask you again、thanks many！you are very kind
        在2016年07月21日 18:41，安兴乐-siler 写道:@sinmaystar   Yeah,it's really hard to compile. Do you set $LD_LIBRARY_PATH (let libwarpctc.so in the path)? and edit make/config.mk ,let warpctc_path  contain warp-ctc.

Can you tell me where is your warp-ctc and how about your config.mk and more info?

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

sinmaystar · 2016-07-22T01:37:21Z

@anxingle Where can I find libwarpctc.so file? and how you set $LD_LIBRARY_PATH ? I can not find libwarpctc.so file after compile warp-ctc

anxingle · 2016-07-22T02:49:17Z

@sinmaystar in warp-ctc/build . You can use the command to find it: find /home/$USER/warp-ctc/ -name libwarpctc.so. And it shows the message: /home/a/warp-ctc/build/libwarpctc.so . Then you edit the ~/.bashrc,add the line export LD_LIBRARY_PATH="/home/a/warp-ctc/build:/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH"(this is my own path) to the end of the file.

sinmaystar · 2016-07-22T05:34:24Z

@anxingle thanks for your help I have run this example successfully.
I am working on add a cnn layer before lstm, so it can extract word features and resist rotate, but I do not know how to add a simple cnn layer before lstm, do u have any suggestions?

yuzhuqingyun · 2016-07-29T03:53:22Z

When predict using:
model_load = mx.model.FeedForward.load('ocr',15)
[prob, data1, label1] = model_load.predict(data_val, return_data=True)
get error:
self.ctx[0], grad_reg = 'null', **dict(input_shapes)
ValueError:Input node is not complete
how can I predict with one picture?

BrianZhu01 · 2016-12-06T06:24:55Z

I also meet the error, how do u integrate baidu warp-ctc?
I have installed baidu warp-ctc and do as readme introduction, but it occurs an error:
AttributeError: 'module' object has no attribute 'WarpCTC'
how can I solve this problem

xiazi-yu · 2017-09-15T06:50:17Z

你好，我用mxnet中的mxnet_predict.py进行预测的时候，好像不支持GPU的选项，只能在CPU下跑，用GPU会卡住，报terminate called without an active exception的错误

livedimg · 2017-09-25T01:11:07Z

mxnet当时只是玩一玩，没深做。

…

On Fri, Sep 15, 2017 at 2:50 PM, wuchuanying ***@***.***> wrote: 你好，我用mxnet中的mxnet_predict.py进行预测的时候，好像不支持GPU的选项，只能在CPU下跑，用GPU会卡住，报terminate called without an active exception的错误 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2326 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATT_GBKViUAVaRdSBma-Oz_pwymots7Aks5sih5IgaJpZM4ItWu7> .

xlvector changed the title ~~Warpctc~~ Integrate Baidu-warpctc Jun 3, 2016

xlvector force-pushed the warpctc branch from 822a33c to 0abd007 Compare June 3, 2016 09:34

antinucleon reviewed Jun 3, 2016
View reviewed changes

xlvector reviewed Jun 6, 2016
View reviewed changes

piiswrong reviewed Jun 11, 2016
View reviewed changes

xlvector force-pushed the warpctc branch 3 times, most recently from 06fd3f1 to 97a4891 Compare June 12, 2016 14:20

xlvector mentioned this pull request Jun 12, 2016

Enable training fast rcnn with multiple GPUs #2358

Merged

sxjscience merged commit 2279cbf into apache:master Jun 16, 2016

xlvector mentioned this pull request Jul 19, 2016

issue in example warpctc/lstm_ocr.py #2696

Closed

xlvector mentioned this pull request Jul 21, 2016

support cpu #2793

Merged

Conversation

xlvector commented Jun 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antinucleon Jun 3, 2016

Choose a reason for hiding this comment

Uh oh!

xlvector commented Jun 4, 2016

Uh oh!

xlvector commented Jun 5, 2016

Uh oh!

xlvector commented Jun 5, 2016

Uh oh!

xlvector commented Jun 5, 2016

Uh oh!

tqchen commented Jun 5, 2016

Uh oh!

xlvector Jun 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tqchen commented Jun 9, 2016

Uh oh!

xlvector commented Jun 9, 2016

Uh oh!

xlvector commented Jun 11, 2016

Uh oh!

piiswrong Jun 11, 2016

Choose a reason for hiding this comment

Uh oh!

xlvector Jun 12, 2016

Choose a reason for hiding this comment

Uh oh!

xlvector commented Jun 12, 2016

Uh oh!

Yang507 commented Jul 1, 2016

Uh oh!

xlvector commented Jul 1, 2016

Uh oh!

livedimg commented Jul 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xlvector commented Jul 2, 2016

Uh oh!

Yang507 commented Jul 3, 2016

Uh oh!

xlvector commented Jul 4, 2016

Uh oh!

Yang507 commented Jul 5, 2016

Uh oh!

sinmaystar commented Jul 14, 2016

Uh oh!

Yang507 commented Jul 14, 2016

Uh oh!

sinmaystar commented Jul 14, 2016

Uh oh!

Yang507 commented Jul 14, 2016

Uh oh!

sinmaystar commented Jul 14, 2016

Uh oh!

xlvector commented Jul 14, 2016

Uh oh!

Yang507 commented Jul 18, 2016

Uh oh!

thewintersun commented Jul 20, 2016

Uh oh!

xlvector commented Jul 20, 2016

Uh oh!

anxingle commented Jul 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sinmaystar commented Jul 21, 2016

Uh oh!

anxingle commented Jul 21, 2016

Uh oh!

sinmaystar commented Jul 21, 2016

Uh oh!

sinmaystar commented Jul 22, 2016

Uh oh!

anxingle commented Jul 22, 2016

xlvector commented Jun 3, 2016 •

edited

Loading

xlvector Jun 6, 2016 •

edited

Loading

livedimg commented Jul 2, 2016 •

edited

Loading

anxingle commented Jul 21, 2016 •

edited

Loading