Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Integrate Baidu-warpctc#2326

Merged
sxjscience merged 1 commit intoapache:masterfrom
xlvector:warpctc
Jun 16, 2016
Merged

Integrate Baidu-warpctc#2326
sxjscience merged 1 commit intoapache:masterfrom
xlvector:warpctc

Conversation

@xlvector
Copy link
Contributor

@xlvector xlvector commented Jun 3, 2016

The code is not ready for merge! (Related to issue #2305)

Just want you to help me review code in plugin/warpctc/ to see if I make large mistakes. (There are many debug logs in code, I will remove them at last)

Current code can run in CPU mode (can run means will not crash, I does not check if results is right).

For GPU mode, it will crash.

@xlvector xlvector changed the title Warpctc Integrate Baidu-warpctc Jun 3, 2016
&cpu_alloc_bytes),
"Error: get_workspace_size in inf_test");
std::cout << cpu_alloc_bytes << std::endl;
void* ctc_cpu_workspace = malloc(cpu_alloc_bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ctx.get_host_space

@xlvector
Copy link
Contributor Author

xlvector commented Jun 4, 2016

Thanks all, I will try all your ideas today.

@xlvector
Copy link
Contributor Author

xlvector commented Jun 5, 2016

I add following inferType function

virtual bool InferType(std::vector *in_type,
std::vector *out_type,
std::vector *aux_type) const {
CHECK_LE(in_type->size(), this->ListArguments().size());
in_type->clear();
in_type->push_back(mshadow::kFloat32);
in_type->push_back(mshadow::kInt32);
out_type->clear();
out_type->push_back(mshadow::kFloat32);
return true;
}

I want to make label be int type. In python code, I use symbol.Cast:

label = mx.sym.Cast(data = label, dtype = 'int32')

This works OK under CPU mode.

However, when I switch ctx to gpu, I get following error.

/dl/mxnet/dmlc-core/include/dmlc/logging.h:235: [11:51:03] src/ndarray/ndarray_function.cu:19: Check failed: (to->type_flag_) == (from.type_flag_) Source and target must have the same data type when copying across devices.

I think this occurs when copy label data generated by DataIter from cpu to gpu. Here, I find to->type_flag_ = 0 (float32) and from->type_flag_ = 4 (int32)

@xlvector
Copy link
Contributor Author

xlvector commented Jun 5, 2016

I find this is because, in Copy<cpu, cpu>, it can do type cast:

void Copy<cpu, cpu>(const TBlob &from, TBlob *to,
15 Context from_ctx, Context to_ctx,
16 RunContext ctx) {
17 MSHADOW_TYPE_SWITCH(to->type_flag_, DType, {
18 if (to->type_flag_ == from.type_flag_) {
19 mshadow::Copy(to->FlatTo2D<cpu, DType>(),
20 from.FlatTo2D<cpu, DType>());
21 } else {
22 MSHADOW_TYPE_SWITCH(from.type_flag_, SrcDType, {
23 to->FlatTo2D<cpu, DType>() =
24 mshadow::expr::tcast(from.FlatTo2D<cpu, SrcDType>());
25 })
26 }
27 })
28 }

But in Copy<cpu, gpu>, it can not do type cast.

So, my problem is while I still can not get int type label after implement InferType.

@xlvector
Copy link
Contributor Author

xlvector commented Jun 5, 2016

WarpCTC need label to be a int* in CPU context.

How can I make other symbols in GPU while make label in CPU?

@tqchen
Copy link
Member

tqchen commented Jun 5, 2016

Currently the assumption is that the inputs need to be in the same context, due to the type inference algorithm

The simplest way is to copy the GPU array to CPU during computation on demand. @antinucleon knows how to allocate a cpu temp space.

Stream<xpu> *s = ctx.get_stream<xpu>();
Tensor<xpu, 2> grad = in_grad[warpctc_enum::kData].FlatTo2D<xpu, real_t>(s);
grad.dptr_ = grads;

Copy link
Contributor Author

@xlvector xlvector Jun 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fall to test under GPU, so I copy memory to cpu and run computer_ctc_loss, and get grads. I want to set in_grad to grads.

Here, I do not know if in_grad can take values from grads.

@tqchen
Copy link
Member

tqchen commented Jun 9, 2016

any updates on this ?

@xlvector
Copy link
Contributor Author

xlvector commented Jun 9, 2016

CPU works now. I am working on GPU these days.

@xlvector
Copy link
Contributor Author

I have make everything works now (in both CPU and GPU). And I am working on write an example to prove rightness of CTC. Will submit code in next week.

Makefile Outdated
ifeq ($(USE_OPENCV), 1)
CFLAGS += -DMXNET_USE_OPENCV=1 `pkg-config --cflags opencv`
LDFLAGS += `pkg-config --libs opencv`
LDFLAGS += -lopencv_core -lopencv_highgui -lopencv_imgproc -lopencv_imgcodecs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert them when I do finial submit. Thanks!

@xlvector xlvector force-pushed the warpctc branch 3 times, most recently from 06fd3f1 to 97a4891 Compare June 12, 2016 14:20
@xlvector
Copy link
Contributor Author

After rebase upstream/master, I find pull request #2358

will cause problem crash when do evaluation. I have commented in #2358.

@sxjscience sxjscience merged commit 2279cbf into apache:master Jun 16, 2016
@Yang507
Copy link

Yang507 commented Jul 1, 2016

@xlvector when running lstm_ocr.py,one error happen:
[ERROR]:TypeError:must be sequence of length 4,not 2
I just change the fonts of ImageCaptcha for I cannot find the file (./data/Xerox.ttf)

@xlvector
Copy link
Contributor Author

xlvector commented Jul 1, 2016

@Yang507 you can download Xerox.ttf from http://www.webpagepublicity.com/free-fonts/x/Xerox%20Sans%20Serif%20Narrow.ttf and put it in ./data/

Can you paster all error messages?

@livedimg
Copy link

livedimg commented Jul 2, 2016

I use cnn_ocr.py(https://github.com/xlvector/learning-dl/tree/master/mxnet/ocr),
why Train-Accuracy = 0.000000 after batch 1300
and this batch number is always changing when next training
2016-07-01 12:27:17,341 Epoch[0] Batch [50] Speed: 181.20 samples/sec Train-Accuracy=0.000000
2016-07-01 12:27:26,751 Epoch[0] Batch [100] Speed: 170.05 samples/sec Train-Accuracy=0.000000
2016-07-01 12:27:36,490 Epoch[0] Batch [150] Speed: 164.29 samples/sec Train-Accuracy=0.000000
2016-07-01 12:27:46,196 Epoch[0] Batch [200] Speed: 164.85 samples/sec Train-Accuracy=0.000400
2016-07-01 12:27:56,264 Epoch[0] Batch [250] Speed: 158.92 samples/sec Train-Accuracy=0.001600
2016-07-01 12:28:05,724 Epoch[0] Batch [300] Speed: 169.14 samples/sec Train-Accuracy=0.004000
2016-07-01 12:28:15,906 Epoch[0] Batch [350] Speed: 157.14 samples/sec Train-Accuracy=0.001600
2016-07-01 12:28:25,906 Epoch[0] Batch [400] Speed: 160.00 samples/sec Train-Accuracy=0.000400
2016-07-01 12:28:35,385 Epoch[0] Batch [450] Speed: 168.79 samples/sec Train-Accuracy=0.006800
2016-07-01 12:28:45,341 Epoch[0] Batch [500] Speed: 160.71 samples/sec Train-Accuracy=0.023200
2016-07-01 12:28:54,888 Epoch[0] Batch [550] Speed: 167.60 samples/sec Train-Accuracy=0.076000
2016-07-01 12:29:04,616 Epoch[0] Batch [600] Speed: 164.47 samples/sec Train-Accuracy=0.122800
2016-07-01 12:29:14,225 Epoch[0] Batch [650] Speed: 166.50 samples/sec Train-Accuracy=0.161600
2016-07-01 12:29:23,789 Epoch[0] Batch [700] Speed: 167.30 samples/sec Train-Accuracy=0.172400
2016-07-01 12:29:33,543 Epoch[0] Batch [750] Speed: 164.03 samples/sec Train-Accuracy=0.175200
2016-07-01 12:29:43,938 Epoch[0] Batch [800] Speed: 153.93 samples/sec Train-Accuracy=0.242800
2016-07-01 12:29:53,821 Epoch[0] Batch [850] Speed: 161.88 samples/sec Train-Accuracy=0.264800
2016-07-01 12:30:03,501 Epoch[0] Batch [900] Speed: 165.30 samples/sec Train-Accuracy=0.282400
2016-07-01 12:30:12,893 Epoch[0] Batch [950] Speed: 170.37 samples/sec Train-Accuracy=0.305200
2016-07-01 12:30:22,977 Epoch[0] Batch [1000] Speed: 158.66 samples/sec Train-Accuracy=0.304000
2016-07-01 12:30:32,574 Epoch[0] Batch [1050] Speed: 166.72 samples/sec Train-Accuracy=0.265600
2016-07-01 12:30:42,185 Epoch[0] Batch [1100] Speed: 166.48 samples/sec Train-Accuracy=0.000400
2016-07-01 12:30:51,542 Epoch[0] Batch [1150] Speed: 170.99 samples/sec Train-Accuracy=0.004800
2016-07-01 12:31:01,694 Epoch[0] Batch [1200] Speed: 157.61 samples/sec Train-Accuracy=0.000000
2016-07-01 12:31:11,258 Epoch[0] Batch [1250] Speed: 167.30 samples/sec Train-Accuracy=0.000400
2016-07-01 12:31:20,939 Epoch[0] Batch [1300] Speed: 165.27 samples/sec Train-Accuracy=0.000000
2016-07-01 12:31:31,098 Epoch[0] Batch [1350] Speed: 157.51 samples/sec Train-Accuracy=0.000000
2016-07-01 12:31:40,550 Epoch[0] Batch [1400] Speed: 169.26 samples/sec Train-Accuracy=0.000000
2016-07-01 12:31:50,023 Epoch[0] Batch [1450] Speed: 168.91 samples/sec Train-Accuracy=0.000000
2016-07-01 12:31:59,475 Epoch[0] Batch [1500] Speed: 169.27 samples/sec Train-Accuracy=0.000400
2016-07-01 12:32:08,882 Epoch[0] Batch [1550] Speed: 170.09 samples/sec Train-Accuracy=0.000400
2016-07-01 12:32:18,604 Epoch[0] Batch [1600] Speed: 164.58 samples/sec Train-Accuracy=0.000000

@xlvector
Copy link
Contributor Author

xlvector commented Jul 2, 2016

@livedimg we can discuss in that repos.

@Yang507
Copy link

Yang507 commented Jul 3, 2016

@xlvector
[Error]:
begin fit
infer label shape: 128
2016-07-01 09:54:48,300 Start training with [gpu(3)]
infer label shape: 128
infer label shape: 128
infer label shape: 128
iter
Traceback (most recent call last):
File "lstm_ocr.py", line 165, in
batch_end_callback=mx.callback.Speedometer(BATCH_SIZE, 50),)
File "../../python/mxnet/model.py", line 788, in fit
sym_gen=self.sym_gen)
File "../../python/mxnet/model.py", line 221, in _train_multi_device
for data_batch in train_data:
File "lstm_ocr.py", line 66, in iter
img = self.captcha.generate(num)
File "/usr/local/lib/python2.7/dist-packages/captcha/image.py", line 40, in generate
im = self.generate_image(chars)
File "/usr/local/lib/python2.7/dist-packages/captcha/image.py", line 220, in generate_image
self.create_noise_curve(im, color)
File "/usr/local/lib/python2.7/dist-packages/captcha/image.py", line 132, in create_noise_curve
Draw(image).arc(points, start, end, fill=color)
File "/usr/lib/python2.7/dist-packages/PIL/ImageDraw.py", line 164, in arc
self.draw.draw_arc(xy, start, end, ink)
TypeError: must be sequence of length 4, not 2

Thanks!I download the Xerox.ttf from the link and put in my one directory,but the result has the same error.

@xlvector
Copy link
Contributor Author

xlvector commented Jul 4, 2016

seems this error is just because of captcha. can you write a test script to make python-captcha work?

@Yang507
Copy link

Yang507 commented Jul 5, 2016

I change the learing rate and make learning rate =0.005,the Train-Accuracy is closed to 1.
The one thing is important for me, I have some captcha images, but I don't know how to put images in code and train it@xlvector, I'm very sorry to interrupt you for the time and thank you for your help.

@sinmaystar
Copy link

@Yang507 I have meet a same problem with u, when I use captcha.gennerate() method, it occurs a same problem, TypeError: must be sequence of length 4, not 2. How do u do to solve this problem?

@Yang507
Copy link

Yang507 commented Jul 14, 2016

@sinmaystar you could input follow commands in your terminal:sudo pip install -U Pillow,then run lstm_ocr.py again, I almost forget it.

@sinmaystar
Copy link

@Yang507 just update Pillow package is alright? my pillow version is 2.3.0, what's your pillow package version?

@Yang507
Copy link

Yang507 commented Jul 14, 2016

my Pillow version is 3.3 now , I forget if I have Pillow package before I update Pillow @sinmaystar

@sinmaystar
Copy link

@Yang507 can u run this code successfully just right now? are u sure its Pillow's problem? I have updated Pillow but its still not working

@xlvector
Copy link
Contributor Author

@sinmaystar you can refer to lepture/captcha#7

@Yang507
Copy link

Yang507 commented Jul 18, 2016

@xlvector I met a same problem with you when I run lstm_ocr.py with my own data(number=18), occur following error:

 Traceback(msot recent call last):
 File "lstm_ocr.py",line 165,in <module>
     batch_end_callback=mx.callback.Speedometer(BATCHSIZE,50))
 File "../../python/mxnet/model.py",line 745,in fit
    self._init_params(dict(data.provide_data+data.provide_label))
 File "../..python/mxnet/model.py",line 486,in _init_params 
    assert(arg_shapes is not None)
 AssertionError

I know my data format is not match with the random captcha format, but I don't know how to write a iterator to load my captcha data and train it with lstm_ocr.py, I want to get your help and i'm grateful.

@thewintersun
Copy link

能不能在你的wrapctc的例子里多写点介绍。
一点注释也没有。。。

@xlvector
Copy link
Contributor Author

@thewintersun 其实相比于examples/rnn里面的例子,就改了一行。
一些简单的介绍见 https://zhuanlan.zhihu.com/p/21344595?refer=xlvector

@anxingle
Copy link
Contributor

anxingle commented Jul 21, 2016

@xlvector The log you train lstm shows the network just take less than 40 minutes to converge and finished just in epoch 0 . So , I wonder do you set BATCH_SIZE (in LSTM_OCR) very big(160 or 320 )? or learning rate a little lower (not 0.005)? I run lstm_ocr.py in GTX980ti, the result isn't very good as yours (about 0.09)and take me about 3 hours.

@sinmaystar
Copy link

@anxingle did u run lstm_ocr.py example successfully ?
how do u integrate baidu warp-ctc?
I have installed baidu warp-ctc and do as readme introduction, but it occurs an error:
AttributeError: 'module' object has no attribute 'WarpCTC'
how can I solve this problem

@anxingle
Copy link
Contributor

@sinmaystar Yeah,it's really hard to compile. Do you set $LD_LIBRARY_PATH (let libwarpctc.so in the path)? and edit make/config.mk ,let warpctc_path contain warp-ctc.
Can you tell me where is your warp-ctc and how about your config.mk and more info?

@sinmaystar
Copy link

        I do not set LD_LIBRARY_PATH,I have comment out two lines as introduction says,my warp-ctc and mxnet all in home directory,thanks for your help,tomorrow I will retry it,thanks for your help,if I it does not work,maybe I will ask you again、thanks many!you are very kind
        在2016年07月21日 18:41,安兴乐-siler 写道:@sinmaystar   Yeah,it's really hard to compile. Do you set $LD_LIBRARY_PATH (let libwarpctc.so in the path)? and edit make/config.mk ,let warpctc_path  contain warp-ctc.

Can you tell me where is your warp-ctc and how about your config.mk and more info?

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

@xlvector xlvector mentioned this pull request Jul 21, 2016
@sinmaystar
Copy link

@anxingle Where can I find libwarpctc.so file? and how you set $LD_LIBRARY_PATH ? I can not find libwarpctc.so file after compile warp-ctc

@anxingle
Copy link
Contributor

@sinmaystar in warp-ctc/build . You can use the command to find it: find /home/$USER/warp-ctc/ -name libwarpctc.so. And it shows the message: /home/a/warp-ctc/build/libwarpctc.so . Then you edit the ~/.bashrc,add the line export LD_LIBRARY_PATH="/home/a/warp-ctc/build:/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH"(this is my own path) to the end of the file.

@sinmaystar
Copy link

@anxingle thanks for your help I have run this example successfully.
I am working on add a cnn layer before lstm, so it can extract word features and resist rotate, but I do not know how to add a simple cnn layer before lstm, do u have any suggestions?

@yuzhuqingyun
Copy link

When predict using:
model_load = mx.model.FeedForward.load('ocr',15)
[prob, data1, label1] = model_load.predict(data_val, return_data=True)
get error:
self.ctx[0], grad_reg = 'null', **dict(input_shapes)
ValueError:Input node is not complete
how can I predict with one picture?

@BrianZhu01
Copy link

I also meet the error, how do u integrate baidu warp-ctc?
I have installed baidu warp-ctc and do as readme introduction, but it occurs an error:
AttributeError: 'module' object has no attribute 'WarpCTC'
how can I solve this problem

@xiazi-yu
Copy link

你好,我用mxnet中的mxnet_predict.py进行预测的时候,好像不支持GPU的选项,只能在CPU下跑,用GPU会卡住,报terminate called without an active exception的错误

@livedimg
Copy link

livedimg commented Sep 25, 2017 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.