Skip to content

chenxinpeng/yolo2.pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YOLO-v2 参考

  1. https://github.com/longcw/yolo2-pytorch

预编译工具

先进入 misc 目录下:

python3.6 build.py build_ext --inplace

进入 layers/reorg/src 目录下:

nvcc -c -o reorg_cuda_kernel.cu.o reorg_cuda_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_52
cd ../
python3.6 build.py

进入 roi_pooling/src/cuda 目录下:

nvcc -c -o roi_pooling_kernel.cu.o roi_pooling_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_52
cd ../../
python3.6 build.py

refcoco

实验后发现类别不均衡情况很严重,因为 classification 的 loss 一直降不下来。

Targets

dataset splitBy detections val testA testB val+Ours testA+Ours testB+Ours
RefCOCO unc ssd(testAB) - 92.76% 80.11% - 95.85% 88.34%
RefCOCO unc Mask R-CNN - 94.18% 83.09% - 96.46% 90.00%
RefCOCO+ unc ssd(testAB) - 86.89% 69.63% - 94.83% 85.10%
RefCOCO+ unc Mask R-CNN - 94.17% 83.15% - 95.65% 87.38%
RefCOCOg google ssd(val) 75.74% - - 84.70% - -
RefCOCOg google Mask R-CNN 90.96% - - 92.30% - -

scripts

后来发现还是将处理字典、词向量等预处理工作独立开来比较好,跟 train 等主代码等写在一起容易混乱。

NIC

RefCOCO

seed Feat Max BLEU-4 CIDEr
110 VGG 37 0.11301 0.85345
111 V4 10 0.14117 1.032

RefCOCO+

seed Feat Max BLEU-4 CIDEr
110 V4 28 0.07565 0.697

RefCOCOg

seed Feat Max BLEU-4 CIDEr
111 V4 35 0.12280 0.702
114 V4 29 0.12652 0.711

Versions

Version 3

修正了 源代码 中的错误,主要在 _process_batch 这个函数中。

Version 4

​直接预测 $(x_1, y_1, w, h)$;在 testA 和 testB 上能到 26% ~ 28% 左右。比原先的带有 anchor 的版本好很多。

Version 5

加入 soft attention 机制,用 sequence 中的 LSTM 网络最后一个 hidden states 来做 soft attention;V5 中一开始预测的仍然同 V4 一样,仍然是 $(x_1, y_1, w, h)$。发现提升效果很明显。详情如下:

seed epoch val testA testB
116 0 - 30.23% 27.11%
116 1 - 32.30% 29.36%
116 2 - 33.36% 30.87%
116 3 - 34.03% 31.64%

再往下我就没在继续下去了。接着,我将预测的 $(x_1, y_1, w, h)$ 仿照 YOLO V1 中,替换为预测 $(x_{1}, y_{1}, \sqrt{w}, \sqrt{h})$

Version 6

在这个版本中,预测的仍然是替换过后的 $(x_{1}, y_{1}, \sqrt{w}, \sqrt{h})$,除此之外,还加入了 cross entropy 的 loss,效果很不错,RefCOCO(UNC) 能到 60%。

尝试 Word Embedding 不固定加入训练,但结果奇差无比,低了近 30 个点。

尝试 coord_scale 降低为 1.0,但效果不好。

seed coord scale fine-tuning val Optimizer
124 10.0 Yes 0.473 adam
125 10.0 Yes 0.472 adam
126 10.0 No 0.467 adam
127 10.0 No 0.467 adam
117 1.0 Yes epoch 159, 0.583 sgd
116 1.0 Yes epoch 42, 0.463 adam
134 10.0 No epoch 47, 0.403 sgd
135 10.0 No epoch 32, 0.400 sgd
136 10.0 Yes epoch 26, 0.581 sgd
137 10.0 Yes epoch 34, 0.572 sgd

Version 7

在这个版本中,尝试加速训练,即事先用 forward_cnn 函数提取出 conv7 层的特征数据(其 size 为 bsize x 1024 x 13 x 13。但实验后发现其只有 43%。本人猜测是因为一是没有进行 fine-tuning,二是 BN 曾的关系。

seed coord scale fine-tuning val Optimizer
111 5.0 No 0.436 adam
112 10.0 No 0.440 adam
113 10.0 No 0.439 adam
110 10.0 No 0.350 sgd
111 10.0 No 0.348 sgd
120 1.0 No 0.289 sgd
121 1.0 No 0.289 sgd

Version 8

在这个版本中,基于 V7,本人在 text embedding 层之后,又加上了一个 decoder,参考 Semi-supervised Sequence Learning 这篇文章,得到一句话的表示后,通过 decoder 将其自身 reconstruct 出来。但效果只有 42% 左右,低于 V7 的。说明这个 V8 的实验效果不如预期,没什么效果。

Version 9

这个版本融合了 Google TDM (Concat),DSSD 中的 Element-wise Product,以及 FPN 中的 Element-wise Sum 特征融合方法。为了方便,LSTM 的 size 是 1024。无效

Version 10

只有 FPN: Element-wise Sum 的融合方法。同样为了方便,LSTM 的大小设置为 1024。无效

Version 11

只有 DSSD: Element-wise Product 的融合方法。同样为了方便,LSTM 的大小设置为 1024。无效

Version 12

特征使用 concat;引入 anchor boxes 来预测。anchor boxes 的计算参考:darknet_scripts,得到 3 个 anchor: [(0.65, 9.15), (2.21, 8.00), (4.44, 5.90)]。尝试了各种 anchor,但 anchor 的始终跑不好。这个待定。

Version 14, 416 input size

在 V6 的基础上,加入 segmentation。加入的是用 conv8(即 concat 了 visual 与 sequence 之后再经过 conv + bn 层),1536 x 1 x 1 经过 deconv,到 deconv size。

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
110 32 Yes 10.0 1.0 1.0 epoch 38, 0.638 0.65494 0.59686
111 32 Yes 10.0 1.0 1.0 epoch 50, 0.631 0.65706 0.59804
112 416 Yes 10.0 1.0 1.0 epoch 51, 0.631 0.66166 0.59333
113 416 Yes 10.0 1.0 1.0 epoch 46, 0.623 0.66042 0.58665
116 416 Yes 10.0 1.0 1.0 epoch 45, 0.611 0.63479 0.57017
117 416 Yes 10.0 1.0 1.0 epoch 54, 0.619 0.63090 0.57821
134 32 Yes 20.0 5.0 1.0 epoch 37, 0.617 0.65194 0.59450
135 32 Yes 20.0 5.0 1.0 epoch 44, 0.629 0.65795 0.58724
136 32 Yes 20.0 1.0 1.0 epoch 36, 0.603 0.63691 0.58175
137 32 Yes 20.0 1.0 1.0 epoch 37, 0.614 0.62807 0.58705
210 32 Yes 10.0 0.0 0.0 epoch 18, 0.408 0.43504 0.41001
211 32 Yes 10.0 0.0 0.0 epoch 17, 0.395 0.42408 0.39921
v6, 117 32 Yes 10.0 1.0 0.0 epoch 159,0.583 0.59873 0.54681
311 32 Yes 50.0 10.0 0.0 epoch 28, 0.640 0.66254 0.57213
311 32 Yes 50.0 10.0 0.0 epoch 23, 0.640 0.64840 0.56016
312 32 Yes 50.0 10.0 1.0 epoch 25, 0.636 0.67156 0.56075
313 32 Yes 50.0 0.0 0.0 epoch 33, 0.557 0.57416 0.53837
412 32 No 50.0 10.0 1.0 epoch 83, 0.467 0.50731 0.45043
413 32 No 50.0 10.0 0.0 epoch 51, 0.467 0.49019 0.44318
416 32 No 50.0 0.0 0.0 epoch 33, 0.450 0.46279 0.42414

2018.5.28 rebuttal 实验:

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
516 32 Yes 50.0 0.0 1.0 epoch 131, 0.608
models/yolo2_refer_comprehension_416_v14_refcoco_unc_seed_312/model_epoch-25.pth
models/yolo2_refer_comprehension_416_v14_refcoco_unc_seed_311/model_epoch-23.pth

RefCOCO+, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
112 416 Yes 10.0 1.0 1.0 epoch 58, 0.409 0.45774 0.34956
113 416 Yes 10.0 1.0 1.0 epoch 59, 0.405 0.45145 0.34383
144 32 Yes 20.0 1.0 1.0 epoch 43, 0.442 0.49284 0.37247
145 32 Yes 20.0 1.0 1.0 epoch 39, 0.435 0.48166 0.36101
146 32 Yes 20.0 5.0 1.0 epoch 45, 0.390 - -
147 32 Yes 20.0 5.0 1.0 epoch 44, 0.384 - -
212 32 Yes 10.0 0.0 0.0 epoch 31, 0.201 0.22634 0.18327
213 32 Yes 10.0 0.0 0.0 epoch 25, 0.200 0.22686 0.18347
216 32 Yes 10.0 1.0 0.0 epoch 32, 0.379 0.42525 0.31029
217 32 Yes 10.0 1.0 0.0 epoch 31, 0.388 0.42176 0.32911
414 32 Yes 20.0 0.0 0.0 epoch 21, 0.201 - -
415 32 Yes 20.0 0.0 0.0 epoch 32, 0.206 - -
416 32 Yes 20.0 1.0 0.0 epoch 33, 0.373 - -
417 32 Yes 20.0 1.0 0.0 epoch 31, 0.352 - -
316 32 Yes 50.0 10.0 1.0 epoch 48, 0.471 0.52008 0.39272
317 32 Yes 50.0 10.0 1.0 epoch 43, 0.468 0.52410 0.38658
514 32 Yes 50.0 10.0 0.0 epoch 37, 0.449 0.49424 0.37615
515 32 Yes 50.0 10.0 0.0 epoch 37, 0.455 0.49721 0.37513
516 32 Yes 50.0 0.0 0.0 epoch 24, 0.074 0.08383 0.06975
517 32 Yes 50.0 0.0 0.0 epoch 36, 0.250 0.28379 0.22234
610 32 No 50.0 10.0 0.0 epoch 56, 0.274 0.29742 0.24811
611 32 No 50.0 0.0 0.0 epoch 27, 0.192 0.21691 0.16241
617 32 No 50.0 10.0 1.0 epoch 45, 0.289 0.32676 0.26222

2018.5.28 rebuttal 实验:

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
717 32 Yes 50.0 0.0 1.0 epoch 72, 0.279
models/yolo2_refer_comprehension_416_v14_refcoco+_unc_seed_316/model_epoch-48.pth
models/yolo2_refer_comprehension_416_v14_refcoco+_unc_seed_514/model_epoch-37.pth

RefCOCOg, google

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
114 32 Yes 10.0 1.0 1.0 epoch 30, 0.263 - -
115 32 Yes 10.0 1.0 1.0 epoch 29, 0.270 - -
120 32 Yes 50.0 10.0 10.0 epoch 30, 0.332 - -
121 32 Yes 50.0 10.0 10.0 epoch 31, 0.326 - -
122 32 Yes 20.0 5.0 1.0 epoch 20, 0.299 - -
123 32 Yes 20.0 5.0 1.0 epoch 23, 0.293 - -
214 32 Yes 10.0 0.0 0.0 epoch 13, 0.189 - -
215 32 Yes 10.0 0.0 0.0 epoch 12, 0.198 - -
316 32 Yes 10.0 1.0 0.0 epoch 25, 0.247 - -
317 32 Yes 10.0 1.0 0.0 epoch 23, 0.247 - -
514 32 Yes 50.0 0.0 0.0 epoch 51, 0.192 - -
515 32 Yes 50.0 10.0 0.0 epoch 46, 0.342 - -
610 32 Yes 50.0 10.0 1.0 epoch 33, 0.343 - -
611 32 Yes 50.0 10.0 1.0 epoch 30, 0.347 - -
712 32 No 50.0 10.0 1.0 epoch 59, 0.260 - -
713 32 No 50.0 10.0 0.0 epoch 22, 0.255 - -
714 32 No 50.0 0.0 0.0 epoch 43, 0.179 - -
816 32 Yes 50.0 10.0 1.0 - -
817 32 Yes 50.0 10.0 1.0 - -

2018.5.28 rebuttal 实验:

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
915 32 Yes 50.0 0.0 1.0
models/yolo2_refer_comprehension_416_v14_refcocog_google_seed_611/model_epoch-30.pth
models/yolo2_refer_comprehension_416_v14_refcocog_google_seed_515/model_epoch-46.pth

Version 15, 416 input size

在 V6 的基础上,加入 segmentation。加入的是用 1536 x 1 x 1 经过 upsample。暂时未做。

Version 16, 416 input size

用 Conv7(bsize x 1024 x 13 x 13)经过 AvgPool,来 deconv 出 32 x 32 的分割图。

seed deconv size fine-tuning coord scale cls scale seg scale val
114 32 Yes 10.0 1.0 1.0
115 32 Yes 10.0 1.0 1.0

Version 6, 608 input size

seed fine-tuning coord scale cls scale val
116 Yes 10.0 1.0 epoch 57, 0.587
117 Yes 10.0 1.0 epoch 46, 0.592

Version 14, 608 input size

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale val
116 32 Yes 10.0 1.0 1.0
117 32 Yes 10.0 1.0 1.0

RefCOCO+, unc

seed deconv size fine-tuning coord scale cls scale seg scale val
114 32 Yes 10.0 1.0 1.0
115 32 Yes 10.0 1.0 1.0

RefCOCOg, google

seed deconv size fine-tuning coord scale cls scale seg scale val
111 32 Yes 10.0 1.0 1.0
110 32 Yes 10.0 1.0 1.0

Version 18, 416 input size

基于 v14(416)的代码,加入 co-attention 机制。

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
110 32 Yes 20.0 5.0 1.0 epoch 24, 0.621 - -
111 32 Yes 20.0 5.0 1.0 epoch 39, 0.621 - -
112 32 Yes 10.0 1.0 1.0 epoch 34, 0.587 - -
113 32 Yes 10.0 1.0 1.0 epoch 34, 0.584 - -
210 32 Yes 50.0 10.0 1.0 - - -
211 32 Yes 50.0 10.0 1.0 - - -
212 32 Yes 50.0 10.0 1.0 epoch 36, 0.628
213 32 Yes 50.0 10.0 1.0 epoch 38, 0.639
310 32 Yes 50.0 10.0 1.0
311 32 Yes 50.0 10.0 1.0

2018.5.24 rebuttal 实验:

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
510 32 Yes 50.0 0.0 1.0 epoch 74, 0.172
511 32 Yes 50.0 0.0 10.0 epoch 76, 0.158

RefCOCO+, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
216 32 Yes 50.0 10.0 1.0 epoch 42, 0.440
217 32 Yes 50.0 10.0 1.0 epoch 42, 0.457
312 32 Yes 50.0 10.0 1.0
313 32 Yes 50.0 10.0 1.0

2018.5.24 rebuttal 实验:

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
512 32 Yes 50.0 0.0 1.0 epoch 75, 0.096
513 32 Yes 50.0 0.0 10.0 epoch 79, 0.082

RefCOCOg, google

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
110 32 Yes 50.0 10.0 10.0 epoch 29, 0.308 - -
111 32 Yes 50.0 10.0 10.0 epoch 25, 0.314 - -
210 32 Yes 50.0 10.0 1.0 epoch 21, 0.311
211 32 Yes 50.0 10.0 1.0 epoch 23, 0.316
314 32 Yes 50.0 10.0 1.0
315 32 Yes 50.0 10.0 1.0

2018.5.24 rebuttal 实验:

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB
514 32 Yes 50.0 0.0 1.0 epoch 91, 0.118
515 32 Yes 50.0 0.0 10.0 epoch 61, 0.097

Version 20, 416 input size

加入 Inception V4 作为 Base 网络,取代 YOLO-v2 的权重。

Version 21, 416 input size

取消掉 Multimodal Interaction

RefCOCO, unc

seed deconv size fine-tuning coord scale seg scale val testA testB
110 32 Yes 50.0 0.0 epoch 20, 0.275 0.31006 0.27282
111 32 Yes 50.0 0.0 epoch 16, 0.192

RefCOCO+, unc

seed deconv size fine-tuning coord scale seg scale val testA testB
112 32 Yes 50.0 0.0 epoch 21, 0.079 0.08261 0.07527
113 32 Yes 50.0 0.0 epoch 22, 0.078

RefCOCOg, google

seed deconv size fine-tuning coord scale seg scale val testA testB
114 32 Yes 50.0 0.0 epoch 21, 0.088
115 32 Yes 50.0 0.0 epoch 27, 0.132

Version 1, 416 input size, YOLO-v3

Thu, 6.14, 2018:

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
111 32 Yes 50.0 10.0 1.0 epoch 30, 0.535 - - Thu, 6.14, 2018
210 32 Yes 10.0 1.0 1.0 epoch 47, 0.647 0.68464 0.59627 Sun, 6.17, 2018
211 32 Yes 10.0 1.0 1.0 epoch 37, 0.648 0.68057 0.61237 Sun, 6.17, 2018

奇怪的是,跑下来后, seed 110、112 的 performance 效果要好很多,这两个都跑了 40 个 epoch,被我 kill 掉了。同时,由于跑的时候没有设置验证,下面是每一个 epoch 的测试结果:

seed epoch testA testB
110 28 0.65282 0.58135
110 29 0.65795 0.58155
110 30 0.65795 0.58292
110 31 0.66148 0.58626
110 32 0.65441 0.58253
110 33 0.65724 0.57939
110 34 0.65017 0.58116
110 35 0.65812 0.58351
110 36 0.65194 0.58077
110 37 0.65547 0.58096
seed epoch testA testB
112 27 0.66307 0.59235
112 28 0.66219 0.59470
112 29 0.66608 0.59352
112 30 0.66077 0.59058
112 31 0.67191 0.59274
112 32 0.66378 0.59254
112 33 0.66254 0.59254
112 34 0.66820 0.59352
112 35 0.66060 0.59019
112 36 0.66590 0.59529
112 37 0.66731 0.59431
112 38 0.66608 0.58705
112 39 0.66307 0.58822

RefCOCO+, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
113 32 Yes 50.0 10.0 1.0 epoch 33, 0.336 - - Thu, 6.14, 2018
114 32 Yes 50.0 10.0 1.0 epoch 31, 0.299 - - Thu, 6.14, 2018
212 32 Yes 10.0 1.0 1.0 epoch 52, 0.449 0.49738 0.37533 Sun, 6.17, 2018
213 32 Yes 10.0 1.0 1.0 epoch 47, 0.454 0.51188 0.37451 Sun, 6.17, 2018

RefCOCOg, google

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
115 32 Yes 50.0 10.0 1.0 epoch 29, 0.232 - - Thu, 6.14, 2018
116 32 Yes 50.0 10.0 1.0 epoch 26, 0.234 - - Thu, 6.14, 2018
214 32 Yes 10.0 1.0 1.0 epoch 30, 0.321 - - Sun, 6.17, 2018
215 32 Yes 10.0 1.0 1.0 epoch 34, 0.327 - - Sun, 6.17, 2018

Version 2, 416 input size, YOLO-v3

使用 YOLO-v3 中第二个预测模块的输出,作为图像的 Encoder。

26 x 26 weights, RegCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
116 32 Yes 10.0 1.0 1.0 epoch 37, 0.582 0.62383 0.51835
117 32 Yes 10.0 1.0 1.0 epoch 41, 0.587 0.61640 0.51011

26 x 26 --> 13 x 13 weights, RegCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
125 32 Yes 10.0 1.0 1.0 epoch 42, 0.586 0.62789 0.53680
126 32 Yes 10.0 1.0 1.0 epoch 38, 0.585 0.62949 0.52758

Version 3, 416 input size, YOLO-v3

参考 MAttNet,设置为 Bi-LSTM 作为句子的 Encoder。

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
114 32 Yes 10.0 1.0 1.0 epoch 39, 0.681 0.70992 0.63768
117 32 Yes 10.0 1.0 1.0 epoch 35, 0.675
124 32 Yes 10.0 1.0 1.0 epoch 32, 0.679
125 32 Yes 50.0 10.0 1.0 epoch 40, 0.665
126 32 Yes 50.0 10.0 1.0 epoch 39, 0.663
130 32 Yes 10.0 1.0 1.0 epoch 30, 0.672
131 32 Yes 10.0 1.0 1.0 epoch 29, 0.666
seed epoch testA testB
114 24 0.70567 0.62944
114 25 0.71292 0.62924
114 26 0.70355 0.63494
114 27 0.70744 0.63238
114 28 0.71133 0.63513
114 29 0.71062 0.63631
114 30 0.70780 0.62689
114 31 0.71168 0.63651
114 32 0.71328 0.63690
114 33 0.71186 0.63533
114 34 0.71239 0.63729
114 35 0.70886 0.63631
114 36 0.70744 0.63592
114 37 0.71062 0.63454
114 38 0.71133 0.63101
114 39 0.70992 0.63768
117 19 0.70674 0.62414
117 20 0.70497 0.63101
117 21 0.70285 0.63160
117 22 0.72158 0.63670
117 23 0.70886 0.63454
117 24 0.71398 0.63945
117 25 0.71522 0.63906
117 26 0.72512 0.63906
117 27 0.72035 0.63984
117 28 0.72229 0.64024
117 29 0.71982 0.63886
117 30 0.71646 0.63906
117 31 0.71381 0.64043
117 32 0.72176 0.64573
117 33 0.72123 0.64259
117 34 0.71982 0.64004
117 35 0.71752 0.63670
117 36 0.71699 0.63925
117 37 0.71769 0.64082
117 38 0.71575 0.64141
117 39 0.71451 0.63808
124 28 0.72264 0.62964
124 29 0.72371
124 30 0.71540
124 31 0.71911
124 32 0.71876
124 33 0.71487
124 34 0.71858
124 35 0.71363
124 36 0.71716

RefCOCO+, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
110 32 Yes 10.0 1.0 1.0 epoch 49, 0.479 0.53790 0.39476 Sun, 6.24, 2018
111 32 Yes 10.0 1.0 1.0 epoch 51, 0.477 0.53615 0.40151 Sun, 6.24, 2018

RefCOCOg, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
112 32 Yes 10.0 1.0 1.0 epoch 22, 0.382 - - Sun, 6.24, 2018
113 32 Yes 10.0 1.0 1.0 epoch 20, 0.379 - - Sun, 6.24, 2018

Version 4, 416 input size, YOLO-v3

将 cls target 从 1 替换为 guassian target

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
110 32 Yes 10.0 1.0 1.0 epoch
111 32 Yes 10.0 1.0 1.0 epoch
112 32 Yes 50.0 10.0 1.0 epoch
113 32 Yes 50.0 10.0 1.0 epoch

不 work 。

Version 4 新, 416 input size, YOLO-v3

将 61 层的输出的特征图(bsize * 512 * 26 * 26),先做 max pooling,得到 bsize * 512 * 13 * 13 大小的特征图。然后将其与 80 层输出的特征图(bsize * 1024 * 13 * 13)进行拼接,得到 bsize * 1536 * 13 * 13 大小的特征图。接着通过一个全卷积非线性映射变换,将这个特征图代替原有仅是 80 层输出的特征图。

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
110 32 Yes 10.0 1.0 1.0 epoch 25, 0.677 Wed, 7.18, 2018
111 32 Yes 10.0 1.0 1.0 epoch 26, 0.674 Wed, 7.18, 2018

RefCOCO+, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
112 32 Yes 10.0 1.0 1.0 epoch 50, 0.456 Wed, 7.18, 2018
113 32 Yes 10.0 1.0 1.0 epoch 54, 0.466 Wed, 7.18, 2018

RefCOCOg, google

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
114 32 Yes 10.0 1.0 1.0 0.367 - - Wed, 7.18, 2018
115 32 Yes 10.0 1.0 1.0 0.360 - - Wed, 7.18, 2018

Version 5

这个版本是采用 Hu Ronghang 的 Modeling Relationships in Referential Expressions with Compositional Modular Networks 中,对于 query sequence 采用三个 self attention 的结构。具体而言,分别是 subject attention、relationship attention,以及 object attention。将这三个结果,拼接起来作为 query sequence 的表示。

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
110 32 Yes 10.0 1.0 1.0 epoch Wed, 7.18, 2018
111 32 Yes 10.0 1.0 1.0 epoch Wed, 7.18, 2018

RefCOCO+, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
112 32 Yes 10.0 1.0 1.0 epoch Wed, 7.18, 2018
113 32 Yes 10.0 1.0 1.0 epoch Wed, 7.18, 2018

RefCOCOg, google

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
114 32 Yes 10.0 1.0 1.0 epoch Wed, 7.18, 2018
115 32 Yes 10.0 1.0 1.0 epoch Wed, 7.18, 2018

Version 6

在这个版本中,同样采用 Hu Ronghang 里的句子处理方式,不过采用双层的双向 LSTM 对 query sequence 进行编码。

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
126 32 Yes 10.0 1.0 1.0 epoch 34, 0.687 0.72565 0.64396 Wed, 7.23, 2018
116 32 Yes 10.0 1.0 1.0 epoch 42, 0.684 0.72618 0.64455 Wed, 7.23, 2018
117 32 Yes 10.0 1.0 1.0 epoch 45, 0.684 0.72406 0.63965 Wed, 7.23, 2018

将字典重新生成,单词数量变为 1991 个,还在跑,经过试验,词汇表的减少对于结果有一定影响。

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
132 32 Yes 10.0 1.0 2.0 epoch 30, 0.665 Wed, 7.25, 2018
133 32 Yes 10.0 1.0 2.0 epoch 28, 0.671 Wed, 7.25, 2018
134 32 Yes 10.0 1.0 1.0 epoch 32, 0.658 Wed, 7.25, 2018
135 32 Yes 10.0 1.0 1.0 epoch 35, 0.657 Wed, 7.25, 2018

RefCOCO+, unc

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
112 32 Yes 10.0 1.0 1.0 epoch Wed, 7.18, 2018
113 32 Yes 10.0 1.0 1.0 epoch Wed, 7.18, 2018

RefCOCOg, google

seed deconv size fine-tuning coord scale cls scale seg scale val testA testB date
114 32 Yes 10.0 1.0 1.0 epoch Wed, 7.18, 2018
115 32 Yes 10.0 1.0 1.0 epoch Wed, 7.18, 2018

Version 7

基于 Version 6,horizontal flip data augment,不 work,跑崩了,performance 直接飞掉了,原因有待检查。

Version 8

基于 Version 6,增加 IoU loss,计算 gt boxes 与 predicted boxes 之间的 jaccard overlap,参考:https://github.com/amdegroot/ssd.pytorch/blob/master/layers/box_utils.py#L48 ,但效果不好。似乎加入 IoU 的损失后还会影响预测的结果。

seed deconv size fine-tuning coord scale cls scale seg scale iou scale val testA testB
110 32 Yes 10.0 1.0 1.0 1.0 效果不好
111 32 Yes 10.0 1.0 1.0 1.0 效果不好
112 32 Yes 10.0 1.0 1.0 1.0 效果不好
113 32 Yes 10.0 1.0 1.0 1.0 效果不好
114 32 Yes 10.0 1.0 1.0 2.0 效果不好
115 32 Yes 10.0 1.0 1.0 2.0 效果不好

Verison 9

基于 Version 6,增加了 attribute prediction。seed 110,111 补上了 self.attr_fc 的参数初始化。seed 114、115 忘记对 self.attr_fc 进行初始化了,不过影响应该不大。实验发现,attribute prediction 对于 testB(即非人物体)效果提升明显,相对于直接接近 2% 的提升。对于 testA 上的结果几乎没有变化。

RefCOCO

后续,seed 110 第 35 个 epoch,在 val 上达到了 0.688,在 testA 上测试之后,testA 达到了 0.73113。

seed 140 - 143,将 Google 的 Word2Vec 与 GloVe 相连接。

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
110 32 Yes 10.0 1.0 1.0 1.0 epoch 21, 0.688 0.72512 0.66143 4 train, 39 test
111 32 Yes 10.0 1.0 1.0 1.0 epoch 31, 0.686 0.72176 0.62414 4 train, 39 test
120 32 Yes 10.0 1.0 2.0 1.0 epoch 24, 0.674 0.70355 0.62630 4 train, drop 0.1
121 32 Yes 10.0 1.0 2.0 1.0 epoch 29, 0.686 0.72335 0.63965 4 train, drop 0.1
126 32 Yes 10.0 1.0 2.0 1.0 epoch 25, 0.681 0.71946 0.64534 4 train, drop 0.0
127 32 Yes 10.0 1.0 2.0 1.0 epoch 28, 0.689 0.73378 0.62355 4 train, drop 0.0
134 32 Yes 10.0 1.0 2.0 1.0 epoch 30, 0.659 0.70161 0.61433 Killed
135 32 Yes 10.0 1.0 2.0 1.0 epoch 31, 0.659 0.70886 0.61276 Killed
136 32 Yes 10.0 1.0 2.0 1.0 epoch 28, 0.655 0.70002 0.60608 Killed
137 32 Yes 10.0 1.0 2.0 1.0 epoch 26, 0.655 0.70037 0.62277 Killed
114 32 Yes 10.0 1.0 1.0 1.0 epoch 26, 0.676 0.72229 0.63847 Killed
115 32 Yes 10.0 1.0 1.0 1.0 epoch 26, 0.678 0.71416 0.62826 Killed
134 32 Yes 10.0 1.0 2.0 1.0 epoch 18, 0.699 0.73307 0.64789 4 train, drop 0.0
135 32 Yes 10.0 1.0 2.0 1.0 epoch 17, 0.701 0.73025 0.64868 4 train, drop 0.0
136 32 Yes 20.0 1.0 2.0 1.0 epoch 22, 0.716 0.74934 0.66673 4 train, drop 0.0
137 32 Yes 20.0 1.0 2.0 1.0 epoch 23, 0.706 0.74969 0.66654 4 train, drop 0.0
135 32 Yes 20.0 1.0 2.0 1.0 epoch 23, 0.706 0.75146 0.66104 39 train drop 0.0
137 32 Yes 20.0 1.0 2.0 1.0 epoch 20, 0.712 0.75252 0.64966 39 train drop 0.0
140 32 Yes 20.0 1.0 2.0 1.0 epoch 25, 0.723 0.76825 0.67085 4 train, drop 0.0
141 32 Yes 20.0 1.0 2.0 1.0 epoch 25, 0.726 0.75234 0.66202 4 train, drop 0.0
142 32 Yes 20.0 1.0 2.0 1.0 epoch 21, 0.719 0.75694 0.66359 4 train, drop 0.0
143 32 Yes 20.0 1.0 2.0 1.0 epoch 25, 0.721 0.75853 0.66909 4 train, drop 0.0

RefCOCOg, google split

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val Note
110 32 Yes 10.0 1.0 2.0 1.0 epoch 19, 0.389
111 32 Yes 10.0 1.0 2.0 1.0 epoch 18, 0.399
112 32 Yes 20.0 1.0 2.0 1.0 epoch 20, 0.419
113 32 Yes 20.0 1.0 2.0 2.0 epoch 18, 0.415
114 32 Yes 20.0 2.0 2.0 2.0 epoch 22, 0.411
115 32 Yes 20.0 2.0 1.0 1.0 epoch 20, 0.411
122 32 Yes 20.0 1.0 2.0 1.0
123 32 Yes 20.0 1.0 2.0 1.0
144 32 Yes 20.0 1.0 2.0 1.0 epoch 22, 0.431 4 train, drop 0.0
145 32 Yes 20.0 1.0 2.0 1.0 epoch 21, 0.437 4 train, drop 0.0
146 32 Yes 20.0 1.0 2.0 1.0 epoch 20, 0.438 4 train, drop 0.0

Version 10

基于 Version 6,在 segmentation 的时候,现在只用 visual attention 的结果。原来是将 1024 维度的 visual attention 与 2048 维度的 sequence embedding 相接,加上 DeConv 之后进行预测 segmentation。

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB date
112 32 Yes 10.0 1.0 1.0 1.0 epoch 24, 0.677 未完全跑完
113 32 Yes 10.0 1.0 1.0 1.0 epoch 20, 0.681 未完全跑完

Version 11

基于 Version 9,仿照 MAttNet 中的 Visual subject representation 模块,将 YOLO V3 的第 62 层与第 80 层想拼接,然后通过一层 1x1 的卷积层,然后 AvgPool 之后去预测 attributes。同时,将拼接后的特征图与 80 层的再进行拼接,作为 visual representation。这样,网络的参数增加了近 20%,达到惊人的 102210023 个,超过了 1 亿个参数!

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
114 32 Yes 10.0 1.0 1.0 1.0 epoch 25, 0.670 100.102.33.39
115 32 Yes 10.0 1.0 1.0 1.0 epoch 27, 0.678 100.102.33.39
116 32 Yes 10.0 1.0 1.0 1.0 epoch 26, 0.675 100.102.33.39

Version 12

基于 Version 6,增加了 RoI Pooling,生成 captioning。参数量约八千多万个。

seed deconv size fine-tuning coord scale cls scale seg scale cap scale val testA testB Note
112 32 Yes 10.0 1.0 1.0 1.0 epoch 20, 0.646 100.102.33.39, Stop
113 32 Yes 10.0 1.0 1.0 0.1 epoch 30, 0.673 100.102.33.39, Stop
117 32 Yes 10.0 1.0 1.0 0.5 epoch 20, 0.653 100.102.33.39, Stop
120 32 Yes 10.0 1.0 1.0 0.1 epoch 22, 0.654 100.102.33.4, Running
121 32 Yes 10.0 1.0 1.0 0.1 epoch 28, 0.656 100.102.33.4, Running

Version 13

基于 Version 9。先统计出在 RefCOCO 数据集中,person 的 referring 有 21013 个,其余 common objects 所占比较很小,总计和有 21391。为了平衡这种差异,在训练的每一个 batch(如 batch 等于 10) 中,每 5 个从 person 中随机抽取,剩下的 5 个从 common object 中抽取。还有,采取了 weight 的方法,即统计出每一类所有的 referring 数据占总数量的比例,然后取 np.log(),得到每一类的权重。因此,person 的权重会比较小,只有 0.7 左右,最大的 object 的权重能够 np.log(42404.0 / 1),约等于 10.65。

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
112 32 Yes 10.0 1.0 2.0 1.0 epoch 32, 0.657 100.102.33.4, batch
113 32 Yes 10.0 1.0 2.0 1.0 epoch 28, 0.657 100.102.33.4, batch

Version 14

RefCOCOg

在这个版本中,将 RoI Pooling 换成了 Bilinear Interpolation。参考:

  1. https://github.com/jcjohnson/densecap/blob/master/densecap/modules/BilinearRoiPooling.lua
  2. https://github.com/jcjohnson/densecap/blob/master/densecap/modules/BoxToAffine.lua
  3. https://github.com/jcjohnson/densecap/blob/master/densecap/modules/BatchBilinearSamplerBHWD.lua
  4. https://github.com/qassemoquab/stnbhwd/blob/master/AffineGridGeneratorBHWD.lua
  5. https://zhuanlan.zhihu.com/p/28455306
  6. http://www.telesens.co/2018/03/11/object-detection-and-classification-using-r-cnns/
seed deconv size fine-tuning coord scale cls scale seg scale cap_scale pool size val Note
110 32 Yes 20.0 1.0 2.0 0.1 1 epoch 17, 0.373 Killed
111 32 Yes 20.0 1.0 2.0 0.5 1 epoch 22, 0.356 Killed
112 32 Yes 20.0 1.0 2.0 1.0 1 epoch 15, 0.091 Killed
113 32 Yes 20.0 1.0 2.0 0.1 3 epoch 20, 0.379 Killed
114 32 Yes 20.0 1.0 2.0 0.5 3 epoch 23, 0.354 Killed
115 32 Yes 20.0 1.0 2.0 1.0 3 epoch 17, 0.106 Killed
116 32 Yes 20.0 1.0 2.0 0.1 5 epoch 22, 0.384 Killed
117 32 Yes 20.0 1.0 2.0 0.5 5 epoch 16, 0.360 Killed
120 32 Yes 20.0 1.0 2.0 0.1 7 epoch 15, 0.377 Killed
121 32 Yes 20.0 1.0 2.0 0.1 7 epoch 15, 0.363 Killed
112 32 Yes 20.0 1.0 2.0 0.05 7 Killed
113 32 Yes 20.0 1.0 2.0 0.05 7 Killed

Version 15

在这个版本中,采取了两次 visual attention,第一次是用 YOLO-v3 中的第 60 层(bsize * 512 * 26 * 26)与 sequence embedding 的结果做 attention。注意,这里的 sequence embedding 采取的是两个双层 LSTM,分别对两个不同的句子做 embedding。第二次是用 YOLO-v3 中的第 80 层(bsize * 1024 * 13 * 13)与 sequence embedding 做 attention。将这两次 visual attention 的结果进行想拼接,同时与 sequence embedding 相拼接,维度则为 512 + 1024 + 2048,为 3584 维。计算的 loss 也为两次 attention map 的 loss 之和。

RefCOCO

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
110 32 Yes 20.0 1.0 2.0 1.0 epoch 20, 0.688 4 train,
111 32 Yes 20.0 1.0 2.0 1.0 epoch 29, 0.702 4 train,
110 32 Yes 20.0 1.0 2.0 1.0 epoch 26, 0.699 39 train,
121 32 Yes 20.0 1.0 2.0 1.0 epoch 21, 0.699 39 train,

Version 16

跟 Version 15 差不多,多加了卷积层做变换。

RefCOCO

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
114 32 Yes 20.0 1.0 2.0 1.0 epoch 21, 0.677 39 train,
116 32 Yes 20.0 1.0 2.0 1.0 epoch 24, 0.689 39 train,

Version 17

将原先 Version 9 中双层 LSTM 替换为 单层双向 LSTM,在 utils.py 中新增加了一个函数 get_sent_neg(),用来对一句话进行“反转”,如下:

def get_sent_neg(sent_pos, sent_stop):
    sent_neg = np.copy(sent_pos)
    sent_neg[:sent_stop+1] = sent_pos[:sent_stop+1][::-1]

    return sent_neg

RefCOCO

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
125 32 Yes 20.0 1.0 2.0 1.0 epoch 21, 0.713 0.76189 0.66634 39 train
126 32 Yes 20.0 1.0 2.0 1.0 epoch 27, 0.713 0.75164 0.66026 39 train
127 32 Yes 20.0 1.0 2.0 1.0 epoch 27, 0.718 0.75747 0.66065 39 train

RefCOCOg

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
122 32 Yes 20.0 1.0 2.0 1.0 epoch 20, 0.422 - - 39 train
123 32 Yes 20.0 1.0 2.0 1.0 epoch 21, 0.421 - - 39 train
124 32 Yes 20.0 1.0 2.0 1.0 epoch 20, 0.424 - - 39 train

Version 18

基于 Version 9,融入 Transformer 模块。

Version 19

基于 Version 9,增加一个 cross gating 机制。

Version 20

基于 Version 9,增加了3个句子。

Version 21

增加了负样本的句子。

Version 22

将输出的 conv 80, conv 92, conv 104 层的 feature map,做 visual attention 之后想拼接,然后与 sequence embedding 相结合,共同预测坐标。

Version 23

在这个版本中,我将 YOLO-v3 的预测坐标输出的3个 feature map分别用来预测坐标,只要这3个中有一个预测的对,那么就认为模型预测正确。

这种方法虽然对我们模型的出发点有所伤害,但是对于提升 performance 提升有帮助。

RefCOCO

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
110 32 Yes 20.0 1.0 2.0 1.0 epoch 25, 0.727 0.76472 0.67066 39 train
111 32 Yes 20.0 1.0 2.0 1.0 epoch 25, 0.736 0.76878 0.67517 39 train
112 32 Yes 20.0 1.0 2.0 1.0 epoch 29, 0.731 0.77355 0.68126 39 train
113 32 Yes 20.0 1.0 2.0 1.0 epoch 27, 0.736 0.76719 0.68616 39 train

Version 24

在这个版本中,我将 sequence (s1, s2) 的整体表达相拼接,用于对 feature map 做 visual attention。然后,将 visual attention 的结果用于对 s1, s2 分别做 sequence attention。将两个 attention 的结果想拼接,用于预测坐标。

RefCOCO

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
114 32 Yes 20.0 1.0 2.0 1.0 epoch 29, 0.711 0.74722 0.66104 39 train
115 32 Yes 20.0 1.0 2.0 1.0 epoch 30, 0.702 0.74456 0.65319 39 train
116 32 Yes 20.0 1.0 2.0 1.0 epoch 29, 0.712 0.75057 0.66418 39 train
117 32 Yes 20.0 1.0 2.0 1.0 epoch 25, 0.717 0.74686 0.65810 39 train

Version 25

在这个版本中,我将 segmentation 信息去除,也将 attributes prediction 的模块去掉,多用 feature map 来预测。

现在就是,我把 segmentation,attributes prediction 模块全部移除了。但现在我用了 YOLO-v3 (共计 107 个 feature maps)中的, conv 20, conv 32, conv 44, conv 56, conv 68, conv 80, conv 86, conv 92, conv 98, conv 104 这 10 个 feature map 层,分别预测,每一个层分别有不同的 visual attention weights。这 10 个 feature map 预测的框框中,只要有一个是正确的,咱们暂且认为是正确的。现在以上的这个模型,训练了一个 epoch,其正确率就达到了 63.532 。

RefCOCO

seed deconv size fine-tuning coord scale cls scale val testA testB Note
114 32 Yes 20.0 1.0 epoch 29, 0.824 0.84815 0.77841 39 train
115 32 Yes 20.0 1.0 epoch 28, 0.821 0.85734 0.77743 39 train
116 32 Yes 20.0 1.0 epoch 20, 0.824 0.85293 0.77409 39 train
117 32 Yes 20.0 1.0 epoch 28, 0.819 0.85346 0.78508 39 train

但是这样做,新的问题来了,如何从这10个box中选出那个最好的?况且,从Mask R-CNN,或者SSD的proposals中选出最好的一个,其testA,testB的正确率还要高于我们的这样。这就是令人很尴尬的地方。

Version 17

在原先 Version 17 的基础上,重新改写,每次输入只有一句话。而不是 [A;C]、[B;B]、[C;A] 两句话。不过是双层双向 LSTM 来对句子进行编码。

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
114 32 Yes 20.0 1.0 2.0 1.0 39 train
115 32 Yes 20.0 1.0 2.0 1.0 39 train
116 32 Yes 20.0 1.0 2.0 1.0 39 train
117 32 Yes 20.0 1.0 2.0 1.0 39 train

Version 26

在原先 Version 17 的基础上,重新改写,每次输入只有一句话。而不是 [A;C]、[B;B]、[C;A] 两句话。不过是双层 LSTM 来对句子进行编码。

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
110 32 Yes 20.0 1.0 2.0 1.0 39 train
111 32 Yes 20.0 1.0 2.0 1.0 39 train
112 32 Yes 20.0 1.0 2.0 1.0 39 train
113 32 Yes 20.0 1.0 2.0 1.0 39 train

Version 27

在原先 Version 17 的基础上,重新改写,每次输入只有一句话。而不是 [A;C]、[B;B]、[C;A] 两句话。不过是单层 LSTM 来对句子进行编码。

RefCOCO, unc

seed deconv size fine-tuning coord scale cls scale seg scale attr scale val testA testB Note
114 32 Yes 20.0 1.0 2.0 1.0 39 train
115 32 Yes 20.0 1.0 2.0 1.0 39 train
116 32 Yes 20.0 1.0 2.0 1.0 39 train

About

referring expression comprehension

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published