先进入 misc 目录下:
python3.6 build.py build_ext --inplace
进入 layers/reorg/src 目录下:
nvcc -c -o reorg_cuda_kernel.cu.o reorg_cuda_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_52
cd ../
python3.6 build.py
进入 roi_pooling/src/cuda 目录下:
nvcc -c -o roi_pooling_kernel.cu.o roi_pooling_kernel.cu -x cu -Xcompiler -fPIC -arch=sm_52
cd ../../
python3.6 build.py
实验后发现类别不均衡情况很严重,因为 classification 的 loss 一直降不下来。
dataset | splitBy | detections | val | testA | testB | val+Ours | testA+Ours | testB+Ours |
---|---|---|---|---|---|---|---|---|
RefCOCO | unc | ssd(testAB) | - | 92.76% | 80.11% | - | 95.85% | 88.34% |
RefCOCO | unc | Mask R-CNN | - | 94.18% | 83.09% | - | 96.46% | 90.00% |
RefCOCO+ | unc | ssd(testAB) | - | 86.89% | 69.63% | - | 94.83% | 85.10% |
RefCOCO+ | unc | Mask R-CNN | - | 94.17% | 83.15% | - | 95.65% | 87.38% |
RefCOCOg | ssd(val) | 75.74% | - | - | 84.70% | - | - | |
RefCOCOg | Mask R-CNN | 90.96% | - | - | 92.30% | - | - |
后来发现还是将处理字典、词向量等预处理工作独立开来比较好,跟 train 等主代码等写在一起容易混乱。
seed | Feat | Max | BLEU-4 | CIDEr |
---|---|---|---|---|
110 | VGG | 37 | 0.11301 | 0.85345 |
111 | V4 | 10 | 0.14117 | 1.032 |
seed | Feat | Max | BLEU-4 | CIDEr |
---|---|---|---|---|
110 | V4 | 28 | 0.07565 | 0.697 |
seed | Feat | Max | BLEU-4 | CIDEr |
---|---|---|---|---|
111 | V4 | 35 | 0.12280 | 0.702 |
114 | V4 | 29 | 0.12652 | 0.711 |
修正了 源代码 中的错误,主要在 _process_batch
这个函数中。
直接预测
加入 soft attention 机制,用 sequence 中的 LSTM 网络最后一个 hidden states 来做 soft attention;V5 中一开始预测的仍然同 V4 一样,仍然是
seed | epoch | val | testA | testB |
---|---|---|---|---|
116 | 0 | - | 30.23% | 27.11% |
116 | 1 | - | 32.30% | 29.36% |
116 | 2 | - | 33.36% | 30.87% |
116 | 3 | - | 34.03% | 31.64% |
再往下我就没在继续下去了。接着,我将预测的
在这个版本中,预测的仍然是替换过后的
尝试 Word Embedding 不固定加入训练,但结果奇差无比,低了近 30 个点。
尝试 coord_scale 降低为 1.0,但效果不好。
seed | coord scale | fine-tuning | val | Optimizer |
---|---|---|---|---|
124 | 10.0 | Yes | 0.473 | adam |
125 | 10.0 | Yes | 0.472 | adam |
126 | 10.0 | No | 0.467 | adam |
127 | 10.0 | No | 0.467 | adam |
117 | 1.0 | Yes | epoch 159, 0.583 | sgd |
116 | 1.0 | Yes | epoch 42, 0.463 | adam |
134 | 10.0 | No | epoch 47, 0.403 | sgd |
135 | 10.0 | No | epoch 32, 0.400 | sgd |
136 | 10.0 | Yes | epoch 26, 0.581 | sgd |
137 | 10.0 | Yes | epoch 34, 0.572 | sgd |
在这个版本中,尝试加速训练,即事先用 forward_cnn
函数提取出 conv7
层的特征数据(其 size 为 bsize x 1024 x 13 x 13。但实验后发现其只有 43%。本人猜测是因为一是没有进行 fine-tuning,二是 BN 曾的关系。
seed | coord scale | fine-tuning | val | Optimizer |
---|---|---|---|---|
111 | 5.0 | No | 0.436 | adam |
112 | 10.0 | No | 0.440 | adam |
113 | 10.0 | No | 0.439 | adam |
110 | 10.0 | No | 0.350 | sgd |
111 | 10.0 | No | 0.348 | sgd |
120 | 1.0 | No | 0.289 | sgd |
121 | 1.0 | No | 0.289 | sgd |
在这个版本中,基于 V7,本人在 text embedding 层之后,又加上了一个 decoder,参考 Semi-supervised Sequence Learning 这篇文章,得到一句话的表示后,通过 decoder 将其自身 reconstruct 出来。但效果只有 42% 左右,低于 V7 的。说明这个 V8 的实验效果不如预期,没什么效果。
这个版本融合了 Google TDM (Concat),DSSD 中的 Element-wise Product,以及 FPN 中的 Element-wise Sum 特征融合方法。为了方便,LSTM 的 size 是 1024。无效。
只有 FPN: Element-wise Sum 的融合方法。同样为了方便,LSTM 的大小设置为 1024。无效。
只有 DSSD: Element-wise Product 的融合方法。同样为了方便,LSTM 的大小设置为 1024。无效。
特征使用 concat;引入 anchor boxes 来预测。anchor boxes 的计算参考:darknet_scripts,得到 3 个 anchor: [(0.65, 9.15), (2.21, 8.00), (4.44, 5.90)]。尝试了各种 anchor,但 anchor 的始终跑不好。这个待定。
在 V6 的基础上,加入 segmentation。加入的是用 conv8(即 concat 了 visual 与 sequence 之后再经过 conv + bn 层),1536 x 1 x 1 经过 deconv,到 deconv size。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 38, 0.638 | 0.65494 | 0.59686 |
111 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 50, 0.631 | 0.65706 | 0.59804 |
112 | 416 | Yes | 10.0 | 1.0 | 1.0 | epoch 51, 0.631 | 0.66166 | 0.59333 |
113 | 416 | Yes | 10.0 | 1.0 | 1.0 | epoch 46, 0.623 | 0.66042 | 0.58665 |
116 | 416 | Yes | 10.0 | 1.0 | 1.0 | epoch 45, 0.611 | 0.63479 | 0.57017 |
117 | 416 | Yes | 10.0 | 1.0 | 1.0 | epoch 54, 0.619 | 0.63090 | 0.57821 |
134 | 32 | Yes | 20.0 | 5.0 | 1.0 | epoch 37, 0.617 | 0.65194 | 0.59450 |
135 | 32 | Yes | 20.0 | 5.0 | 1.0 | epoch 44, 0.629 | 0.65795 | 0.58724 |
136 | 32 | Yes | 20.0 | 1.0 | 1.0 | epoch 36, 0.603 | 0.63691 | 0.58175 |
137 | 32 | Yes | 20.0 | 1.0 | 1.0 | epoch 37, 0.614 | 0.62807 | 0.58705 |
210 | 32 | Yes | 10.0 | 0.0 | 0.0 | epoch 18, 0.408 | 0.43504 | 0.41001 |
211 | 32 | Yes | 10.0 | 0.0 | 0.0 | epoch 17, 0.395 | 0.42408 | 0.39921 |
v6, 117 | 32 | Yes | 10.0 | 1.0 | 0.0 | epoch 159,0.583 | 0.59873 | 0.54681 |
311 | 32 | Yes | 50.0 | 10.0 | 0.0 | epoch 28, 0.640 | 0.66254 | 0.57213 |
311 | 32 | Yes | 50.0 | 10.0 | 0.0 | epoch 23, 0.640 | 0.64840 | 0.56016 |
312 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 25, 0.636 | 0.67156 | 0.56075 |
313 | 32 | Yes | 50.0 | 0.0 | 0.0 | epoch 33, 0.557 | 0.57416 | 0.53837 |
412 | 32 | No | 50.0 | 10.0 | 1.0 | epoch 83, 0.467 | 0.50731 | 0.45043 |
413 | 32 | No | 50.0 | 10.0 | 0.0 | epoch 51, 0.467 | 0.49019 | 0.44318 |
416 | 32 | No | 50.0 | 0.0 | 0.0 | epoch 33, 0.450 | 0.46279 | 0.42414 |
2018.5.28 rebuttal 实验:
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
516 | 32 | Yes | 50.0 | 0.0 | 1.0 | epoch 131, 0.608 |
models/yolo2_refer_comprehension_416_v14_refcoco_unc_seed_312/model_epoch-25.pth
models/yolo2_refer_comprehension_416_v14_refcoco_unc_seed_311/model_epoch-23.pth
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
112 | 416 | Yes | 10.0 | 1.0 | 1.0 | epoch 58, 0.409 | 0.45774 | 0.34956 |
113 | 416 | Yes | 10.0 | 1.0 | 1.0 | epoch 59, 0.405 | 0.45145 | 0.34383 |
144 | 32 | Yes | 20.0 | 1.0 | 1.0 | epoch 43, 0.442 | 0.49284 | 0.37247 |
145 | 32 | Yes | 20.0 | 1.0 | 1.0 | epoch 39, 0.435 | 0.48166 | 0.36101 |
146 | 32 | Yes | 20.0 | 5.0 | 1.0 | epoch 45, 0.390 | - | - |
147 | 32 | Yes | 20.0 | 5.0 | 1.0 | epoch 44, 0.384 | - | - |
212 | 32 | Yes | 10.0 | 0.0 | 0.0 | epoch 31, 0.201 | 0.22634 | 0.18327 |
213 | 32 | Yes | 10.0 | 0.0 | 0.0 | epoch 25, 0.200 | 0.22686 | 0.18347 |
216 | 32 | Yes | 10.0 | 1.0 | 0.0 | epoch 32, 0.379 | 0.42525 | 0.31029 |
217 | 32 | Yes | 10.0 | 1.0 | 0.0 | epoch 31, 0.388 | 0.42176 | 0.32911 |
414 | 32 | Yes | 20.0 | 0.0 | 0.0 | epoch 21, 0.201 | - | - |
415 | 32 | Yes | 20.0 | 0.0 | 0.0 | epoch 32, 0.206 | - | - |
416 | 32 | Yes | 20.0 | 1.0 | 0.0 | epoch 33, 0.373 | - | - |
417 | 32 | Yes | 20.0 | 1.0 | 0.0 | epoch 31, 0.352 | - | - |
316 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 48, 0.471 | 0.52008 | 0.39272 |
317 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 43, 0.468 | 0.52410 | 0.38658 |
514 | 32 | Yes | 50.0 | 10.0 | 0.0 | epoch 37, 0.449 | 0.49424 | 0.37615 |
515 | 32 | Yes | 50.0 | 10.0 | 0.0 | epoch 37, 0.455 | 0.49721 | 0.37513 |
516 | 32 | Yes | 50.0 | 0.0 | 0.0 | epoch 24, 0.074 | 0.08383 | 0.06975 |
517 | 32 | Yes | 50.0 | 0.0 | 0.0 | epoch 36, 0.250 | 0.28379 | 0.22234 |
610 | 32 | No | 50.0 | 10.0 | 0.0 | epoch 56, 0.274 | 0.29742 | 0.24811 |
611 | 32 | No | 50.0 | 0.0 | 0.0 | epoch 27, 0.192 | 0.21691 | 0.16241 |
617 | 32 | No | 50.0 | 10.0 | 1.0 | epoch 45, 0.289 | 0.32676 | 0.26222 |
2018.5.28 rebuttal 实验:
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
717 | 32 | Yes | 50.0 | 0.0 | 1.0 | epoch 72, 0.279 |
models/yolo2_refer_comprehension_416_v14_refcoco+_unc_seed_316/model_epoch-48.pth
models/yolo2_refer_comprehension_416_v14_refcoco+_unc_seed_514/model_epoch-37.pth
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
114 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 30, 0.263 | - | - |
115 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 29, 0.270 | - | - |
120 | 32 | Yes | 50.0 | 10.0 | 10.0 | epoch 30, 0.332 | - | - |
121 | 32 | Yes | 50.0 | 10.0 | 10.0 | epoch 31, 0.326 | - | - |
122 | 32 | Yes | 20.0 | 5.0 | 1.0 | epoch 20, 0.299 | - | - |
123 | 32 | Yes | 20.0 | 5.0 | 1.0 | epoch 23, 0.293 | - | - |
214 | 32 | Yes | 10.0 | 0.0 | 0.0 | epoch 13, 0.189 | - | - |
215 | 32 | Yes | 10.0 | 0.0 | 0.0 | epoch 12, 0.198 | - | - |
316 | 32 | Yes | 10.0 | 1.0 | 0.0 | epoch 25, 0.247 | - | - |
317 | 32 | Yes | 10.0 | 1.0 | 0.0 | epoch 23, 0.247 | - | - |
514 | 32 | Yes | 50.0 | 0.0 | 0.0 | epoch 51, 0.192 | - | - |
515 | 32 | Yes | 50.0 | 10.0 | 0.0 | epoch 46, 0.342 | - | - |
610 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 33, 0.343 | - | - |
611 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 30, 0.347 | - | - |
712 | 32 | No | 50.0 | 10.0 | 1.0 | epoch 59, 0.260 | - | - |
713 | 32 | No | 50.0 | 10.0 | 0.0 | epoch 22, 0.255 | - | - |
714 | 32 | No | 50.0 | 0.0 | 0.0 | epoch 43, 0.179 | - | - |
816 | 32 | Yes | 50.0 | 10.0 | 1.0 | - | - | |
817 | 32 | Yes | 50.0 | 10.0 | 1.0 | - | - |
2018.5.28 rebuttal 实验:
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
915 | 32 | Yes | 50.0 | 0.0 | 1.0 |
models/yolo2_refer_comprehension_416_v14_refcocog_google_seed_611/model_epoch-30.pth
models/yolo2_refer_comprehension_416_v14_refcocog_google_seed_515/model_epoch-46.pth
在 V6 的基础上,加入 segmentation。加入的是用 1536 x 1 x 1 经过 upsample。暂时未做。
用 Conv7(bsize x 1024 x 13 x 13)经过 AvgPool,来 deconv 出 32 x 32 的分割图。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val |
---|---|---|---|---|---|---|
114 | 32 | Yes | 10.0 | 1.0 | 1.0 | |
115 | 32 | Yes | 10.0 | 1.0 | 1.0 |
seed | fine-tuning | coord scale | cls scale | val |
---|---|---|---|---|
116 | Yes | 10.0 | 1.0 | epoch 57, 0.587 |
117 | Yes | 10.0 | 1.0 | epoch 46, 0.592 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val |
---|---|---|---|---|---|---|
116 | 32 | Yes | 10.0 | 1.0 | 1.0 | |
117 | 32 | Yes | 10.0 | 1.0 | 1.0 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val |
---|---|---|---|---|---|---|
114 | 32 | Yes | 10.0 | 1.0 | 1.0 | |
115 | 32 | Yes | 10.0 | 1.0 | 1.0 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val |
---|---|---|---|---|---|---|
111 | 32 | Yes | 10.0 | 1.0 | 1.0 | |
110 | 32 | Yes | 10.0 | 1.0 | 1.0 |
基于 v14(416)的代码,加入 co-attention 机制。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 20.0 | 5.0 | 1.0 | epoch 24, 0.621 | - | - |
111 | 32 | Yes | 20.0 | 5.0 | 1.0 | epoch 39, 0.621 | - | - |
112 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 34, 0.587 | - | - |
113 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 34, 0.584 | - | - |
210 | 32 | Yes | 50.0 | 10.0 | 1.0 | - | - | - |
211 | 32 | Yes | 50.0 | 10.0 | 1.0 | - | - | - |
212 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 36, 0.628 | ||
213 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 38, 0.639 | ||
310 | 32 | Yes | 50.0 | 10.0 | 1.0 | |||
311 | 32 | Yes | 50.0 | 10.0 | 1.0 |
2018.5.24 rebuttal 实验:
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
510 | 32 | Yes | 50.0 | 0.0 | 1.0 | epoch 74, 0.172 | ||
511 | 32 | Yes | 50.0 | 0.0 | 10.0 | epoch 76, 0.158 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
216 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 42, 0.440 | ||
217 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 42, 0.457 | ||
312 | 32 | Yes | 50.0 | 10.0 | 1.0 | |||
313 | 32 | Yes | 50.0 | 10.0 | 1.0 |
2018.5.24 rebuttal 实验:
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
512 | 32 | Yes | 50.0 | 0.0 | 1.0 | epoch 75, 0.096 | ||
513 | 32 | Yes | 50.0 | 0.0 | 10.0 | epoch 79, 0.082 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 50.0 | 10.0 | 10.0 | epoch 29, 0.308 | - | - |
111 | 32 | Yes | 50.0 | 10.0 | 10.0 | epoch 25, 0.314 | - | - |
210 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 21, 0.311 | ||
211 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 23, 0.316 | ||
314 | 32 | Yes | 50.0 | 10.0 | 1.0 | |||
315 | 32 | Yes | 50.0 | 10.0 | 1.0 |
2018.5.24 rebuttal 实验:
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|
514 | 32 | Yes | 50.0 | 0.0 | 1.0 | epoch 91, 0.118 | ||
515 | 32 | Yes | 50.0 | 0.0 | 10.0 | epoch 61, 0.097 |
加入 Inception V4 作为 Base 网络,取代 YOLO-v2 的权重。
取消掉 Multimodal Interaction
seed | deconv size | fine-tuning | coord scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|
110 | 32 | Yes | 50.0 | 0.0 | epoch 20, 0.275 | 0.31006 | 0.27282 |
111 | 32 | Yes | 50.0 | 0.0 | epoch 16, 0.192 |
seed | deconv size | fine-tuning | coord scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|
112 | 32 | Yes | 50.0 | 0.0 | epoch 21, 0.079 | 0.08261 | 0.07527 |
113 | 32 | Yes | 50.0 | 0.0 | epoch 22, 0.078 |
seed | deconv size | fine-tuning | coord scale | seg scale | val | testA | testB |
---|---|---|---|---|---|---|---|
114 | 32 | Yes | 50.0 | 0.0 | epoch 21, 0.088 | ||
115 | 32 | Yes | 50.0 | 0.0 | epoch 27, 0.132 |
Thu, 6.14, 2018:
-
使用 YOLO-v3 的前 80 层卷积层作为 encoder,这里前 80 层指的是 YOLO-v3 中第一个预测层(YOLO层)之前的卷积神经网络部分。
-
采用的权重是 YOLO-v3 官方提供的权重,地址为:https://pjreddie.com/media/files/yolov3.weights
-
权重转换,参考的是:https://github.com/eriklindernoren/PyTorch-YOLOv3/blob/master/models.py
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
111 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 30, 0.535 | - | - | Thu, 6.14, 2018 |
210 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 47, 0.647 | 0.68464 | 0.59627 | Sun, 6.17, 2018 |
211 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 37, 0.648 | 0.68057 | 0.61237 | Sun, 6.17, 2018 |
奇怪的是,跑下来后, seed 110、112 的 performance 效果要好很多,这两个都跑了 40 个 epoch,被我 kill 掉了。同时,由于跑的时候没有设置验证,下面是每一个 epoch 的测试结果:
seed | epoch | testA | testB |
---|---|---|---|
110 | 28 | 0.65282 | 0.58135 |
110 | 29 | 0.65795 | 0.58155 |
110 | 30 | 0.65795 | 0.58292 |
110 | 31 | 0.66148 | 0.58626 |
110 | 32 | 0.65441 | 0.58253 |
110 | 33 | 0.65724 | 0.57939 |
110 | 34 | 0.65017 | 0.58116 |
110 | 35 | 0.65812 | 0.58351 |
110 | 36 | 0.65194 | 0.58077 |
110 | 37 | 0.65547 | 0.58096 |
seed | epoch | testA | testB |
---|---|---|---|
112 | 27 | 0.66307 | 0.59235 |
112 | 28 | 0.66219 | 0.59470 |
112 | 29 | 0.66608 | 0.59352 |
112 | 30 | 0.66077 | 0.59058 |
112 | 31 | 0.67191 | 0.59274 |
112 | 32 | 0.66378 | 0.59254 |
112 | 33 | 0.66254 | 0.59254 |
112 | 34 | 0.66820 | 0.59352 |
112 | 35 | 0.66060 | 0.59019 |
112 | 36 | 0.66590 | 0.59529 |
112 | 37 | 0.66731 | 0.59431 |
112 | 38 | 0.66608 | 0.58705 |
112 | 39 | 0.66307 | 0.58822 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
113 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 33, 0.336 | - | - | Thu, 6.14, 2018 |
114 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 31, 0.299 | - | - | Thu, 6.14, 2018 |
212 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 52, 0.449 | 0.49738 | 0.37533 | Sun, 6.17, 2018 |
213 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 47, 0.454 | 0.51188 | 0.37451 | Sun, 6.17, 2018 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
115 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 29, 0.232 | - | - | Thu, 6.14, 2018 |
116 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 26, 0.234 | - | - | Thu, 6.14, 2018 |
214 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 30, 0.321 | - | - | Sun, 6.17, 2018 |
215 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 34, 0.327 | - | - | Sun, 6.17, 2018 |
使用 YOLO-v3 中第二个预测模块的输出,作为图像的 Encoder。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
116 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 37, 0.582 | 0.62383 | 0.51835 | |
117 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 41, 0.587 | 0.61640 | 0.51011 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
125 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 42, 0.586 | 0.62789 | 0.53680 | |
126 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 38, 0.585 | 0.62949 | 0.52758 |
参考 MAttNet,设置为 Bi-LSTM 作为句子的 Encoder。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
114 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 39, 0.681 | 0.70992 | 0.63768 | |
117 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 35, 0.675 | |||
124 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 32, 0.679 | |||
125 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 40, 0.665 | |||
126 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch 39, 0.663 | |||
130 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 30, 0.672 | |||
131 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 29, 0.666 |
seed | epoch | testA | testB |
---|---|---|---|
114 | 24 | 0.70567 | 0.62944 |
114 | 25 | 0.71292 | 0.62924 |
114 | 26 | 0.70355 | 0.63494 |
114 | 27 | 0.70744 | 0.63238 |
114 | 28 | 0.71133 | 0.63513 |
114 | 29 | 0.71062 | 0.63631 |
114 | 30 | 0.70780 | 0.62689 |
114 | 31 | 0.71168 | 0.63651 |
114 | 32 | 0.71328 | 0.63690 |
114 | 33 | 0.71186 | 0.63533 |
114 | 34 | 0.71239 | 0.63729 |
114 | 35 | 0.70886 | 0.63631 |
114 | 36 | 0.70744 | 0.63592 |
114 | 37 | 0.71062 | 0.63454 |
114 | 38 | 0.71133 | 0.63101 |
114 | 39 | 0.70992 | 0.63768 |
117 | 19 | 0.70674 | 0.62414 |
117 | 20 | 0.70497 | 0.63101 |
117 | 21 | 0.70285 | 0.63160 |
117 | 22 | 0.72158 | 0.63670 |
117 | 23 | 0.70886 | 0.63454 |
117 | 24 | 0.71398 | 0.63945 |
117 | 25 | 0.71522 | 0.63906 |
117 | 26 | 0.72512 | 0.63906 |
117 | 27 | 0.72035 | 0.63984 |
117 | 28 | 0.72229 | 0.64024 |
117 | 29 | 0.71982 | 0.63886 |
117 | 30 | 0.71646 | 0.63906 |
117 | 31 | 0.71381 | 0.64043 |
117 | 32 | 0.72176 | 0.64573 |
117 | 33 | 0.72123 | 0.64259 |
117 | 34 | 0.71982 | 0.64004 |
117 | 35 | 0.71752 | 0.63670 |
117 | 36 | 0.71699 | 0.63925 |
117 | 37 | 0.71769 | 0.64082 |
117 | 38 | 0.71575 | 0.64141 |
117 | 39 | 0.71451 | 0.63808 |
124 | 28 | 0.72264 | 0.62964 |
124 | 29 | 0.72371 | |
124 | 30 | 0.71540 | |
124 | 31 | 0.71911 | |
124 | 32 | 0.71876 | |
124 | 33 | 0.71487 | |
124 | 34 | 0.71858 | |
124 | 35 | 0.71363 | |
124 | 36 | 0.71716 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 49, 0.479 | 0.53790 | 0.39476 | Sun, 6.24, 2018 |
111 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 51, 0.477 | 0.53615 | 0.40151 | Sun, 6.24, 2018 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
112 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 22, 0.382 | - | - | Sun, 6.24, 2018 |
113 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 20, 0.379 | - | - | Sun, 6.24, 2018 |
将 cls target 从 1 替换为 guassian target
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | |||
111 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | |||
112 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch | |||
113 | 32 | Yes | 50.0 | 10.0 | 1.0 | epoch |
不 work 。
将 61 层的输出的特征图(bsize * 512 * 26 * 26),先做 max pooling,得到 bsize * 512 * 13 * 13 大小的特征图。然后将其与 80 层输出的特征图(bsize * 1024 * 13 * 13)进行拼接,得到 bsize * 1536 * 13 * 13 大小的特征图。接着通过一个全卷积非线性映射变换,将这个特征图代替原有仅是 80 层输出的特征图。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 25, 0.677 | Wed, 7.18, 2018 | ||
111 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 26, 0.674 | Wed, 7.18, 2018 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
112 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 50, 0.456 | Wed, 7.18, 2018 | ||
113 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 54, 0.466 | Wed, 7.18, 2018 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
114 | 32 | Yes | 10.0 | 1.0 | 1.0 | 0.367 | - | - | Wed, 7.18, 2018 |
115 | 32 | Yes | 10.0 | 1.0 | 1.0 | 0.360 | - | - | Wed, 7.18, 2018 |
这个版本是采用 Hu Ronghang 的 Modeling Relationships in Referential Expressions with Compositional Modular Networks 中,对于 query sequence 采用三个 self attention 的结构。具体而言,分别是 subject attention、relationship attention,以及 object attention。将这三个结果,拼接起来作为 query sequence 的表示。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | Wed, 7.18, 2018 | ||
111 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | Wed, 7.18, 2018 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
112 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | Wed, 7.18, 2018 | ||
113 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | Wed, 7.18, 2018 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
114 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | Wed, 7.18, 2018 | ||
115 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | Wed, 7.18, 2018 |
在这个版本中,同样采用 Hu Ronghang 里的句子处理方式,不过采用双层的双向 LSTM 对 query sequence 进行编码。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
126 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 34, 0.687 | 0.72565 | 0.64396 | Wed, 7.23, 2018 |
116 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 42, 0.684 | 0.72618 | 0.64455 | Wed, 7.23, 2018 |
117 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 45, 0.684 | 0.72406 | 0.63965 | Wed, 7.23, 2018 |
将字典重新生成,单词数量变为 1991 个,还在跑,经过试验,词汇表的减少对于结果有一定影响。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
132 | 32 | Yes | 10.0 | 1.0 | 2.0 | epoch 30, 0.665 | Wed, 7.25, 2018 | ||
133 | 32 | Yes | 10.0 | 1.0 | 2.0 | epoch 28, 0.671 | Wed, 7.25, 2018 | ||
134 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 32, 0.658 | Wed, 7.25, 2018 | ||
135 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch 35, 0.657 | Wed, 7.25, 2018 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
112 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | Wed, 7.18, 2018 | ||
113 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | Wed, 7.18, 2018 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|
114 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | Wed, 7.18, 2018 | ||
115 | 32 | Yes | 10.0 | 1.0 | 1.0 | epoch | Wed, 7.18, 2018 |
基于 Version 6,horizontal flip data augment,不 work,跑崩了,performance 直接飞掉了,原因有待检查。
基于 Version 6,增加 IoU loss,计算 gt boxes 与 predicted boxes 之间的 jaccard overlap,参考:https://github.com/amdegroot/ssd.pytorch/blob/master/layers/box_utils.py#L48 ,但效果不好。似乎加入 IoU 的损失后还会影响预测的结果。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | iou scale | val | testA | testB |
---|---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | 效果不好 | ||
111 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | 效果不好 | ||
112 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | 效果不好 | ||
113 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | 效果不好 | ||
114 | 32 | Yes | 10.0 | 1.0 | 1.0 | 2.0 | 效果不好 | ||
115 | 32 | Yes | 10.0 | 1.0 | 1.0 | 2.0 | 效果不好 |
基于 Version 6,增加了 attribute prediction。seed 110,111 补上了 self.attr_fc 的参数初始化。seed 114、115 忘记对 self.attr_fc 进行初始化了,不过影响应该不大。实验发现,attribute prediction 对于 testB(即非人物体)效果提升明显,相对于直接接近 2% 的提升。对于 testA 上的结果几乎没有变化。
后续,seed 110 第 35 个 epoch,在 val 上达到了 0.688,在 testA 上测试之后,testA 达到了 0.73113。
seed 140 - 143,将 Google 的 Word2Vec 与 GloVe 相连接。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | epoch 21, 0.688 | 0.72512 | 0.66143 | 4 train, 39 test |
111 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | epoch 31, 0.686 | 0.72176 | 0.62414 | 4 train, 39 test |
120 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 24, 0.674 | 0.70355 | 0.62630 | 4 train, drop 0.1 |
121 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 29, 0.686 | 0.72335 | 0.63965 | 4 train, drop 0.1 |
126 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 25, 0.681 | 0.71946 | 0.64534 | 4 train, drop 0.0 |
127 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 28, 0.689 | 0.73378 | 0.62355 | 4 train, drop 0.0 |
134 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 30, 0.659 | 0.70161 | 0.61433 | Killed |
135 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 31, 0.659 | 0.70886 | 0.61276 | Killed |
136 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 28, 0.655 | 0.70002 | 0.60608 | Killed |
137 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 26, 0.655 | 0.70037 | 0.62277 | Killed |
114 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | epoch 26, 0.676 | 0.72229 | 0.63847 | Killed |
115 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | epoch 26, 0.678 | 0.71416 | 0.62826 | Killed |
134 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 18, 0.699 | 0.73307 | 0.64789 | 4 train, drop 0.0 |
135 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 17, 0.701 | 0.73025 | 0.64868 | 4 train, drop 0.0 |
136 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 22, 0.716 | 0.74934 | 0.66673 | 4 train, drop 0.0 |
137 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 23, 0.706 | 0.74969 | 0.66654 | 4 train, drop 0.0 |
135 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 23, 0.706 | 0.75146 | 0.66104 | 39 train drop 0.0 |
137 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 20, 0.712 | 0.75252 | 0.64966 | 39 train drop 0.0 |
140 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 25, 0.723 | 0.76825 | 0.67085 | 4 train, drop 0.0 |
141 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 25, 0.726 | 0.75234 | 0.66202 | 4 train, drop 0.0 |
142 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 21, 0.719 | 0.75694 | 0.66359 | 4 train, drop 0.0 |
143 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 25, 0.721 | 0.75853 | 0.66909 | 4 train, drop 0.0 |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | Note |
---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 19, 0.389 | |
111 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 18, 0.399 | |
112 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 20, 0.419 | |
113 | 32 | Yes | 20.0 | 1.0 | 2.0 | 2.0 | epoch 18, 0.415 | |
114 | 32 | Yes | 20.0 | 2.0 | 2.0 | 2.0 | epoch 22, 0.411 | |
115 | 32 | Yes | 20.0 | 2.0 | 1.0 | 1.0 | epoch 20, 0.411 | |
122 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | ||
123 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | ||
144 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 22, 0.431 | 4 train, drop 0.0 |
145 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 21, 0.437 | 4 train, drop 0.0 |
146 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 20, 0.438 | 4 train, drop 0.0 |
基于 Version 6,在 segmentation 的时候,现在只用 visual attention 的结果。原来是将 1024 维度的 visual attention 与 2048 维度的 sequence embedding 相接,加上 DeConv 之后进行预测 segmentation。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | date |
---|---|---|---|---|---|---|---|---|---|---|
112 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | epoch 24, 0.677 | 未完全跑完 | ||
113 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | epoch 20, 0.681 | 未完全跑完 |
基于 Version 9,仿照 MAttNet 中的 Visual subject representation 模块,将 YOLO V3 的第 62 层与第 80 层想拼接,然后通过一层 1x1 的卷积层,然后 AvgPool 之后去预测 attributes。同时,将拼接后的特征图与 80 层的再进行拼接,作为 visual representation。这样,网络的参数增加了近 20%,达到惊人的 102210023 个,超过了 1 亿个参数!
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
114 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | epoch 25, 0.670 | 100.102.33.39 | ||
115 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | epoch 27, 0.678 | 100.102.33.39 | ||
116 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | epoch 26, 0.675 | 100.102.33.39 |
基于 Version 6,增加了 RoI Pooling,生成 captioning。参数量约八千多万个。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | cap scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
112 | 32 | Yes | 10.0 | 1.0 | 1.0 | 1.0 | epoch 20, 0.646 | 100.102.33.39, Stop | ||
113 | 32 | Yes | 10.0 | 1.0 | 1.0 | 0.1 | epoch 30, 0.673 | 100.102.33.39, Stop | ||
117 | 32 | Yes | 10.0 | 1.0 | 1.0 | 0.5 | epoch 20, 0.653 | 100.102.33.39, Stop | ||
120 | 32 | Yes | 10.0 | 1.0 | 1.0 | 0.1 | epoch 22, 0.654 | 100.102.33.4, Running | ||
121 | 32 | Yes | 10.0 | 1.0 | 1.0 | 0.1 | epoch 28, 0.656 | 100.102.33.4, Running |
基于 Version 9。先统计出在 RefCOCO 数据集中,person 的 referring 有 21013 个,其余 common objects 所占比较很小,总计和有 21391。为了平衡这种差异,在训练的每一个 batch(如 batch 等于 10) 中,每 5 个从 person 中随机抽取,剩下的 5 个从 common object 中抽取。还有,采取了 weight 的方法,即统计出每一类所有的 referring 数据占总数量的比例,然后取 np.log(),得到每一类的权重。因此,person 的权重会比较小,只有 0.7 左右,最大的 object 的权重能够 np.log(42404.0 / 1),约等于 10.65。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
112 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 32, 0.657 | 100.102.33.4, batch | ||
113 | 32 | Yes | 10.0 | 1.0 | 2.0 | 1.0 | epoch 28, 0.657 | 100.102.33.4, batch |
在这个版本中,将 RoI Pooling 换成了 Bilinear Interpolation。参考:
- https://github.com/jcjohnson/densecap/blob/master/densecap/modules/BilinearRoiPooling.lua
- https://github.com/jcjohnson/densecap/blob/master/densecap/modules/BoxToAffine.lua
- https://github.com/jcjohnson/densecap/blob/master/densecap/modules/BatchBilinearSamplerBHWD.lua
- https://github.com/qassemoquab/stnbhwd/blob/master/AffineGridGeneratorBHWD.lua
- https://zhuanlan.zhihu.com/p/28455306
- http://www.telesens.co/2018/03/11/object-detection-and-classification-using-r-cnns/
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | cap_scale | pool size | val | Note |
---|---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 20.0 | 1.0 | 2.0 | 0.1 | 1 | epoch 17, 0.373 | Killed |
111 | 32 | Yes | 20.0 | 1.0 | 2.0 | 0.5 | 1 | epoch 22, 0.356 | Killed |
112 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 1 | epoch 15, 0.091 | Killed |
113 | 32 | Yes | 20.0 | 1.0 | 2.0 | 0.1 | 3 | epoch 20, 0.379 | Killed |
114 | 32 | Yes | 20.0 | 1.0 | 2.0 | 0.5 | 3 | epoch 23, 0.354 | Killed |
115 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 3 | epoch 17, 0.106 | Killed |
116 | 32 | Yes | 20.0 | 1.0 | 2.0 | 0.1 | 5 | epoch 22, 0.384 | Killed |
117 | 32 | Yes | 20.0 | 1.0 | 2.0 | 0.5 | 5 | epoch 16, 0.360 | Killed |
120 | 32 | Yes | 20.0 | 1.0 | 2.0 | 0.1 | 7 | epoch 15, 0.377 | Killed |
121 | 32 | Yes | 20.0 | 1.0 | 2.0 | 0.1 | 7 | epoch 15, 0.363 | Killed |
112 | 32 | Yes | 20.0 | 1.0 | 2.0 | 0.05 | 7 | Killed | |
113 | 32 | Yes | 20.0 | 1.0 | 2.0 | 0.05 | 7 | Killed |
在这个版本中,采取了两次 visual attention,第一次是用 YOLO-v3 中的第 60 层(bsize * 512 * 26 * 26)与 sequence embedding 的结果做 attention。注意,这里的 sequence embedding 采取的是两个双层 LSTM,分别对两个不同的句子做 embedding。第二次是用 YOLO-v3 中的第 80 层(bsize * 1024 * 13 * 13)与 sequence embedding 做 attention。将这两次 visual attention 的结果进行想拼接,同时与 sequence embedding 相拼接,维度则为 512 + 1024 + 2048,为 3584 维。计算的 loss 也为两次 attention map 的 loss 之和。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 20, 0.688 | 4 train, | ||
111 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 29, 0.702 | 4 train, | ||
110 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 26, 0.699 | 39 train, | ||
121 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 21, 0.699 | 39 train, |
跟 Version 15 差不多,多加了卷积层做变换。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
114 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 21, 0.677 | 39 train, | ||
116 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 24, 0.689 | 39 train, |
将原先 Version 9 中双层 LSTM 替换为 单层双向 LSTM,在 utils.py 中新增加了一个函数 get_sent_neg()
,用来对一句话进行“反转”,如下:
def get_sent_neg(sent_pos, sent_stop):
sent_neg = np.copy(sent_pos)
sent_neg[:sent_stop+1] = sent_pos[:sent_stop+1][::-1]
return sent_neg
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
125 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 21, 0.713 | 0.76189 | 0.66634 | 39 train |
126 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 27, 0.713 | 0.75164 | 0.66026 | 39 train |
127 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 27, 0.718 | 0.75747 | 0.66065 | 39 train |
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
122 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 20, 0.422 | - | - | 39 train |
123 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 21, 0.421 | - | - | 39 train |
124 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 20, 0.424 | - | - | 39 train |
基于 Version 9,融入 Transformer 模块。
基于 Version 9,增加一个 cross gating 机制。
基于 Version 9,增加了3个句子。
增加了负样本的句子。
将输出的 conv 80, conv 92, conv 104 层的 feature map,做 visual attention 之后想拼接,然后与 sequence embedding 相结合,共同预测坐标。
在这个版本中,我将 YOLO-v3 的预测坐标输出的3个 feature map分别用来预测坐标,只要这3个中有一个预测的对,那么就认为模型预测正确。
这种方法虽然对我们模型的出发点有所伤害,但是对于提升 performance 提升有帮助。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 25, 0.727 | 0.76472 | 0.67066 | 39 train |
111 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 25, 0.736 | 0.76878 | 0.67517 | 39 train |
112 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 29, 0.731 | 0.77355 | 0.68126 | 39 train |
113 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 27, 0.736 | 0.76719 | 0.68616 | 39 train |
在这个版本中,我将 sequence (s1, s2) 的整体表达相拼接,用于对 feature map 做 visual attention。然后,将 visual attention 的结果用于对 s1, s2 分别做 sequence attention。将两个 attention 的结果想拼接,用于预测坐标。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
114 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 29, 0.711 | 0.74722 | 0.66104 | 39 train |
115 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 30, 0.702 | 0.74456 | 0.65319 | 39 train |
116 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 29, 0.712 | 0.75057 | 0.66418 | 39 train |
117 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | epoch 25, 0.717 | 0.74686 | 0.65810 | 39 train |
在这个版本中,我将 segmentation 信息去除,也将 attributes prediction 的模块去掉,多用 feature map 来预测。
现在就是,我把 segmentation,attributes prediction 模块全部移除了。但现在我用了 YOLO-v3 (共计 107 个 feature maps)中的, conv 20, conv 32, conv 44, conv 56, conv 68, conv 80, conv 86, conv 92, conv 98, conv 104 这 10 个 feature map 层,分别预测,每一个层分别有不同的 visual attention weights。这 10 个 feature map 预测的框框中,只要有一个是正确的,咱们暂且认为是正确的。现在以上的这个模型,训练了一个 epoch,其正确率就达到了 63.532 。
seed | deconv size | fine-tuning | coord scale | cls scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|
114 | 32 | Yes | 20.0 | 1.0 | epoch 29, 0.824 | 0.84815 | 0.77841 | 39 train |
115 | 32 | Yes | 20.0 | 1.0 | epoch 28, 0.821 | 0.85734 | 0.77743 | 39 train |
116 | 32 | Yes | 20.0 | 1.0 | epoch 20, 0.824 | 0.85293 | 0.77409 | 39 train |
117 | 32 | Yes | 20.0 | 1.0 | epoch 28, 0.819 | 0.85346 | 0.78508 | 39 train |
但是这样做,新的问题来了,如何从这10个box中选出那个最好的?况且,从Mask R-CNN,或者SSD的proposals中选出最好的一个,其testA,testB的正确率还要高于我们的这样。这就是令人很尴尬的地方。
在原先 Version 17 的基础上,重新改写,每次输入只有一句话。而不是 [A;C]、[B;B]、[C;A] 两句话。不过是双层双向 LSTM 来对句子进行编码。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
114 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 39 train | |||
115 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 39 train | |||
116 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 39 train | |||
117 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 39 train |
在原先 Version 17 的基础上,重新改写,每次输入只有一句话。而不是 [A;C]、[B;B]、[C;A] 两句话。不过是双层 LSTM 来对句子进行编码。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
110 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 39 train | |||
111 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 39 train | |||
112 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 39 train | |||
113 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 39 train |
在原先 Version 17 的基础上,重新改写,每次输入只有一句话。而不是 [A;C]、[B;B]、[C;A] 两句话。不过是单层 LSTM 来对句子进行编码。
seed | deconv size | fine-tuning | coord scale | cls scale | seg scale | attr scale | val | testA | testB | Note |
---|---|---|---|---|---|---|---|---|---|---|
114 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 39 train | |||
115 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 39 train | |||
116 | 32 | Yes | 20.0 | 1.0 | 2.0 | 1.0 | 39 train |