End-to-end Learning of Deep Visual Representations for Image Retrieval #17

chullhwan-song · 2018-07-12T01:01:56Z

chullhwan-song · 2018-07-12T01:40:22Z

what ?
- instance-level image retrieval
- r-mac descriptor - learnable
- landmark dataset - Neural Codes for Image Retrieval #14 연구에서 소개한 neuralcode dataset를 이용. 이 데이터는 언급한거와 같이 noisy한 데이터 -> clean
- triplet loss 기반 학습. > 참고 Deep metric learning using Triplet network #16
- Oxford 5k, Paris 6k and Holidays testset 적용 - delf Large-Scale Image Retrieval with Attentive Deep Local Features #4 연구가 나올때까지 SOTA 성능
- 구 제록스 연구소 현 네이버랩스유럽 소속 논문
학습셋 구성
- neuralcode dataset noisy한 데이터 -> clean 를 clean 데이터로 구성
  - SIFT & Hessian-Affine keypoint 이용 -> bow, kNN -> graph : 32-core server
  - This cleaning process leaves about 49,000 images (divided in 42,410 training and 6,382 validation images) still belonging to one of the 586 landmarks,
  - 링크 다운로드 가능 -> delf에서도 이를 이용.
  - Bounding box estimation 부분이 있는데, 이는 R-MAC의 local region을 획득하는 방법(원래는 rigid grid)
R-MAC
- 이 논문의 기본 descriptor 알고리즘은 MAC 인데 이를 local 특성을 가진 feature 업그레이드 한것이 R-MAC
- 현재 많이 적용/응용되어 사용하고 있는 인기있는 feature
- MAC은 conv feature map으로 부터 channel dimension의 feature 를 생성, 한 채널은 wxh의 크기를 가지는데 이때 wxh의 max값을 취함.
- 이 방식을 기반으로 하고, pre-processing단계에서, 다음과 같이 rigid grid region 획득
  - caffe 기반 공개소스에서 보면, L = 2 ?

    def get_rmac_region_coordinates(self, H, W, L):
        # Almost verbatim from Tolias et al Matlab implementation.
        # Could be heavily pythonized, but really not worth it...
        # Desired overlap of neighboring regions
        ovr = 0.4
        # Possible regions for the long dimension
        steps = np.array((2, 3, 4, 5, 6, 7), dtype=np.float32)
        w = np.minimum(H, W)

        b = (np.maximum(H, W) - w) / (steps - 1)
        # steps(idx) regions for long dimension. The +1 comes from Matlab
        # 1-indexing...
        idx = np.argmin(np.abs(((w**2 - w * b) / w**2) - ovr)) + 1

        # Region overplus per dimension
        Wd = 0
        Hd = 0
        if H < W:
            Wd = idx
        elif H > W:
            Hd = idx

        regions_xywh = []
        for l in range(1, L+1):
            wl = np.floor(2 * w / (l + 1))
            wl2 = np.floor(wl / 2 - 1)
            # Center coordinates
            if l + Wd - 1 > 0:
                b = (W - wl) / (l + Wd - 1)
            else:
                b = 0
            cenW = np.floor(wl2 + b * np.arange(l - 1 + Wd + 1)) - wl2
            # Center coordinates
            if l + Hd - 1 > 0:
                b = (H - wl) / (l + Hd - 1)
            else:
                b = 0
            cenH = np.floor(wl2 + b * np.arange(l - 1 + Hd + 1)) - wl2

            for i_ in cenH:
                for j_ in cenW:
                    regions_xywh.append([j_, i_, wl, wl])

        # Round the regions. Careful with the borders!
        for i in range(len(regions_xywh)):
            for j in range(4):
                regions_xywh[i][j] = int(round(regions_xywh[i][j]))
            if regions_xywh[i][0] + regions_xywh[i][2] > W:
                regions_xywh[i][0] -= ((regions_xywh[i][0] + regions_xywh[i][2]) - W)
            if regions_xywh[i][1] + regions_xywh[i][3] > H:
                regions_xywh[i][1] -= ((regions_xywh[i][1] + regions_xywh[i][3]) - H)
        return np.array(regions_xywh).astype(np.float32)

이 region들을 기반(conv feature map기준으로 보면 sub region)으로 MAC feature를 취함.
이후에 각 region별로, l2 norm -> whitening -> l2 하고 각 region별로 sum을 한후, 다시 l2 norm을 하여 최종적으로 feature를 취함
- whitening은 base는 PCA 인데, 원 MAC 들은 learnable feature가 아님. 그래서, 이를 해결하기 위해 , shift+fc, 즉, affine transformation을 하는것 같음. (backpropa에서 미분가능하도록 만들어줘야함)
  
  * 또한 앞에서 언급한 Bounding box estimation 에서 획득한 region들은 위의 rigid grid 방식을 대체함
- faster rcnn 기반으로..feature 추출등..
triplet loss 기반 학습
- 최종적으로 learnable 형태로 만들고 학습을 진행
  - backbone 에서 어느단계의 layer에서의 feature로써 추출하는가? (구현단계에서는...하나하나가...고민된다.)
    - vgg16 feature 어떤 layer에서, 언급이 딱 하나가 있는데, Fig.5에서 conv5_3라는 layer 이름을 언급 > 이는 caffe에서 pool5이전의 layer를 의미한다.
    - resnet101에서는 언급이 없지만, 공개된 caffe 소스를 보면 "res5c"

....,
layer {
        bottom: "res5b"
        bottom: "res5c_branch2c"
        top: "res5c"
        name: "res5c"
        type: "Eltwise"
}

layer {
        bottom: "res5c"
        top: "res5c"
        name: "res5c_relu"
        type: "ReLU"
}

## Get rmac regions with a RoiPooling layer. If batch size was 1, we end up with N_regions x D x pooled_h x pooled_w
layer {
   name: "pooled_rois"
   type: "ROIPooling"
   bottom: "res5c"
   bottom: "rois"
   top: "pooled_rois"
   roi_pooling_param {
     pooled_w: 1
     pooled_h: 1
    spatial_scale: 0.03125 # 1/32
   }
 }
layer {
  name: "pooled_rois/normalized"
  type: "Python"
  bottom: "pooled_rois"
  top: "pooled_rois/normalized"
  python_param {
    module: 'custom_layers'
    layer: 'NormalizeLayer'
    param_str: "{}"
  }
}
....

classification loss와의 비교
triplet loss 기반 학습은 기본적으로 negative sample을 획득하는 방법이 중요.(facenet 참고바람)
자세히 쓰진 않겠지만, 논문상에선, vgg/resnet101같은 network을 이용. 하지만 gpu 메모리등 이유로 어려움 겪음... >해결방법제시.
실험
- query image
  - crop하여 적용 : Oxf5k/Par6k
  - 원래 이미지 적용 : holidays
- Oxf5k의 의미(당연한건데..)
  - 물론 oxford데이터셋을 이용했다라는 의미인데..
  - 총 5063개 존재
  - 쿼리와 정답셋 리스트가 존재한다.
    - pitt_rivers_1_query.txt : list(pitt_rivers_1_good.txt, pitt_rivers_1_junk.txt, pitt_rivers_1_ok.txt)
  - 궁금한것은 ap를 계산한 결과가 쿼리가 5063개안에 존재하는 랭킹 결과라는 것인가..??
    - 너무적어서문제가 있지않을까해서..(의심.)
    - 옆에 Oxf105k가 있어서 맞을듯한데..
      - Oxf5k vs Oxf105k = 83.1 vs 78.6
      - 십만개나 대상이 늘었는데.. 5%적도 밖에 차이가 안나는게...(원래 그런가..ㅎ)
- 결과적으로 delf나오기 전까지 SOTA(너무 올려놨음 ㅠ) - 하지만 delf는 자체는 이 연구와 combined
  한 실험 결과임.
- 이외 다양한 실험했는데, resnet101사용하여 SOTA
- 학습
  - batch size of 64 triplets per iteration > 이 조건으로 GPU메모리학습 가능한지 묻고싶다. ?? > 더군다나 image size가 고정된 224가 아닌 800기준이다. 즉 이미지 자체도 너무 크다.
  - 학습할때, 고정된 이미지 사이즈 아니고, dynamic 사이즈 ??
    - 먼저 본문에,
      - The second consideration is the amount of memory required during training, as we train with large images (largest side resized to 800 pixels) and with three streams at the same time.
    - 고정된 batch size에서, 안의 이미지들은 가장 긴 부분이 800으로 고정시켜서, resize > 이미지의 고유의 종횡비 비율을 유지하면서 학습한것이란 의미인것으로 보여지는데,
    - 실제 학습에서는, 고정된 memory확보해야하는데, width & height을 일정 크기로 static하게 고정시켜야한다. > 현재까지 내가 알기론,
      - 예를들어, SPPNet같은경우, input image의 고정된 사이즈가 아닌 원본사이즈를 받을때 사용하도록 설계되어 있다. 하지만 실제 학습(open source에서는)에서는 모두 고정된 사이즈로 하는 것을 확인(현재까지..). 다만 inference할때 resize없이 가능
    - 음 제 생각엔, 실제 학습에서...첫번째, largest side resized to 800 pixels 로 하는것맞고, 두번째, 다시 고정된 사이즈로 resize하는것으로 보인다. > 그렇지 않고서야...ㅠ
- classification 보다 triplet loss가 더 좋다.
- rigid grid는 성능이 낮으나 그리 차이가 나지 않는다.
- clean data로 하니 더 좋다. fine-tuning
- 이외 QE, multi-resolution
- compression했는데 PQ방식이 젤 좋다 등등

chullhwan-song mentioned this issue Jul 16, 2018

CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples #19

Open

chullhwan-song added the Deep Image Feature label Aug 3, 2018

chullhwan-song mentioned this issue Jun 12, 2019

2018 Google Landmark Retrieval Challenge 리뷰 #105

Open

chullhwan-song mentioned this issue Jun 21, 2019

REMAP: Multi-layer entropy-guided pooling of dense CNN features for image retrieval #145

Open

chullhwan-song mentioned this issue Dec 26, 2019

Regional Maximum Activations of Convolutions with Attention for Cross-domain Beauty and Personal Care Product Retrieval #270

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-end Learning of Deep Visual Representations for Image Retrieval #17

End-to-end Learning of Deep Visual Representations for Image Retrieval #17

chullhwan-song commented Jul 12, 2018

chullhwan-song commented Jul 12, 2018 •

edited

Loading

End-to-end Learning of Deep Visual Representations for Image Retrieval #17

End-to-end Learning of Deep Visual Representations for Image Retrieval #17

Comments

chullhwan-song commented Jul 12, 2018

chullhwan-song commented Jul 12, 2018 • edited Loading

chullhwan-song commented Jul 12, 2018 •

edited

Loading