DeViSE: A Deep Visual-Semantic Embedding Model #1

chullhwan-song · 2018-06-19T01:52:26Z

https://research.google.com/pubs/pub41869.html
Andrea Frome*, Greg S. Corrado*, Jonathon Shlens*, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, Tomas Mikolov
google

chullhwan-song · 2018-06-19T02:11:11Z

What ?

입력값 - 이미지 vs 그 이미지의 caption(text sentence)
각각 다른 알고리즘(Space)으로 뽑은 Image Feature 와 관련 text feature는 서로 Similarity를 계산할 수 없다.
즉, Similarity를 계산하고 싶다.란게 목적
이를 위해, 딥러닝 기법을 이용한(사실 NN이란게 더 가까운듯~) Embedding 기법에 대한 소개

데이터셋

imagenet 1000 클래스
- 이후의 논문은 flickr3k 같은 image 당 5개의 sentence 데이터셋 같은것을 적용하기 시작
- 아무래도 초기 논문이다보니,
- 즉, class 의 1~3개짜리 word로 구성된 데이터를 이용.

모델 구조

매우 심플
두 개의 pre-trained 모델로 초기화
- text feature - word2vec
- image feature - alexnet
transform(affine변환)에 의해 word2vec의 차원과 일치 - 다들 이렇게 한다. ex) google's image captioning
text 쪽에서는 word2vec이외에 하는일이 없음.
similar metric
- 가장 중요한 부분인듯
- loss 계산 - metric learning과 다름 없는듯 보여짐.
  - used a combination of dot-product similarity and hinge rank loss
  - hinge rank loss
    - : a column vector denoting the output of the top layer of our core visual network for the given image
    - M is the matrix of trainable parameters in the linear transformation layer
    - : a row vector denoting learned embedding vector for the provided text label
    - : the embeddings of other text terms
    - fixed margin of 0.1 was used in all experiments
    - 보다 자세히,
      - 는 embedding된 이미지 feature(= 위의 첫번째 그림에서, image > core visual model > transformation 의 output인 feature )와 positive pair
      - 와 상관없는 negative text feature 이다.
        
        즉, image feature 중심으로 text에 대한 positive vs negative pair를 구성하여 hinge loss를 구하는 것이다.
      - 문제는 앞의 sum 표식이다. 어떻게 구해야할까.?
        
        ref) Deep Fusion LSTMs for Text Semantic Matching 발췌
        
        이식을 왜 올렸냐면, 보통 triplet loss같은 경우에 m-pos_dis+neg_dis가 아니라 m+pos_dis-neg_dis 이다.
      - img_feat =
      - positve pair =
      - negative pair = 차이를 알아야한다.
        
        학습할때 batch size 만큼 sampling하여 학습시킨다고 가정하고..보통그렇게 하니까~

실험

adding

위의 목표는 실제적으로 classification과 다를바가 없다.
더구나, 논문보면 baseline softmax보다 떨어진다.
다만,
- 두 이질적인 feature간의 embedding
- zero-shot learning 개념으로 확장한다는 것이 contribution이다.
  - 예를들어, tiger shark,bull shark, and blue shark 로 라벨링 되어 있어서, classification task에는 shark라는 단어로 찾을수 없다. 다만, 이논문의 알고리즘을 적용하면 당연히 찾을수 있다..

chullhwan-song added the Image_to_Text_Embedding label Jun 19, 2018

chullhwan-song mentioned this issue Jul 24, 2018

Learning Deep Structure-Preserving Image-Text Embeddings #26

Open

peternara mentioned this issue Apr 11, 2020

Learning Type-Aware Embeddings for Fashion Compatibility #360

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeViSE: A Deep Visual-Semantic Embedding Model #1

DeViSE: A Deep Visual-Semantic Embedding Model #1

chullhwan-song commented Jun 19, 2018

chullhwan-song commented Jun 19, 2018

DeViSE: A Deep Visual-Semantic Embedding Model #1

DeViSE: A Deep Visual-Semantic Embedding Model #1

Comments

chullhwan-song commented Jun 19, 2018

chullhwan-song commented Jun 19, 2018

What ?

데이터셋

모델 구조

실험

adding