Less is More: Learning Highlight Detection from Video Duration #113

chullhwan-song · 2019-03-06T03:39:47Z

https://arxiv.org/abs/1903.00859

chullhwan-song · 2019-03-08T03:00:21Z

Abstract

개인적으로 이런류(비디오관련) paper는 처음 > 리뷰가 좀 어설픈듯.ㅠ
밑에 참고

Contribution

unsupervised video highlight detection -일반 비디오 데이터에서 shot video(user-generated video)을 그냥 이용 > 이는 명확한 학습 시그널로써 판단함.
- unsupervised 라기 보다.. weakly supervised로써도...아마..
새로운 "video clip deep ranking framework" 제안 > 이는 노이즈한 라벨된 학습에서 강건한 특성.
기존보다 1~2배 큰 scale의 학습셋으로 학습한다는 것을 보여주고, 이는 이 규모의 학습이 성능에 매우 중요하다는 것을 보여줌.
공개된 두개의 benchmark set에서, unsupervised highlight detection부분에서 SOTA

Approach

unlabeled video데이터를 가지고 학습된 "domain-specific highlight detection"
수집방법 - large-scale hashtag video data for a domain

Large-scale Instagram Training Video

데이터 수집 방법에 대해~
hashtags를 가지고 있는 public video > large scale > Instagram...
Instagram
- 짧고 눈길끄는 비디오( short and eye-catching video)
- 1초부터 1분내외

Learning Highlights from Video Duration

video highlight detector를 학습 하기 위해 large scale hash-tagged video와 그 지속시간을 이용하는 랭킹 모델 소개
video highlight는 사용자의 관심과 흥미를 캡쳐할 수 있는 긴 동영상내의 짧은 동영상 Segment라 정의할 수 있다.
우리의 목표는
- 주어진 feature x에서 temporal video segment에 대한 score를 예측할 수 있는 function f(x)를 학습하는 것이다.
그래서, 주어진 새로운 비디오에서, 그 highlight는 Segment의 예측된 highlight score에 기반하여 순위가 메겨진다(ranked)
supervised regression solution들은 labeled 학습 데이터로 그 f(x)를 학습할 것이다. 하지만, 다수의 인간에 의해 매겨져 수집된 highlight scores에 대한 교정은 그자체로도 도전적 즉, 인간에 의해 매겨진 것 자체가 주관적이어서..문제가 많다란 의미인듯.
대신에, highlight detection은 pair( human-labeled/edited video highlight)로 부터 학습된 ranking 문제로 볼수 있다.
- segments in the manually annotated highlight ought to score more highly than those elsewhere in the original long video. > 사람에 의해 주석이된 highlight는 긴 원본 비디오안에서 다른 부분보다 높은 점수가 매겨져야한다.
- 하지만, 이러한 데이터는 수집이 어렵고 비용이 많이 듦.
이러한 문제를 해결하기 위해, 우리는 수집한 매우 큰 unlabeled video를 이용하여 highlight detection를 학습하도록 제안한다.
- 짧은 비디오를 업로드할때는 그 내용을 선택하여 올린다. 반면에 긴 비디오들은 좋거나 덜 흥미있는 내용을 포함하고 있을 확률이 매우 높다.
- 그래서, 이 점 즉, video duration(동영상 재생시간)을 라벨링된 정보로써 이용(the duration of videos as supervision signals)
- 다시 말해서, long video보다 short video를 높은 점수를 가진 라벨 정보로 하여 function f(x)를 학습하는 것.
- 그렇다고해서, long video가 항상 나쁜 점수를 가지지는 않다. 따라서, 이러한 noisy ranking data를 다룰수 있는 ranking모델을 고안.

Training data and loss

D : video tag 정보 set
D는 3개의 non-overlapping set으로 나눔 = D = {D_S, D_L, D_R}, > S:short, L:long, R:rest
- S : 15초 이하, L은 45초 이상.
그리고 이러한 비디오들은 uniform 한 길이(논문에서는 2초)로 segment화한다. > 이 단위로 몬가를 하는듯~
video i 번째 segment를
s_i번째에서 feature x_i를 추출한다.
- 아직 어떻게 뽑는지는 알지 못함. ?
- segment단위는 여러 frame일 텐데(예상), 그렇다면, 문제는 여러 frame의 feature를 어떻게 구성할껀지? (합칠껀지??)
그래서, 우리의 목표는 긴 비디오보다 짧은 비디오 더 높은 점수가 되도록 video segment들을 rank하는 것이다. = our goal is to rank video segments from shorter videos higher than those from longer videos
training pairs(S_i, S_j) 구성, 이 Pair의 collection을
ranking loss
- f는 CNN
- comparing those inside and outside of the true highlight region
- our constraints span(확장하다) segments from distinct short and long videos.

Learning from noisy pairs - 이부분이 잘 이해가 안감 ㅠ

shot video가 반드시 highlight가 아닐수 있고, long video가 highlight를 포함할수도 있다. 또한 hashtags자체가 무의미한 정보일수도 있다.
그래서, 안의 어떤 subset은 유효한 ranking 제약(valid ranking constraints) 를 가지고 있다.
- s_i는 highlight, s_j는 non-highlight
정리하자면(Ideally), ranking model은 valid ranking constraints은 단지 학습하고 나머지는 무시한다. > How??
이러한것을 annotation 정보없이 가능하게 하기 위해서, 이 연구는 binary latent variables : 개념 소개
- ranking constraint 유효한지 그렇지 않한지를...
그래서, object function을 다음과 같이..
- h : neural network
- : 전체 랭킹 수(=total number of ranking constraints)
- p는 유효한 랭킹 비율(the anticipated proportion of ranking constraints that are valid)
  - training with p = 0.8 tells the system that about 80% of the pairs are a prior expected
    to be valid.
  - noise level prior p
- 은 w_ij 값
  - 식(2)를 자세히 보면 w_ij가 같은 등식이 두개가 있음
  - NN형태인데, 두개의 x_i, x_j값을 받는 형태인듯한데..?? > 뒤에 나오겠지?
  - 일단, 바로 위의 설명한 개념을 이것으로 대체하는것으로 이해
  - NN 이니 학습가능 > 그래서, optimize가능 >학습셋에 대한 선택을 hyper-param이 아닌 학습에 의해 optimize하려는 목적인듯~(바로밑이 그것에 대한..)
  - 이것을 씀으로써의 이익
    - function f와 선택된 학습 pair에 대한 optimize.
    - w_ij conditioned on the input features so it can learn whether a ranking constraint is valid as a function of the specific visual input.
    - relaxing w_ij > [0, 1] > we capture uncertainty about pair validity during training.
final loss
- 그래서, latent variables wij 를 parameterize > 학습에 의해
  - we parameterize the latent variables w_ij , which provide learned weights for the training samples.
- 다시한번 objective function 재정의
  - 학습셋을 group단위로, 이 그룹은 n개의 pair로,
  - w_ij의 합이 1
  - 는 pair set 를 random으로 split
  - m개의 group안에 n개의 pair 가 존재한다는 의미.

은 group 안에서 정의된 softmax function
한개의 group 크기가 n이라면, label noise prior p를 유지하는것, p=1/n(allowing a differentiable loss
for the selection function h)
- n이 작다면, training speed 업. > 실수로 잘못된 pair를 계산하여 promoting하는 cost에서 적어져서..
- 반대로, n보다 크면 비용이 크다.
- 실제 학습에서는 n=8 fix

Network structure

와 는 Neural Network
- : 3 hidden layer fullyconnected model
  - cnn도 아니고..? 단순? > 밑에 논의 될듯~
- : 3 fully-connected layers+ n-way softmax > 식(3)
다음은 NN의 전체 구조

Video segment feature representation

segment s_i에 대한 feature x_i를 추출
- frame은 곧 이미지인데, 이미지 여러장일텐데 ?(위에서의 같은 고민)
여기서는 temporal 정보를 3D Convolution을 적용 > 저도 처음 ?ㅎ
그래서, ResNet-34 backbone인데, 이게 3d conv로 구성되어 있는듯..
- pretrained on Kinetics > refer를 봐야할듯(첨봄)
  - 의문은 실제 동영상은 kinetics(?)정보를 받는가? 그냥 sequential frame일텐데..? kinetics를 이렇게 부르는가?
- pooling of the final convolution layer를 적용
- 512 dimension
그래서, 합치지는 과정(?)이 없고 한번에 해결하는듯.. > 그래도 "sequential frame"를 합치는 과정이 필요없이 사용 가능한 idea를 준듯..

실험

chullhwan-song added Video Highlight Detection labels Mar 6, 2019

chullhwan-song closed this as completed Mar 6, 2019

chullhwan-song reopened this Mar 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Less is More: Learning Highlight Detection from Video Duration #113

Less is More: Learning Highlight Detection from Video Duration #113

chullhwan-song commented Mar 6, 2019

chullhwan-song commented Mar 8, 2019 •

edited

Loading

Less is More: Learning Highlight Detection from Video Duration #113

Less is More: Learning Highlight Detection from Video Duration #113

Comments

chullhwan-song commented Mar 6, 2019

chullhwan-song commented Mar 8, 2019 • edited Loading

Abstract

Contribution

Approach

Large-scale Instagram Training Video

Learning Highlights from Video Duration

Training data and loss

Learning from noisy pairs - 이부분이 잘 이해가 안감 ㅠ

Network structure

Video segment feature representation

실험

chullhwan-song commented Mar 8, 2019 •

edited

Loading