Everybody Dance Now #100

chullhwan-song · 2019-02-25T01:47:00Z

https://arxiv.org/abs/1808.07371

chullhwan-song · 2019-02-26T01:09:52Z

Abstract

나 처럼 따라해봐라(do as I do)하는 개념의 motion transfer 연구
시공간적 smoothing(spatio-temporal smoothing)을 이용해 image-to-image translation를 frame단위에 이 문제에 적용함
소스와 타켓 사이의 중간 표현으로써 pose detection을 사용하여, pose 이미지로부터 타켓 물체의 모습까지 매핑하는것을 learn한다.
리얼리스트한 얼굴합성까지 포함하여, 시간적으로 포괄적인 비디오 generation을 위한 이 설정을 적용.
- We adapt this setup for temporally coherent video generation including realistic face synthesis.
이연구의 동영상 예제 : https://www.youtube.com/watch?v=PCBTZh41Ris&feature=youtu.be

METHOD OVERVIEW

Given = { source person' video, target person's video}
이 연구의 목표는 source person' video와 같이, 이와 같은 모션을 행하는 target person의 video를 generator하는것. = Fig.1처럼
이를 달성하기 위해, 이 연구에서는 3 stage로,
- pose detection -> pose stick figures 들을 만듦.
- global pose normalization > 소스와 타켓간의 body shape와 위치사이의 차이를 고려
- mapping from normalized pose stick figures to the target subject(피사체). > adversarial training 통해 target person의 이미지들을 "normalized pose stick figures(당연히 소스의 pose정보?)"의 형태로 mapping하도록 학습시킨다.
전체 학습과정
- pose detector P
- frame y from the original target video
- pose stick figure(x)를 생성 , x = P(y)
- 학습동안, 주어진 pose stick x와 target person의 합성 이미지를 매핑하려는 G를 학습하기 위해 (x,y) pairs 를 사용.
- 두개의 loss 즉, D loss와 pre-trained vgg 을 이용한 reconstruction loss = dist
  - 그래서, GT(target y)와 G(x)사이를 최적화.
  - D는 real(pose stick figure x, ground truth image y) or fake (pose stick figure x, model output G(x))인지 아닌지?
transfer setup : Fig 3의 아래부분
- 학습과 유사하게, source frame y는 pose detector P를 이용하여 pose 정보 추출하여 pose stick figure x를 생성.
- 그러나, 이 정보는 target video의 피사체들보다 다른 위치(서있는위치, 피사체가 작거나 크거나..)로써 나타나는 문제가 발생함. (당연한 문제..)
- 이러한 문제를 해결하기 위해서,
  - source pose가 target에 더 잘 align하도록 만들어야 하는데, 그렇게 만들려고 하면, target video x안의 포즈에 일치(consistent)되도록 source’s original pose x′에 잘 맞는 transform을 해야 한다. 따라서 이 연구에서는 이를 달성하기 위해, "global pose normalization Norm"를 적용한다.
  - 이후에, original image of the source y′로 대응된 우리의 target person의 G(x)를 획득하기 위해서, normalized pose stick figure x를 학습된 모델 G에 넣는다.

POSE ESTIMATION AND NORMALIZATION

Pose estimation
- CPM같은 알고리즘. > OpenPose 이용.
- pose detector P를 미리 만들어야함. > (x, y)형태의 pose들을 detector하는...
- 이를 이용하여 Fig2의 왼쪽 그림 처럼 pose stick figure를 생성.
  - by plotting the keypoints and drawing lines between connected joints
- 이 정보는 학습과정시 G의 input이 됨.
- 또한, transfer를 과정시에는 P는 source subject(피사체)의 pose를 평가하고 난후에, 이는 다음장(4.2장)에서 설명한 normalize한다.
  - normalized pose coordinate들은 G의 input인 "pose stick figures" 를 생성하는데 사용.
Global pose normalization
- 다른 video안에는 (당연히) 피사체의 사지 비율이 다르거나, 카메라가 멀리 또는 가까이 일수 있다.
- 그래서, Transfer section of Figure 3 처럼 서로 맞춰줘야 한다.
  - 두 피사체 사이의 모션을 transferring할 때, body shape, 비율을 일치시키기 위해, source person의 pose keypoint에 transform해야함.
- 그래서 일치(맞추기)시키기 위해,
  - 피사체(subject)의 높이, 그리고 발목의 위치(ankle position)를 분석한다. 그리고, 양쪽 비디오안에서 가깝고 먼 발목의 위치 사이의 linear mapping에 사용하여 이 transformation 를 찾음(번역이..?ㅠ)
    - 양쪽 비디오에서 기준이되는 점(ankle position)를 찾고, 이를 서로 일치(매핑)하여 transformation 한다는 의미인듯~
  - 이러한 확률적 정보를 획득(수집)한 후, pose detection에 대응하는 것을 기반으로 각 frame에 scale/translation를 계산.
- section 9(APPENDIX) 에 보다 자세한 내용

ADVERSARIAL TRAINING OF IMAGE TO IMAGE TRANSLATION

based pix2pixHD
pix2pixHD기반으로 하되, 이 두개의 특성을 고려하여 수정
- temporally coherent video frames(Temporal smoothing) 그리고, realistic face image 합성처리.
왜 이 두 개념을 추가한 이유는?
- Temporal smoothing - 실제적으로 이미지가 아닌 동영상에서 적용되어야 하므로, 현재 frame과 이전 frame간에 끊김없이(자연스럽게) 이어주기 위함으로..
- realistic face image - Face GAN을 추가하는것인데, 이는 또한 얼굴에 대해 좀더 realistic하게 보여주기 위함 - 다른 특징보다 얼굴이 중요한 특징중의 하나므로,,, 춤추는데..원동영상의 얼굴의 표정을 현재 얼굴에 잘 맞춰서..한다면 이보다 좋은 결과가...

pix2pixHD framework

multi-scale discriminators D = (D1,D2,D3)
는 D인데, feature-matching loss
는 perceptual reconstruction loss

Temporal smoothing

G의 입장에서 볼때, 현재 t에 대한 pose stick figure x_t와 이전 프레임에서 생성된 G(t-1)를 concat하여 G에 넣는다.
D의 입장에서 볼때, (xt−1, xt ,G(xt−1), G(xt )) > fake, (xt−1, xt, yt−1, yt )>real
- 각각 4개정보를 concat하는듯..

Face GAN

얼굴의 realistic를 강조하기 위한 또하나의 GAN이다.
얼굴 주위만 가지고 input으로, 사진이든, pose stick figure든,,
pose stick figure의 얼굴(x_F)과, G(x)의 얼굴부분을 concat하여 G_f에서 Generator하는데, 이때의 결과를
r = 와 resiual하게 G(x)의 얼굴부분( G_f(x))를 다시 결합하면, 가 된다.
그래서, GAN face loss는
- > real
  - face region of the input pose stick figure, face region of the ground truth target person image
- > fake
pix2pixHD를 기반으로 하는데, Global Generator network로만...
70x70 Patch-GAN discriminator
다음 fig는 smoothing, face gan을 결합할 때의 실험 결과를 보여주고 있다.

최종적인, Full Objective

식(5)는 원래의 pix2pixHD loss 식(1)와 대응하는데, 이전의 single frame이 아닌, 동영상에서 적용해아하니, 식(5)처럼 변환된것이다.
LSGAN ? - "Least squares generative adversarial networks"

EXPERIMENT

chullhwan-song added Video Pose GAN labels Feb 25, 2019

chullhwan-song closed this as completed Feb 25, 2019

chullhwan-song reopened this Feb 26, 2019

chullhwan-song added the Image-to-Image Translation label Feb 26, 2019

chullhwan-song mentioned this issue Feb 27, 2019

Be Your Own Prada: Fashion Synthesis with Structural Coherence #92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Everybody Dance Now #100

Everybody Dance Now #100

chullhwan-song commented Feb 25, 2019

chullhwan-song commented Feb 26, 2019 •

edited

Loading

Everybody Dance Now #100

Everybody Dance Now #100

Comments

chullhwan-song commented Feb 25, 2019

chullhwan-song commented Feb 26, 2019 • edited Loading

Abstract

METHOD OVERVIEW

POSE ESTIMATION AND NORMALIZATION

ADVERSARIAL TRAINING OF IMAGE TO IMAGE TRANSLATION

pix2pixHD framework

Temporal smoothing

Face GAN

최종적인, Full Objective

EXPERIMENT

chullhwan-song commented Feb 26, 2019 •

edited

Loading