New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs #101

Open

chullhwan-song opened this issue Feb 25, 2019 · 1 comment

Labels

GAN Image-to-Image Translation

Owner

chullhwan-song commented Feb 25, 2019

https://arxiv.org/abs/1711.11585

chullhwan-song added GAN Image-to-Image Translation labels

chullhwan-song closed this as completed

Owner Author

chullhwan-song commented Feb 25, 2019 •

edited

Loading

Abstract

notify) 링크 저자의 PPT refer로 활용
conditional generative adversarial networks (conditional GANs)
semantic label (map) to realistic photo
- semantic segmentation 이미지로 그와 관련 mapping된 라벨영역들을 실제 이미지들로 합성하는 역할.
그동안의 연구는 실제 해상도가 많이 떨어졌음.
이 연구에서는 이러한 문제를 해결 = 2048 × 1024 = pix2pixHD
- novel adversarial loss
- new multi-scale generator
- new discriminator architecture
- 그리고, 두개의 추가적인 feature를 통해 상호 조작(visual manipulation) 가능함.
  - object instance segmentation information를 구체화(incorporate)한다. 그래서, object들의 조작(제거/추가/카테고리 변경)이 가능.
  - 같은 입력에서 다양한 결과를 생성하는 방법제시. interactive하게 editing 가능

Instance-Level Image Synthesis

The pix2pix Baseline

"Imaget o-image translation with conditional adversarial networks" 이란 연구, 일명 "pix2pix" 논문이 base 참고로 이 논문은 pix2pixHD
- 위 연구는 Image to image translation을 위해 conditional GAN 제안.
  - pair set (s_i, x_i) : s_i is a semantic label map 이고 xi는 s_i와 매핑되는 실제 사진
  - G : U-Net network based
  - D : patch-based fully convolutional network > FCN 논문 ?
    - D의 입력값은 semantic label map+corresponding image를 concat하는 형태
      - corresponding image는 매핑되는 리얼이미지 또는 G의 결과 이미지. 다음 이미지를 보면 명확함.
pix2pix 의 문제는 높은 해상도를 Generator하기엔 역부족

Improving Photorealism and Resolution

Coarse-to-fine generator
- Coarse-to-fine이란 단어는 예전 viola jones face detection에 많이 인용되었는데, 보통 tracking에서도..
- 아무튼, 일반 대상 object에 대해 대충 찾고 그 후에 detail하게 찾자는 의미. 아마 여기도.. 대충 translation하고 좀더 디테일한 translation를 나중에하지 않을까?
- G는 두 개의 sub-network G1, G2로 구성
  - G1 : global newtwork
    - "Perceptual losses for real-time style transfer and super-resolution" 연구가 base
    - low 해상도(half size) : 1024 × 512
  - G2 : local enhancer network
    - 높은 해상도 : 2048 × 1024
    - G2는 첫번째 G2에서는 G1과 동일한 input를 받고, 두번째 G2에서는 G1+첫번째 G2의 output 를 element-wise sum하여 input으로 받음
Multi-scale discriminators
- 고해상도를 가진 리얼과 합성 이미지들을 구별하기 우해서 큰 receptive field를 가져야함.
  - 이를 위해서는 보통 깊은 nets나 큰 filter를 가진 CNN이 필요. > 이는 큰 메모리가 필요 ㅠ
- 그래서 이 연구에서는 3개의 Multi-scale discriminator를 이용, G1, G2, G3
  - trained to differentiate real and synthesized images at the 3 different scales,
  - coarse-to-fine generator

Improved adversarial loss

식 2를 향상시킴
feature matching loss
- D_k의 i th 번째 layer에서 추출한 feature :
- T는 total layer number
- 는 각 layer에서의 element 수
GAN loss
- VAE-GANs에서 사용되었던 loss와 비슷
- discriminator > feature matching loss & the perceptual loss > for image super-resolution and style transfer

Using Instance Maps

semantic label map 데이터셋을 이용 - pixel 레벨로 라벨링.
- plus Boundary improvement > Fig 4 b)를 함께 사용하면, 다음과 같은 결과를 얻을수 있음.
  - the channel-wise concatenation = instance boundary map + semantic label map + real/synthesized image

Multi-modal results using feature embedding

To generate diverse images and allow instance-level control,
- G의 입력값으로 ow-dimensional feature channel들을 이용.
- low-dimensional feature로 Encoder를 이용
- instance-wise average pooling layer : Encoder 결과에 대한 average pooling

실험

chullhwan-song reopened this

chullhwan-song mentioned this issue

Everybody Dance Now #100

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment