Spatial Transformer Network #2

chullhwan-song · 2018-06-20T01:05:28Z

chullhwan-song · 2018-06-20T01:22:59Z

What?

Yolo 이후 신기하다~란 느낌의 논문
STN은 CNN 이 Scale, Rotation, Translation이 invariant한가란 질문에서 시작한다.
실제 CNN은 Scale, Rotation은 invariant하지 않고 translation에 대해서는 부분적으로만 invariant하다.
(Max pooling > 다운샘플링 ) 그래서 이를 위해 보통 augment dataset를 만들기도한다. > SIFT ?
CNN은 global 단위이지 (전체 feature map에 대해서), local 단위가 아니다. 이에 반해 STN은 "regions
that are most relevant (attention)" 그리고 "transform those regions" 까지 허용한다.
이외 bp 등은 CNN과 마친가지로 가능하다.
benefit
- image classification
- co-localisation
- spatial attention
모듈화된 layer로 adding 가능하다.
그럼 구체적으로 어떤 일을 할까? 예를 들어서 보면, 기존 하고자하는 NN에 넣어서 아래 그림과 같이 spatial transformation 기능을 동적으로 제공하는 network라 볼수 있음.
- 위의 그림은 fully-connected network의 바로 앞 단에 spatial transformer를 사용하고 MNIST digit classification을 위해 training한 결과

Localisation network takes the input feature map, and through a number of hidden layers
outputs parameters of spatial transformation.
Grid generator creates a sampling grid by using predicted transformation parameters.
Sampler takes feature map and the sampling grid as inputs, and produces the output map
sampled from the input at the grid points.

input feature map에서 sampling할 지점의 위치를 정해줌
Tθ 가 2d affine transform 이라면,
- Affine transform은 6개의 parameter로 scale, rotation, translation, skew, cropping을 표현할 수 있습니다
Tθ 가 isotropic scale(가로와 세로 비율이 같은 확대/축소)이라면, > Attention Model
즉, Tθ의 parameter가 미분가능하기만 하면, projective transformation, thin plate spline transformation 등 그 밖의 일반적인 transform을 모두 표현가능

sampling grid Tθ(G)를 input U에 적용하여 V를 만든다.
V의 특정 pixel값을 얻기 위해, sampling grid Tθ(G)는 U에서 어느 위치값에서 V와 매핑될지를 가지고 있음.
학습시, back-propagation의 loss계산은 당연히, U, G에 대한 미분이 가능해야한다. Bilinear interpolation의 경우 각각의 partial derivative를 구해보면 아래와 같음.

Spatial Transformer Networks 는 Localisation network, grid generator와 sampler로 구성한 spatial transformer module을 CNN 구조에 끼워 넣은 것을 Spatial Transformer Network 함.
Spatial transformer module은 CNN의 어느 지점에나, 몇 개라도 이론상 집어넣을 수 있음.
Spatial transformer module을 CNN의 입력 바로 앞에 배치하는 것이 가장 일반적
동영상 참고 : https://www.youtube.com/watch?time_continue=5&v=Ywv0Xi2-14Y

chullhwan-song · 2018-06-20T01:33:58Z

chullhwan-song added the Attention label Feb 15, 2019

chullhwan-song mentioned this issue Mar 21, 2019

Robust Scene Text Recognition with Automatic Rectification #124

Open