You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Video Object Detection with an Aligned Spatial-Temporal Memory
Architecture
Summary
Proposing 1. a novel Spatial-Temporal Memory module (STMM) (as the recurrent computation unit) to model long-term temoral appearance and motion dynamicis; 2. a novel MatchTrans module to align the Spatial-Temporal Memory (feature maps) across frames.
Assuming $F_t$ as the appearane feature for the current frame and $M^{\rightarrow}{t-1}$ as the the feature of all previous frames, the **STMM updates $M^{\rightarrow}{t}$ with the input $F_t$ and $M^{\rightarrow}{t-1}$**. Two STMMs are used to obtain feature maps from both directions and the final feature maps are the concatenation of $M^{\rightarrow}{t}$ and $M^{\leftarrow}{t}$. The MatchTrans module computes transformation coefficients $\Gamma$ for position (x,y) from $M{t-1}$ to $M^{'}{t-1}$ (' means matched to $F_t$) by measuing the similarity between $F_t(x,y)$ and $F{t-1}(x+i,y+j)$. The transformation coefficients are then used to synthesize $M^{'}{t-1}$ by interpolating the corresponding $M{t-1}$ feature vectors: $M^{'}{t-1}(x,y)=\sum{i,j \in {-k,...k}} \Gamma_{x,y}(i,j) \cdot M_{t-1}(x+i,y+j)$. Sequence length $T=7$ for training and and $T=11$ for testing. During testing, STMN detector and initial R-FCN detector detection results are ensembled.