You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We follow the evaluation standards from video segmentation fields. Kindly clarify it from the following three aspects.
At first, we actually and normally evaluate $T-2$ frames on $T$ frames, which has been illustrated here.
Simply speaking, we remove the first frame and last frame during evaluation since there would have some frame-difference models, just like what optical flow models do. But we do not know how it actions from forward direction or backward direction, and thus, we remove the first frame and last one, that's why.
That means, on the top of these basics, that we can only train and infer $T-2$ frames equally. So, we do not provide the label when timestamp $T=1$, because we do not need to evaluate it during our evaluation.
From the aspect of design insights, we attempt to explore the effectiveness of anchor frame (when $T=1$), whose broadcasting ability can be achieved via our global-local modules. This idea works (matching-based spatial-temporal modelling) well even though we do not provide annotation for this frame.
Hope this information help you a lot. Further discussions are welcome.
for idx, (img, label) in enumerate(zip(img_li, label_li)):
if idx == 0:
IMG = torch.zeros(len(img_li), *(img.shape))
LABEL = torch.zeros(len(img_li) - 1, *(label.shape))
IMG[idx, :, :, :] = img
else:
IMG[idx, :, :, :] = img
LABEL[idx - 1, :, :, :] = label
so, why LABEL has less frame than IMG ? i 've been confused about it ?
The text was updated successfully, but these errors were encountered: