Our implementation is based on CLVR's implementation. However, we found multiple issues with their implementation. We tried to reach Shao-Hua Sun over email and github to discuss these issues, but recieved no reply.
In this document, we discuss a serious bug in CLVR's implementation. Then, we highlight other I/O and performance limitations.
The following image demostrates the loss function bug. The following two equations (Eq.1 and Eq.2) are different.
Mehdi et al.[1] employed Eq.1. This equation enforce that the feature should be different between two randomly chosen different images. Eq.1 computes the distance (d \in R^1) between two different features and make sure d > M.
In contrast, Eq.2 computes the distance across every dimension (d \in R^d) and make sure d > M for every dimension, where d in the dimension of the feature f..
Mehdi et al.[1] proposed Eq.1. CLVR's repos uses Eq.2, while we use Eq.1. This fix is essential in order to converge.
[1] Representation Learning by Learning to Count