# Implementation of "Visual Relationship Detection with Language Priors"



In [1]:
%qtconsole

In [2]:
import tensorflow as tf
import numpy as np
import gensim

## Load and prepare data


Our dataset (Table 1) contains 5000 images with 100 object categories and 70 predicates. In total, the dataset contains 37,993 relationships with 6,672 relationship types and 24.25 predicates per object category. Some example relationships are shown in Figure 3. The distribution of relationships in our dataset highlights the long tail of infrequent relationships (Figure 3 (left))


#### TODO

sample ~5000 images, compare statistics to these

In [4]:
M = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
word2vec = lambda t: M[t]

IOError: [Errno 2] No such file or directory: './GoogleNews-vectors-negative300.bin'

## Visual Appearence Module

![1](eqs/1.png)

where Θ is the parameter set of {zk , sk }. zk and sk are the parameters learnt to convert our CNN features to relationship likelihoods. k = 1, . . . , K represent the K predicates in our dataset. Pi (O1 ) and Pj (O2 ) are the CNN likelihoods of categorizing box O1 as object category i and box O2 as category j . CNN(O1 , O2 ) features extracted from the union of the O1 and O2 boxes.

In [None]:
O = # TODO: object probabilities: get final layer vector of "probabilities" from VGG-16 standard
R = # TODO: get final layer of relationship CNN (use tensorflow . . .)
def V(i, j, k, Z, S, O, R):
    P_i = O[i]
    P_j = O[j]
    P_k = np.dot(Z[k].T, R[k]) + S[k]
    return P_i * P_j * P_k

## Language Module

![2](eqs/2.png)

Next, we concatenate these two vectors together and transform it into the relationship vector space using a projection parameterized by W, which we learn. This projection presents how two objects interact with each other. We denote word2vec() as the function that converts a word to its 300 dim. vector. The relationship projection function (shown in Figure 4) is defined in Eq 2. 

In [None]:
T = # TODO: T['word'] = index

def F(i, j, W, B, T):
    """
    Project relationship `R = <i,j>` to K-dim relationship space.
    """
    w2v = np.concatenate(word2vec(T[i]), word2vec(T[j]))
    return np.dot(W.T, w2v) + B

def f(i, j, k, W, B, T):
    """
    Project relationship `R = <i,j>` to scalar space.
    """
    w2v = np.concatenate(word2vec(T[i]), word2vec(T[j]))
    return np.dot(W[k].T, w2v) + B[k]

#### Training Projection Function

![3](eqs/3.png)

where d(R, R′ ) is the sum of the cosine distances (in word2vec space [7])

We want the distance between ⟨man - riding - horse⟩ to be close to ⟨man - riding - cow⟩ but farther from ⟨car - has - wheel⟩. We formulate this by using a heuristic where the distance between two relationships is proportional to the word2vec distance between its component objects and predicate.

In [None]:
class Relationship:
    def __init__(i, j, k):
        self.r = (i,j,k)

def dist(R1, R2, W, B, O, T):
    d_rel = f(*R1.r, W, B, T) - f(*R2.r, W, B, T) 
    d_obj = M.similarity(M[R1.i], M[R1.j]) + \
            m.similarity(M[R2.i], M[R2.j])
    return (d_rel ** 2) / d_obj

![4](eqs/4.png)

To satisfy Eq 3, we randomly sample pairs of relationships (⟨R, R′ ⟩) and minimize their variance. The sample number we use is 500K.

In [None]:
N = len(O)
num_samples = 500000

from numpy.random import randint

def sample_R(num_samples=500000)
    R_rand = lambda O: Relationship(randint(N), randint(N), randint(K))
    R_pairs = []
    for n in range(num_samples):
        R1 = R_rand(O)
        R2 = R_rand(O)
        R_pairs.append((R1, R2))
    return R_pairs


R_pairs = sample_R(O, num_samples)

def K_fun(R_pairs, W, B, O, T)
    D = []
    for R1, R2 in R_pairs:
        d = dist(R1, R2, W, B, O, T)
        D.append(d)
    return np.var(D)


#### Likelihood of a Relationship

![5](eqs/5.png)


The output of our projection function should ideally indicate the likelihood of a visual relationship. For example, our model should not assign a high likelihood score to a relationship like ⟨dog - drive - car⟩, which is unlikely to occur. We model this by enforcing that if R occurs more frequently than R′ in our training data, then it should have a higher likelihood of occurring again. 

Minimizing this objective enforces that a relationship with a lower likelihood of occurring has a lower f () score.

In [None]:
def L(R_pairs, W, B)
    val = 0.0
    for R1, R2 in R_pairs:
        val += max(f(*R1.r, W, B) - f(*R2.r, W, B) + 1, 0)

![6](eqs/6.png)

Visual appearance module (V ()) and the language module (f ()). They are combined to maximize the rank of the ground truth relationship R with bounding boxes O1 and O2 using the following rank loss function C.

In [None]:
def C(Z, S, W, B):
    Rs = [R for R_pair in R_pairs for R in R_pair]
    cs = []
    
    for R1 in Rs:
        vis  = V(*R1.r, Z, S)
        lang = f(*R1.r, W, B)
        diff_R = lambda R1, R2: ((R1.i != R2.i) or (R1.j != R2.j)) and (R2.k != R2.k)
        
        c = max(V(*R2.r, Z, S) * f(*R2.r, W, B) for R2 in Rs if diff_R(R1, R2))
        cs.append(max(c, 0))
        
    return sum(cs)

![7](eqs/7.png)


We use a ranking loss function to make it more likely for our model to choose the correct relationship. Given the large number of possible relationships, we find that 308 a classification loss performs worse. Therefore, our final objective function combines Eq 6 with Eqs 4 and 5 as Eq 7.

In [None]:
lamb1 = 0.05
lamb2 = 0.001

min_{Z,S,W,B} ( loss )

loss = C(Z, S, W, B) + (lamb1 * K_fun(R_pairs, W, B, O, T)) + (lamb2 * L(R_pairs, W, B))



![8](eqs/8.png)

In [None]:
M = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
word2vec = lambda t: M[t]

# Visual appearence module
# ---------------

O = # TODO: object probabilities: get final layer vector of "probabilities" from VGG-16 standard
R = # TODO: get final layer of relationship CNN (use tensorflow . . .)
def V(i, j, k, Z, S, O, R):
    P_i = O[i]
    P_j = O[j]
    P_k = np.dot(Z[k].T, R[k]) + S[k]
    return P_i * P_j * P_k

# Language module
# ===============

T = # TODO: T['word'] = index

def F(i, j, W, B, T):
    """
    Project relationship `R = <i,j>` to K-dim relationship space.
    """
    w2v = np.concatenate(word2vec(T[i]), word2vec(T[j]))
    return np.dot(W.T, w2v) + B

def f(i, j, k, W, B, T):
    """
    Project relationship `R = <i,j>` to scalar space.
    """
    w2v = np.concatenate(word2vec(T[i]), word2vec(T[j]))
    return np.dot(W[k].T, w2v) + B[k]

# Training projection function
# ----------------------------

class Relationship:
    def __init__(i, j, k):
        self.r = (i,j,k)

def dist(R1, R2, W, B, O, T):
    d_rel = f(*R1.r, W, B, T) - f(*R2.r, W, B, T) 
    d_obj = M.similarity(M[R1.i], M[R1.j]) + \
            m.similarity(M[R2.i], M[R2.j])
    return (d_rel ** 2) / d_obj

N = len(O)
num_samples = 500000

from numpy.random import randint

def sample_R(num_samples=500000)
    R_rand = lambda O: Relationship(randint(N), randint(N), randint(K))
    R_pairs = []
    for n in range(num_samples):
        R1 = R_rand(O)
        R2 = R_rand(O)
        R_pairs.append((R1, R2))
    return R_pairs


R_pairs = sample_R(O, num_samples)

def K_fun(R_pairs, W, B, O, T)
    D = []
    for R1, R2 in R_pairs:
        d = dist(R1, R2, W, B, O, T)
        D.append(d)
    return np.var(D)

# Likelihood of a relationship
# ----------------------------

def L(R_pairs, W, B)
    val = 0.0
    for R1, R2 in R_pairs:
        val += max(f(*R1.r, W, B) - f(*R2.r, W, B) + 1, 0)
        
# Rank loss function
# ------------------

def C(Z, S, W, B):
    Rs = [R for R_pair in R_pairs for R in R_pair]
    cs = []
    
    for R1 in Rs:
        vis  = V(*R1.r, Z, S)
        lang = f(*R1.r, W, B)
        diff_R = lambda R1, R2: ((R1.i != R2.i) or (R1.j != R2.j)) and (R2.k != R2.k)
        
        c = max(V(*R2.r, Z, S) * f(*R2.r, W, B) for R2 in Rs if diff_R(R1, R2))
        cs.append(max(c, 0))
        
    return sum(cs)


# Final objective function
# ------------------------

lamb1 = 0.05
lamb2 = 0.001

min_{Z,S,W,B} ( loss )

loss = C(Z, S, W, B) + (lamb1 * K_fun(R_pairs, W, B, O, T)) + (lamb2 * L(R_pairs, W, B))



In [None]:
M = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
word2vec = lambda t: M[t]

from numpy.random import randint


class Model:
    
    def __init__(self, TODO):
        self.T   # TODO: T['word'] = word2vec index   (maybe do reverse also?)
        self. 
        self.W
        self.B
        self.Z
        self.S
        self.w2v    # word2vec model
        self.N = 100
        self.K = 70
        self.num_samples = 500000
        self.lamb1 = 0.05
        self.lamb2 = 0.001
        self.batch_size = # TODO


    # Visual appearence module
    # ------------------------

    def V(self, i, j, k, P_O1, P_O2, P_R):
        P_O = # TODO: object probabilities: get final layer vector of "probabilities" from VGG-16 standard
        P_R = # TODO: get final layer of relationship CNN (use tensorflow . . .)
        
        P_i = P_O1[i]
        P_j = P_O2[j]
        P_k = np.dot(self.Z[k].T, P_R[k]) + self.S[k]
        return P_i * P_j * P_k
    
    def Vs(self, i, j, img_idx):
        """
        Get 70-vector for all relationships for a given i,j.
        
        """
        P_i = self.P_O1[img_idx, i]
        P_j = self.P_O2[img_idx, j]
        P_r = self.P_R[img_idx, :]
        P = np.dot(self.Z.T, P_r) + self.S
        return P
    
    def V_all(self, img_ids, scene_graphs):
        """
        Compute V for a list, with an arbitrary sized set of relation triplets for each.
        
        """
        
#         if len(img_ids) != len(img_Rs) != len(img_FRs):
#             print 'Error: number of relationships != number of images'; return
        
        
        # TODO: select only specific objects/relationships/scene graphs
        objss = [scene_graph.objects for scene_graph in scene_graphs]
        relss = [scene_graph.relationships for scene_graph in scene_graphs]
        obj_coords = [[(obj.x, obj.y, obj.w, obj.h) for obj in objs] for objs in objss]
        rel_coords = [[(rel.x, rel.y, rel.w, rel.h) for rel in rels] for rel in relss]
        
        

        self.P_O = self.run_images(img_ids, obj_coords)
        self.P_R = self.run_images(img_ids, rel_coords)

        
        
        v_batch = {}    
        for img_idx, scene_graph in enumerate(scene_graphs):
            img_id = scene_graph.image.id  ## TODO: will this work?
            P_o1 = self.P_O[img_idx, :]
            P_o2 = self.P_O[img_idx, :]
            P_r  = self.P_R[img_idx, :]
            v_batch[img_id] = {R: self.V(*R, P_O1, P_O2, P_R) for R in img_Rs[img_idx]}
            
            
#             ks = [k for i,j,k in img_Rs]
#             v = {(i,j): self.Vs(i, j, P_O1, P_O2, P_R)[ks] for i,j,k in img_Rs[img_idx]}
#             v_batch[img_id] = v

                
        return v_batch
    
    
    def run_images(img_ids, coords):
        """
        Load images, run cnn on each list of crop coords for each.
        
        """
        for img_id in img_ids:
    
    
    def get_R(self, rel):
        """
        TODO
        ====
        - have this turn a vg Relationship into a triplet (i,j,k)
        - have a `self.dictionary` 
        
        """
    
    def get_cnn_prob(self, cnn, img_ids, layer='prob'):
        ims = self.TODO_load_images(img_ids)
        
        cnn.feed({'data':ims})
        cnn.run()
        prob = cnn.get_layer_TODO('prob')
                
        
                
            
            
    

    # Language module
    # ===============

    def F(self, i, j):
        """
        Project relationship `R = <i,j>` to K-dim relationship space.
        """
        word2vec = np.concatenate(self.w2v[self.T[i]], self.w2v[self.T[j]])
        return np.dot(self.W.T, word2vec) + self.B

    def f(self, i, j, k):
        """
        Project relationship `R = <i,j>` to scalar space.
        """
        word2vec = np.concatenate(self.w2v[self.T[i]], self.w2v[self.T[j]])
        return np.dot(self.W[k].T, word2vec) + self.B[k]

    # Training projection function
    # ----------------------------

    def dist(self, R1, R2):
        """
        Distance between two predicate triplets.
        
        """
        d_rel = self.f(*R1.r) - f(*R2.r) 
        d_obj = self.M.similarity(self.M[R1[0]], self.M[R1[1]]) + \
                self.M.similarity(self.M[R2[0]], self.M[R2[1]])
        return (d_rel ** 2) / d_obj

    def sample_R(self):
        """
        Draw a number of random (i,j,k) indices: 2 objects and 1 relationship
        
        """
        R_rand = lambda: (randint(self.N), randint(self.N), randint(self.K))
        R_pairs = [(R_rand(), R_rand()) for n in range(self.num_samples)]
        return R_pairs

    def K_fun(self)
        """
        Eq (4): randomly sample relationship pairs and minimize variance.
        
        """
        R_samples = self.sample_R()

        D = []
        for R1, R2 in R_samples:
            d = dist(R1, R2)
            D.append(d)
        return np.var(D)

    def L(self, Rs):
        """
        Likelihood of a sampled relatinoships
        
        """
        val = 0.0
        for R1 in Rs:
            for R2 in Rs:
                val += max(self.f(*R1) - self.f(*R2) + 1, 0)

    def C(self, Rs, TODO_IMG):
        """
        Rank loss function
        
        Rs : list of relationships for training data
        
        """
        cs = []
        for R1 in Rs:
            c = max(V(*R2) * f(*R2) for R2 in Rs if  \
                    (R2[2] != R2[2]) and ((R1[0] != R2[0]) or (R1[1] != R2[1])))
            c = max(1 + V(*R1) * f(*R1) * c, 0)
            cs.append(c)
        return sum(cs)

    def objective(self, Rs, TODO_IMG):
        """
        Final objective function.
        
        """
        loss = self.C(Rs, TODO_IMG) +  \
               (self.lamb1 * self.K_fun() +  \
               (self.lamb2 * L(R_pairs, W, B))
        return loss
                
    def SGD(self, TODO):
        """
        Perform SGD over eqs 5 (L) 6 (C)
        """

