# Embeddings from pretrained model(Resnet50 default):

The authors start from a scene image $I_s$ and a product image $I_p$. They use a pretrained (on ImageNet) Resnet50 and push the image through the the resnet and get back the output of the final pooling layer, which can be interpreted as an embedding (they call it visual feature, this is going to be our global similarity) vector. This gives us $\textbf{v}_s \in R^{d_1}$ and if we do this for the product image we similarly get $\textbf{v}_p \in R^{d_1}$.

For the local embeddings they follow a similar procedure but take the embeddings from one of the intermediate convolutional neural networks. In the CNN the region/point in space is preserved and so image, we take the lower left corner and all the features from that corner. That will give us the local feature maps $m_i \in R_{d_2}$ where the i denotes the point in the picture.

All these embeddings are passed through a 2 layer MLP $g(\Theta,.)$ to convert them to the same metric space, so a distance can be applied. The MLP params seem to be the same for the product and scene embeddings but different for all the location specific ones. With it we get the final 
embeddings $\textbf{f}_s = g(\Theta_g;v_s),\textbf{f}_p = g(\Theta_g;v_p), \textbf{f}_i = g(\Theta_l;m_i), \hat{\textbf{f}}_i' = g(\Theta_l';m_i')$

_The authors mention applying l2 norm penalty, this is the architecture for g() :Linear-BN-Relu-Dropout-Linear -L2Norm_

In section 4.1 they mention unit normalization of the embeddings, then what they mean for l2 norm is the UnitNormalization layer from keras which normalizes using the l2 norm.

If the notation in the paper is not wrong, there is parameter sharing in the global embeddings from the scene and product image, that is $\Theta_s = \Theta_p$ according to stack overflow this is the appropriate way to handle this: https://stackoverflow.com/questions/60094332/how-to-use-the-same-layer-model-twice-in-one-model-in-keras


Guillem 12/04/2025

In [2]:
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras import layers 



I was done writing my function and then chatgpt was like no that's too simple not reusable enough blablabla. I checked keras docs and they mention this as the proper way to build a custom layer so I rewrote it, looks pretty much the same but some keras fancy stuff of auto shapes goes on in the background... 

https://www.tensorflow.org/tutorials/customization/custom_layers

In [3]:
class GLayer(layers.Layer):
    """
    Implements the g(·) that takes the visual feature vectors (v_s, v_p global embeddings) and 
    the feature maps m_i and puts them through a small nn, where the final dimension of the embeddings 
    gets specified.

    parameters:

        input_shape: works for v_s, v_p, m_i. Should be a 1*1*feature vector, it gets flattened. 
        u_intermediate: units from first dense layer, unspecified in the paper if I am not mistaken. 
        dropout: Dropout layer percentage,  unspecified in the paper if I am not mistaken. 
        f_dimension: the dimension of the final embeddings used in the attention mechanism.
    
    returns: 
    
        model mapping from the input_embedding to the output f embedding. 
    
    """
    def __init__(self, u_intermediate, dropout, f_dimension,name=None, **kwargs):
        # gets all methods from the parent(inheritance) of GLayer which is keras.Layer
        super(GLayer, self).__init__(name=name,**kwargs)
        self.flatten = layers.Flatten()
        self.dense1 = layers.Dense(units=u_intermediate) # no activation means linear.
        self.bn = layers.BatchNormalization()
        self.relu = layers.ReLU()
        self.dropout = layers.Dropout(rate=dropout)
        self.dense2 = layers.Dense(units=f_dimension)
        self.unit_norm = layers.UnitNormalization()
    def call(self,inputs):
        x = self.flatten(inputs)
        x = self.dense1(x)
        x = self.bn(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.dense2(x)
        f = self.unit_norm(x)
        return f
    def get_config(self):
        # If we want to save the model we need this apparently.
        config = super().get_config()
        config.update({
            "u_intermediate": self.dense1.units,
            "dropout": self.dropout.rate,
            "f_dimension": self.dense2.units, 
            "name": self.name
        })
        return config

We should have shared weights for both global embeddings and different params for each of the local ones.
Create one GLayer for both and #(width*length) layers for the local mappings.

In [4]:
#gnn = GLayer(u_intermediate=256, dropout=0.3, f_dimension=128)

# Input shape could be (batch_size, 1, 1, feature_dim)
#out1 = gnn(input1)
#out2 = gnn(input2)  # Shared weights


## $e_c$ embeddings

A bit tricky to implement, althought it is just a bias layer that changes by category.

**Forgot the unit normalization! -> Now added**

_To Do_: Add docstrings and check differences between build and call... Also that final call thing that was giving me errors.

In [5]:
class CategoryDependentBiasLayer(layers.Layer):
    """ 
    args: 
        num_categories: given from data 
        bias_dim: should be same dimension as f 
    returns: 
        bias vector for the correct category, should be of the same dimension as f 
    """
    def __init__(self, num_categories, bias_dim, **kwargs):
        super(CategoryDependentBiasLayer, self).__init__(**kwargs)
        self.num_categories = num_categories
        self.bias_dim = bias_dim
        self.unit_norm = layers.UnitNormalization()
    def build(self, input_shape):
        # One bias vector per category
        self.biases = self.add_weight(
            shape=(self.num_categories, self.bias_dim),
            initializer="zeros",
            trainable=True,
            name="category_biases"
        )

    def call(self, category_indices):
        # Some stuff from chatgpt to differentiate between one input and a batch of inputs... 
        # TO DO: Check it makes sense.
        category_indices = tf.convert_to_tensor(category_indices)
    
        # Handle scalar (inference) input
        if category_indices.shape.rank == 0:
            category_indices = tf.expand_dims(category_indices, axis=0)
    
        # Handle (batch_size, 1) input (common in Keras with integer inputs)
        if category_indices.shape.rank == 2 and category_indices.shape[-1] == 1:
            category_indices = tf.squeeze(category_indices, axis=-1)
    
        # Now shape should be (batch_size,)
        biases = tf.gather(self.biases, category_indices)
        return self.unit_norm(biases)