In [36]:
import pandas
import os
from few_shot_learning.datasets import FashionProductImages, FashionProductImagesSmall
from few_shot_learning.utils_data import prepare_class_embedding, prepare_vocab,\
    prepare_vocab_embedding, prepare_word_embedding
from few_shot_learning.train_zero_shot import zero_shot_training
# from few_shot_learning.utils_evaluation import evaluate_few_shot
from config import DATA_PATH, PATH
from few_shot_learning.models import Identity, ClassEmbedding

# Zero-Shot Learning

# 1. Introduction and Strategy

The idea of zero-shot learning is to train a model to be able to classify images from unseen classes, that is classes of which the model has not seen any samples during training. An example in the context of the fashion dataset could be the task of deciding whether an new, unseen, image is a "Jeans" or a "Casual Shoe" when neither jeans nor casual shoe images were present a training time. At training time, the model can only access images and class labels from a set of training classes, i.e. classes disjoint from the set of unseen classes. This could be images of "Formal Shoes" and "Tshirts".

The task of zero-shot learning is different from even one-shot learning, since in few-shot learning the model at test time has access to a few sample images from all unseen classes and can adapt using these samples before being asked to classify unseen images.

To be able to classify unseen classes with zero-shot learning, the model needs to make use of at least one of two strategies. Either there is additional class-level meta information that describes seen and unseen classes in terms of additional attributes, e.g. visual attributes. In this case the model can classify unseen images of unseen classes at test time by relating the unseen classes to classes seen at training time via their attributes. 
Or the model can use the class labels themselves to understand what to look for in an unseen image by relating them to classes seen at training time semantically. In the context of the fashion dataset, the example above already illustrates this strategy: If the model has knowledge of the semantic similarity between "Formal Shoes" (training class) and "Casual Shoes" (unseen class), it could reasonably deduce that an unseen image of a casual shoe must look more like a formal shoe than like jeans.

For the fashion dataset, the only class-level meta information are the `masterCategory` and `subCategory` columns of the data. The column `productDisplayName`, although semantically informative, represents sample-level information and can thus not be used for zero-shot learning. Or rather, it can not straightforwardly be used for zero-shot learning. One can imagine pooling the semantic descriptions of all samples of a given class, distilling them into a a single class-level description. This could be, for example, a measure of how often certain descriptors like "green" are used to describe images of the class. Importantly, if this strategy were used, the product descriptions of indivdual images could not be used.

Here, I chose a different strategy, combining the class-level attributes `masterCategory` and `subCategory` as categorical features with NLP features of the class labels through pre-trained semantic word embeddings. Specifically, I chose to represent class labels as vectors via the [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/) language model. The algorithm used for zero-shot learning with these features were **Prototypical Networks** in the zero-shot configuration as described in the paper.



# 2. Methods and Training Procedure

In [14]:
# loading data with class attribute configuration
all_data = FashionProductImagesSmall(split='all', classes=None, return_class_attributes=True)

The `masterCategory` is a class-level meta information which takes 7 different values. These are:

In [15]:
print(all_data.df_meta["masterCategory"].unique())

['Apparel' 'Accessories' 'Footwear' 'Personal Care' 'Free Items'
 'Sporting Goods' 'Home']


I chose to the represent these as categorical, one-hot features, ignoring their semantic meaning. The dataset can return them with `return_class_attributes=True`, e.g. via:

In [5]:
X, y, attr = all_data[0]
attr

array([0., 1., 0., 0., 0., 0., 0.])

In [6]:
len(attr)

7

A set of utility functions uses the word representations from **GloVe: Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip** to build word embeddings of the class labels. This is a common strategy e.g. for zero-shot learning on ImageNet (see this review [paper](https://arxiv.org/pdf/1707.00600.pdf)).

A slight complication arises for the fashion data in that the class labels are sometimes multi-word descriptions, e.g. "Perfume and Body Mist" or "Laptop Bag". A possible solution would be to learn an RNN on top of the single-word embeddings, outputting a fixed size embedding vector for all classes. Given how complicated the setup is expected to be, however, it is preferable to not introduce additional parameters when it can be avoided. For multi-word class labels, I chose to simply pool all single-word embeddings for a given multi-word class label via a simple averaging operation as suggested in this [paper](https://www.aclweb.org/anthology/P18-1041.pdf).

The class label embeddings thus are $300$-dimensional vectors, which are simply concatenated with the $7$-dimensional attribute vectors encoding the `masterCategory`. A simple linear layer is learned on top to project these features into a $256$-dimensional cross-modal embedding space into which the query images will be mapped as well. As in the Prototypical Networks paper, the $256$-dimensional class embeddings are normalized to unit length in the cross-modal space.

To embed images in this cross-modal space, a linear layer is learned on top of pre-trained ResNet18 image features, which are $512$-dimensional. The ResNet18 features are not fine-tuned. There is a potential problem with this approach when there are overlapping classes between ImageNet (on which ResNet18 was trained) and the unseen test classes, but I chose to ignore that.

Below is a code snippet showcasing the class label embeddings:

In [None]:
# this prepares the indexing of word2vec for glove and takes a while
prepare_word_embedding()

In [7]:
background = FashionProductImagesSmall(split='all', classes='background', return_class_attributes=True)
# evaluation = FashionProductImagesSmall(split='all', classes='evaluation', return_class_attributes=True)

target_vocab = prepare_vocab(background.df_meta, columns=["articleType"])
vocab_embedding, word2idx = prepare_vocab_embedding(target_vocab)

In [8]:
vocab_embedding[word2idx["tshirts"]]

array([-1.9351e-02,  6.9610e-02, -1.3461e-01, -3.5931e-01,  3.8046e-01,
        4.4326e-01,  7.0563e-02, -1.6641e-01,  2.3013e-01, -1.8636e-02,
       -9.0016e-02,  3.6856e-01, -3.2842e-01,  1.3455e-01,  5.6520e-01,
       -6.7276e-01,  3.0663e-01,  2.4319e-01, -1.6085e-01,  2.9144e-01,
       -6.9306e-01,  1.8493e-01, -1.0965e-01, -4.1638e-01, -2.7546e-01,
       -1.0150e-01, -2.2765e-01, -5.8273e-02,  2.2619e-02,  4.9403e-01,
        3.3521e-01, -6.1759e-01, -1.6073e-01,  3.8577e-01, -2.6317e-01,
       -5.3999e-01,  1.4394e-01,  4.0844e-01, -6.3808e-02,  4.8194e-01,
       -2.4429e-01,  4.9381e-02, -4.8517e-01, -3.7395e-01, -3.0018e-01,
       -2.8939e-01,  3.5502e-01,  1.5590e-01,  4.3168e-01,  4.3281e-02,
        1.9047e-01,  9.2106e-02,  1.1732e+00, -6.1249e-01,  1.4235e-01,
       -2.2451e-01,  4.2231e-01,  1.0565e-01,  5.6860e-01, -3.1866e-02,
       -4.7184e-01, -3.7409e-01,  5.8321e-01, -1.2297e-01, -6.7348e-02,
       -2.8081e-02, -6.1327e-01,  6.7741e-01, -2.2748e-02, -7.42

# 3. Training

Run

```
python -m experiments.zero_shot_experiment --k-train 10 --k-test 2 --q-train 10 --q-test 1 --small-dataset --pretrained --freeze

python -m experiments.zero_shot_experiment --k-train 20 --k-test 5 --q-train 10 --q-test 1 --small-dataset --pretrained --freeze

python -m experiments.zero_shot_experiment --k-train 30 --k-test 15 --q-train 10 --q-test 1 --small-dataset --pretrained --freeze
```

# 4. Results

In [35]:
LOG_DIR = os.path.expanduser("~/few-shot-learning/logs/proto_nets")
MODEL_DIR = os.path.expanduser("~/few-shot-learning/models/proto_nets")

In [32]:
small = True
pretrained = True

validate = [False]
shot_way_query = [(0,0,2,10,1,10), (0,0,5,20,1,10), (0,0,15,30,1,10)]

# best_model_state_dict = {}
csv_logs = {}
top1_accuracy = {}

for val in validate:
    
    # best_model_state_dict[val] = {}
    csv_logs[val] = {}
    top1_accuracy[val] = {}
    
    for (n_test, n_train, k_test, k_train, q_test, q_train) in shot_way_query:
        
        param_str = f'fashion_nt={n_train}_kt={k_train}_qt={q_train}_' \
        f'nv={n_test}_kv={k_test}_qv={q_test}_small={small}_' \
        f'pretrained={pretrained}_validate={val}'

        logfile = os.path.join(LOG_DIR, param_str + ".csv")
        modelfile = os.path.join(MODEL_DIR, param_str + ".pth")
        
        # best_model_state_dict[val][(n_test, k_test)] = {}
        csv_log = pandas.read_csv(logfile)
        csv_logs[val][(n_test, k_test)] = csv_log
        top1_accuracy[val][(n_test, k_test)] = csv_log[f"val_{n_test}-shot_{k_test}-way_acc"].iloc[-1]
        

In [33]:
top1_accuracy

{False: {(0, 2): 0.77, (0, 5): 0.484, (0, 15): 0.20266666666666666}}

Top 1 accuracy for initial zero-shot learning experiments is shown below:

|                           | Fashion Small |     |      |
|---------------------------|---------------|-----|------|
| **k-way, zero_shot**      | **k=2**       |**k=5**|**k=15**|
| 40 epochs                 | 77.0          |48.4 |20.2  |