In [1]:
from __future__ import print_function
import torch
import os
import numpy as np
import sys
import matplotlib.pyplot as plt

from few_shot_learning.datasets import FashionProductImages, FashionProductImagesSmall   
from few_shot_learning.utils_evaluation import evaluate_few_shot
from config import DATA_PATH

# 1. Data Preprocessing and Statistics

I chose to integrate as much as possible with the existing code from the [github repository](https://github.com/oscarknagg/few-shot). To this end, I adapted mainly the dataset classes `few_shot_learning.datasets.FashionProductImages` and `few_shot_learning.datasets.FashionProductImagesSmall` to expose a `pandas.DataFrame` with attributes `'class_id'` and `'id'` to allow integration with `few_shot.core.NShotTaskSampler` from https://github.com/oscarknagg/few-shot/blob/master/few_shot/core.py. In keeping with the repository's naming conventions, it chose to call the **meta-training set** set the **background set** and the **meta-test set** the **evaluation set**.

Further changes to the existing codebase appear in `few_shot_learning.train_few_shot` and are marked as such. They are mainly small bugfixes and concern a different monitoring strategy during training (monitoring of validation accuracy instead of test accuracy).

The following code snippets illustrate the functionality:

In [2]:
# use split='all' to ignore the traing/test split from the transfer learning experiments
# use classes='background' to use the background/meta-training classes
background = FashionProductImagesSmall(DATA_PATH, split='all', classes='background')
# use classes='background' to use the evaluation/meta-testing classes
evaluation = FashionProductImagesSmall(DATA_PATH, split='all', classes='evaluation')

In [3]:
background.df
# evaluation.df

Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName,productDisplayName2,filename,my_id,class_id
2,0,Women,Accessories,Watches,Watches,Silver,Winter,2016.0,Casual,Titan Women Silver Watch,,59263.jpg,59263,58
4,1,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012.0,Casual,Puma Men Grey T-shirt,,53759.jpg,53759,55
5,2,Men,Apparel,Topwear,Tshirts,Grey,Summer,2011.0,Casual,Inkfruit Mens Chain Reaction T-shirt,,1855.jpg,1855,55
9,3,Men,Accessories,Watches,Watches,Black,Winter,2016.0,Casual,Skagen Men Black Watch,,30039.jpg,30039,58
11,4,Women,Accessories,Belts,Belts,Black,Summer,2012.0,Casual,Fossil Women Black Huarache Weave Belt,,48123.jpg,48123,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44439,25039,Women,Apparel,Topwear,Tshirts,Peach,Fall,2011.0,Casual,Tantra Women Printed Peach T-shirt,,12544.jpg,12544,55
44440,25040,Women,Apparel,Topwear,Tops,Blue,Summer,2012.0,Casual,Sepia Women Blue Printed Top,,42234.jpg,42234,53
44442,25041,Men,Footwear,Flip Flops,Flip Flops,Red,Summer,2011.0,Casual,Lotto Men's Soccer Track Flip Flop,,6461.jpg,6461,20
44443,25042,Men,Apparel,Topwear,Tshirts,Blue,Fall,2011.0,Casual,Puma Men Graphic Stellar Blue Tshirt,,18842.jpg,18842,55


In [4]:
background.classes

array(['Accessory Gift Set', 'Baby Dolls', 'Backpacks', 'Bangle',
       'Basketballs', 'Bath Robe', 'Belts', 'Booties', 'Boxers', 'Bra',
       'Camisoles', 'Capris', 'Compact', 'Concealer', 'Cufflinks',
       'Deodorant', 'Dresses', 'Duffel Bag', 'Dupatta', 'Flats',
       'Flip Flops', 'Formal Shoes', 'Gloves', 'Hair Colour', 'Handbags',
       'Heels', 'Jackets', 'Jeggings', 'Kajal and Eyeliner', 'Kurtis',
       'Laptop Bag', 'Leggings', 'Lip Gloss', 'Lip Liner', 'Mobile Pouch',
       'Mufflers', 'Nail Polish', 'Necklace and Chains', 'Nightdress',
       'Patiala', 'Pendant', 'Rain Jacket', 'Ring', 'Rompers',
       'Rucksacks', 'Sandals', 'Shorts', 'Skirts', 'Sports Sandals',
       'Stockings', 'Suspenders', 'Swimwear', 'Ties', 'Tops', 'Trousers',
       'Tshirts', 'Waist Pouch', 'Waistcoat', 'Watches', 'Water Bottle'],
      dtype='<U19')

In [5]:
evaluation.classes

array(['Bracelet', 'Briefs', 'Caps', 'Casual Shoes', 'Churidar',
       'Clutches', 'Earrings', 'Eyeshadow', 'Face Moisturisers',
       'Face Wash and Cleanser', 'Foundation and Primer',
       'Fragrance Gift Set', 'Free Gifts', 'Highlighter and Blush',
       'Innerwear Vests', 'Jeans', 'Jewellery Set', 'Jumpsuit',
       'Kurta Sets', 'Kurtas', 'Lip Care', 'Lipstick', 'Lounge Pants',
       'Lounge Shorts', 'Mascara', 'Mask and Peel', 'Messenger Bag',
       'Night suits', 'Perfume and Body Mist', 'Salwar', 'Sarees',
       'Scarves', 'Shirts', 'Shoe Accessories', 'Socks', 'Sports Shoes',
       'Stoles', 'Sunglasses', 'Sunscreen', 'Sweaters', 'Sweatshirts',
       'Track Pants', 'Tracksuits', 'Travel Accessory', 'Trunk', 'Tunics',
       'Wallets'], dtype='<U22')

# 2. Tests

Run `pytest` in the root directory to run tests. There are tests specifically designed to test the integration with `few_shot.core.NShotTaskSampler` in `tests/test_few_shot_integration`.

# 3. Training

I focused on running experiments with Prototypical Networks.
Run e.g  the following for 2- and 5-shot, 2-, 5- and 15-way experiments on the small dataset:

```
python -m experiments.few_shot_learning --dataset fashion --k-test 2 --n-test 1 --k-train 10 --n-train 1 --q-train 5 --small-dataset
python -m experiments.few_shot_learning --dataset fashion --k-test 5 --n-test 1 --k-train 30 --n-train 1 --q-train 5 --small-dataset
python -m experiments.few_shot_learning --dataset fashion --k-test 15 --n-test 1 --k-train 30 --n-train 1 --q-train 5 --small-dataset
python -m experiments.few_shot_learning --dataset fashion --k-test 2 --n-test 5 --k-train 10 --n-train 5 --q-train 5 --small-dataset
python -m experiments.few_shot_learning --dataset fashion --k-test 5 --n-test 5 --k-train 30 --n-train 5 --q-train 5 --small-dataset
python -m experiments.few_shot_learning --dataset fashion --k-test 15 --n-test 5 --k-train 30 --n-train 5 --q-train 5 --small-dataset
```

The full set of experiments that I ran can be found in `experiments/few_shot_experiments`.

For a list of arguments, refer to the help message:

In [6]:
run "~/few-shot-learning/experiments/few_shot_experiment.py" -h

usage: few_shot_experiment.py [-h] [--dataset DATASET] [--distance DISTANCE]
                              [--n-train N_TRAIN] [--n-test N_TEST]
                              [--k-train K_TRAIN] [--k-test K_TEST]
                              [--q-train Q_TRAIN] [--q-test Q_TEST]
                              [--small-dataset] [-a ARCHITECTURE]
                              [--pretrained] [--validate]
                              [--n_val_classes N_VAL_CLASSES] [--gpu GPU]

optional arguments:
  -h, --help            show this help message and exit
  --dataset DATASET
  --distance DISTANCE
  --n-train N_TRAIN
  --n-test N_TEST
  --k-train K_TRAIN
  --k-test K_TEST
  --q-train Q_TRAIN
  --q-test Q_TEST
  --small-dataset
  -a ARCHITECTURE, --arch ARCHITECTURE
                        model architecture: alexnet | densenet121 |
                        densenet161 | densenet169 | densenet201 | googlenet |
                        inception_v3 | mobilenet_v2 | resnet101 | resnet152 |
       

Currently, the only available option `--dataset fashion`. Arguments `--n-train N, --n-test N, --k-train K, --k-test K` determine to the type of few-shot learning to run, that is the number of support samples and the number of classes to encounter during training and testing, respectively. Arguments `--q-train Q, --q-test` determine the number of query examples and can be flexibly chosen. It's important though to ensure `n_train + q_train <= 10` and `n_test+ q_test <= 10` as this is the minimum amount of samples per class in both the background set and the evaluation set for the the fashion dataset.

Usually, one would chose `n_train=n_test` and `k_train=k_test` to match training and testing conditions exactly. However, it is possible to chose `k_train > k_test` to make the training conditions harder, possibly combating overfitting. I used this strategy in all experiments as suggested in the proto-nets paper.

The number of query samples is kept at `q_train=5` and `q_test=1`.

Furthermore arguments include `--small-dataset` to train on the smaller image sizes (80-by-60 pixels) which should be chosen as long as `--pretrained` is not used. The default model is a 4-layer conv-net, such that for the given image size, the embedding dimension is reasonable. In the original code, this model was used for 28-by-28 Omniglot images and 84-by-84 miniImageNet images, which is the same ballpark as the small fashion dataset.

If `--small-dataset` is not given, the images will have size 400-by-300 as in the transfer learning experimets. This is too large for a 4-layer conv-net, as the embedding dimension will explode together with the memory and computation requirements. Instead, I used `--pretrained` to use a pre-defined model from the ImageNet model zoo. For all models from the zoo, the last fully connected layer will be discarded (actually replaced with an identity), so that the models can serve as embedding models. I only ran experiments with `--arch resnet18` which is also the default. That turned out to be intensive enough, since for few-shot learning with Prototypical Networks, the batch size is `(n_train + q_train) * k_train`. This goes up to 300 for the most difficult configuration I had in mind (`n_train=5, q_train=5, k_train=30`), which I found didn't fit on a single 12GB GPU. One could think about multi-GPU training but I am not certain at the moment whether this can work as straightforwardly as with transfer learning.

Now, technically, it is only important to use the larger architecture, and not the pretrained weights for the 400-by-300 pixel images. However, using pretrained weights was one the ideas I had of how to improve the performance, and so I blended model architecture and pretraining in the arguments.

### Strategy to test model and improve performance

The important decisions I made for training can be summarized like this:

- As in the original implementation:
    - *n_epochs*: 80
    - *learning rate schedule*: `learning_rate=1e-3`, halfed every `drop_lr_every=20` epochs
    - *episodes_per_epoch*: 100
    - *model architecture*: 4-layer conv-net (for 80-by-60 pixel small dataset) 
    - *distance*: L2
- Changes:
    - *validation*: introduced the possibility to monitor validation accuracy. The validation set is constructed by taking a subset of the background classes. The size of the subset is determined by `n_val_classes` and should be at least `k_test`. The original implementation monitored only performance on the evaluation/meta-test set. This is not a good idea as it cannot be used for early-stopping or best model selection.
    - *validation_episodes*: 200
    - *evaluation_episodes*: 1000 (the original code had bug and evaluated only on *episodes_per_epoch* episodes)
    - *model architecture*: **resnet18** with pretrained weights (for 400-by-300 pixel dataset)
    - *data augmentation*: `torchvision.transforms.RandomResizedCrop`, `torchvision.transforms.RandomPerspective`, `torchvision.transforms.RandomHorizontalFlip` only for the background/meta-training set.
    - *n-shot, k-way*:
        - 1-shot, 2-way: `n_train=1, n_test=1, k_train=10, k_test=2, q_train=5, q_test=1`
        - 1-shot, 5-way: `n_train=1, n_test=1, k_train=30, k_test=5, q_train=5, q_test=1`
        - 1-shot, 15-way: `n_train=1, n_test=1, k_train=30, k_test=15, q_train=5, q_test=1`
        - 5-shot, 2-way: `n_train=5, n_test=5, k_train=10, k_test=2, q_train=5, q_test=1`
        - 5-shot, 5-way: `n_train=5, n_test=5, k_train=30, k_test=2, q_train=5, q_test=1`
        - 5-shot, 15-way: `n_train=5, n_test=5, k_train=30, k_test=2, q_train=5, q_test=1`
        - 5-shot, 15-way: `n_train=5, n_test=5, k_train=20, k_test=2, q_train=5, q_test=1` (for 400-by-300 pixel imaes, not enough memory with `k_train=30`)
    
    

# 4. Results

In [7]:
LOG_DIR = os.path.expanduser("~/few-shot-learning/logs/proto_nets")
MODEL_DIR = os.path.expanduser("~/few-shot-learning/models/proto_nets")

if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

## 4.1. Small dataset

In [9]:
# load best model state_dict
small = True
pretrained = False

validate = [False, True]
shot = [(1,1), (5,5)]
way = [(2,10), (5,30), (15,30)]
query = [(1,5), (1,5)]

best_model_state_dict = {}
results = {}

for val in validate:
    
    best_model_state_dict[val] = {}
    results[val] = {}
    
    for (n_test, n_train), (q_test, q_train) in zip(shot, query):
        
        best_model_state_dict[val][n_test] = {}
        results[val][n_test] = {}
        
        for (k_test, k_train) in way:
            
            param_str = f'fashion_nt={n_train}_kt={k_train}_qt={q_train}_' \
            f'nv={n_test}_kv={k_test}_qv={q_test}_small={small}_' \
            f'pretrained={pretrained}_validate={val}'
            
            logfile = os.path.join(LOG_DIR, param_str + ".csv")
            modelfile = os.path.join(MODEL_DIR, param_str + ".pth")
            
            print(modelfile)
            if not os.path.isfile(modelfile):
                print("not found")
                continue
                
            state_dict = torch.load(modelfile, map_location=device)
            best_model_state_dict[val][n_test][k_test] = state_dict
            
            totals = evaluate_few_shot(
                state_dict,
                n_shot=n_test,
                k_way=k_test,
                q_queries=q_test,
                device=device,
                architecture='resnet18',
                pretrained=pretrained,
                small_dataset=small,
                metric_name="accuracy",
                evaluation_episodes=1000,
                num_input_channels=3,
                distance='l2'  
            )
            
            results[val][n_test][k_test] = totals
           

/home/ubuntu/few-shot-learning/models/proto_nets/fashion_nt=1_kt=10_qt=5_nv=1_kv=2_qv=1_small=True_pretrained=False_validate=False.pth
/home/ubuntu/few-shot-learning/models/proto_nets/fashion_nt=1_kt=30_qt=5_nv=1_kv=5_qv=1_small=True_pretrained=False_validate=False.pth
/home/ubuntu/few-shot-learning/models/proto_nets/fashion_nt=1_kt=30_qt=5_nv=1_kv=15_qv=1_small=True_pretrained=False_validate=False.pth
/home/ubuntu/few-shot-learning/models/proto_nets/fashion_nt=5_kt=10_qt=5_nv=5_kv=2_qv=1_small=True_pretrained=False_validate=False.pth
/home/ubuntu/few-shot-learning/models/proto_nets/fashion_nt=5_kt=30_qt=5_nv=5_kv=5_qv=1_small=True_pretrained=False_validate=False.pth
/home/ubuntu/few-shot-learning/models/proto_nets/fashion_nt=5_kt=30_qt=5_nv=5_kv=15_qv=1_small=True_pretrained=False_validate=False.pth
/home/ubuntu/few-shot-learning/models/proto_nets/fashion_nt=1_kt=10_qt=5_nv=1_kv=2_qv=1_small=True_pretrained=False_validate=True.pth
/home/ubuntu/few-shot-learning/models/proto_nets/fashi

In [10]:
results

{False: {1: {2: {'loss': 0.2830387688097544, 'accuracy': 0.8905},
   5: {'loss': 0.8062824005491566, 'accuracy': 0.7478},
   15: {'loss': 1.4794104229211806, 'accuracy': 0.5820666666666666}},
  5: {2: {'loss': 0.11855049550533295, 'accuracy': 0.9575},
   5: {'loss': 0.3890641279128595, 'accuracy': 0.8796},
   15: {'loss': 0.8682025781907141, 'accuracy': 0.7622666666666666}}},
 True: {1: {2: {'loss': 0.33640861096978186, 'accuracy': 0.885},
   5: {'loss': 0.8115123350869399, 'accuracy': 0.7266},
   15: {'loss': 1.7122219169139863, 'accuracy': 0.5309333333333334}},
  5: {2: {'loss': 0.11383653211593628, 'accuracy': 0.962},
   5: {'loss': 0.4309825728155229, 'accuracy': 0.8732},
   15: {'loss': 1.0273922931402921, 'accuracy': 0.7261333333333333}}}}

The top-1 accuracy for the few-shot experiments is shown in the following table:

|                           | Fashion Small |     |      |      |     |      |
|---------------------------|---------------|-----|------|------|-----|------|
| **k-way**                 | **2**         |**5**|**15**|**2** |**5**|**15**|
| **n-shot**                | **1**         |**1**|**1** |**5** |**5**|**5** |
| 80 epochs                 | 89.1          |74.8 |58.2  |95.8  |88.0 |76.2  |
| best model (validation)   | 88.5          |72.7 |53.1  |96.2  |87.3 |72.6  |

It seems that the best model selection via validation accuracy underperforms the simple approach of training the model for 80 epochs. It could be that the gains of potentially combating overfitting through validation do not even out the costs of sacrificing 10-20 of the 60 available background classes for validation.

## 4.2. Full dataset

In [19]:
# load best model state_dict
small = False
pretrained = True

validate = [True] #[False, True]
# n-shot, k-way and q-query have more complicated structure here.
# Due to memory problems, some combinations were not possible.
# Since concerns the training values only though.
shot_way_query = [
    (1,1,2,10,1,5),(1,1,5,30,1,5),(1,1,15,30,1,5),
    (5,5,2,10,1,5),(5,5,5,20,1,5),(5,5,15,20,1,5)
]

#best_model_state_dict = {}
#results_full = {}

for val in validate:
    
    #best_model_state_dict[val] = {}
    #results_full[val] = {}
    
    for (n_test, n_train, k_test, k_train, q_test, q_train) in shot_way_query:
        
        best_model_state_dict[val][(n_test, k_test)] = {}
        results_full[val][(n_test, k_test)] = {}
        
        param_str = f'fashion_nt={n_train}_kt={k_train}_qt={q_train}_' \
        f'nv={n_test}_kv={k_test}_qv={q_test}_small={small}_' \
        f'pretrained={pretrained}_validate={val}'

        logfile = os.path.join(LOG_DIR, param_str + ".csv")
        modelfile = os.path.join(MODEL_DIR, param_str + ".pth")

        print(modelfile)
        if not os.path.isfile(modelfile):
            print("not found")
            continue

        state_dict = torch.load(modelfile, map_location=device)
        best_model_state_dict[val][(n_test, k_test)] = state_dict

        totals = evaluate_few_shot(
            state_dict,
            n_shot=n_test,
            k_way=k_test,
            q_queries=q_test,
            device=device,
            architecture='resnet18',
            pretrained=pretrained,
            small_dataset=small,
            metric_name="accuracy",
            evaluation_episodes=400,
            num_input_channels=3,
            distance='l2'  
        )

        results_full[val][(n_test, k_test)] = totals
           

/home/ubuntu/few-shot-learning/models/proto_nets/fashion_nt=5_kt=20_qt=5_nv=5_kv=15_qv=1_small=False_pretrained=True_validate=True.pth
not found


In [18]:
results_full

{False: {(1, 2): {'loss': 0.29882320020347836, 'accuracy': 0.90625},
  (1, 5): {'loss': 0.7771180894847567, 'accuracy': 0.783},
  (1, 15): {'loss': 1.6507120440900325, 'accuracy': 0.5725},
  (5, 2): {'loss': 0.12285698566585779, 'accuracy': 0.95875},
  (5, 5): {'loss': 0.44557703804807036, 'accuracy': 0.8815},
  (5, 15): {'loss': 0.8342426947876811, 'accuracy': 0.7483333333333333}},
 True: {(1, 2): {'loss': 0.35847300887107847, 'accuracy': 0.89625},
  (1, 5): {'loss': 0.8223146000178531, 'accuracy': 0.757},
  (1, 15): {'loss': 1.733450947329402, 'accuracy': 0.5735},
  (5, 2): {'loss': 0.12618330255150795, 'accuracy': 0.96875},
  (5, 5): {'loss': 0.3712885344750248, 'accuracy': 0.874},
  (5, 15): {}}}

The top-1 accuracy for the dataset with 400-by-300 pixel images is shown in the table below:

|                           | Fashion Small |     |      |      |     |      |
|---------------------------|---------------|-----|------|------|-----|------|
| **k-way**                 | **2**         |**5**|**15**|**2** |**5**|**15**|
| **n-shot**                | **1**         |**1**|**1** |**5** |**5**|**5** |
| 80 epochs                 | 90.6          |78.3 |57.2  |95.9  |88.1 |74.8  |
| best model (validation)   | 89.6          |75.7 |57.3  |96.9  |87.4 |tbd   |

There are a few gains here and there but overall the results with a pretrained network do not improve upon the results from before, where the images and the network are smaller. On the other hand, this shows that the approach is viable for larger image sizes, where the smaller network produces embeddings with too high a dimensionality. Unfortunately, even in this case the validation approach underperforms slightly.

# 5. Outlook

### Extensions:

- *Class augmentation*: As in the Omniglot experiments in the [paper](https://arxiv.org/pdf/1703.05175v2.pdf), it could be a viable approach to augment the set of background classes. In Omniglot, this was done by 90 degree rotations of the original images, increasing the number of classes from 1200 to 4800. Note that this is different from the simple data augmentations used here, which increase the virtual amount of samples available for each class. An increase of classes could be beneficial especially to improve the monitoring of validation accuracy, which came with substantial sacrifices in the current setup with only 60 background classes.
    It is unclear to me, however, whether rotations of the original images for the fashion data are a good way to create new virtual classes.