In [1]:
from sys import path
from pathlib import Path
from collections import defaultdict

import torchvision.transforms as transforms

import pandas as pd

In [2]:
path.append( "../code/" )

from relationnet.nn import ImageRelationNetwork

Lets import the relation net code. Its not very generalized at the moment so you need to get your dataframe in a particular format for it to work. I add it to the PYTHONPATH here so I can import it easily, but you could move in to the current directory and get it to work.

In [3]:
data_root_dir = Path( "/home/lewis/Work/Employment/fellowshipai/fashion/data/streetstyle/" )
model_dir = Path( "/home/lewis/Work/Employment/fellowshipai/fashion/submission" )
im_dir = data_root_dir/"streetstyle27k_cropped"
csv = data_root_dir/"ss27k_labels.csv"

We're gonna extract specific classes from the csv and create a new dataframe from it, keeping only those classes. Have a read below for why we're only using one class and not all of them.

In [4]:
targets = set( [ "clothing_category_dress", "clothing_category_outwear", "clothing_category_shirt", 
              "clothing_category_suit", "clothing_category_sweater", "clothing_category_tank_top",
             "clothing_category_t-shirt" ] )

ims = { "image": [], "label": [] }
with open( csv, "r" ) as f:
    for line in f:
        im, cls = line.split( "," )
        cls = set( cls.split( " " ) )
        
        inter = targets.intersection( cls )
        if len( inter ) == 1:
            ims["image"].append( im )
            ims["label"].append( list( inter )[0] )
        elif len( inter ) > 1:
            print( im, cls )
            
df = pd.DataFrame.from_dict( ims ).sample( frac=1 )
df.head()

Unnamed: 0,image,label
8976,1a23c7470bfccfd65c068ea12299b5a8_7479875228365...,clothing_category_t-shirt
13467,1e80fcaa80f96a744233886cacfe97d8_7978015252306...,clothing_category_dress
6069,7d39a00c898b09501b49098fe7034fe6_9244181586496...,clothing_category_shirt
10605,863b16ef17ba5b4b9be4f9c8cb17f237_8505265244947...,clothing_category_sweater
4254,8ea7166097ac1e5d3a89f08aab7b4988_6437852106281...,clothing_category_t-shirt


Ideally we would train on the entire dataset but there is an issue in that with how relation nets are set up. They train based on a single relation score between two images. For a single label problem this is easy, images are related if they are of the same class and are unrelated otherwise. Its more complicated in a multi-label scenario. If the labels are not exclusive, i.e. one does not preclude the other, you get the issue of false negatives/positives. 

For example: if you are training clothing category and colour on the same net (same weights) and you have two examples, (dress,red) and (jumper,red). When training using clothing category as the relation metric, you score them as not similar and the net updates its weights. When you then score the same set on colour you now score them as similar. To the net, it doesn't know the difference between colour and clothing type - it does not distinguish between different types/classes of similarity, and so you introduce noise to the net in the form of either a false positive of dress being similar to jumper, or a false negative of red not being similar to red.

Even if you don't use the same images for the comparison, you are still comparing on the same latent features (the cnn representation for 'red' etc). There are ways around this problem, e.g. having a separate head in the relation net for each exclusive class (clothing category is exclusive in that you can't be labelled wearing both a dress and a t-shirt, and so we only need one head for the entire category) and going further a paper (https://arxiv.org/pdf/1805.12501.pdf) introduces the idea of linking these heads with a shared loss function so they can share semantic information (e.g. gender=woman predicts clothing_category=dress). We may try these extensions, depending on how large our dataset is (which determines whether we are constrained to few-shot approaches or not) however that is not the direction we are going right now.

In [5]:
len( df["image"].unique() ), len( df["image"] )

(15890, 15890)

Relational nets are meant to be scored on unseen _classes_, not just unseen images. This doesn't really work for us since we only have 7 classes and on each episode we want to train with 5 of them, can't hold another 5 out for the validation set. This is fine though, we aren't leaking data. The main justification behind this (I think) is to promote the few-shot capabilities of the architecture in working with totally unseen classes. Fundamentally the architecture is not trying to learn how to compare classes but to compare abstract images - to find comonalities. 

Since the default implementation is set up to split the validation set like this, we're gonna create our own manually.

In [6]:
val_pcnt = 0.2
val_len = int( len( df ) * val_pcnt )

val_df, train_df = df[:val_len], df[val_len:]

train_df.head()

Unnamed: 0,image,label
2377,d747f6321f54aa8b674d08e6a1381f59_7203359563331...,clothing_category_t-shirt
6443,0b6c3cad49cb7b06d68c337de11b8ad9_7532315915052...,clothing_category_t-shirt
1423,73fa0b081f9c82ab2e0ff80220e3b13f_7022395166408...,clothing_category_shirt
8523,921d82a93037bf75d3abd83606dbaab0_8507080194065...,clothing_category_t-shirt
1144,824a98c7ca6c3dd9fddce42819618de7_9251100104920...,clothing_category_sweater


We can define a set of transformations to apply to each image as it is retrieved. These can be any torch transformation, have a look at the docs for more. The original paper used fixed rotations (0,90,180,270) applied to an entire episode of images. We do it a bit differently here by applying a random rotation per image instead of per episode. It does reduce the accuracy on the benchmark set, as to be expected, but not by much and I think random rotations lead to a more robust model anyway.

In [7]:
norm = transforms.Normalize( mean=[0.92206], std=[0.08426] ) # we take these straight from the paper although they may not be appropriate for our dataset. It shouldn't affect accuracy too much however.
rot = transforms.RandomRotation( [0, 364] )
tsf = transforms.Compose( [rot, transforms.ToTensor(), norm] )

In [8]:
# todo: size needs to be 84 because of the nn dimensions, need to work out a way to infer the proper dimensions/view from size
rn = ImageRelationNetwork( df=train_df, val_df=val_df, data_dir=im_dir, model_dir=model_dir, data_num_dims=3, padding=0, shuffle=True, tsf=tsf, size=84 )

In [9]:
rn.train( print_on=100 )

Beginning train
Scoring on validation set...
Validation set accuracy: 0.20073333333333332
Saving model
Episode: 100 Loss: 0.1599491983652115
Episode: 200 Loss: 0.1574077606201172
Episode: 300 Loss: 0.15948539972305298
Episode: 400 Loss: 0.1593618094921112
Episode: 500 Loss: 0.15747348964214325
Episode: 600 Loss: 0.1575508862733841
Episode: 700 Loss: 0.16456301510334015
Episode: 800 Loss: 0.15151704847812653
Episode: 900 Loss: 0.1609458327293396
Episode: 1000 Loss: 0.16191701591014862
Episode: 1100 Loss: 0.15128806233406067
Episode: 1200 Loss: 0.15386083722114563
Episode: 1300 Loss: 0.16425058245658875
Episode: 1400 Loss: 0.15359683334827423
Episode: 1500 Loss: 0.1600746512413025
Episode: 1600 Loss: 0.15862910449504852
Episode: 1700 Loss: 0.14875391125679016
Episode: 1800 Loss: 0.15047527849674225
Episode: 1900 Loss: 0.14869792759418488
Episode: 2000 Loss: 0.16144363582134247
Episode: 2100 Loss: 0.1498306542634964
Episode: 2200 Loss: 0.15337921679019928
Episode: 2300 Loss: 0.14599832892

Episode: 19700 Loss: 0.11790784448385239
Episode: 19800 Loss: 0.12241291999816895
Episode: 19900 Loss: 0.12180857360363007
Episode: 20000 Loss: 0.1167541891336441
Scoring on validation set...
Validation set accuracy: 0.39681333333333335
Saving model
Episode: 20100 Loss: 0.12754984200000763
Episode: 20200 Loss: 0.12351611256599426
Episode: 20300 Loss: 0.11808043718338013
Episode: 20400 Loss: 0.1522122323513031
Episode: 20500 Loss: 0.12074363976716995
Episode: 20600 Loss: 0.13170786201953888
Episode: 20700 Loss: 0.12210224568843842
Episode: 20800 Loss: 0.12174086272716522
Episode: 20900 Loss: 0.12744639813899994
Episode: 21000 Loss: 0.1485367864370346
Episode: 21100 Loss: 0.11839252710342407
Episode: 21200 Loss: 0.14569413661956787
Episode: 21300 Loss: 0.123101145029068
Episode: 21400 Loss: 0.12222810089588165
Episode: 21500 Loss: 0.1347251534461975
Episode: 21600 Loss: 0.11882774531841278
Episode: 21700 Loss: 0.10868573188781738
Episode: 21800 Loss: 0.13400529325008392
Episode: 21900 Lo

Episode: 39000 Loss: 0.11465155333280563
Episode: 39100 Loss: 0.0977013111114502
Episode: 39200 Loss: 0.11200759559869766
Episode: 39300 Loss: 0.13735944032669067
Episode: 39400 Loss: 0.1168680414557457
Episode: 39500 Loss: 0.10081189125776291
Episode: 39600 Loss: 0.1139603704214096
Episode: 39700 Loss: 0.10008884966373444
Episode: 39800 Loss: 0.13146212697029114
Episode: 39900 Loss: 0.09883448481559753
Episode: 40000 Loss: 0.13082613050937653
Scoring on validation set...
Validation set accuracy: 0.42396
Episode: 40100 Loss: 0.12396571040153503
Episode: 40200 Loss: 0.1286865770816803
Episode: 40300 Loss: 0.11933986097574234
Episode: 40400 Loss: 0.11182323098182678
Episode: 40500 Loss: 0.1020120307803154
Episode: 40600 Loss: 0.10444393008947372
Episode: 40700 Loss: 0.11743707209825516
Episode: 40800 Loss: 0.10420970618724823
Episode: 40900 Loss: 0.1081407442688942
Episode: 41000 Loss: 0.09445800632238388
Episode: 41100 Loss: 0.10195919871330261
Episode: 41200 Loss: 0.10348039865493774
E

Episode: 58300 Loss: 0.11414483934640884
Episode: 58400 Loss: 0.10541059076786041
Episode: 58500 Loss: 0.1321871280670166
Episode: 58600 Loss: 0.09112986922264099
Episode: 58700 Loss: 0.10765879601240158
Episode: 58800 Loss: 0.14407476782798767
Episode: 58900 Loss: 0.0810224786400795
Episode: 59000 Loss: 0.09004433453083038
Episode: 59100 Loss: 0.08513959497213364
Episode: 59200 Loss: 0.11329155415296555
Episode: 59300 Loss: 0.09473826736211777
Episode: 59400 Loss: 0.08527130633592606
Episode: 59500 Loss: 0.07199623435735703
Episode: 59600 Loss: 0.10287430137395859
Episode: 59700 Loss: 0.08081746101379395
Episode: 59800 Loss: 0.10817600786685944
Episode: 59900 Loss: 0.10789787024259567
Episode: 60000 Loss: 0.0900472104549408
Scoring on validation set...
Validation set accuracy: 0.44552
Saving model
Episode: 60100 Loss: 0.07124406099319458
Episode: 60200 Loss: 0.0971679612994194
Episode: 60300 Loss: 0.07859226316213608
Episode: 60400 Loss: 0.11741618067026138
Episode: 60500 Loss: 0.1483

Episode: 77800 Loss: 0.08294432610273361
Episode: 77900 Loss: 0.08510445058345795
Episode: 78000 Loss: 0.07954765111207962
Episode: 78100 Loss: 0.06435859948396683
Episode: 78200 Loss: 0.10014348477125168
Episode: 78300 Loss: 0.09211694449186325
Episode: 78400 Loss: 0.058398183435201645
Episode: 78500 Loss: 0.08468029648065567
Episode: 78600 Loss: 0.08199609071016312
Episode: 78700 Loss: 0.1035439670085907
Episode: 78800 Loss: 0.10910926014184952
Episode: 78900 Loss: 0.10390175133943558
Episode: 79000 Loss: 0.08561650663614273
Episode: 79100 Loss: 0.08996885269880295
Episode: 79200 Loss: 0.0949498787522316
Episode: 79300 Loss: 0.08980295062065125
Episode: 79400 Loss: 0.0617012158036232
Episode: 79500 Loss: 0.07126345485448837
Episode: 79600 Loss: 0.09499789774417877
Episode: 79700 Loss: 0.06762786209583282
Episode: 79800 Loss: 0.0917576402425766
Episode: 79900 Loss: 0.10007762908935547
Episode: 80000 Loss: 0.0838574543595314
Scoring on validation set...
Validation set accuracy: 0.43974

Episode: 97200 Loss: 0.06025364622473717
Episode: 97300 Loss: 0.10015638172626495
Episode: 97400 Loss: 0.0840829461812973
Episode: 97500 Loss: 0.10319524258375168
Episode: 97600 Loss: 0.06907015293836594
Episode: 97700 Loss: 0.10065454244613647
Episode: 97800 Loss: 0.0638587549328804
Episode: 97900 Loss: 0.10316804051399231
Episode: 98000 Loss: 0.08068301528692245
Episode: 98100 Loss: 0.07418342679738998
Episode: 98200 Loss: 0.05160099267959595
Episode: 98300 Loss: 0.0761108323931694
Episode: 98400 Loss: 0.08698144555091858
Episode: 98500 Loss: 0.08840814977884293
Episode: 98600 Loss: 0.10412349551916122
Episode: 98700 Loss: 0.061697788536548615
Episode: 98800 Loss: 0.06349963694810867
Episode: 98900 Loss: 0.09987089037895203
Episode: 99000 Loss: 0.09823610633611679
Episode: 99100 Loss: 0.0719592347741127
Episode: 99200 Loss: 0.06052582338452339
Episode: 99300 Loss: 0.06018393114209175
Episode: 99400 Loss: 0.108957938849926
Episode: 99500 Loss: 0.0778040662407875
Episode: 99600 Loss: 0

The previous cohorts SotA result for clothing category with binary classification was 0.683. Clearly we are not beating this, although thats not surprising. With the amount of data we have, and the interaction between other labels (e.g. sleeve length, clothing pattern, etc) that we are ignoring, it is not surprising that a standard resnet classifier outperforms the relational net. We would expect the ~45% result we obtain here as the original paper achieved around ~50% on a subset of imagenet. The two are generally similar in that they contain a lot of background noise per image and the comonalities between classes are more nuanced.

Having said that, with a proper segmentation model, the relational net can focus more on shape/colour/etc of the clothes without being distracted by background and we may well achieve better results, more comparable to the omniglot benchmark which achieves ~99% accuracy. This could also be true for the resnet classifier however so it remains to be seen what the better approach is.

The real value of relational nets however is in few-shot analysis. Since we don't have the similar.ai dataset at the time of writing we are still unsure of what approaches are even available to us and it may be the case that we _have_ to use relational nets.

We may also be able to improve accuracy, as mentioned above, with separate heads for each label and loss interaction terms.