# BERT Model for Outfit

Create a BERT type model with outfit data (only positive examples) so that we can use this model for transfer learning. 

1. While it is straight-forward to create BERT type training instance, what to do with [CLS], [SEP] and [MASK] tokens. In a way we are mixing images with textual tokens. We can have a separate learnable embedding for these three tokens but how to mix them with the rest of the images that have a pre-defined embedding or embeddings to be learnt in a different way?

2. The size of the vocabulary is large, order of 200,000. The masked language model output would be softmax over this dimension. 

3. There is no inherent sequence in an outfit and as a result MLM probbaly does not make sense. Instead, Fill-In-The-Blanks is a better option.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import tensorflow as tf
import os
import json
import pickle
import numpy as np
from tqdm import tqdm 
from argparse import Namespace
import pandas as pd

In [3]:
data_file = "/recsys_data/RecSys/fashion/polyvore-dataset/polyvore_outfits/nondisjoint/bert_train.txt"
vocab_file = "/recsys_data/RecSys/fashion/polyvore-dataset/polyvore_outfits/nondisjoint/vocab.txt"

In [4]:
max_seq_length = 8
max_predictions_per_seq = 3

def decode_fn(record_bytes):
    return tf.io.parse_single_example(
      # Data
      record_bytes,
      # Schema
      {
        "input_ids":
            tf.io.FixedLenFeature([max_seq_length], tf.int64),
        "input_mask":
            tf.io.FixedLenFeature([max_seq_length], tf.int64),
        "segment_ids":
            tf.io.FixedLenFeature([max_seq_length], tf.int64),
        "masked_lm_positions":
            tf.io.FixedLenFeature([max_predictions_per_seq], tf.int64),
        "masked_lm_ids":
            tf.io.FixedLenFeature([max_predictions_per_seq], tf.int64),
        "masked_lm_weights":
            tf.io.FixedLenFeature([max_predictions_per_seq], tf.float32),
        "next_sentence_labels":
            tf.io.FixedLenFeature([1], tf.int64),
      })

In [5]:
dataset = tf.data.TFRecordDataset([data_file])
dataset = dataset.map(decode_fn)
dataset = dataset.batch(32)

In [6]:
data_list = list(dataset.as_numpy_iterator())

# for batch in dataset.map(decode_fn):
#     print(batch)
#     sys.exit()




In [7]:
x = data_list[0]
x['input_ids'].shape, x['input_mask'].shape, x['masked_lm_ids'].shape, x['masked_lm_positions'].shape, x['masked_lm_weights'].shape, x['next_sentence_labels'].shape, x['segment_ids'].shape

((32, 8), (32, 8), (32, 3), (32, 3), (32, 3), (32, 1), (32, 8))

In [8]:
x['input_ids']

array([[204679,  72700, 204681,  78017, 204680, 204681,  30019, 204680],
       [204679, 130657, 204681, 204681, 204680,  56228, 154398, 204680],
       [204679, 204681,  49819,  45150, 204680, 204681,  67973, 204680],
       [204679,  20558, 204681, 204681, 204680, 184611, 131288, 204680],
       [204679, 204681, 100139,  97137, 204680, 204681,  89707, 204680],
       [204679,  95687, 204681, 204681, 204680,  43880, 167625, 204680],
       [204679,  45008, 204681, 204681, 204680, 165946,  96253, 204680],
       [204679,  39454, 204681,  36244, 204680,  32220, 204681, 204680],
       [204679,  49575, 204681,  18771, 204680, 126723,  28309, 204680],
       [204679,  84908, 204681, 156368, 204680, 157835,  12329, 204680],
       [204679,  64614, 204681, 204681, 204680,  12345, 161319, 204680],
       [204679,  44450, 162621, 204681, 204680, 132762, 204681, 204680],
       [204679, 154831, 203890,  21537, 204680,  69976, 204681, 204680],
       [204679,  34198, 204681, 190788, 204680, 184

In [21]:
x['masked_lm_ids']

array([[ 65694, 113711,      0],
       [  7186,  33879,      0],
       [143307, 169821,      0],
       [ 25213, 153447,      0],
       [144736,  18787,      0],
       [ 56758, 118579,      0],
       [167311,  76142,      0],
       [172050,  81976,      0],
       [103198,  28309,      0],
       [119081, 156368,      0],
       [146483, 100679,      0],
       [ 53418, 150250,      0],
       [183679,  69688,      0],
       [130919,   7954,      0],
       [ 73044,  93101,      0],
       [ 32513, 178625,      0],
       [202741, 118139,      0],
       [ 22550,  98684,      0],
       [153657, 151641,      0],
       [195954,  68319,      0],
       [166237,  23611,      0],
       [ 91954,  20964,      0],
       [183519, 157219,      0],
       [103405,  24536,      0],
       [153684,  96338,      0],
       [ 39237, 169344,      0],
       [  2263,  14365,      0],
       [ 12869,  46117,      0],
       [ 75988,  96779,      0],
       [  5359,  68048,      0],
       [ 7

In [22]:
x['masked_lm_positions']

array([[2, 5, 0],
       [2, 3, 0],
       [1, 5, 0],
       [2, 3, 0],
       [1, 5, 0],
       [2, 3, 0],
       [2, 3, 0],
       [2, 6, 0],
       [2, 6, 0],
       [2, 3, 0],
       [2, 3, 0],
       [3, 6, 0],
       [1, 6, 0],
       [2, 6, 0],
       [3, 5, 0],
       [3, 6, 0],
       [2, 3, 0],
       [1, 6, 0],
       [5, 6, 0],
       [1, 5, 0],
       [1, 4, 0],
       [3, 6, 0],
       [2, 3, 0],
       [1, 3, 0],
       [2, 3, 0],
       [3, 5, 0],
       [2, 5, 0],
       [1, 6, 0],
       [1, 3, 0],
       [1, 4, 0],
       [1, 6, 0],
       [1, 6, 0]])

In [23]:
x['segment_ids']

array([[0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 1, 1, 1, 1],
       [0,

In [9]:
vocab_file = "/recsys_data/RecSys/fashion/polyvore-dataset/polyvore_outfits/nondisjoint/vocab.txt"
token_dict, inv_dict = {}, {}
with open(vocab_file, 'r') as fr:
    for line in fr:
        k, v = line.strip().split()
        token_dict[k] = int(v)
        inv_dict[int(v)] = k
    

In [10]:
inv_dict[204679]

'[CLS]'

In [11]:
embed_dir = "/recsys_data/RecSys/fashion/polyvore-dataset/precomputed"
image_embedding_file = os.path.join(embed_dir, "effnet_tuned_polyvore.pkl")
with open(image_embedding_file, "rb") as fr:
    embedding_dict = pickle.load(fr)


In [12]:
extra_embedding = np.random.uniform(size=(3,1280))
embedding_dict["[CLS]"] = extra_embedding[0]
embedding_dict["[SEP]"] = extra_embedding[1]
embedding_dict["[MASK]"] = extra_embedding[2]

In [13]:
def convert_to_input(inps):
    y = []
    for ii in range(inps.shape[0]):
        toks = inps[ii,:]
        toks = [inv_dict[jj] for jj in toks]
        vs = np.array([embedding_dict[t] for t in toks])
        y.append(vs)
    return np.stack(y)

In [14]:
y = convert_to_input(x['input_ids'])
y.shape

(32, 8, 1280)

## Train a Transformer Model

In [3]:
base_dir = "/recsys_data/RecSys/fashion/polyvore-dataset/polyvore_outfits"
data_type = "nondisjoint" # "nondisjoint", "disjoint"
train_dir = os.path.join(base_dir, data_type)
image_dir = os.path.join(base_dir, "images")
embed_dir = "/recsys_data/RecSys/fashion/polyvore-dataset/precomputed"
train_json = "train.json"
valid_json = "valid.json"
test_json = "test.json"

train_file = "compatibility_train.txt"
valid_file = "compatibility_valid.txt"
test_file = "compatibility_test.txt"
item_file = "polyvore_item_metadata.json"
outfit_file = "polyvore_outfit_titles.json"

In [4]:
with open(os.path.join(train_dir, train_json), 'r') as fr:
    train_pos = json.load(fr)
    
with open(os.path.join(train_dir, valid_json), 'r') as fr:
    valid_pos = json.load(fr)
    
with open(os.path.join(train_dir, test_json), 'r') as fr:
    test_pos = json.load(fr)
    
with open(os.path.join(base_dir, item_file), 'r') as fr:
    pv_items = json.load(fr)
    
with open(os.path.join(base_dir, outfit_file), 'r') as fr:
    pv_outfits = json.load(fr)
print(f"Total {len(train_pos)}, {len(valid_pos)}, {len(test_pos)} outfits in train, validation and test split, respectively")

Total 53306, 5000, 10000 outfits in train, validation and test split, respectively


In [5]:
with open(os.path.join(train_dir, train_file), 'r') as fr:
    train_X, train_y = [], []
    for line in fr:
        elems = line.strip().split()
        train_y.append(elems[0])
        train_X.append(elems[1:])

with open(os.path.join(train_dir, valid_file), 'r') as fr:
    valid_X, valid_y = [], []
    for line in fr:
        elems = line.strip().split()
        valid_y.append(elems[0])
        valid_X.append(elems[1:])

with open(os.path.join(train_dir, test_file), 'r') as fr:
    test_X, test_y = [], []
    for line in fr:
        elems = line.strip().split()
        test_y.append(elems[0])
        test_X.append(elems[1:])

print(f"Total {len(train_X)}, {len(valid_X)}, {len(test_X)} examples in train, validation and test split, respectively")

Total 106612, 10000, 20000 examples in train, validation and test split, respectively


In [6]:
item_dict = {}
for ii, outfit in enumerate(train_pos):
    items = outfit['items']
    mapped = train_X[ii]
    item_dict.update({jj:kk['item_id'] for jj, kk in zip(mapped, items)})
print(len(item_dict))

for ii, outfit in enumerate(valid_pos):
    items = outfit['items']
    mapped = valid_X[ii]
    item_dict.update({jj:kk['item_id'] for jj, kk in zip(mapped, items)})
print(len(item_dict))

for ii, outfit in enumerate(test_pos):
    items = outfit['items']
    mapped = test_X[ii]
    item_dict.update({jj:kk['item_id'] for jj, kk in zip(mapped, items)})
print(len(item_dict))

284767
311548
365054


In [7]:
max_seq_length = 8
batch_size = 32

In [8]:
from data_process import BertDataGen

extra_embedding = np.random.uniform(size=(1,1280))
train_gen = BertDataGen(positive_samples=train_pos,
                        item_description=pv_items, 
                        extra_embedding=extra_embedding,
                        max_samples_per_example=1,
                        image_embedding_file=os.path.join(embed_dir, "effnet_tuned_polyvore.pkl"),
                        max_items=max_seq_length,
                        batch_size=batch_size,
                       )

valid_gen = BertDataGen(positive_samples=valid_pos,
                        item_description=pv_items, 
                        extra_embedding=extra_embedding,
                        max_samples_per_example=1,
                        image_embedding_file=os.path.join(embed_dir, "effnet_tuned_polyvore.pkl"),
                        max_items=max_seq_length,
                        batch_size=batch_size,
                       )


Total 154 item categories


 94%|█████████▍| 49999/53306 [01:14<00:04, 670.91it/s]


Total 100000 examples
Total 154 item categories


100%|██████████| 5000/5000 [00:07<00:00, 662.14it/s]


Total 10000 examples


In [9]:
for ii in range(4):
    inps, targs = train_gen[ii]
    print(inps.shape, targs.shape)

(32, 8, 1280) (32,)
(32, 8, 1280) (32,)
(32, 8, 1280) (32,)
(32, 8, 1280) (32,)


In [20]:
with open("bert_config.json", 'r') as fr:
    bert_config = json.load(fr)

# add some extra parameters
bert_config["model_name"] = "rnn"
bert_config["inp_seq_len"] = max_seq_length
bert_config["inp_dim"] = 1280
bert_config["d_model"] = 768 # 128
bert_config["include_text"] = False
bert_config["text_feature_dim"] = 768
bert_config["image_embedding_dim"] = 1280
bert_config["include_item_categories"] = False
bert_config["image_data_type"] = "embedding"

bert_config["num_attention_heads"] = 12
bert_config["dff"] = 768 # 128
bert_config["seed_value"] = 100
bert_config["embedding_activation"] = "linear"
bert_config["rate"] = 0.1
bert_config["final_activation"] = "sigmoid"

# Traning
bert_config["learning_rate"] = 1e-04
bert_config["epochs"] = 100
bert_config["batch_size"] = 32
bert_config["patience"] = 5

bert_config = Namespace(**bert_config)
bert_config

Namespace(attention_probs_dropout_prob=0.1, batch_size=32, d_model=768, dff=768, embedding_activation='linear', epochs=100, final_activation='sigmoid', hidden_act='gelu', hidden_dropout_prob=0.1, hidden_size=768, image_data_type='embedding', image_embedding_dim=1280, include_item_categories=False, include_text=False, initializer_range=0.02, inp_dim=1280, inp_seq_len=8, intermediate_size=3072, learning_rate=0.0001, max_position_embeddings=512, model_name='rnn', num_attention_heads=12, num_hidden_layers=12, patience=5, rate=0.1, seed_value=100, text_feature_dim=768, type_vocab_size=2, vocab_size=30522)

In [21]:
from bert_modeling import BertModel

bml = BertModel(bert_config)
bml.model.summary()

Model: "rnn"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 8, 1280)]         0         
_________________________________________________________________
tf_op_layer_Sum (TensorFlowO [(None, 8)]               0         
_________________________________________________________________
tf_op_layer_NotEqual (Tensor [(None, 8)]               0         
_________________________________________________________________
bidirectional (Bidirectional (None, 8, 1536)           12589056  
_________________________________________________________________
bidirectional_1 (Bidirection (None, 8, 1536)           14161920  
_________________________________________________________________
batch_normalization (BatchNo (None, 8, 1536)           6144      
_________________________________________________________________
permute (Permute)            (None, 1536, 8)           0       

In [22]:
history = bml.train(train_gen, valid_gen)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100


In [23]:
from data_process import get_polyvore_data, get_zalando_data

In [24]:
pv_config = {"batch_size": 32, "max_seq_len": 8, "include_text": False, 
             "image_embedding_dim": 1280, 
             "image_embedding_file": os.path.join(embed_dir, "effnet_tuned_polyvore.pkl"),
             "text_embedding_file": os.path.join(embed_dir, "bert_polyvore.pkl"),
             "text_embedding_dim": 768,
             "include_item_categories": False,
             "image_data_type": "embedding",
             "add_cls": True,
             "extra_embedding": extra_embedding,
            }
pv_config = Namespace(**pv_config)
pv_train, pv_valid, pv_test = get_polyvore_data(pv_config)

Total 53306, 5000, 10000 outfits in train, validation and test split, respectively
Total 106612, 10000, 20000 examples in train, validation and test split, respectively
284767
311548
365054


In [25]:
def get_accuracy_auc(data_gen, model):
    m = tf.keras.metrics.BinaryAccuracy()
    m2 = tf.keras.metrics.AUC()
    acc_list = []
    pbar = tqdm(range(len(data_gen)))
    ys, yhats = [], []
    for ii in pbar:
        x, y = data_gen[ii]  # batch size
        yhat = model(x)
        m.update_state(y, yhat)
        batch_acc = m.result().numpy()
        acc_list.append(batch_acc)
        pbar.set_description("Batch accuracy %g" % batch_acc)
        ys.append(y)
        yhats.append(yhat)
    print(f"Average Accuracy: {np.mean(acc_list)}")
    big_y = np.concatenate(ys, axis=0)
    big_yh = np.concatenate(yhats, axis=0)
    m2.update_state(big_y, big_yh)
    auc = m2.result().numpy()
    print(f"AUC: {auc}")


In [26]:
get_accuracy_auc(pv_valid, bml.model)
get_accuracy_auc(pv_test, bml.model)
get_accuracy_auc(pv_train, bml.model)

Batch accuracy 0.626298: 100%|██████████| 313/313 [00:29<00:00, 10.79it/s]


Average Accuracy: 0.6173496246337891
AUC: 0.6748557686805725


Batch accuracy 0.6286: 100%|██████████| 625/625 [00:57<00:00, 10.95it/s]  


Average Accuracy: 0.6261652112007141
AUC: 0.6724649667739868


Batch accuracy 0.637434: 100%|██████████| 3332/3332 [05:09<00:00, 10.77it/s]


Average Accuracy: 0.6388095021247864
AUC: 0.6875200867652893


In [27]:
from data_process import get_zalando_data

zd_config = {"batch_size": 32, "max_seq_len": 8, "include_text": False, 
             "image_embedding_dim": 1280, 
             "image_embedding_file": "/recsys_data/RecSys/Zalando_Outfit/female/Outfit_Data/precomputed/effnet2_zalando.pkl",
             "text_embedding_file": os.path.join(embed_dir, "bert_polyvore.pkl"),
             "text_embedding_dim": 768,
             "include_item_categories": False,
             "image_data_type": "embedding",
             "add_cls": True,
             "extra_embedding": extra_embedding,
            }
zd_config = Namespace(**zd_config)
zd_train, zd_valid, zd_test = get_zalando_data(zd_config)

Total 272541, 7479, 17427 examples in train, validation and test split, respectively
90847 training examples, average 3.89 items, max 8 item
2493 validation examples, average 4.12 items, max 7 items
5809 test examples, average 4.13 items, max 7 item


In [28]:
get_accuracy_auc(zd_valid, bml.model)
get_accuracy_auc(zd_test, bml.model)
get_accuracy_auc(zd_train, bml.model)

Batch accuracy 0.432141: 100%|██████████| 234/234 [00:21<00:00, 10.84it/s]


Average Accuracy: 0.43026426434516907
AUC: 0.5008366703987122


Batch accuracy 0.43575: 100%|██████████| 545/545 [00:49<00:00, 10.96it/s] 


Average Accuracy: 0.4426589012145996
AUC: 0.5023459196090698


Batch accuracy 0.427473: 100%|██████████| 8517/8517 [13:05<00:00, 10.84it/s]


Average Accuracy: 0.42716798186302185
AUC: 0.5053552985191345


In [25]:
pd.DataFrame({"Polyvore-ND": {"train-AUC": 0.738, "valid-AUC": 0.719, "test-AUC": 0.721},
              "Zalando": {"train-AUC": 0.50, "valid-AUC": 0.50, "test-AUC": 0.50},
             })

Unnamed: 0,Polyvore-ND,Zalando
train-AUC,0.738,0.5
valid-AUC,0.719,0.5
test-AUC,0.721,0.5
