#### Env settings

Get ready! Find all the details to set up your machine in the *set-up/set_up.ipynb* Jupyter notebook.

---

#### Libraries

In [None]:
import os
import sys

import random
import numpy as np
import pandas as pd
from datetime import datetime, timedelta

from IPython.display import display

#### Config variables

In [None]:
# set your local path to the labcamp directory 
# ...
path = \
 {
     "log": os.path.join(home, "log"),
     "udf": os.path.join(home, "src"),
     "data": os.path.join(home, "data"),
     "model": os.path.join(home, "models")
 }

In [None]:
sys.path.insert(0, path["udf"])

import ipynb.fs.full.utils as utils

def log_info(logger, 
             info_string,
             end=None,
             video_print=True):
    logger.info(info_string)
    if video_print:
        print(info_string, end=end)

In [None]:
# set the file where we'll save the progress during the model training
logger, logfile_name = utils.LogFile(directory=path["log"]).get_logfile()
log_info(logger, "NEXT OUTFIT LABCAMP - Model training\n", video_print=False)

# _AmazonFashion_ Dataset

In [None]:
# Load data
clock_start = datetime.now()
dataset = np.load(os.path.join(path["data"], "AmazonFashion6ImgPartitioned.npy"), 
                  encoding="bytes")
[user_train, user_validation, user_test, items, user_num, item_num] = dataset

process_duration = datetime.now() - clock_start
log_info(logger, "Loading data took %.2f seconds" % (process_duration.seconds))

* It consists of reviews of clothing items crawled from _Amazon.com_
* It contains six representative fashion categories (men/women’s tops, bottoms and shoes)
* We treat users’ reviews as ***implicit feedback***:
    > If an item $i$ has been reviewed, then $i$ is referred to as *observed* and will have preference score higher than the preference score assigned to a *not-observed* item $j$
* __For data preprocessing, inactive users _u_ (for whom $|I_u^+| < 5$) have been discarded. __
* __For each user, one action for validation and another for testing have been witheld randomly. All remaining items are used for training.__

Amazon datasets are derived from [here](http://jmcauley.ucsd.edu/data/amazon/). Please cite the corresponding papers if you use the datasets. __Please note that raw images are for academic use only.__

##### Take time to take a look at the data structure. What you'll find out:
<img style="float:right;margin:0px 30px 10px 50px" width="50%" src="imgs/cover.jpg"/> 

> * ***user_num*** is the total number of users
* ***item_num*** is the total number of reviewed items
* ***user_train***, ***user_validation***, ***user_test*** store users' review info:
    - The user is identified by ***reviewerID*** and ***reviewerName***
    - ***user_train***, ***user_validation***, ***user_test*** are dictionaries, the key is a mapping into $[0:$ ***user_num***$]$ of the ***reviewerID*** field
    - Each element of ***user_**** is a list of some of the reviews made by the considered user: the complete set of reviews made by that user has been split in order to create training (N items), test (1 item), and validation (1 item) sets.
    - Each element of the list is a new dict storing the actual info, see the example below
* ***items*** is a dict and each element stores info about one specific item, identified by __asin__
    - The dict key is a mapping into $[0:$ ***item_num***$]$ of the ***asin*** field
    - If exists, the ***related*** filed is very interesting: it's a dict having keys ***also_bought*** and ***also_viewed***

In [None]:
# NB Your proxy settings may cause some problems
for item_idx in random.sample(range(len(items)), k=20):
    title = items[item_idx][b'title'].decode("utf-8")
    categories = items[item_idx][b'categories']
    cat = "; ".join(np.unique([c.decode("utf-8") for c in sum(categories, [])]))
    
    img = utils.image_displayer(items[item_idx][b'imUrl'].decode("utf-8"))
    print("Title: %s" % (title))
    print("Categories: %s" % (cat))
    display(img)

In [None]:
print("user_num: %.0f" % (user_num))
print("len of user_train: %.0f" % (len(user_train)))

In [None]:
print("item_num: %.0f" % (item_num))
print("len of items: %.0f" % (len(items)))

In [None]:
iii = []
for u in user_train.keys():
    iii += [item[b'productid'] for item in user_train[u]]
ii_train = np.unique(iii)
len(ii_train)

In [None]:
iii = []
for u in user_test.keys():
    iii += [item[b'productid'] for item in user_test[u]]
ii_test = np.unique(iii)
len(ii_test)

In [None]:
len(set(ii_test).intersection(ii_train))

In [None]:
user_train[random.randrange(user_num)]

In [None]:
idx = random.randrange(item_num)

In [None]:
items[idx].keys()

In [None]:
items[idx]

# Deep Visually-aware Fashion Recommender System

### An end-to-end approach

We'll develop an **end-to-end visually-aware ranking method** to *simultaneously extract task-guided visual features and learn user latent factors*.

<img style="float:left;margin:10px 30px 0px 30px" width="52%" src="imgs/clothes2bit.png"/> 

Our goal is to generate, for each user $u$, a **personalized ranking over items the user $u$ has not interacted with yet**.

To achieve this, 
* We set the preference predictor of a user $u$ about an item $i$ as the score given by $$x_{u,i} = \theta_u^T \Phi(X_i)$$, where $\theta_u$ is the user latent factors; $\Phi(X_i)$ is the embedding of the item image. 
$$x_{u,i} \in \mathbb{R}^K$$, where $\mathbb{R}^K$ is the **K-dimensional latent space** whose dimensions correspond to facets of fashion style that explain **variance in users’ opinions**.


* We choose the **Bayesian Personalized Ranking (BPR)** as learning method, namely the state-of-the-art ranking optimization framework for implicit feedback. 

<img style="float:right;margin:25px 30px 0px 43px" width="37%" src="imgs/doilikeit.png"/> 

### Training on batches via bootstrap sampling of triples

In BPR, the main idea is
* to optimize rankings by considering **randomly-selected triplets** $$(user, observed-item, not-observed-item)$$
* to seek to maximize an **objective function** given by $$\sum \ln(\sigma(x_{uij}))$$, i.e. the number of times in which $$x_{u, observed-i} \geq x_{u, not-observed-i} $$

Each training iteration (__epoch__) involves $B$ batches of data. For each sample batch, we compute the training and the validation sets.
* __Training set.__ Composed of $N = B \times$__batch_size__ users: for each user, one pair _(observed item, not-observed item)_ is randomly chosen.
* __Validation set.__ Composed of all the users that have been selected in the training set: for each user, $M$ pairs _($v$, not-observed item)_ are randomly chosen, with $v$ the single observed item stored in ***user_validation*** for the considered user.


### Performance validation via AUC

* The AUC measures the quality of a ranking based on pairwise comparisons
* The AUC is the measure that BPR-like methods are trained to optimize

Basically, we are **counting the fraction of times that the "observed" items $i$ are preferred over "non-observed" items $j$.**

In [None]:
def uniform_train_validation_sample_batch(user_train_ratings,
                                          user_validation_ratings,
                                          item_images,
                                          batch_size,
                                          image_width=224,
                                          image_height=224,
                                          validation_sample_count=1000):
    """
    validation_sample_count (int): Number of not-observed items to sample to get the validation set for each user.
    """

    triplet_train_batch = {}
    triplet_validation_batch = {}
    for b in range(batch_size):
        # user id
        u = random.randrange(len(user_train_ratings))

        # training set
        i = ...                                          # >> COMPLETE HERE!
        j = ...                                          # >> COMPLETE HERE!
        
        image_i = image_translate(item_images[i][b'imgs'], 
                                  image_width, 
                                  image_height)
        image_j = image_translate(item_images[j][b'imgs'],
                                  image_width, 
                                  image_height)
        triplet_train_batch[u] = [image_i,
                                  image_j]

        # validation set
        i = ...                                          # >> COMPLETE HERE!
        image_i = image_translate(item_images[i][b'imgs'],
                                  image_width, 
                                  image_height)

        reviewed_items = set()
        for item in user_train_ratings[u]:
            reviewed_items.add(item[b'productid'])
        reviewed_items.add(user_validation_ratings[u][0][b'productid'])

        triplet_validation_batch[u] = []
        for j ...                                        # >> COMPLETE HERE!
            if j ...                                     # >> COMPLETE HERE!
                image_j = image_translate(item_images[j][b'imgs'],
                                          image_width, 
                                          image_height)
                triplet_validation_batch[u].append([image_i,
                                                    image_j])
        
    return triplet_train_batch, triplet_validation_batch

In [None]:
# Define the loss function as ln(sigmoid) according to the BPR method
# Pay attention.
# BPR wants to maximize the loss function while Keras engine minimizes it
def softplus_loss(label_matrix, prediction_matrix):
    return K.mean(K.softplus(-prediction_matrix))

In [None]:
# Define the metric as AUC according to the BPR method
#
# Count the ratio of prediction value > 0
# i.e., predicting positive item score > negative item score for a user
#
# Pay attention.
# Do not use a plain integer as a parameter to keras.backend.switch,
# instead, pass a compatible tensor (for example create it with keras.backend.zeros_like)
def auc(label_tensor, prediction_tensor):
    return K.mean(...)                                   # >> COMPLETE HERE!

## Build the model

Let's move to the *src/convolutional_siameseNet.ipynb* Jupyter notebook.

## Model Training

#### Libraries

In [None]:
from keras import backend as K
from keras.models import model_from_yaml
from keras.utils.np_utils import to_categorical   
from keras.regularizers import l2
from keras.optimizers import Adam

import ipynb.fs.full.convolutional_siameseNet as model

#### Hyper-parameters

In [None]:
# Network params
# image size
image_width = 224
image_height = 224

# latent dimensionality K
latent_dimensionality = 100

# weight decay - conv layer
lambda_cnn = 1e-3  # 2e-4
# weight decay - fc layer
lambda_fc = 1e-3
# regularizer for theta_u
lambda_u = 1.0

In [None]:
# Training params
# epoch params
learning_rate = 1e-4
training_epoch = 3 # 30
batch_count = 2**8
# batch_size = 2**7
validation_sample_count = 100

#### Let's consider a subset of users to speed up the process

In [None]:
log_info(logger, "original total nb of users: %.0f" % user_num)
user_num_original = user_num
user_train_original = user_train

In [None]:
# for each batch, force the number of users to be the same
batch_count = 2**8
user_num = (user_num_original - (user_num_original % batch_count))
log_info(logger, "total nb of users: %.0f" % user_num)

# one complete model will be linked to each user_subset
user_subsets = dict(zip(range(batch_count), np.array_split(range(user_num), batch_count)))
log_info(logger, "total nb of batches: %.0f" % len(user_subsets))
log_info(logger, "users per batch: %.0f" % len(user_subsets[0]))

In [None]:
# let's consider 2**4 batches of users
batch_count = 2**4
user_num = len(user_subsets[0]) * batch_count
log_info(logger, "nb of considered users: %.0f" % user_num)
user_subsets = dict(zip(range(batch_count), np.array_split(range(user_num), batch_count)))
log_info(logger, "nb of considered batches: %.0f" % len(user_subsets))
log_info(logger, "users per batch: %.0f" % len(user_subsets[0]))

#### Set and compile the DVBPR

In [None]:
clock_start = datetime.now()
conv_siamese_net = model.ConvSiameseNet(users_dim=len(user_subsets[0]),
                                        width=image_width,
                                        height=image_height,
                                        depth=3,
                                        latent_dim=latent_dimensionality,
                                        cnn_w_regularizer=l2(lambda_cnn),
                                        fc_w_regularizer=l2(lambda_fc),
                                        u_w_regularizer=l2(lambda_u)
                                        )
process_duration = datetime.now() - clock_start
log_info(logger, 
         "Building Convolutional SiameseNet model (%.0f params) took %.2f minutes" % (conv_siamese_net.count_params(), 
                                                                                      process_duration.seconds/60))

In [None]:
optimizer = Adam(learning_rate)
conv_siamese_net.compile(loss=utils.softplus_loss,
                         optimizer=optimizer,
                         metrics=[utils.auc])

In [None]:
# serialize model to YAML
model_yaml = conv_siamese_net.to_yaml()
with open(os.path.join(path["model"], "dvbpr.yaml"), "w") as yaml_file:
    yaml_file.write(model_yaml)

#### Find the pre-trained models in *models/pre-trained-pre-trained-24early-stopped-epochs-98AUC* directory

## Given a user, let's predict the final ranking!

#### Randomly choose a user

In [None]:
user = random.randrange(user_num)
user=312
print("user idx: %.0f\nuser name: %s" % (user, user_train_original[user][0]["reviewerName"]))

In [None]:
# See what she/he likes
observed_items_url_cat = utils.get_observed_imUrl_imCat(user_idx=user, 
                                                        user_train_ratings=user_train_original,  
                                                        item_images=items)
for idx, url_cat in observed_items_url_cat.items():   
    img = utils.image_displayer(url_cat["imUrl"])
    print("categories: %s" % ("; ".join(url_cat["imCat"])))
    display(img)

In [None]:
baseline_url_cat = utils.get_observed_imUrl_imCat(user_idx=user, 
                                                  user_train_ratings=user_test,  
                                                  item_images=items)
baseline_id = list(baseline_url_cat.keys())[0]
baseline_cat = baseline_url_cat[baseline_id]["imCat"]
baseline_img = utils.image_translate(items[baseline_id][b'imgs'],
                                     image_width,
                                     image_height)
display(utils.image_displayer(baseline_url_cat[baseline_id]["imUrl"]))

In [None]:
# See what she/he does not reviwed
not_observed_item_ids = random.sample(range(len(items)), k=15000)
not_observed_item_ids = [item_id for item_id in not_observed_item_ids if item_id not in observed_items_url_cat.keys()]

In [None]:
# Get the trained layers for that user
batch_models = os.listdir(os.path.join(path["model"], "pre-trained-24early-stopped-epochs-98AUC"))
trained_model = [model for model in batch_models
                 if user in user_subsets[int(os.path.splitext(model)[0].split("_")[2])]][0]
print(trained_model)

In [None]:
# Define the user matrix 
user_subset_origin = user_subsets[int(os.path.splitext(trained_model)[0].split("_")[2])][0]
user_E = to_categorical(list(range(user - user_subset_origin,
                                   user - user_subset_origin + latent_dimensionality)),
                        num_classes=latent_dimensionality * len(user_subsets[0])).transpose()

#### Predict the ranking of user's preferences 

In [None]:
# Build the DVBPR model
dvbpr_ranker = model.ConvSiameseNet(users_dim=len(user_subsets[0]),
                                    width=image_width,
                                    height=image_height,
                                    depth=3,
                                    latent_dim=latent_dimensionality,
                                    cnn_w_regularizer=l2(lambda_cnn),
                                    fc_w_regularizer=l2(lambda_fc),
                                    u_w_regularizer=l2(lambda_u)
                                   )

In [None]:
# Tranfer the trained weights to our predictor
dvbpr_ranker.load_weights(os.path.join(path["model"], "pre-trained-24early-stopped-epochs-98AUC", 
                                       os.path.split(trained_model)[1]))

In [None]:
# Get the preferences scores for new items
user_placeholder = []
users_E = []
baseline_item_image = []
new_item_images = []
for item_id in not_observed_item_ids:
    user_placeholder.append(1)
    users_E.append(user_E)
    baseline_item_image.append(baseline_img)
    new_item_images.append(utils.image_translate(items[item_id][b'imgs'],
                                                      image_width, 
                                                      image_height)) 

preference_scores = dict(zip(not_observed_item_ids, 
                             dvbpr_ranker.predict(
                                 [np.array(user_placeholder),
                                  np.array(users_E),
                                  np.array(baseline_item_image),
                                  np.array(new_item_images)])))
item_score = pd.DataFrame(preference_scores.items(), columns=["item", "score"])
item_score["score"] = item_score["score"].map(lambda s: s[0])

# and order them on a 0-100 scale
item_score.sort_values("score", ascending=False, inplace=True)
item_score.set_index("item", inplace=True)
item_score["score"] = round((item_score["score"] - min(item_score["score"])) / \
                             (max(item_score["score"]) -  min(item_score["score"])) * 100, 2)
item_score["categories"] = item_score.index.map(lambda i: 
                                                np.unique([cat.decode("utf-8") 
                                                           for cat in sum(items[i][b'categories'], [])]))
item_score["close2test"] = item_score["categories"].map(lambda cat: sum([c in baseline_cat for c in cat]) > 1)

#### Let's display the predicted ranking!

In [None]:
suggested_count = 0
for item_idx in item_score.index:
    if (suggested_count < 5) & (item_score.loc[item_idx]["close2test"]):
        img = utils.image_displayer(items[item_idx][b'imUrl'].decode("utf-8"))
        print("item id %.0f" % (item_idx))
        print("score: %.2f" % (item_score.loc[item_idx, "score"]))
        display(img)
        suggested_count += 1

In [None]:
img = utils.image_displayer(items[item_idx][b'imUrl'].decode("utf-8"))