These are my notes on the paper [Unified Deep Supervised Domain Adaptation and Generalization](https://arxiv.org/abs/1709.10190). All mistakes are my own.

In [1]:
import tensorflow as tf

## The problem

- Source and target datasets $D_S, D_T$ with covariate shift between the respective distributions.
- Same set of classes for both datasets
- Both may have large number of datapoints (e.g. images) but in $D_T$ most of the points are unlabelled
- A model trained on $D_S$ will typically not do well on $D_T$ unless it is finetuned on $D_T$ but if the model is deep, then the lack of labels for $D_T$ means we probably can't finetune it successfully.

## Solution
1. Map inputs $\mathcal{X}_S$ and $\mathcal{X}_T$ to a common embedding space $Z$ via functions $g_i: \mathcal{X}_i \rightarrow \mathcal{Z}$.
2. Since embeddings $\mathcal{Z}$ assumed to be domain invariant, they can be mapped to a single label space $\mathcal{Y}$ via a function $h: \mathcal{Z} \rightarrow \mathcal{Y}$


- The model function for domain $i$ is $f_i = h \circ g_i$.
- Usually $g$ can be shared between domains e.g. can be a ConvNet for images.
- The feature network $g$ is trained using both $D_S$ and $D_T$
- The classifier $h$ is trained initially only on $D_S$ and is later fine-tuned on $D_T$

## Domain invariance

- Domain invariance of $Z$ can enforced by making the probability distributions of the embeddings from each domain align.
- In practice, due to lack of samples from the target domain, the distance between distributions is approximated by the sum of pairwise distances between embeddings.

### Semantic alignment
- Regardless of domain:
    - Samples with the same label should be nearby in the embedding space
    - Samples with different labels should far apart


- The model can be trained to produce semantically aligned domain invariant embeddings by optimising the following losses:

    - Maximise similarity between samples with different labels 

    $$\sum_a\sum_{ij, y_i^s = y_i^t=a}k\left(g\left(x_i^s\right), 
                    g\left(x_j^t\right)\right)$$

    - Minimise distance between samples with same labels

    $$\sum_{ab, a\neq b}\sum_{ij, y_i^s =a, y_i^t =b}d\left(g\left(x_i^s\right), 
                    g\left(x_j^t\right)\right)$$



Note that in the code below we take the mean of similarity and distance losses separately for each class represented in the batch based on the following from the paper:

> to balance the classification versus the contrastive semantic alignment portion of the loss (5), (7) and (8) are normalized

where (8) is the equation for the similarity loss and (7) for the distance loss, for a single class. This seems to suggest that the single class losses are normalised. 

In [14]:
def distance_loss(a, b):
    return tf.reduce_sum((a - b) ** 2, axis=-1) / 2.

def similarity_loss(a, b, margin):
    diff = margin - tf.sqrt(2 * distance_loss(a, b))
    return (tf.maximum(0., diff) ** 2) / 2.

def separation_loss(x1, x2, y1, y2, margin):
    select = tf.not_equal(y1, y2)
    
    x1 = tf.boolean_mask(x1, select) # (M, F)
    x2 = tf.boolean_mask(x2, select) # (M, F)
    y = tf.boolean_mask(y1, select) # (M)
    
    sim = similarity_loss(x1, x2, margin)
    n_unique = tf.unique(y)[0].shape[0]
    
    losses = tf.unsorted_segment_mean(sim, y, n_unique) 
    
    return tf.reduce_sum(losses)

def semantic_alignment_loss(x1, x2, y1, y2):
    select = tf.equal(y1, y2)
    
    x1 = tf.boolean_mask(x1, select) # (M, F)
    x2 = tf.boolean_mask(x2, select) # (M, F)
    y = tf.boolean_mask(y1, select) # (M)
    
    dist = distance_metric(x1, x2)  # (M,)
    n_unique = tf.unique(y)[0].shape[0]
    
    losses = tf.unsorted_segment_mean(dist, y, n_unique) 
    
    return tf.reduce_sum(losses)

## Domain generalization 
- Here there may be any number of domains and all of the losses use all of the domains
- The target domain is regarded as unknown
- The goal is to make $g$ map to a domain invariant embedding space such that the model would perform well on any arbitrary target domain
- As the number of pairs for pairwise losses grows quadratically with the number of domains, pairs may be selected randomly - for example they mention that for a given sample (within a batch???), they randomly select a fixed number say 2 or 5 samples from each other source domain to pair with it

## Architecture

They report two kinds of architectures 

1. Pre-computed followed by some trained layers to generate embeddings followed by classifier --- for example for their VLCS domain generalisation experiment they use

$$\underbrace{\text{(DeCaF-fc6 features)-fc-fc}}_g\text{-}\underbrace{\text{fc-softmax}}_h$$

2. End-to-end trained feature network with classifier on top --- for example for the MNIST domain adaptation experiment they use:

$$\underbrace{\text{conv5x5-maxpool-conv5x5-maxpool-fc-fc}}_g\text{-}
\underbrace{\text{fc-softmax}}_h$$

## Results
### Office dataset
- Performance reported for pairwise domain adaptation for three pairs of domains from $\mathcal{W}, \mathcal{A}, \mathcal{D}$
- Does better on average compared to other models both supervised and unsupervised.
- This is the case of various settings (training and testing on all or a selection of classes, training on some and testing on other classes).
- Improvement over other models most noticeable where domain shifts are greater (between $\mathcal{W} and \mathcal{A}$, and between $\mathcal{A} and \mathcal{D}$).

### MNIST-USPS 
- Does better than other models as the number of samples from the target domain is increased. 

###  Domain generalization
- Experiments are done on the VLCS and MNIST rotated datasets and their performance averaged across different pairs of source datasets and target datasets is better other models, for some of the pairs the best.