# Unsupervised Domain Adaptation by Backpropagation
In this notebook, we will demonstrate how to implement Domain Adversarial Neural Network (DANN) as proposed in this [paper](https://arxiv.org/abs/1409.7495). In this notebook, we will adapt a digit classifier trained on MNIST digit dataset to USPS digit dataset.

## Initial Imports

In [1]:
import os

import tensorflow as tf
import numpy as np

import fastestimator as fe

## Step 1: Defining Pipeline
We will download the two datasets and then define ``Pipeline`` objects accordingly.

In [2]:
from fastestimator.dataset import mnist, usps
from fastestimator.op.numpyop import ImageReader
from fastestimator import RecordWriter

usps_train_csv, usps_eval_csv, usps_parent_dir = usps.load_data()
mnist_train_csv, mnist_eval_csv, mnist_parent_dir = mnist.load_data()

Extracting /root/fastestimator_data/USPS/zip.train.gz
Extracting /root/fastestimator_data/USPS/zip.test.gz


In [3]:
# parameters
batch_size = 128
epochs = 100 

The dataset api creates a train csv file with each row containing a relative path to a image and the class label. Two train csv files will have the same column names. We need to change these column names to unique name for our purpose.

In [4]:
import pandas as pd

df = pd.read_csv(mnist_train_csv)
df.columns = ['source_img', 'source_label']
df.to_csv(mnist_train_csv, index=False)

df = pd.read_csv(usps_train_csv)
df.columns = ['target_img', 'target_label']
df.to_csv(usps_train_csv, index=False)

With the modified csv files, we can now create an input data pipeline that returns a batch from the MNIST dataset and the USPS dataset.

#### Note that the input data pipeline created here is an unpaired dataset of the MNIST and the USPS.

In [5]:
from fastestimator.op.tensorop import Resize, Minmax

writer = RecordWriter(save_dir=os.path.join(os.path.dirname(mnist_parent_dir), 'dann', 'tfr'),
                      train_data=(usps_train_csv, mnist_train_csv),
                      ops=(
                          [ImageReader(inputs="target_img", outputs="target_img", parent_path=usps_parent_dir, grey_scale=True)], # first tuple element
                          [ImageReader(inputs="source_img", outputs="source_img", parent_path=mnist_parent_dir, grey_scale=True)])) # second tuple element

We apply the following preprocessing to both datasets:

* Resize of images to $28\times28$
* Minmax pixel value normalization

In [6]:
pipeline = fe.Pipeline(
    batch_size=batch_size,
    data=writer,
    ops=[
        Resize(inputs="target_img", outputs="target_img", size=(28, 28)),
        Resize(inputs="source_img", outputs="source_img", size=(28, 28)),
        Minmax(inputs="target_img", outputs="target_img"),
        Minmax(inputs="source_img", outputs="source_img")
    ]
)

## Step 2: Defining Network
![DANN](./GRL.png)
*Image Credit: [DANN Paper](https://arxiv.org/abs/1409.7495)*

With ``Pipeline`` defined, we define the network architecture.
The digit classification model is composed of the feature extraction network and the classifier network.
In addition, the domain discriminator is attached to the output of the feature extraction network.

The main idea is to train the feature extraction network to extract features that are invariant to domain shift.
MNIST samples are used to train both classification branch (upper branch) and domain classification branch (lower branch). 
USPS samples are only used to train domain classification branch.
Note that there is a gradient reversal layer between feature extraction network and the domain classification network. This layer helps the feature extraction network to be domain invariant by updating its parameters in the reverse direction of the gradient of domain classification loss.

For stable training, the gradient of domain classification $\frac{\partial L_{d}}{\partial \theta_{d}}$ is scaled by a constant $\lambda$ which smoothyl changes from 0 to 1 throughout the training. In our example, we define a tensor variable named ``alpha`` for this purpose.

In [7]:
from tensorflow.keras import layers, Model
from fastestimator.layers import GradReversal
alpha = tf.Variable(0.0, dtype=tf.float32, trainable=False)
img_shape=(28, 28, 1)
feat_dim = 7 * 7 * 48

def build_feature_extractor(img_shape=(28, 28, 1)):
    x0 = layers.Input(shape=img_shape)
    x = layers.Conv2D(32, 5, activation="relu", padding="same")(x0)
    x = layers.MaxPooling2D()(x)
    x = layers.Conv2D(48, 5, activation="relu", padding="same")(x)
    x = layers.MaxPooling2D()(x)
    feat_map = layers.Flatten()(x)
    return Model(inputs=x0, outputs=feat_map)


def build_label_predictor(feat_dim):
    x0 = layers.Input(shape=(feat_dim,))
    x = layers.Dense(100, activation="relu")(x0)
    x = layers.Dense(100, activation="relu")(x)
    return Model(inputs=x0, outputs=x)

def build_domain_predictor(feat_dim):
    x0 = layers.Input(shape=(feat_dim,))
    x = GradReversal(l=alpha)(x0)
    x = layers.Dense(100, activation="relu")(x)
    x = layers.Dense(1, activation="sigmoid")(x)
    return Model(inputs=x0, outputs=x)

In [8]:
feature_extractor = fe.build(
    model_def=lambda: build_feature_extractor(img_shape),
    model_name="feature_extractor",
    loss_name="fe_loss",
    optimizer=tf.keras.optimizers.Adam(1e-4)
)

label_predictor = fe.build(
    model_def=lambda: build_label_predictor(feat_dim),
    model_name="label_predictor",
    loss_name="fe_loss",
    optimizer=tf.keras.optimizers.Adam(1e-4)
)

domain_predictor = fe.build(
    model_def=lambda: build_domain_predictor(feat_dim),
    model_name="domain_predictor",
    loss_name="fe_loss",
    optimizer=tf.keras.optimizers.Adam(1e-4)
)

We define the loss for feature extraction network.

In [9]:
from fastestimator.op.tensorop.loss import Loss, BinaryCrossentropy, SparseCategoricalCrossentropy
from tensorflow.keras import losses

class FELoss(Loss):
    def __init__(self, inputs, outputs=None, mode=None):
        super().__init__(inputs=inputs, outputs=outputs, mode=mode)        
        self.label_loss_obj = losses.SparseCategoricalCrossentropy(reduction=losses.Reduction.NONE)
        self.domain_loss_obj = losses.BinaryCrossentropy(reduction=losses.Reduction.NONE)        
        
    def forward(self, data, state):
        src_c_logit, src_c_label, src_d_logit, tgt_d_logit = data
        c_loss = self.label_loss_obj(y_true=src_c_label, y_pred=src_c_logit)
        src_d_loss = self.domain_loss_obj(y_true=tf.zeros_like(src_d_logit), y_pred=src_d_logit) 
        tgt_d_loss = self.domain_loss_obj(y_true=tf.ones_like(tgt_d_logit), y_pred=tgt_d_logit)
        return c_loss + src_d_loss + tgt_d_loss

We define the overall forward pass of the training.

In [10]:
from fastestimator.op.tensorop.model import ModelOp
network = fe.Network(ops=[
    ModelOp(inputs="source_img", outputs="src_feat", model=feature_extractor),
    ModelOp(inputs="target_img", outputs="tgt_feat", model=feature_extractor),
    ModelOp(inputs="src_feat", outputs="src_c_logit", model=label_predictor),
    ModelOp(inputs="src_feat", outputs="src_d_logit", model=domain_predictor),
    ModelOp(inputs="tgt_feat", outputs="tgt_d_logit", model=domain_predictor),
    FELoss(inputs=("src_c_logit","source_label", "src_d_logit", "tgt_d_logit"), outputs="fe_loss")    
])

As mentioned in the [paper](https://arxiv.org/abs/1409.7495), the magnitude of the reversed gradient is smoothly changed from [0, 1] for stable training. 
We accomplish this by defining a trace to control the value of ``alpha`` defined previously.

In [11]:
from fastestimator.trace import Trace
from tensorflow.python.keras import backend

class GRLWeightController(Trace):
    def __init__(self, alpha):
        super().__init__(inputs=None, outputs=None, mode="train")
        self.alpha = alpha
        
    def on_begin(self, state):
        self.total_steps = state['total_train_steps']
        
    def on_batch_begin(self, state):
        p = state['train_step'] / self.total_steps
        current_alpha = float(2.0 / (1.0 + np.exp(-10.0 * p)) - 1.0)
        backend.set_value(self.alpha, current_alpha)

In [12]:
traces = [GRLWeightController(alpha=alpha)]

estimator = fe.Estimator(
    pipeline= pipeline, 
    network=network,
    traces = traces,
    epochs = epochs
)

## Step 3: Defining Estimator
We put everything together in ``Estimator`` and start training.

In [13]:
estimator.fit()

    ______           __  ______     __  _                 __            
   / ____/___ ______/ /_/ ____/____/ /_(_)___ ___  ____ _/ /_____  _____
  / /_  / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/
 / __/ / /_/ (__  ) /_/ /___(__  ) /_/ / / / / / / /_/ / /_/ /_/ / /    
/_/    \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/     
                                                                        

FastEstimator: Reading non-empty directory: /root/fastestimator_data/dann/tfr
FastEstimator: Found 60000 examples for train in /root/fastestimator_data/dann/tfr/train_summary1.json
FastEstimator: Found 7291 examples for train in /root/fastestimator_data/dann/tfr/train_summary0.json
FastEstimator-Warn: No ModelSaver Trace detected. Models will not be saved.
FastEstimator-Start: step: 0; total_train_steps: 5600; feature_extractor_lr: 1e-04; label_predictor_lr: 1e-04; domain_predictor_lr: 1e-04; 
FastEstimator-Train: step: 0; fe_loss: 9.094802; 
FastEstimato