# Unsupervised Domain Adaptation 

<center>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bonom/DL-Project/blob/main/notebook.ipynb)

</center>

Notebook for the final project of the course "Machine Learning" at the University of Trento.

## Students:
- Andrea Bonomi [GitHub](https://github.com/bonom)
- Evelyn Turri [GitHub](https://github.com/EvelynTurri)

## Weights

Al the weights of the best models we trained are available [here](https://drive.google.com/drive/folders/1YNAXZ96Deq58423yJ9ignHHTZMdbl33n?usp=share_link).

## Introduction

In this notebook, we will explore the concept of **domain adaptation** and how it can be applied to improve the performance of machine learning models. In domain adaptation we aim at learning from a source data distribution a well performing model on a different (but related) target data distribution. This is particularly useful when we have limited labeled data in the target domain but a large amount of labeled data in the source domain.

We will start by discussing the problem of domain shift and how it can affect the performance of a model. Then, we will introduce different domain adaptation techniques such as feature alignment and adversarial training. We will also implement these techniques in PyTorch and evaluate their effectiveness on a real-world dataset.

## Goal

The goal is to implement an UDA technique for the problem of multiclass classification task with a suibset of the [Adaptiope](https://ieeexplore.ieee.org/document/9423412) dataset, consisting of images belonging to two different domains: Real Life and Product. The UDA technique can be chosen between:

 - Discrepancy-based methods
 - Adversarial-based methods
 - Reconstruction-based methods

As UDA method we present the **Maximum Classifier Discrepancy** (MCD) approach, a discrepancy-based one. This method is taken from the paper [Maximum Classifier Discrepancy for Unsupervised Domain Adapation](https://openaccess.thecvf.com/content_cvpr_2018/papers/Saito_Maximum_Classifier_Discrepancy_CVPR_2018_paper.pdf). And we will show the results obtained with this method and some possible improvements.

## Organization & Lineguides

Below we present the organization of the notebook and the lineguides that we followed during the project.

### Notations

- Source domain = domain from which we have the labeled data
- Target domain = domain to which we want to adapt our model (we have no labeled data)
- $P$ = Product domain
- $RW$ = Real World domain
- $P \rightarrow RW$ = Product to Real World: source domain is $P$ and target domain is $RW$
- $RW \rightarrow P$ = Real World to Product: source domain is $RW$ and target domain is $P$

### Dataset

Consider that datasets are split as following:

- Source domain = Product domain:
    - Train = $P_{train}$
    - Test = $P_{test}$
- Target domain = Real World domain:
    - Train = $RW_{train}$
    - Test = $RW_{test}$
  
### Tasks

Task that we accomplish, for both directions ($P$ as source domain and $RW$ as target domain and viceversa):

1. *Baseline* 
    - Train the proposed model supervisedly on the source domain and evaluate it as it is on the target
    - Accuracy here is called $acc_{so}$ referred to the *source only* scenario
    - Finally, we compute the baseline in both directions: $P \rightarrow RW$ and $RW \rightarrow P$

2. *Upperbound* 
    - Train supervisedely on the target train set and test on the target test set
        - $P \rightarrow RW$ : train on $RW_{train}$ and test on $RW_{test}$
        - $RW \rightarrow P$ : train on $P_{train}$ and test on $P_{test}$

3. *Advanced* 
    - Simultaneously:
        - Train supervisedly on the source domain
        - Train unsupervisedly on the target domain applying the UDA component
    - Test on the target domain
    - Accuracy here is called $acc_{uda}$ referred to the *unsupervised domain adaptation* scenario  
    - Compute the gain: $G = acc_{uda} - acc_{so}$
  
        > Notice that $acc_{uda}$ should be bigger than $acc_{so}$ 

4. *Our proposals*
    - Try to improve the results of the advanced task with the help of some papers and ideas that we will present in the report
    - Compute the gain: $G = acc_{uda} - acc_{so}$
### Accuracies assignment

Minimum accuracy required for each task:

<center>

| Version | $P \rightarrow RW$ | $RW \rightarrow P$ |
| :---: | :---: | :---: |
| Source only | 76% | 90% |
| DA | 80% | 93% |

</center>

## Preparing the Notebook

### Requirements

- Google Colab (pre-installed)
- wandb (to install: `!pip install -U wandb`)

In [None]:
! pip install -qU wandb

In [None]:
# For plotting correctly
%matplotlib inline 

### Imports

In [None]:
import os
import copy
import time
import shutil
import torch
import torchvision
import numpy as np
import pandas as pd
import seaborn as sns
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import torchvision.models as models
import torchvision.transforms as transforms

from sklearn.manifold import TSNE
from sklearn.metrics import confusion_matrix, classification_report

from tqdm import tqdm
from typing import Tuple, List, Dict, Union, Optional, Callable
from torch.utils.data import Dataset, DataLoader, random_split, Subset

# If on google colab needs to import drive to mount google drive
try:
    from google.colab import drive
except ImportError:
    print("Not using Google Colab, not mounting google drive.")

# To avoid errors with wandb
os.environ['WANDB_NOTEBOOK_NAME'] = '229357_230616.ipynb'

import wandb

### Wandb setup

In [None]:
if wandb.login():
    print("Logged in to Weights & Biases!")
else:
    if wandb.login(relogin=True):
        print("Logged in to Weights & Biases!")
    else:
        raise Exception("Could not log in to Weights & Biases!")

### Plotting

For a best view of the `matplotlib` plots in the notebook we use code explained in this [Github Gist](https://gist.github.com/micaleel/965d1672a58af2fcd469d8a24024cb29).

In [None]:
from IPython.core.display import HTML

HTML("""
<style>
 {
    display: table-cell;
    text-align: center;
   .output_png vertical-align: middle;
}
</style>
""")

### Constants


In [None]:
# For reproducibility, set the seed
torch.manual_seed(0)

# To fix the width of the pandas dataframes
pd.set_option('expand_frame_repr', False)

# Device configuration
DEFAULT_DEVICE = torch.device("cpu")
if torch.cuda.is_available():
    print("Using GPU as default device!")
    DEFAULT_DEVICE = torch.device("cuda")
    torch.backends.cudnn.enabled = True # Accelerate cuda
else:
    print("Using CPU as default device!\nThe performances will decrease significantly!")

# Hyper-parameters
DEFAULT_BATCH_SIZE = 256
DEFAULT_CLASSES = ["backpack", "bookcase", "car jack", "comb", "crown", "file cabinet", "flat iron", "game controller", "glasses", "helicopter", "ice skates", "letter tray", "monitor", "mug", "network switch", "over-ear headphones", "pen", "purse", "stand mixer", "stroller"]
DEFAULT_EPOCHS_BASELINE = 10
DEFAULT_EPOCHS = 30
DEFAULT_GAMMA = 0.1
DEFAULT_GENERATOR_SPLIT = torch.Generator().manual_seed(0)
DEFAULT_K_STEPS = 4
DEFAULT_LEARNING_RATE = 1e-1
DEFAULT_MOMENTUM = 0.9
DEFAULT_NUM_CLASSES = len(DEFAULT_CLASSES)
DEFAULT_SHUFFLE_DATALOADER = True
DEFAULT_SPLIT_RATIO = 0.8
DEFAULT_STEP_SIZE = 1
DEFAULT_TSNE_PERPLEXITY = 5
DEFAULT_WANDB_PROJECT_NAME = "229357_230616"
DEFAULT_WEIGHT_DECAY = 1e-5

### Paths for saving models
# First get the time and date
_time_now = time.strftime("%Y%m%d-%H%M%S")

# Then create the path
LOG_PATH = os.path.join("logs", _time_now)
LOG_PATH_MODELS = os.path.join(LOG_PATH, "models")
LOG_PATH_IMAGES = os.path.join(LOG_PATH, "images")

# Create the directories
if not os.path.exists(LOG_PATH_MODELS):
    os.makedirs(LOG_PATH_MODELS)
if not os.path.exists(LOG_PATH_IMAGES):
    os.makedirs(LOG_PATH_IMAGES)

# Release variables that are no longer needed
del _time_now 

## Dataset

The provided dataset is [Adaptiope](https://ieeexplore.ieee.org/document/9423412). We use only a subset of the dataset, which is composed by 2000 images for each domain. Indeed, we have two different domains: Real World and Product. The images are divided in 20 classes, as follows: `backpack`, `bookcase`, `car jack`, `comb`, `crown`, `file cabinet`, `flat iron`, `game controller`, `glasses`, `helicopter`, `ice skates`, `letter tray`, `monitor`, `mug`, `network switch`, `over-ear headphones`, `pen`, `purse`, `stand mixer`, `stroller`.

### ONLY IF you have already unzipted the dataset
If you have already downloaded the unzipted dataset, you can just run this cell to load the dataset from the drive. Otherwise skip this cell and go to the next one.

Note that the dataset should be in the folder `dataset` in your personal drive. Otherwise change `gdrive_path` accordingly.

In [None]:
gdrive_path = "gdrive/My Drive/dataset/adaptiope_small"
if not os.path.exists(gdrive_path):
    # Mount google drive if not already mounted
    if not os.path.exists("gdrive"):
        drive.mount('/content/gdrive')

    # To load the data from google drive faster it is possible to copy directly the extracted dataset
    classes = DEFAULT_CLASSES
    if os.path.isdir(gdrive_path):
        for d, td in zip([os.path.join(gdrive_path, "product_images"), os.path.join(gdrive_path, "real_life")], ["adaptiope_small/product_images", "adaptiope_small/real_life"]):
            os.makedirs(td, exist_ok=True)
            for c in tqdm(classes):
                c_path = os.path.join(d, c)
                c_target = os.path.join(td, c)
                shutil.copytree(c_path, c_target)

### Drive Mount

In [None]:
drive.mount('/content/gdrive')

In [None]:
! mkdir dataset
! cp "gdrive/My Drive/dataset/Adaptiope.zip" dataset/

In [None]:
! unzip -qq dataset/Adaptiope.zip 

In [None]:
# If there is an old version of the dataset it is better to remove it for safety
! rm -rf adaptiope_small

In [None]:
# Create folder
! mkdir adaptiope_small

In [None]:
# Build the dataset folder
classes = DEFAULT_CLASSES
for d, td in zip(["Adaptiope/product_images", "Adaptiope/real_life"], ["adaptiope_small/product_images", "adaptiope_small/real_life"]):
    os.makedirs(td)
    for c in tqdm(classes):
        c_path = os.path.join(d, c)
        c_target = os.path.join(td, c)
        shutil.copytree(c_path, c_target)

### Transformations
Transformations on the input (as said in the assignment instructions). For the transformations we use the `torchvision.transforms` library. We apply the same transformations to both domains. 
1. Resize the image to `224x224` pixels to be compatible with the Resnet architecture
2. Convert the image to a tensor to avoid errors
3. Normalize the image with the mean and standard deviation of the ImageNet dataset

In [None]:
transformations = transforms.Compose([
    transforms.Resize((224, 224)),                          # Resize to the input size of resnet
    transforms.ToTensor(),                                  # Convert to pytorch Tensor
    transforms.Normalize(mean=[0.485, 0.456, 0.406],        # Normalize with ImageNet mean and std
                         std=[0.229, 0.224, 0.225])   
])

### Load Datasets

This cell loads two datasets, *product_images* and *real_life* from the *adaptiope_small* directory and applies the same set of `transformations` to each of them. 

It checks if the default `device` is CPU too: if so, it reduces the batch size to 8 and reduces the size of both datasets to 100 samples each to avoid long training times. This is done with the help of the `torch.utils.data.Subset` utility class.

In [None]:
# Load datasets and apply transformations
products_dataset = torchvision.datasets.ImageFolder('adaptiope_small/product_images', transformations)
reals_dataset = torchvision.datasets.ImageFolder('adaptiope_small/real_life', transformations)

# If using CPU it is better to use smaller datasets to avoid long training times
if DEFAULT_DEVICE == torch.device("cpu"):
    DEFAULT_BATCH_SIZE = 8
    products_dataset = Subset(products_dataset, range(0, 100))
    reals_dataset = Subset(reals_dataset, range(0, 100))

print("Products dataset size: {}".format(len(products_dataset)))
print("Real life dataset size: {}".format(len(reals_dataset)))

### Create Train and Test Datasets

Both datasets are splitted in train and test set with a ratio of `DEFAULT_SPLIT_RATIO` (Default: 80% and 20%).

In [None]:
# Split dataset into train and test
split_product_train = int(len(products_dataset)*DEFAULT_SPLIT_RATIO)
split_real_train = int(len(reals_dataset)*DEFAULT_SPLIT_RATIO)

print("Splitting products dataset into train and test with ratio {}:\n\n\tTrain product = {}\n\tTest product = {}\n\n\tTrain real life = {}\n\tTest real life = {}".format(
    DEFAULT_SPLIT_RATIO, 
    split_product_train, 
    len(products_dataset)-split_product_train, 
    split_real_train, 
    len(reals_dataset)-split_real_train)
)

# Split dataset into train and test
train_products, test_products = random_split(products_dataset, [split_product_train, len(products_dataset)-split_product_train], generator = DEFAULT_GENERATOR_SPLIT)
train_reals, test_reals = random_split(reals_dataset, [split_real_train, len(reals_dataset)-split_real_train], generator = DEFAULT_GENERATOR_SPLIT)

### Create Dataloaders

Build the dataloaders for both domains. We enable the `shuffle` parameter to shuffle the dataset and improve the training.

In [None]:
# Create dataloaders
train_loader_products = DataLoader(train_products, batch_size=DEFAULT_BATCH_SIZE, shuffle=DEFAULT_SHUFFLE_DATALOADER)
test_loader_products = DataLoader(test_products, batch_size=DEFAULT_BATCH_SIZE, shuffle=DEFAULT_SHUFFLE_DATALOADER)
train_loader_reals = DataLoader(train_reals, batch_size=DEFAULT_BATCH_SIZE, shuffle=DEFAULT_SHUFFLE_DATALOADER)
test_loader_reals = DataLoader(test_reals, batch_size=DEFAULT_BATCH_SIZE, shuffle=DEFAULT_SHUFFLE_DATALOADER)

## Global Functions

Definition of the most used functions during the notebook.

### Accuracy function
Definition of the accuracy function as defined from the instructions: $$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$$ 
where:
- TP = True Positives
- TN = True Negatives
- FP = False Positives
- FN = False Negatives

In [None]:
def compute_accuracy(prediction: torch.Tensor, label: torch.Tensor) -> torch.Tensor:
    """Compute the accuracy of a prediction given the label."""
    # Compute accuracy
    _, predicted = prediction.max(dim=1)

    return (predicted == label).sum().item()

### Gain function

Definition of the **gain**: $$ G = acc_{uda} - acc_{so} $$

In [None]:
def compute_gain(acc_uda, acc_so) -> float:
    return acc_uda - acc_so

### Logger function

Function to log the results in the notebook.

In [None]:
def make_log_print(status:str = "Train", epoch:tuple = None, timer:float = None, metrics: dict = None, *args, **kwargs) -> None:
    def _convert_seconds_to_h_m_s(seconds):
        m, s = divmod(seconds, 60)
        h, m = divmod(m, 60)
        h, m, s = int(h), int(m), int(s)

        if h > 0:
            _ret = f"{h}:{m}:{s} hh:mm:ss"
        else:
            if m > 0:
                _ret = f"{m}:{s} mm:ss"
            else:
                _ret = f"{s} seconds"
          
        return _ret
    
    _string = " " + status + " "
    if epoch is not None:
        _actual_epoch = epoch[0]
        _total_epoch = epoch[1]

        _string = " " + status + " - Epoch " + str(_actual_epoch) + "/" + str(_total_epoch) + " " 
      
    # Convert seconds of timer to minutes:seconds
    if timer is not None:
        _chrono = _convert_seconds_to_h_m_s(timer)
        
        # Compute the estimate time remaining 
        _eta = (_total_epoch - _actual_epoch) * timer/_actual_epoch 
        _eta = _convert_seconds_to_h_m_s(_eta)

    
    print(f"{_string:=^50}")
    if metrics is not None:
        if "train/loss_source" in metrics.keys():
            print(f"  Training loss source {metrics['train/loss_source']:.3f}, Training accuracy source {metrics['train/accuracy_source']:.3f}")
            print(f"  Training loss target {metrics['train/loss_target']:.3f}, Training accuracy target {metrics['train/accuracy_target']:.3f}")
            print(f"  Discrepancy {metrics['train/discrepancy']:.3f}, Discrepancy generator {metrics['train/loss_features']:.3f}")
        elif "train/loss" in metrics.keys():
            print(f"  Training loss {metrics['train/loss']:.3f}, Training accuracy {metrics['train/accuracy']:.3f}")
        if "eval/accuracy_source" in metrics.keys():
            print(f"  Test loss {metrics['eval/loss']:.3f}, Test accuracy source {metrics['eval/accuracy_source']:.3f}, Test accuracy target {metrics['eval/accuracy_target']:.3f}, Test accuracy total {metrics['eval/accuracy']:.3f}")
        elif "eval/loss" in metrics.keys():
            print(f"  Test loss {metrics['eval/loss']:.3f}, Test accuracy {metrics['eval/accuracy']:.3f}")

    if timer is not None and epoch is not None:
        print(f"  Time/epoch: {timer/_actual_epoch:.2f} s - Time elapsed: {_chrono} - Time remaining: {_eta}")

    if args is not None:
        for arg in args:
            for k, v in arg.items():
                print(f"  {k}: {v}")

    if kwargs is not None:
        for k, v in kwargs.items():
            print(f"  {k}: {v}")
      
    _string = " End " + status.lower() + " "
    if epoch is not None:
        _string = " End " + status.lower() + " " + str(_actual_epoch) + "/" + str(_total_epoch) + " "

    print(f"{_string:=^50}")

### Optimizer, Criterion and Scheduler

These three functions are used to create the optimizer, the criterion and the scheduler. It is possible to change the parameters by adding them to the function call.

#### Example

```python
optimizer = get_optimizer(model, optimizer='sgd', lr=0.001, momentum=0.9)
```

In [None]:
def get_optimizer(model: nn.Module, optimizer: str = 'adam', *args, **kwargs) -> torch.optim.Optimizer:
    _lr = DEFAULT_LEARNING_RATE
    _weight_decay = DEFAULT_WEIGHT_DECAY 
    _momentum = DEFAULT_MOMENTUM
    if "lr" in kwargs:
        _lr = kwargs["lr"]
    if "weight_decay" in kwargs:
        _weight_decay = kwargs["weight_decay"]

    if optimizer == 'adam':
        return torch.optim.Adam(model.parameters(), lr=_lr, weight_decay=_weight_decay)
    elif optimizer == 'sgd':
        if "momentum" in kwargs:
            _momentum = kwargs["momentum"]
        return torch.optim.SGD(model.parameters(), lr=_lr, momentum=_momentum, weight_decay=_weight_decay)
    else:
        raise Exception(f"Optimizer {optimizer} not implemented.")

def get_criterion(criterion: str = 'cross_entropy', *args, **kwargs) -> nn.Module:
    if criterion == 'cross_entropy':
        return nn.CrossEntropyLoss()
    elif criterion == 'nll':
        return nn.NLLLoss()
    else:
        raise Exception(f"Criterion {criterion} not implemented.")

def get_scheduler(optimizer: torch.optim.Optimizer, scheduler: str = 'lambda', *args, **kwargs) -> torch.optim.lr_scheduler._LRScheduler:
    _lr_lambda = lambda epoch: 0.9 ** epoch
    _step_size = DEFAULT_STEP_SIZE
    _gamma = DEFAULT_GAMMA

    if "lr_lambda" in kwargs:
        _lr_lambda = kwargs["lr_lambda"]
    if "step_size" in kwargs:
        _step_size = kwargs["step_size"]
    if "gamma" in kwargs:
        _gamma = kwargs["gamma"]

    if scheduler == 'lambda':
        return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=_lr_lambda)
    elif scheduler == 'step':
        return torch.optim.lr_scheduler.StepLR(optimizer, step_size=_step_size, gamma=_gamma)
    else:
        raise Exception(f"Scheduler {scheduler} not implemented.")

### Visualizer

Function to visualize the results of the training, it prints the confusion matrix and some performance metrics. In addition, it plots all the relevant metrics in the notebook.

#### `plot_tsne`

This function plots the t-SNE representation of the features extracted from the model. It is used to visualize the features extracted by the model and to check if the model is able to learn the features of the dataset. 

#### `visualize_results`

This function plots the results of the training. It plots 
 - The loss and the accuracy of the training and testing sets
 - The confusion matrix (on both multiclass and single class)
 - The t-SNE representation of the features extracted by the model

In [None]:
def plot_tsne(
        feature_model: nn.Module = None,
        classifier_model_s: nn.Module = None,
        classifier_model_t: nn.Module = None,
        dataset: DataLoader = None,
        # all_predictions: np.ndarray = None,
        # all_labels: np.ndarray = None,
        device: torch.device = DEFAULT_DEVICE,
        save_path: str = None,
        *args, 
        **kwargs
    ) -> None:
    """
    Plot t-SNE of features.
    """
    feature_model.to(device)
    feature_model.eval()
    classifier_model_s.to(device)
    classifier_model_s.eval()
    if classifier_model_t is not None:
        classifier_model_t.to(device)
        classifier_model_t.eval()
        all_predictions_t = []

    all_predictions = []
    all_predictions_s = []
    all_labels = []
    n_samples = 0

    for batch in dataset:
        images, labels = batch
        images = images.to(device)
        labels = labels.to(device)
        features = feature_model(images)
        predictions_s = classifier_model_s(features, features_layer = True)
        
        if classifier_model_t is not None:
            predictions_t = classifier_model_t(features, features_layer = True)
            all_predictions_t.append(predictions_t.detach().cpu().numpy())

        all_predictions.append(features.detach().cpu().numpy())
        all_predictions_s.append(predictions_s.detach().cpu().numpy())
        
        all_labels.append(labels.detach().cpu().numpy())
        n_samples += images.shape[0]
    
    all_predictions = np.concatenate(all_predictions, axis=0)
    all_predictions_s = np.concatenate(all_predictions_s, axis=0)
    all_labels = np.concatenate(all_labels, axis=0)

    if classifier_model_t is not None:
        all_predictions_t = np.concatenate(all_predictions_t, axis=0)

    # Get tsne
    tsne = TSNE(n_components=2, perplexity=DEFAULT_TSNE_PERPLEXITY)
    tsne_results = tsne.fit_transform(all_predictions)

    # Plot
    plt.figure(figsize=(10, 10))
    plt.scatter(tsne_results[:, 0], tsne_results[:, 1], c=all_labels, cmap='tab10')
    plt.colorbar()
    if save_path is not None:
        plt.savefig(save_path+"_tsne_resnet.png")
    else:
        plt.show()

    # clear plot
    plt.clf()
    tsne = TSNE(n_components=2, perplexity=DEFAULT_TSNE_PERPLEXITY)
    tsne_results_s = tsne.fit_transform(all_predictions_s)
    plt.figure(figsize=(10, 10))
    plt.scatter(tsne_results_s[:, 0], tsne_results_s[:, 1], c=all_labels, cmap='tab10')
    plt.colorbar()
    if save_path is not None:
        plt.savefig(save_path+"_tsne_source.png")
    else:
        plt.show()
    
    if classifier_model_t is not None:
        plt.clf()
        tsne = TSNE(n_components=2, perplexity=DEFAULT_TSNE_PERPLEXITY)
        tsne_results_t = tsne.fit_transform(all_predictions_t)
        plt.figure(figsize=(10, 10))
        plt.scatter(tsne_results_t[:, 0], tsne_results_t[:, 1], c=all_labels, cmap='tab10')
        plt.colorbar()
        if save_path is not None:
            plt.savefig(save_path+"_tsne_target.png")
        else:
            plt.show()


def visualize_results(
        feature_model: nn.Module,
        classifier_model_s: nn.Module,
        eval_dataset: DataLoader,
        classifier_model_t: nn.Module = None,
        metrics: dict = None,
        device: torch.device = DEFAULT_DEVICE,
        save_path: str = None,
        *args, 
        **kwargs
    ) -> None:
    feature_model.to(device)
    feature_model.eval()

    classifier_model_s.to(device)
    classifier_model_s.eval()

    if classifier_model_t is not None:
        classifier_model_t.to(device)
        classifier_model_t.eval()

    ### Init variables
    all_features = []
    all_predictions = []
    all_labels = []
    n_samples = 0
    
    ### Populate confusion matrix multi-class
    for batch in eval_dataset:
        images, labels = batch
        images = images.to(device)
        labels = labels.to(device)
        features = feature_model(images)
        outputs = classifier_model_s(features)
        if classifier_model_t is not None:
            outputs += classifier_model_t(features)
        _, y = torch.max(outputs, dim=1)
        for _f, _label, _pred in zip(features, labels, y):
            all_features.append(_f.cpu().detach().numpy())
            all_predictions.append(_pred.cpu().detach().numpy().item())
            all_labels.append(_label.cpu().detach().numpy().item())
        n_samples += images.shape[0]

    all_classes = set([DEFAULT_CLASSES[i] for i in all_labels+all_predictions])    
    plt.figure(0, figsize=(15, 15))
    confusion_m = pd.DataFrame(confusion_matrix(all_labels, all_predictions, normalize='true'),index=DEFAULT_CLASSES,columns=DEFAULT_CLASSES)
    sns.heatmap(confusion_m, cmap='icefire', annot = True, xticklabels=True, yticklabels=True, linewidths=0.3)
    plt.title('Confusion matrix', fontsize=14)
    if save_path is not None:
        plt.savefig(save_path+"_confusion_matrix.png")
    else:
        plt.show()
    
    ### Populate confusion matrix single-class
    additional_infos = {
        "TP": 0,
        "TN": 0,
        "FP": 0,
        "FN": 0,
    }

    for y_pred, y_true in zip(all_predictions, all_labels):
        if y_pred == y_true:
            additional_infos["TP"] += 1
            additional_infos["TN"] += DEFAULT_NUM_CLASSES - 1
        else:
            additional_infos["FP"] += 1
            additional_infos["FN"] += DEFAULT_NUM_CLASSES - 1
    
    print()
    print(
        pd.DataFrame({
            "Predicted Positive": additional_infos["TP"],
            "Predicted Negative": additional_infos["FP"],
            "Actual Positive": additional_infos["FN"],
            "Actual Negative": additional_infos["TN"],
            }, index=["Actual Positive", "Actual Negative"], columns=["Predicted Positive", "Predicted Negative"])
    )
    print()
        
    ### Print the report for all classes
    print(classification_report(all_labels, all_predictions, zero_division=0, target_names=all_classes))
    
    # Plot the metrics 
    if metrics is not None:
        if "train/loss_source" in metrics.keys():
            # Plot loss on source and target together
            # create a subplot with 1 row and 2 columns
            fig, axs = plt.subplots(1, 3, figsize=(15, 5))
            fig.suptitle('Training')
            axs[0].plot(metrics["train/loss_source"], label="Source")
            axs[0].plot(metrics["train/loss_target"], label="Target")
            axs[0].plot(metrics["eval/loss"], label="Test")
            axs[0].set_title('Loss')
            axs[0].set_xlabel('Epoch')
            axs[0].set_ylabel('Loss')
            axs[0].legend()
            axs[1].plot(metrics["train/accuracy_source"], label="Source Train")
            axs[1].plot(metrics["train/accuracy_target"], label="Target Train")
            axs[1].plot(metrics["eval/accuracy_source"], label="Source Test")
            axs[1].plot(metrics["eval/accuracy_target"], label="Target Test")
            axs[1].plot(metrics["eval/accuracy"], label="Total Test")
            axs[1].set_title('Accuracy')
            axs[1].set_xlabel('Epoch')
            axs[1].set_ylabel('Accuracy')
            axs[1].legend()
            axs[2].plot(metrics["train/discrepancy"], label="Discrepancy")
            axs[2].plot(metrics["train/loss_features"], label="Discrepancy generator")
            axs[2].set_title('Discrepancy')
            axs[2].set_xlabel('Epoch')
            axs[2].set_ylabel('Discrepancy')
            axs[2].legend()
            if save_path is not None:
                plt.savefig(save_path+"_metrics.png")
            else:
                plt.show()

        elif "train/loss" in metrics.keys():
            # Plot loss
            fig, axs = plt.subplots(1, 2, figsize=(10, 5))
            fig.suptitle('Training')
            axs[0].plot(metrics["train/loss"], label="Loss")
            axs[0].plot(metrics["eval/loss"], label="Loss")
            axs[0].set_title('Loss')
            axs[0].set_xlabel('Epoch')
            axs[0].set_ylabel('Loss')
            axs[0].legend()
            axs[1].plot(metrics["train/accuracy"], label="Accuracy")
            axs[1].plot(metrics["eval/accuracy"], label="Accuracy")
            axs[1].set_title('Accuracy')
            axs[1].set_xlabel('Epoch')
            axs[1].set_ylabel('Accuracy')
            axs[1].legend()
            if save_path is not None:
                plt.savefig(save_path+"_metrics.png")
            else:
                plt.show()
            
        elif "eval/loss" in metrics.keys() and "train/loss" not in metrics.keys():
            # Plot loss
            fig, axs = plt.subplots(1, 2, figsize=(10, 5))
            fig.suptitle('Test')
            axs[0].plot(metrics["eval/loss"], label="Loss")
            axs[0].set_title('Loss')
            axs[0].set_xlabel('Epoch')
            axs[0].set_ylabel('Loss')
            axs[0].legend()
            axs[1].plot(metrics["eval/accuracy"], label="Accuracy")
            axs[1].set_title('Accuracy')
            axs[1].set_xlabel('Epoch')
            axs[1].set_ylabel('Accuracy')
            axs[1].legend()
            if save_path is not None:
                plt.savefig(save_path+"_eval.png")
            else:
                plt.show()
        else:
            print("No metrics to plot")
        
    # Plot t-SNE
    plot_tsne(
        feature_model=feature_model,
        classifier_model_s=classifier_model_s,
        classifier_model_t=classifier_model_t,
        dataset=eval_dataset,
        device=device,
        save_path=save_path)

## Maximum Classifier Discrepancy for Unsupervised Domain Adaptation (MCD)

### Theoretical idea of the paper
The idea of the paper is to propose a way to align distributions of the source domain dataset and the target domain dataset by utilizig the task-specific decision boundaries. 
 
In the paper presents 3 main protagonists: 
- 1 generator: used to extract feaures from the inputs
- 2 discriminators: used for the classification part, since their role is to classify element given their features.

There are 2 goals and with the proposed method we are trying to reach both of them at the same time:
1. Maximize the discrepancy between 2 classifiers' outputs to detect target samples that are far from the support of the source.
2. Train the feature extractor in order to generate target features near the support to minimize the discrepancy.

<p align="center">
    <img src="https://drive.google.com/uc?id=13hBuS7j2ynA8C7PJ_N3xun0gnjvn08V_" width=800px>
    <figcaption align = "center"><b>Theoretical Idea of MCD Method</b></figcaption>
</p>

### Feature extractor

In the paper, the authors implemented their own architecture for the generator. However, due to computational complexity problems, we could not use their proposed architectures. Indeed, the training was taking pretty much more than hundreds of epochs. Thus, we came out with the idea of using as Feature extractor a Resnet, removing the last linear layer, in order to output only features. 

In the class below we implemented a pretrained Resnet, with the possibility of choosing between `Resnet-18`, `Resnet-50` or `Resnet-101` with their equivalent weights. 

In [None]:
def get_resnet_model(model: str = 'resnet50', *args, **kwargs) -> nn.Module:
    if model == 'resnet50':
        return models.resnet50(weights="IMAGENET1K_V2")
    elif model == 'resnet101':
        return models.resnet101(weights="IMAGENET1K_V2")
    elif model == 'resnet18':
        return models.resnet18(weights="IMAGENET1K_V1")
    else:
        raise Exception(f"Model {model} not implemented.")

class FeatureModel(nn.Module):
    def __init__(self, resnet_version: str = 'resnet50') -> None:
        super(FeatureModel, self).__init__()

        #Define the pretrained architecture
        resnet = get_resnet_model(resnet_version)

        # Save the output size of the last layer because we will need it later
        self.output_size = resnet.fc.in_features

        # Define the new architecture by removing the last layer
        self.backbone = nn.Sequential(*list(resnet.children())[:-1])
            
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # We have to squeeze the output since it is a 4D tensor and we want a 2D one
        x = self.backbone(x).squeeze().squeeze()
        
        return x

### Classifier

The classifier is a simple 3-layer neural network. The input size is the same as the output size of the feature extractor. The output size is the number of classes in the dataset. For each layer there is a linear layer, a batch normalization layer and a ReLU activation function. Finally, there is a dropout layer to avoid overfitting.

Default parameters are:
 - `input_size`: 512 (Resnet-18)
 - `hidden_dims`: (256, 128)
 - `output_size`: 20 (DEFAULT_NUM_CLASSES)
 - `dropout`: 0.1

In [None]:
class Predictor(nn.Module):
    def __init__(self, input_size:int, num_classes:int = DEFAULT_NUM_CLASSES, hidden_dims: tuple = (256, 128), dropout_pr: float = 0.1) -> None:
        super(Predictor, self).__init__()
        
        # Build the model
        self.fc1 = nn.Linear(input_size, hidden_dims[0])
        self.batch_norm1 = nn.BatchNorm1d(hidden_dims[0])

        self.fc2 = nn.Linear(hidden_dims[0], hidden_dims[1])
        self.batch_norm2 = nn.BatchNorm1d(hidden_dims[1])

        self.fc3 = nn.Linear(hidden_dims[1], num_classes)
        self.batch_norm3 = nn.BatchNorm1d(num_classes)

        self.relu = nn.ReLU()

        self.dropout = nn.Dropout(dropout_pr)

    def forward(self, x: torch.Tensor, features_layer: bool = False) -> torch.Tensor:
        x = self.fc1(x)
        x = self.batch_norm1(x)
        x = self.relu(x)

        x = self.fc2(x)
        x = self.batch_norm2(x)
        x = self.relu(x)
        
        if features_layer:
            return x

        x = self.fc3(x)
        x = self.batch_norm3(x)
        x = self.relu(x)

        x = self.dropout(x)

        return x


## 1) Baseline Approach

To define our baseline we decided to use the method described in the paper, adapted for the training supervised only on the source only domain, without any UDA technique. 

In our case, the UDA part was entering in the game when we were training the 2 classifiers to maximize discrepancy and training the features extractor to minimize the discrepancy between distributions of the models. For this reason we decide to implement our baseline as follows.
- Architecture:
    - 1 generator for feature extractor: we used a pretrained Resnet-18
    - 1 classifier for the classification task
- Method:
    1. Train in a supervised way on the surce domain, we train the Classifier only
    2. Test on the target domain
   <!-- , by generating feature from the Features extractor, and computing the classification task through the classifier -->

Results are discussed after the run for $P \rightarrow RW$ and $RW \rightarrow P$ 

### Train and Test functions

In [None]:
def train_one_step_baseline(
        feature_model: nn.Module, 
        source_model: nn.Module,
        optimizer: torch.optim.Optimizer, 
        criterion: nn.Module, 
        dataset: DataLoader,
        device: str = DEFAULT_DEVICE,
    ) -> dict:
    # Make sure model is in training mode
    source_model.train()
    
    # Initialize variables
    n_samples = 0
    cumulative_loss = 0
    cumulative_accuracy = 0

    # Train loop
    for img, label in dataset:
        # Move data to device
        img, label = img.to(device), label.to(device)

        # Forward pass
        features = feature_model(img)
        out = source_model(features)
        
        # Compute loss
        loss : torch.Tensor = criterion(out, label)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        #Compute cumulative accuracy and cumulative loss
        cumulative_accuracy += compute_accuracy(out, label)
        cumulative_loss += loss.item()
        n_samples += img.shape[0]

    return {
        'train/loss': cumulative_loss/n_samples, 
        'train/accuracy': cumulative_accuracy/n_samples,
    }

def eval_one_step_baseline(
        feature_model: nn.Module, 
        source_model: nn.Module,
        criterion: nn.Module, 
        dataset: DataLoader,
        device: str = DEFAULT_DEVICE,
    ) -> dict:

    # Make sure model is in evaluation
    source_model.eval()
    
    # Initialize variables
    n_samples = 0
    cumulative_loss = 0
    cumulative_accuracy = 0
    # Eval loop
    for img, label in dataset:
      # Move data to device
      img, label = img.to(device), label.to(device)

      # Forward pass
      features = feature_model(img)
      out = source_model(features)

      # Compute loss
      loss : torch.Tensor = criterion(out, label)

      #Compute cumulative accuracy and cumulative loss
      cumulative_accuracy += compute_accuracy(out, label)
      cumulative_loss += loss.item()
      n_samples += img.shape[0]

    return {
        'eval/loss': cumulative_loss/n_samples,
        'eval/accuracy': cumulative_accuracy/n_samples,
    }

In [None]:
def train_baseline(
        feature_model: nn.Module,
        source_model: nn.Module,
        optimizer: torch.optim.Optimizer,
        criterion: nn.Module,
        train_dataset: DataLoader,
        eval_dataset: DataLoader,
        scheduler = None,
        n_epochs: int = DEFAULT_EPOCHS_BASELINE,
        device: str = DEFAULT_DEVICE,
        save_name: str = 'baseline.pth',
    ) -> float:
    # Move models to device
    feature_model.to(device)
    source_model.to(device)

    # Make sure freature model is in evaluation - we don't want to train it here
    feature_model.eval()
    
    # Initialize variables for best model saving
    best_accuracy = 0
    best_model = None

    # Initialize variables
    start_timer = time.time()

    # Store metrics to visualize
    metrics_visualize = {
        'train/loss': [],
        'train/accuracy': [],
        'eval/loss': [],
        'eval/accuracy': [],
    }
    
    # Train loop
    for epoch in range(n_epochs):
        # Train
        train_results = train_one_step_baseline(feature_model, source_model, optimizer, criterion, train_dataset, device)
        # Eval
        eval_results = eval_one_step_baseline(feature_model, source_model, criterion, eval_dataset, device)

        # Combine results into one dictionary
        metrics = {**train_results, **eval_results}

        # Log results in wandb
        wandb.log(metrics)

        # Save best model if test accuracy is better
        if eval_results['eval/accuracy'] > best_accuracy:
            best_accuracy = eval_results['eval/accuracy']
            best_model = copy.deepcopy(source_model)
            
            # Although it can take some computational time, we save the model after each epoch 
            # to make sure we don't lose it in case of a crash
            torch.save(best_model.state_dict(), os.path.join(LOG_PATH_MODELS, save_name+'.pth'))

        # Print results to console
        make_log_print("Train", (epoch+1, n_epochs), time.time()-start_timer, metrics)

        # Save metrics to visualize
        for key, value in metrics.items():
            metrics_visualize[key].append(value)

        # Step scheduler 
        if scheduler is not None:
            scheduler.step()
    
    # Visualize results like loss, accuracy, confusion matrix and t-sne
    visualize_results(
        feature_model=feature_model,
        classifier_model_s=best_model,
        eval_dataset=eval_dataset,
        metrics=metrics_visualize, 
        device=device, 
        save_path=os.path.join(LOG_PATH_IMAGES, save_name))

    # Return best accuracy since we want to compare it with other models
    # No neeed to return best model since we save it 
    return best_accuracy

### Runs

We will evaluate the baseline in 2 directions w.r.t. the dataset:
- $P → RW$ : training on the product domain and testing the model on the real woRWd domain
- $RW → P$ : training on the real woRWd domain and testing the model on the product domain

### $P \rightarrow RW$

In [None]:
# Variable definition
feature_model = FeatureModel(resnet_version='resnet18')
source_model = Predictor(input_size=feature_model.output_size)
optimizer = get_optimizer(source_model, optimizer='sgd', lr=DEFAULT_LEARNING_RATE)
criterion = get_criterion(criterion='cross_entropy')

# Initialize wandb
experiment_name = 'Baseline_P_RW'

wandb.init(
    # Set the project where this run will be logged
    project=DEFAULT_WANDB_PROJECT_NAME, 
    # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
    name=experiment_name, 
    # Track hyperparameters and run metadata
    config={
        "feature_model": feature_model.__class__.__name__,
        "source_model": source_model.__class__.__name__,
        "source_optimizer": optimizer.__class__.__name__,
        "criterion": criterion.__class__.__name__,
        "training_dataset": "Products",
        "evaluation_dataset": "RealWorld",
        "training_type": "Baseline P-RW",
        "learning_rate": optimizer.param_groups[0]['lr'],
        "batch_size": DEFAULT_BATCH_SIZE,
        "epochs": DEFAULT_EPOCHS_BASELINE,
    }
)

# Train
accuracy_baseline_p_rw = train_baseline(
    feature_model=feature_model,
    source_model=source_model,
    optimizer=optimizer,
    criterion=criterion,
    train_dataset=train_loader_products,
    eval_dataset=test_loader_reals,
    save_name=experiment_name
    )

# Close wandb
wandb.finish()

# Delete variables to free memory
del feature_model, source_model, optimizer, criterion, experiment_name

# Clear memory
if DEFAULT_DEVICE.type == 'cuda':
    torch.cuda.empty_cache()

# Print results
print('Accuracy baseline P-RW: {:.2f}'.format(accuracy_baseline_p_rw*100))

#### Results
Here we report the results of the baseline on the $P \rightarrow RW$ direction.

##### Final results

<center>

| | Accuracy | Loss | 
| --- | :---: | --- |
| Train | 88.9 % | 0.002 | 
| Test | 80.7 % | 0.004 |

</center>

##### Metrics

<center>

| | precision | recall | f1-score | support |
| --- | :---: | :---: | :---: | :---: |
| backpack | 0.96 | 0.96 | 0.96 | 24 |
| bookcase | 0.87 | 0.87 | 0.87 | 15 |
| car jack | 1.00 | 0.86 | 0.92 | 21 |
| comb | 0.91 | 0.95 | 0.93 | 21 |
| crown | 1.00 | 0.71 | 0.83 | 17 |
| file cabinet | 0.83 | 0.71 | 0.77 | 21 |
| flat iron | 1.00 | 0.73 | 0.84 | 22 |
| game controller | 0.64 | 0.95 | 0.76 | 22 |
| glasses | 0.82 | 0.53 | 0.64 | 17 |
| helicopter | 1.00 | 0.89 | 0.94 | 19 |
| ice skates | 0.93 | 0.70 | 0.80 | 20 |
| letter tray | 0.57 | 1.00 | 0.72 | 17 |
| monitor | 0.81 | 0.63 | 0.71 | 27 |
| mug | 0.89 | 0.94 | 0.91 | 17 |
| network switch | 0.68 | 1.00 | 0.81 | 19 |
| over-ear headphones | 0.52 | 0.94 | 0.67 | 16 |
| pen | 0.77 | 0.85 | 0.81 | 20 |
| purse | 0.94 | 0.71 | 0.81 | 21 |
| stand mixer | 0.78 | 0.61 | 0.68 | 23 |
| stroller | 0.88 | 0.71 | 0.79 | 21 |
| | | | | |
| accuracy | | | 0.81 | 400 |
| macro avg | 0.84 | 0.81 | 0.81 | 400 |
| weighted avg | 0.84 | 0.81 | 0.81 | 400 |

</center>

##### Loss and Accuracy curves

<p align="center">
    <img src="https://drive.google.com/uc?id=17MQhezyHGicYNNiS_D43FRtAPVPrddVK" width=400px>
    <img src="https://drive.google.com/uc?id=1k7IXzX4gTDGvcZc6LwOwlvcICVSa9KLr" width=400px>
    <figcaption align = "center"><b>Loss & Accuracy</b></figcaption>
</p>

##### Confusion Matrix

<p align="center">
    <img src="https://drive.google.com/uc?id=1ccL_83f5IGKVjBTQfZ_FWy69c7eryHQK" width=600px>
    <figcaption align = "center"><b>Confusion Matrix</b></figcaption>
</p>

##### t-SNE

<p align="center">
    <img src="https://drive.google.com/uc?id=1MePvXiItwbH8fEM7VrL0hm8IZdHoN_Yf" width=400px>
    <figcaption align = "center"><b>t-SNE</b></figcaption>
</p>

### $RW \rightarrow P$

In [None]:
# Variable definition
feature_model = FeatureModel(resnet_version='resnet18')
source_model = Predictor(input_size=feature_model.output_size)
optimizer = get_optimizer(source_model, optimizer='sgd', lr=DEFAULT_LEARNING_RATE)
criterion = get_criterion(criterion='cross_entropy')

# Initialize wandb
experiment_name = 'Baseline_RW_P'

wandb.init(
    # Set the project where this run will be logged
    project=DEFAULT_WANDB_PROJECT_NAME, 
    # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
    name=experiment_name, 
    # Track hyperparameters and run metadata
    config={
        "feature_model": feature_model.__class__.__name__,
        "source_model": source_model.__class__.__name__,
        "source_optimizer": optimizer.__class__.__name__,
        "criterion": criterion.__class__.__name__,
        "training_dataset": "RealWorld",
        "evaluation_dataset": "Products",
        "training_type": "Baseline RW-P",
        "learning_rate": optimizer.param_groups[0]['lr'],
        "batch_size": DEFAULT_BATCH_SIZE,
        "epochs": DEFAULT_EPOCHS_BASELINE,
    }
)

# Train
accuracy_baseline_rw_p = train_baseline(
    feature_model=feature_model,
    source_model=source_model,
    optimizer=optimizer,
    criterion=criterion,
    train_dataset=train_loader_reals,
    eval_dataset=test_loader_products,
    save_name=experiment_name
    )

# Close wandb
wandb.finish()

# Delete variables to free memory
del feature_model, source_model, optimizer, criterion, experiment_name

# Empty cache
if DEFAULT_DEVICE.type == 'cuda':
    torch.cuda.empty_cache()

# Print results
print('Accuracy baseline RW-P: {:.2f}'.format(accuracy_baseline_rw_p*100))

#### Results

Here we report the results of the baseline on the $RW \rightarrow P$ direction.

##### Final results

<center>

| | Accuracy | Loss |
| --- | :---: | --- |
| Train | 89.3 % | 0.002 |
| Test | 92.2 % | 0.002 |

</center>

##### Metrics

<center>

| | precision | recall | f1-score | support |
| --- | :---: | :---: | :---: | :---: |
| backpack | 0.87 | 0.72 | 0.79 | 18 |
| bookcase | 0.94 | 1.00 | 0.97 | 16 |
| car jack | 1.00 | 0.95 | 0.97 | 19 |
| comb | 0.95 | 1.00 | 0.97 | 19 |
| crown | 0.86 | 0.95 | 0.90 | 19 |
| file cabinet | 0.92 | 0.96 | 0.94 | 24 |
| flat iron | 1.00 | 0.71 | 0.83 | 17 |
| game controller | 0.89 | 1.00 | 0.94 | 17 |
| glasses | 0.88 | 1.00 | 0.94 | 15 |
| helicopter | 0.93 | 0.93 | 0.93 | 29 |
| ice skates | 0.89 | 1.00 | 0.94 | 17 |
| letter tray | 0.86 | 0.95 | 0.90 | 19 |
| monitor | 1.00 | 0.95 | 0.97 | 20 |
| mug | 0.83 | 0.90 | 0.86 | 21 |
| network switch | 0.92 | 0.79 | 0.85 | 29 |
| over-ear headphones | 0.94 | 0.81 | 0.87 | 21 |
| pen | 0.83 | 1.00 | 0.91 | 20 |
| purse | 1.00 | 0.94 | 0.97 | 16 |
| stroller | 1.00 | 1.00 | 1.00 | 24 |
| stand mixer | 1.00 | 0.95 | 0.97 | 20 |
| | | | | |
| accuracy | | | 0.92 | 400 |
| macro avg | 0.93 | 0.93 | 0.92 | 400 |
| weighted avg | 0.93 | 0.92 | 0.92 | 400 |

</center>

##### Loss and Accuracy curves

<p align="center">
    <img src="https://drive.google.com/uc?id=1KXR0IBuo8TqPP3qtLWYz1TL4imSv_8M4" width=400px>
    <img src="https://drive.google.com/uc?id=1WGlQdMffQfDs1cDY4NkPwecemyz_dKQW" width=400px>
    <figcaption align = "center"><b>Loss & Accuracy</b></figcaption>
</p>

##### Confusion Matrix

<p align="center">
    <img src="https://drive.google.com/uc?id=1AjQiAWXVMXEhjg97MWfC_rhceUeBrqrO" width=600px>
    <figcaption align = "center"><b>Confusion Matrix</b></figcaption>
</p>

##### t-SNE

<p align="center">
    <img src="https://drive.google.com/uc?id=1myFwRjs6_xVE1op-KSdD34DGbXEInupD" width=400px>
    <figcaption align = "center"><b>t-SNE</b></figcaption>
</p>



### Discussion

Thanks to the pretrained Resnet-18 as features extractor, both directions trainings start with a high accuracy. Moreover, they reach very good results for the baselines in just few epochs. <!-- We note that the features extractor model is not trained in the baseline, so the only model trained is the classifier, which has to learn how classify at its best the features coming from the Resnet.  -->

As we can see from the tables the accuarcy of $P \rightarrow RW$ is much lower than $RW \rightarrow P$. We think that this fact is due to the difference of training domains. In fact, in the first case the model has to learn from domain poor of details, and it is evaluated on a much more complex domain. On the other hand, in the second case the model can learn very well from images which are more complex: the features are surely more characterizing the object in the image with respect to the corresponding images in the other domain.

The confusion matrix is a demonstration of this statement: the matrix for $P \rightarrow RW$ is less sparse than the one for $RW \rightarrow P$, which means that the model cannot understand very well where it has to classify the image in input.
Also the t-SNE representation for $P \rightarrow RW$ is more confused, and the different classes are not very grouped together as for the versus of $RW \rightarrow P$.

We report under this discussion some results for both cases. 

#### Classifiers

We want to underline the fact that the classifiers can already learn very well in the baseline, and this fact can be seen by the difference from the t-SNE representation before the classifier and at the end of it for training. The classes are already very well separated, even though there are still some outliers, especially in the $P \rightarrow RW$ direction, which, as we wrote before, is the most difficult direction to learn. Indeed, the t-SNE representation of the Resnet in that case is pretty confused, while the second case the classes are already well separated before the classifier.

We report both t-SNE representations of both baseline before and after related classifiers.

<p align="center">
    <p align="center">
        <img src="https://drive.google.com/uc?id=18ecf6MmDEBs6LfKvDreGn55-IDa9fKym" width=400px>
        <img src="https://drive.google.com/uc?id=1MePvXiItwbH8fEM7VrL0hm8IZdHoN_Yf" width=400px>
        <figcaption align = "center"><b>P -> RW: t-SNE before and after classifier</b></figcaption>
    </p>
    <p align="center">
        <img src="https://drive.google.com/uc?id=1589q3xPdj5ykOD6Kd41-IVEt-C2RFFdm" width=400px>
        <img src="https://drive.google.com/uc?id=1myFwRjs6_xVE1op-KSdD34DGbXEInupD" width=400px>
        <figcaption align = "center"><b>RW -> P : t-SNE before and after classifier</b></figcaption>
    </p>
</p>    


## 2) Upperbound
We now compute the upperbound for both directions and each upperbound we will name it as:
- $P$ : $acc_{up}^P$
- $RW$ : $acc_{up}^{RW}$

### Upperbound $P$

> MEMO : train on $RW_{train}$ and test on $RW_{test}$

In [None]:
# Variable definition
feature_model = FeatureModel(resnet_version='resnet18')
source_model = Predictor(input_size=feature_model.output_size)
optimizer = get_optimizer(source_model, optimizer='sgd', lr=DEFAULT_LEARNING_RATE)
criterion = get_criterion(criterion='cross_entropy')

# Initialize wandb
experiment_name = 'Upperbound_RW'

wandb.init(
    # Set the project where this run will be logged
    project=DEFAULT_WANDB_PROJECT_NAME, 
    # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
    name=experiment_name, 
    # Track hyperparameters and run metadata
    config={
        "feature_model": feature_model.__class__.__name__,
        "source_model": source_model.__class__.__name__,
        "source_optimizer": optimizer.__class__.__name__,
        "criterion": criterion.__class__.__name__,
        "training_dataset": "RealWorld",
        "evaluation_dataset": "RealWorld",
        "training_type": "Upperbound RW",
        "learning_rate": optimizer.param_groups[0]['lr'],
        "batch_size": DEFAULT_BATCH_SIZE,
        "epochs": DEFAULT_EPOCHS_BASELINE,
    }
)

# Train
accuracy_upperbound_rw = train_baseline(
    feature_model=feature_model,
    source_model=source_model,
    optimizer=optimizer,
    criterion=criterion,
    train_dataset=train_loader_reals,
    eval_dataset=test_loader_reals,
    save_name=experiment_name
)

# Close wandb
wandb.finish()

# Delete variables to free memory
del feature_model, source_model, optimizer, criterion, experiment_name

# Empty cache
if DEFAULT_DEVICE.type == 'cuda':
    torch.cuda.empty_cache() 

# Print results
print('Accuracy upperbound RW: {:.2f}'.format(accuracy_upperbound_rw*100))

### Results

Here we report the results of the baseline on the $P \rightarrow RW$ direction.

#### Final results

<center>

| | Accuracy | Loss |
| --- | :---: | --- |
| Train | 91.2 % | 0.001 |
| Test | 90.5 % | 0.002 |

</center>

#### Metrics

<!-- <center>

| | precision | recall | f1-score | support |
| --- | --- | --- | --- | --- |
| letter tray | 0.68 | 1.00 | 0.81 | 17 |
| over-ear headphones | 0.74 | 0.85 | 0.79 | 20 |
| purse | 0.94 | 1.00 | 0.97 | 15 |
| pen | 0.91 | 0.91 | 0.91 | 22 |
| mug | 1.00 | 0.95 | 0.98 | 21 |
| ice skates | 0.94 | 0.73 | 0.82 | 22 |
| bookcase | 0.94 | 0.94 | 0.94 | 16 |
| crown | 1.00 | 0.95 | 0.98 | 21 |
| network switch | 0.86 | 1.00 | 0.93 | 19 |
| monitor | 1.00 | 1.00 | 1.00 | 19 |
| glasses | 0.89 | 0.76 | 0.82 | 21 |
| file cabinet | 0.79 | 0.70 | 0.75 | 27 |
| helicopter | 0.95 | 0.95 | 0.95 | 20 |
| flat iron | 0.89 | 1.00 | 0.94 | 24 |
| stroller | 0.84 | 0.94 | 0.89 | 17 |
| game controller | 0.94 | 0.94 | 0.94 | 17 |
| backpack | 0.89 | 1.00 | 0.94 | 17 |
| stand mixer | 0.82 | 0.61 | 0.70 | 23 |
| comb | 1.00 | 0.95 | 0.98 | 21 |
| car jack | 0.95 | 0.86 | 0.90 | 21 |
| | | | | |
| accuracy | | | 0.89 | 400 |
| macro avg | 0.90 | 0.90 | 0.90 | 400 |
| weighted avg | 0.90 | 0.89 | 0.89 | 400 |

</center> -->

<center>

| | precision | recall | f1-score | support |
| --- | :---: | :---: | :---: | :---: |
| backpack | 0.89 | 1.00 | 0.94 | 17 |
| bookcase | 0.94 | 0.94 | 0.94 | 16 |
| car jack | 0.95 | 0.86 | 0.90 | 21 |
| comb | 1.00 | 0.95 | 0.98 | 21 |
| crown | 1.00 | 0.95 | 0.98 | 21 |
| file cabinet | 0.79 | 0.70 | 0.75 | 27 |
| flat iron | 0.89 | 1.00 | 0.94 | 24 |
| game controller | 0.94 | 0.94 | 0.94 | 17 |
| glasses | 0.89 | 0.76 | 0.82 | 21 |
| helicopter | 0.95 | 0.95 | 0.95 | 20 |
| ice skates | 0.94 | 0.73 | 0.82 | 22 |
| letter tray | 0.68 | 1.00 | 0.81 | 17 |
| monitor | 1.00 | 1.00 | 1.00 | 19 |
| mug | 1.00 | 0.95 | 0.98 | 21 |
| network switch | 0.86 | 1.00 | 0.93 | 19 |
| over-ear headphones | 0.74 | 0.85 | 0.79 | 20 |
| pen | 0.91 | 0.91 | 0.91 | 22 |
| purse | 0.94 | 1.00 | 0.97 | 15 |
| stand mixer | 0.82 | 0.61 | 0.70 | 23 |
| stroller | 0.84 | 0.94 | 0.89 | 17 |
| | | | | |
| accuracy | | | 0.89 | 400 |
| macro avg | 0.90 | 0.90 | 0.90 | 400 |
| weighted avg | 0.90 | 0.89 | 0.89 | 400 |

</center>

#### Loss and Accuracy curves

<p align="center">
    <img src="https://drive.google.com/uc?id=1s6ye1ogF8ZUR5Q6roVvVECNn22IzF4r6" width=400px>
    <img src="https://drive.google.com/uc?id=1WzNFNOEDn9GOLkJ3525p4WeEbKhwMAOD" width=400px>
    <figcaption align = "center"><b>Loss & Accuracy</b></figcaption>
</p>

#### Confusion Matrix

<p align="center">
    <img src="https://drive.google.com/uc?id=1libsX7VuId7voTl_xpEqheP44NJVoTzS" width=600px>
    <figcaption align = "center"><b>Confusion Matrix</b></figcaption>
</p>

#### t-SNE

<p align="center">
    <img src="https://drive.google.com/uc?id=133bxFUsiDYEI2A0XWWMYRC9DKPmR2W4p" width=400px>
    <figcaption align = "center"><b>t-SNE</b></figcaption>
</p>


### Upperbound $RW$

> MEMO : train on $P_{train}$ and test on $P_{test}$

In [None]:
# Variable definition
feature_model = FeatureModel(resnet_version='resnet18')
source_model = Predictor(input_size=feature_model.output_size)
optimizer = get_optimizer(source_model, optimizer='sgd', lr=DEFAULT_LEARNING_RATE)
criterion = get_criterion(criterion='cross_entropy')

# Initialize wandb
experiment_name = 'Upperbound_P'

wandb.init(
    # Set the project where this run will be logged
    project=DEFAULT_WANDB_PROJECT_NAME, 
    # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
    name=experiment_name, 
    # Track hyperparameters and run metadata
    config={
        "feature_model": feature_model.__class__.__name__,
        "source_model": source_model.__class__.__name__,
        "source_optimizer": optimizer.__class__.__name__,
        "criterion": criterion.__class__.__name__,
        "training_dataset": "Products",
        "evaluation_dataset": "Products",
        "training_type": "Upperbound P",
        "learning_rate": optimizer.param_groups[0]['lr'],
        "batch_size": DEFAULT_BATCH_SIZE,
        "epochs": DEFAULT_EPOCHS_BASELINE,
    }
)

# Train
accuracy_upperbound_p = train_baseline(
    feature_model=feature_model,
    source_model=source_model,
    optimizer=optimizer,
    criterion=criterion,
    train_dataset=train_loader_products,
    eval_dataset=test_loader_products,
    save_name=experiment_name
)

# Close wandb
wandb.finish()

# Delete variables to free memory
del feature_model, source_model, optimizer, criterion, experiment_name

# Clear memory
if DEFAULT_DEVICE.type == 'cuda':
    torch.cuda.empty_cache()

# Print results
print('Accuracy upperbound P: {:.2f}'.format(accuracy_upperbound_p*100))

### Results

Here we report the results of the upperbound on the $RW \rightarrow P$ direction.

#### Final results

<center>

| | Accuracy | Loss |
| --- | :---: | --- |
| Train | 88.9 % | 0.002 |
| Test | 97.3 % | 0.001 |

</center>

#### Metrics

<center>

| | precision | recall | f1-score | support |
| --- | :---: | :---: | :---: | :---: |
| backpack | 1.00 | 1.00 | 1.00 | 17 |
| bookcase | 1.00 | 1.00 | 1.00 | 20 |
| car jack | 0.94 | 1.00 | 0.97 | 29 |
| comb | 1.00 | 1.00 | 1.00 | 20 |
| crown | 1.00 | 1.00 | 1.00 | 19 |
| file cabinet | 1.00 | 0.95 | 0.97 | 20 |
| flat iron | 0.82 | 0.78 | 0.80 | 18 |
| game controller | 1.00 | 0.90 | 0.95 | 21 |
| glasses | 1.00 | 1.00 | 1.00 | 24 |
| helicopter | 1.00 | 1.00 | 1.00 | 24 |
| ice skates | 0.82 | 0.86 | 0.84 | 21 |
| letter tray | 1.00 | 0.89 | 0.94 | 19 |
| monitor | 0.84 | 1.00 | 0.91 | 16 |
| mug | 1.00 | 1.00 | 1.00 | 19 |
| network switch | 0.94 | 1.00 | 0.97 | 17 |
| over-ear headphones | 1.00 | 1.00 | 1.00 | 15 |
| pen | 1.00 | 1.00 | 1.00 | 17 |
| purse | 1.00 | 1.00 | 1.00 | 19 |
| stand mixer | 0.94 | 1.00 | 0.97 | 16 |
| stroller | 0.96 | 0.90 | 0.93 | 29 |
| | | | | |
| accuracy | | | 0.96 | 400 |
| macro avg | 0.96 | 0.96 | 0.96 | 400 |
| weighted avg | 0.96 | 0.96 | 0.96 | 400 |

</center>

#### Loss and Accuracy curves

<p align="center">
    <img src="https://drive.google.com/uc?id=1eEYSm9sFjaUOqy_6jg-wbc9xU2zOpqxD" width=400px>
    <img src="https://drive.google.com/uc?id=1DdF90y6hqwAH3gClgqmnpmdhnU8oPanF" width=400px>
    <figcaption align = "center"><b>Loss & Accuracy</b></figcaption>
</p>

#### Confusion Matrix

<p align="center">
    <img src="https://drive.google.com/uc?id=1CbFhOHQPUYbLHium1gu8Fvlk1P9mzA5g" width=600px>
    <figcaption align = "center"><b>Confusion Matrix</b></figcaption>
</p>

#### t-SNE

<p align="center">
    <img src="https://drive.google.com/uc?id=1jZlEJe906sZ0fyfsmUcjP0zki6KFD4Ek" width=400px>
    <figcaption align = "center"><b>t-SNE</b></figcaption>
</p>


## Discussion

As in the baseline, only the classifier is trained. In the baseline, the classifier is trained on the source domain and tested on the target domain. In the upperbound the classifier is trained on the target domain and tested on the source domain.

The upperbounds are very important: they give an ideal maximum value for the performance. In this task there is the $acc_{up}^P$ that tells how much is the accuracy for $RW \rightarrow P$, while $acc_{up}^RW$ for $P \rightarrow RW$.

In the next sections we report some metrics that we will use for the next cells.

### Results baseline and upperbounds

From the baseline and the upperbound we summarize the results (Accuracies) in the table below.

<center>

| | Baseline | Upperbound |
| :---: | :---: | :---: |
| $P \rightarrow RW$ | 80.7 % | 90.5 % |
| $RW \rightarrow P$  | 92.2 % | 97.3 % |

</center>

### Gain

Table below shows the maximum possible gain we can achieve:

<center>

| | Gain |
| :---: | :---: |
| $P \rightarrow RW$ | 9.8 % |
| $RW \rightarrow P$ | 5.1 % |

## 3) Advanced : test the UDA component
### Maximum Classifier Discrepancy Approach

MCD is a novel adversarial training method that tries to minimize the discrepancy between the source and the target domain. The method is based on the following idea: if the source and the target domain are similar, then the classifier trained on the source domain should be able to classify the target domain with a high accuracy.

After the baseline and the upperbound results, we can now try the UDA component and the method proposed in the paper. 

For this part it is important to apply the following steps training simultaneously:
- **Step 1** : train in a superised way on the source domain 
- **Step 2** : train in an unsupervised way on the target domain

In the paper the previous steps are obviously readjusted for the architectures:
- **Step A** : train in a superised way on the source domain the Feature Extractor and both Classifiers
- **Step B** : train in an unsupervised way on the target domain
    - Step B.1 : Fix the Generator and train in an unsupervised way the 2 Classifiers on the target domain
    - Step B.2 : Fix the Classifiers and train in an unsupervised way the Generator on the target domain

Step A is crucial because it helps to obtain task-specific discriminative features, Step B.1 is useful to train the classifiers as a discriminator for a fixed generator while Step B.2 is needed to train the generator in order to minimize the discrepancy for the fixed classifiers.

When testing the obtained model, the actual prediction is given by the combination of the source classifier's prediction and the target classifier's prediction. 

In the following sections there is the code implementation and the results for each versus.

### Discrepancy Loss

The discrepancy loss is defined as the difference between the source and the target domain. In the second and third step the discrepancy loss must be maximized and minimized respectively, in order to have the model learning in an adversarial manner.

$$\mathcal{D}_{loss}(p_1, p_2) = \frac{1}{K} \sum_{k=1}^K |p_{1_{k}} - p_{2_{k}}| $$

where the $p_{1_{k}}$ and $p_{2_{k}}$ denote probability output of $p_1$ and $p_2$ for class $k$ respectively

In [None]:
# Discrepancy function
def discrepancy(out1: torch.Tensor, out2: torch.Tensor) -> torch.Tensor:
    d = 0
    samples = 0
    for a, b in zip(out1, out2):
        for i in range(len(a)):
            d += torch.abs(a[i] - b[i])
            samples += 1
    
    return d / samples

### Train and Test functions

In [None]:
def train_mcd_one_step(
        features_model: nn.Module, 
        classifiers: Tuple[nn.Module, nn.Module],
        optimizers: Tuple[torch.optim.Optimizer, torch.optim.Optimizer, torch.optim.Optimizer], 
        criterion: nn.Module, 
        datasets: Tuple[DataLoader, DataLoader],
        discrepancy_fn: Callable,
        k: int = DEFAULT_K_STEPS,
        device: str = DEFAULT_DEVICE,
    ) -> dict:
    ### For NLLLoss
    # m = nn.LogSoftmax(dim=1)

    # Extract classifiers
    classifier_source, classifier_target = classifiers

    # Extract optimizers
    optimizer_features, optimizer_source, optimizer_target = optimizers

    # Extract datasets
    train_loader_source, train_loader_target = datasets

    # Set models to train mode
    features_model.train()
    classifier_source.train()
    classifier_target.train()
    
    # Initialize cumulative variables
    cum_loss_source = 0
    cum_loss_target = 0
    cum_accuracy_source = 0
    cum_accuracy_target = 0
    n_samples = 0

    # Train loop    
    for (x_source, y_source), (x_target, y_target) in zip(train_loader_source, train_loader_target):
        # Move to device
        x_source, y_source = x_source.to(device), y_source.to(device)
        x_target, y_target = x_target.to(device), y_target.to(device)

        # Extract features and compute output
        features_source = features_model(x_source)
        out_source = classifier_source(features_source)
        out_target = classifier_target(features_source)

        # Compute loss
        loss_source = criterion(out_source, y_source)
        loss_target = criterion(out_target, y_source)
        loss_s: torch.Tensor = loss_source + loss_target
        cum_loss_source += loss_s.item()
        ### For NLLLoss
        # loss_source = criterion(m(out_source), y_source)
        # loss_target = criterion(m(out_target), y_source)
        # loss_s = loss_source + loss_target
        # cum_loss_source += loss_s.item()

        # Compute accuracy
        cum_accuracy_source += compute_accuracy(out_source, y_source)
        cum_accuracy_source += compute_accuracy(out_target, y_source)

        # Set optimizers to zero_grad
        optimizer_features.zero_grad()
        optimizer_source.zero_grad()
        optimizer_target.zero_grad()

        # Backward
        loss_s.backward()

        # Update weights
        optimizer_features.step()
        optimizer_source.step()
        optimizer_target.step()

        ### UDA ###
        # Extract features and compute output - source
        features_source = features_model(x_source)
        out_source_s = classifier_source(features_source)
        out_target_s = classifier_target(features_source)

        # Extract features and compute output - target
        features_target = features_model(x_target)
        out_source_t = classifier_source(features_target)
        out_target_t = classifier_target(features_target)
        
        # Compute loss - Only on source!!!
        loss_source = criterion(out_source_s, y_source)
        loss_target = criterion(out_target_s, y_source)
        loss_t = loss_source + loss_target
        ### For NLLLoss
        # loss_source = criterion(m(out_source_s), y_source)
        # loss_target = criterion(m(out_target_s), y_source)
        # loss_t = loss_source + loss_target

        # Compute discrepancy loss on target
        d = discrepancy(out_source_t, out_target_t)

        # Compute total loss
        loss: torch.Tensor = loss_t - d
        cum_loss_target += loss.item()

        # Compute accuracy - Can we compute accuracy on target since it is unsupervised? 
        cum_accuracy_target += compute_accuracy(out_source_t, y_target)
        cum_accuracy_target += compute_accuracy(out_target_t, y_target)

        # Backward the toal loss of source and target
        loss.backward()

        # Set optimizers to zero_grad
        optimizer_source.zero_grad()
        optimizer_target.zero_grad()

        # Update weights
        optimizer_source.step()
        optimizer_target.step()
        
        # Update n_samples
        n_samples += x_source.shape[0]
        n_samples += x_target.shape[0]
        
        ### Train features model for k steps ###
        for i in range(k):
            # Extract features and compute output 
            features_target = features_model(x_target)
            out_source_t = classifier_source(features_target)
            out_target_t = classifier_target(features_target)

            # Compute loss
            loss_discrepancy: torch.Tensor = discrepancy_fn(out_source_t, out_target_t)

            # Zero gradients
            optimizer_features.zero_grad()

            # Backpropagate
            loss_discrepancy.backward()
            
            # Step optimizer
            optimizer_features.step()

    return {
        'train/discrepancy' : d.cpu().detach().numpy(),
        'train/loss_features' : loss_discrepancy.cpu().detach().numpy(),
        'train/loss_source': cum_loss_source / n_samples,
        'train/loss_target': cum_loss_target / n_samples,
        'train/accuracy_source': cum_accuracy_source / n_samples,
        'train/accuracy_target': cum_accuracy_target / n_samples,
    }

In [None]:

def test_mcd_one_step(
        feature_model: nn.Module, 
        classifiers: Tuple[nn.Module, nn.Module],
        criterion: nn.Module,
        dataset: DataLoader,
        device: str = DEFAULT_DEVICE,
    ) -> dict:
    ### For NLLLoss
    # m = nn.LogSoftmax(dim=1)

    # Extract classifiers
    classifier_source, classifier_target = classifiers

    # Set models to eval mode
    feature_model.eval()
    classifier_source.eval()
    classifier_target.eval()

    # Initialize cumulative variables
    cum_loss = 0
    cum_accuracy_source = 0
    cum_accuracy_target = 0
    cum_accuracy = 0
    n_samples = 0

    # Test loop
    for x, y in dataset:
        # Move to device
        x, y = x.to(device), y.to(device)

        # Extract features and compute output
        features = feature_model(x)
        out_source = classifier_source(features)
        out_target = classifier_target(features)
        out = out_source + out_target

        # Compute loss
        loss = criterion(out, y)
        ### For NLLLoss
        # loss = criterion(m(out), y)
        cum_loss += loss.item()

        # Compute accuracy
        cum_accuracy_source += compute_accuracy(out_source, y)
        cum_accuracy_target += compute_accuracy(out_target, y)
        cum_accuracy += compute_accuracy(out, y)

        # Update n_samples
        n_samples += x.shape[0]

    return {
        'eval/loss': cum_loss / n_samples,
        'eval/accuracy_source': cum_accuracy_source / n_samples,
        'eval/accuracy_target': cum_accuracy_target / n_samples,
        'eval/accuracy': cum_accuracy / n_samples,
    }


In [None]:
def train_mcd(
        feature_model: nn.Module, 
        classifiers: Tuple[nn.Module, nn.Module],
        optimizers: Tuple[torch.optim.Optimizer, torch.optim.Optimizer, torch.optim.Optimizer], 
        schedulers: Tuple[torch.optim.lr_scheduler._LRScheduler, torch.optim.lr_scheduler._LRScheduler, torch.optim.lr_scheduler._LRScheduler],
        criterion: nn.Module, 
        discrepancy_fn: Callable,
        train_datasets: Tuple[DataLoader, DataLoader],
        eval_dataset: DataLoader,
        n_epochs: int = DEFAULT_EPOCHS,
        k: int = DEFAULT_K_STEPS,
        device: str = DEFAULT_DEVICE,
        save_name: str = "UDA_MCD_P_RW",
    ) -> float:
    # Extract classifiers
    classifier_source, classifier_target = classifiers

    # Move to device
    feature_model.to(device)
    classifier_source.to(device)
    classifier_target.to(device)

    # Extract schedulers
    scheduler_features, scheduler_source, scheduler_target = schedulers

    # Initialize variables
    best_accuracy = 0
    best_model_f = None
    best_model_s = None
    best_model_t = None

    # Start timer
    start_timer = time.time()

    # Store metrics 
    metrics_visualize = {
        'train/discrepancy': [],
        'train/loss_features': [],
        'train/loss_source': [],
        'train/loss_target': [],
        'train/accuracy_source': [],
        'train/accuracy_target': [],
        'eval/loss': [],
        'eval/accuracy': [],
        'eval/accuracy_source': [],
        'eval/accuracy_target': [],
    }

    # Train loop
    for epoch in range(n_epochs):
        # Train
        train_metrics = train_mcd_one_step(
            features_model=feature_model, 
            classifiers=classifiers, 
            optimizers=optimizers, 
            criterion=criterion, 
            discrepancy_fn=discrepancy_fn,
            datasets=train_datasets, 
            k=k,
            device=device,
        )
        # Test
        test_metrics = test_mcd_one_step(
            feature_model=feature_model, 
            classifiers=classifiers, 
            criterion=criterion,
            dataset=eval_dataset, 
            device=device,
        )

        # Put together metrics
        metrics = {**train_metrics, **test_metrics}

        # wandb log
        wandb.log(metrics)

        # Save best model
        if test_metrics['eval/accuracy'] > best_accuracy:
            best_accuracy = test_metrics['eval/accuracy']

            best_model_f = copy.deepcopy(feature_model)
            best_model_s = copy.deepcopy(classifier_source)
            best_model_t = copy.deepcopy(classifier_target)
            
            torch.save(best_model_f.state_dict(), os.path.join(LOG_PATH_MODELS, save_name+"_features_extractor.pth"))
            torch.save(best_model_s.state_dict(), os.path.join(LOG_PATH_MODELS, save_name+"_source.pth"))
            torch.save(best_model_t.state_dict(), os.path.join(LOG_PATH_MODELS, save_name+"_target.pth"))

        # Log metrics
        opt_f, opt_s, opt_t = optimizers
        make_log_print("Train", (epoch+1, n_epochs), time.time() - start_timer, metrics, lr=(round(opt_f.param_groups[0]['lr'], 6), round(opt_s.param_groups[0]['lr'], 6), round(opt_t.param_groups[0]['lr'], 6)))

        # Store metrics
        for key in metrics_visualize.keys():
            metrics_visualize[key].append(metrics[key])

        # Update scheduler
        scheduler_features.step()
        scheduler_source.step()
        scheduler_target.step()

    # Plot metrics
    visualize_results(
        feature_model=best_model_f,
        classifier_model_s=best_model_s,
        classifier_model_t=best_model_t,
        eval_dataset=eval_dataset,
        metrics=metrics_visualize, 
        device=device, 
        save_path=os.path.join(LOG_PATH_IMAGES, save_name))

    return best_accuracy

### $P \rightarrow RW$

In [None]:
feature_model = FeatureModel(resnet_version='resnet18')
model_s = Predictor(input_size=feature_model.output_size)
model_t = Predictor(input_size=feature_model.output_size)

opt_f = get_optimizer(feature_model, optimizer='sgd', lr=1e-2)
opt_s = get_optimizer(model_s, optimizer='sgd', lr=1e-1)
opt_t = get_optimizer(model_t, optimizer='sgd', lr=1e-1)

scheduler_f = get_scheduler(opt_f)
scheduler_s = get_scheduler(opt_s)
scheduler_t = get_scheduler(opt_t)

criterion = get_criterion(criterion='cross_entropy')

# Initialize wandb
experiment_name = 'UDA_P_RW'

wandb.init(
    # Set the project where this run will be logged
    project=DEFAULT_WANDB_PROJECT_NAME, 
    # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
    name=experiment_name, 
    # Track hyperparameters and run metadata
    config={
        "feature_model": feature_model.__class__.__name__,
        "source_model": model_s.__class__.__name__,
        "target_model": model_t.__class__.__name__,
        "features_optimizer": opt_f.__class__.__name__,
        "source_optimizer": opt_s.__class__.__name__,
        "target_optimizer": opt_t.__class__.__name__,
        "features_scheduler": scheduler_f.__class__.__name__,
        "source_scheduler": scheduler_s.__class__.__name__,
        "target_scheduler": scheduler_t.__class__.__name__,
        "criterion": criterion.__class__.__name__,
        "training_type": "UDA P-RW",
        "training_dataset": "Products Supervised + RealWorld Unsupervised",
        "evaluation_dataset": "RealWorld",
        "learning_rate_feature_model": opt_f.param_groups[0]['lr'],
        "learning_rate_source_model": opt_s.param_groups[0]['lr'],
        "learning_rate_target_model": opt_t.param_groups[0]['lr'],
        "batch_size": DEFAULT_BATCH_SIZE,
        "epochs": DEFAULT_EPOCHS,
    }
)

accuracy_uda_p_rw = train_mcd(
    feature_model=feature_model,
    classifiers=(model_s, model_t),
    optimizers=(opt_f, opt_s, opt_t),
    schedulers=(scheduler_f, scheduler_s, scheduler_t),
    criterion=criterion,
    discrepancy_fn=discrepancy,
    train_datasets=(train_loader_products, train_loader_reals),
    eval_dataset=test_loader_reals,
    n_epochs=DEFAULT_EPOCHS,
    k=DEFAULT_K_STEPS,
    save_name=experiment_name
)

# Close wandb
wandb.finish()

# Delete variables
del feature_model, model_s, model_t, opt_f, opt_s, opt_t, scheduler_f, scheduler_s, scheduler_t, criterion, experiment_name

# Empty cache
if DEFAULT_DEVICE.type == 'cuda':
    torch.cuda.empty_cache()

# Print results
print("Accuracy UDA P-RW: {:.2f}%".format(accuracy_uda_p_rw*100))
print("Gain UDA P-RW: {:.2f}%".format((accuracy_uda_p_rw - 0.807)*100))

#### Results

Here we report the results of the baseline on the $P \rightarrow RW$ direction.

##### Final results

<center>

| | Accuracy | Loss |
| :---: | :---: | :---: |
| Train source (supervised) | 88.9 % | 0.002 |
| Train target (unsupervised) | 75.4 % | 0.001 |
| Test source | 87.5 % | 0.002 |
| Test target | 87.7 % | 0.002 |
| Test overall | 88.0 % | 0.002 |

</center>

##### Metrics

<center>

| | precision | recall | f1-score | support |
| :---: | :---: | :---: | :---: | :---: |
| backpack | 0.95 | 0.88 | 0.91 | 24 |
| bookcase | 0.91 | 1.00 | 0.95 | 21 |
| car jack | 0.90 | 0.82 | 0.86 | 22 |
| comb | 0.79 | 0.88 | 0.83 | 17 |
| crown | 1.00 | 0.95 | 0.97 | 19 |
| file cabinet | 1.00 | 1.00 | 1.00 | 21 |
| flat iron | 0.95 | 0.86 | 0.90 | 21 |
| game controller | 0.94 | 1.00 | 0.97 | 17 |
| glasses | 0.85 | 0.81 | 0.83 | 27 |
| helicopter | 0.70 | 0.94 | 0.80 | 17 |
| ice skates | 0.70 | 0.86 | 0.78 | 22 |
| letter tray | 0.88 | 1.00 | 0.93 | 21 |
| monitor | 0.76 | 0.87 | 0.81 | 15 |
| mug | 1.00 | 0.81 | 0.89 | 21 |
| network switch | 0.88 | 0.94 | 0.91 | 16 |
| over-ear headphones | 0.89 | 0.85 | 0.87 | 20 |
| pen | 0.95 | 0.95 | 0.95 | 19 |
| purse | 0.88 | 0.88 | 0.88 | 17 |
| stroller | 0.88 | 0.65 | 0.75 | 23 |
| stand mixer | 0.88 | 0.75 | 0.81 | 20 |
| | | | | |
| accuracy | | | 0.88 | 400 |
| macro avg | 0.89 | 0.88 | 0.88 | 400 |
| weighted avg | 0.89 | 0.88 | 0.88 | 400 |

</center>

##### Loss and Accuracy curves

<p align="center">
    <img src="https://drive.google.com/uc?id=1S3MAoINzyNJ8uZHAtjc7hPxqUd5gRKog" width=400px>
    <img src="https://drive.google.com/uc?id=1fBV4dWJDjCSl8yW-jg_YV9DDbcTmBim8" width=400px>
    <img src="https://drive.google.com/uc?id=1e5_Y1drSnguFs_b4M5RJNBCQ7sMw0AAv" width=400px>
    <figcaption align = "center"><b>Loss & Accuracy & Discrepancy(features only)</b></figcaption>
</p>

##### Confusion Matrix

<p align="center">
    <img src="https://drive.google.com/uc?id=151JLK_-ohBakq8jzkAId27S4G_Wpog8g" width=600px>
    <figcaption align = "center"><b>Confusion Matrix</b></figcaption>
</p>

##### t-SNE

<p align="center">
    <img src="https://drive.google.com/uc?id=1uk6_7Dx_AN-rrHrGgd7lXjTUD6Fld_hu" width=400px>
    <img src="https://drive.google.com/uc?id=1gZafwwJh6Dv7MaPef45CdyUzikPCsWic" width=400px>
    <figcaption align = "center"><b>t-SNE on Source & Target</b></figcaption>
</p>



#### Discussion : $P \rightarrow RW$
As we can see from the results we had a lot of improvements for the performances, in particular we had 7.2 % of gain. Surely the t-SNE representation and the confusion matrix are more precise then the respective plots of the baseline.

As we can see from the Figure below the UDA confusion matrix, on the right, is more sparse around the diagonal. This means that more classes have been correctly classified. 

<p align="center">
    <img src="https://drive.google.com/uc?id=1ccL_83f5IGKVjBTQfZ_FWy69c7eryHQK" width=400px>
    <img src="https://drive.google.com/uc?id=151JLK_-ohBakq8jzkAId27S4G_Wpog8g" width=400px>
    <figcaption align = "center"><b>Confusion matrix Baseline and MCD</b></figcaption>
</p>

The t-SNE representations, instead, can show us that the extracted features of the classes in UDA are more compact with respect to the classes features from the baseline model. The representation from the target classifier has more outliers from the source classfier. 
<p align="center">
    <img src="https://drive.google.com/uc?id=1MePvXiItwbH8fEM7VrL0hm8IZdHoN_Yf" width=300px>
    <img src="https://drive.google.com/uc?id=1uk6_7Dx_AN-rrHrGgd7lXjTUD6Fld_hu" width=300px>
    <img src="https://drive.google.com/uc?id=1gZafwwJh6Dv7MaPef45CdyUzikPCsWic" width=300px>
    <figcaption align = "center"><b>t-SNE Baseline source only and MCD source and target classifier</b></figcaption>
</p>

### $RW \rightarrow P$

In [None]:
feature_model = FeatureModel(resnet_version='resnet18')
model_s = Predictor(input_size=feature_model.output_size)
model_t = Predictor(input_size=feature_model.output_size)

opt_f = get_optimizer(feature_model, optimizer='sgd', lr=1e-2)
opt_s = get_optimizer(model_s, optimizer='sgd', lr=1e-1)
opt_t = get_optimizer(model_t, optimizer='sgd', lr=1e-1)

scheduler_f = get_scheduler(opt_f)
scheduler_s = get_scheduler(opt_s)
scheduler_t = get_scheduler(opt_t)

criterion = get_criterion(criterion='cross_entropy')

# Initialize wandb
experiment_name = 'UDA_RW_P'

wandb.init(
    # Set the project where this run will be logged
    project=DEFAULT_WANDB_PROJECT_NAME, 
    # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
    name=experiment_name, 
    # Track hyperparameters and run metadata
    config={
        "feature_model": feature_model.__class__.__name__,
        "source_model": model_s.__class__.__name__,
        "target_model": model_t.__class__.__name__,
        "features_optimizer": opt_f.__class__.__name__,
        "source_optimizer": opt_s.__class__.__name__,
        "target_optimizer": opt_t.__class__.__name__,
        "features_scheduler": scheduler_f.__class__.__name__,
        "source_scheduler": scheduler_s.__class__.__name__,
        "target_scheduler": scheduler_t.__class__.__name__,
        "criterion": criterion.__class__.__name__,
        "training_type": "UDA RW-P",
        "training_dataset": "RealWorld Supervised + Products Unsupervised",
        "evaluation_dataset": "Products",
        "learning_rate_feature_model": opt_f.param_groups[0]['lr'],
        "learning_rate_source_model": opt_s.param_groups[0]['lr'],
        "learning_rate_target_model": opt_t.param_groups[0]['lr'],
        "batch_size": DEFAULT_BATCH_SIZE,
        "epochs": DEFAULT_EPOCHS,
    }
)

accuracy_uda_rw_p = train_mcd(
    feature_model=feature_model,
    classifiers=(model_s, model_t),
    optimizers=(opt_f, opt_s, opt_t),
    schedulers=(scheduler_f, scheduler_s, scheduler_t),
    criterion=criterion,
    train_datasets=(train_loader_reals, train_loader_products),
    eval_dataset=test_loader_products,
    n_epochs=DEFAULT_EPOCHS,
    k=DEFAULT_K_STEPS,
    save_name=experiment_name
)

# Close wandb
wandb.finish()

# Delete variables
del feature_model, model_s, model_t, opt_f, opt_s, opt_t, scheduler_f, scheduler_s, scheduler_t, criterion, experiment_name

# Empty cache
if DEFAULT_DEVICE.type == 'cuda':
    torch.cuda.empty_cache()

# Print results
print("Accuracy UDA RW-P: {:.2f}%".format(accuracy_uda_rw_p*100))
print("Gain UDA RW-P: {:.2f}%".format((accuracy_uda_rw_p - 0.922)*100))

#### Results

Here we report the results of the baseline on the $P \rightarrow RW$ direction.

##### Final results

<center>

| | Accuracy | Loss |
| :---: | :---: | :---: |
| Train source (supervised) | 89.89 % | 0.001 |
| Train target (unsupervised) | 86.3 % | 0.001 |
| Test source | 96.3 % | 0.001 |
| Test target | 96.3 % | 0.001 |
| Test overall | 96.5% | 0.001 |

</center>

##### Metrics

<center>

| | precision | recall | f1-score | support |
| :---: | :---: | :---: | :---: | :---: |
| backpack | 0.96 | 1.00 | 0.98 | 24 |
| bookcase | 1.00 | 1.00 | 1.00 | 17 |
| car jack | 1.00 | 0.95 | 0.97 | 19 |
| comb | 0.97 | 1.00 | 0.98 | 29 |
| crown | 1.00 | 1.00 | 1.00 | 19 |
| file cabinet | 0.94 | 1.00 | 0.97 | 15 |
| flat iron | 0.94 | 1.00 | 0.97 | 16 |
| game controller | 0.90 | 0.95 | 0.92 | 19 |
| glasses | 0.94 | 0.89 | 0.91 | 18 |
| helicopter | 1.00 | 0.90 | 0.95 | 21 |
| ice skates | 1.00 | 1.00 | 1.00 | 17 |
| letter tray | 1.00 | 1.00 | 1.00 | 19 |
| monitor | 1.00 | 0.96 | 0.98 | 24 |
| mug | 0.95 | 0.95 | 0.95 | 20 |
| network switch | 0.92 | 0.83 | 0.87 | 29 |
| over-ear headphones | 1.00 | 1.00 | 1.00 | 20 |
| pen | 0.91 | 0.95 | 0.93 | 21 |
| purse | 1.00 | 1.00 | 1.00 | 20 |
| stroller | 1.00 | 1.00 | 1.00 | 17 |
| stand mixer | 0.89 | 1.00 | 0.94 | 16 |
| | | | | |
| accuracy | | | 0.96 | 400 |
| macro avg | 0.97 | 0.97 | 0.97 | 400 |
| weighted avg | 0.97 | 0.96 | 0.96 | 400 |

</center>

##### Loss and Accuracy curves

<p align="center">
    <img src="https://drive.google.com/uc?id=18mD0Q0b3DgkqPWbxOp0UwIrRyqKHrbBY" width=400px>
    <img src="https://drive.google.com/uc?id=1o6itff12B8edlB_Brk3BI_RLTdtfTtan" width=400px>
    <img src="https://drive.google.com/uc?id=17-tCeYqB7IZ9IBAgh7JqIAWOQcOcIVv0" width=400px>
    <figcaption align = "center"><b>Loss & Accuracy & Discrepancy(features only)</b></figcaption>
</p>

##### Confusion Matrix

<p align="center">
    <img src="https://drive.google.com/uc?id=1Zo2TUDgOj1wd8SF8fdsuUl6bWgBaBGOL" width=600px>
    <figcaption align = "center"><b>Confusion Matrix</b></figcaption>
</p>

##### t-SNE

<p align="center">
    <img src="https://drive.google.com/uc?id=1IMQ-p3TLz_A8Fj6eMaZARdQG2QTteaJA" width=400px>
    <img src="https://drive.google.com/uc?id=1PJYjGwy6A08NAmCn7e3aEvF3byaUqf6J" width=400px>
    <figcaption align = "center"><b>t-SNE on Source & Target</b></figcaption>
</p>



#### Discussion : $RW \rightarrow P$

As we can see from the results we had a lot of improvements for the performances. Indeed we almost reached the ideal upperbound. Despite of the fact that the t-SNE representation and the confusion matrix of the baseline were already good, with the UDA component we were able to reach greater performances. The accuracy achieves 96.5%, which is only a 0.07% difference to the upperbound. 

The final confusion matrix of the UDA model is almost perfect and is quite close to be an identity matrix, which tells us that the errors were only on few classes. In fact, only 8 classes over 20 do not have 100% of accuracy. Despite the fact that these calsses do not reach the maximum for the accuracy, they have a very high value, and only 2 of them were under the 90% of accuracy. 

<p align="center">
    <img src="https://drive.google.com/uc?id=1AjQiAWXVMXEhjg97MWfC_rhceUeBrqrO" width=400px>
    <img src="https://drive.google.com/uc?id=1Zo2TUDgOj1wd8SF8fdsuUl6bWgBaBGOL" width=400px>
    <figcaption align = "center"><b>Confusion matrix Baseline and MCD</b></figcaption>
</p>

As for the confusion matrix, also the t-SNE representations are almost perfect. Indeed, the classes features extracted from the classifiers of the UDA component are quite perfectly divided, with only very few outliers.
<p align="center">
    <img src="https://drive.google.com/uc?id=1myFwRjs6_xVE1op-KSdD34DGbXEInupD" width=300px>
    <img src="https://drive.google.com/uc?id=1IMQ-p3TLz_A8Fj6eMaZARdQG2QTteaJA" width=300px>
    <img src="https://drive.google.com/uc?id=1PJYjGwy6A08NAmCn7e3aEvF3byaUqf6J" width=300px>
    <figcaption align = "center"><b>t-SNE Baseline and MCD source and target classifier</b></figcaption>
</p>

The $RW \rightarrow P$ direction with this UDA method has performed very well, almost reaching the upperbound. This makes very difficult to bring some improvements, because the margin is very little to have some better results.


### Final Discussion

Here are reported the final gains for $P \rightarrow RW$ and $RW \rightarrow P$:

<center>

| | Accuracy Baseline | Accuracy UDA | Gain | Maximum Gain |
| :---: | :---: | :---: |  :---: |  :---: |
| $P \rightarrow RW$ | 80.7 % | 88.0 % | 7.2 % | 9.8 % |
| $RW \rightarrow P$ | 92.2 % | 96.5 % | 4.3 % | 5.1 % |

</center>

As said before the high accuracy on $RW \rightarrow P$ makes very complicated to have more improvements, while for $P \rightarrow RW$ is more easy. 

## 4) Improvements proposed
In the following section we propose an improvement for the MCD method and some other additional ideas for the MCD that we tried but they reach the same performaces of the MCD. In the first part there will be the explanation of the first improvement with the results, while in the second part the other ideas.

### Improvement 1: Supervised training in the features generator

In the paper of the previous approach, the authors were using three steps in the training, we recap them here:

- **Step A** : train in a superised way on the source domain the Feature Extractor and both Classifiers
- **Step B** : train in an unsupervised way on the target domain
    - Step B.1 : Fix the Generator and train in an unsupervised way the 2 Classifiers on the target domain
    - Step B.2 : Fix the Classifiers and train in an unsupervised way the Generator on the target domain

What we though was missing is a supervised training on the features generator, taking into account not only the two distinct predictions of the classifiers, but also a combined prediction of the two.

After step B we propose to add another step, we call it **Step C**, in which we will do the following procedure:
- Extract the features from an image of the source domain
- Compute the predictions with the classifier of the source domain and the classifier of the target domain, respectively called `out_source_s` and `out_target_s`
- Compute the combined prediction : `out = out_source_s + out_target_s`
- Compute the CrossEntropy Loss between the combined output and the respectively label of the source domain
- Update weights of the Features Extractor with the Loss just computed

For this improvement the only function we modifiy is the `train_mcd_one_step()` in which we add **Step C**. For the test and the loop training functions we keep the same of the MCD method. 

#### Train function

In [None]:
def train_mcd_one_step(
        features_model: nn.Module, 
        classifiers: Tuple[nn.Module, nn.Module],
        optimizers: Tuple[torch.optim.Optimizer, torch.optim.Optimizer], 
        criterion: nn.Module, 
        datasets: Tuple[DataLoader, DataLoader],
        epoch: int,
        k: int = DEFAULT_K_STEPS,
        device: str = DEFAULT_DEVICE,
    ) -> Tuple[nn.Module, nn.Module, nn.Module]:
    ### For NLLLoss
    # m = nn.LogSoftmax(dim=1)

    # Extract classifiers
    classifier_source, classifier_target = classifiers

    # Extract optimizers
    optimizer_features, optimizer_source, optimizer_target = optimizers

    # Extract datasets
    train_loader_source, train_loader_target = datasets

    # Set models to train mode
    features_model.train()
    classifier_source.train()
    classifier_target.train()
    
    # Initialize cumulative variables
    cum_loss_source = 0
    cum_loss_target = 0
    cum_accuracy_source = 0
    cum_accuracy_target = 0
    n_samples = 0

    # Train loop    
    for (x_source, y_source), (x_target, y_target) in zip(train_loader_source, train_loader_target):
        # Move to device
        x_source, y_source = x_source.to(device), y_source.to(device)
        x_target, y_target = x_target.to(device), y_target.to(device)

        # Extract features and compute output
        features_source = features_model(x_source)
        out_source = classifier_source(features_source)
        out_target = classifier_target(features_source)

        # Compute loss
        loss_source = criterion(out_source, y_source)
        loss_target = criterion(out_target, y_source)
        loss_s = loss_source + loss_target
        cum_loss_source += loss_s.item()
        ### For NLLLoss
        # loss_source = criterion(m(out_source), y_source)
        # loss_target = criterion(m(out_target), y_source)
        # loss_s = loss_source + loss_target
        # cum_loss_source += loss_s.item()

        # Compute accuracy
        cum_accuracy_source += compute_accuracy(out_source, y_source)
        cum_accuracy_source += compute_accuracy(out_target, y_source)

        # Set optimizers to zero_grad
        optimizer_features.zero_grad()
        optimizer_source.zero_grad()
        optimizer_target.zero_grad()

        # Backward
        loss_s.backward()

        # Update weights
        optimizer_features.step()
        optimizer_source.step()
        optimizer_target.step()

        ### UDA ###
        # Extract features and compute output - source
        features_source = features_model(x_source)
        out_source_s = classifier_source(features_source)
        out_target_s = classifier_target(features_source)

        # Extract features and compute output - target
        features_target = features_model(x_target)
        out_source_t = classifier_source(features_target)
        out_target_t = classifier_target(features_target)
        
        # Compute loss - Only on source!!!
        loss_source = criterion(out_source_s, y_source)
        loss_target = criterion(out_target_s, y_source)
        loss_t = loss_source + loss_target
        ### For NLLLoss
        # loss_source = criterion(m(out_source_s), y_source)
        # loss_target = criterion(m(out_target_s), y_source)
        # loss_t = loss_source + loss_target

        # Compute discrepancy loss on target
        d = discrepancy(out_source_t, out_target_t)

        # Compute total loss
        loss = loss_t - d
        cum_loss_target += loss.item()

        # Compute accuracy - Can we compute accuracy on target since it is unsupervised? 
        cum_accuracy_target += compute_accuracy(out_source_t, y_target)
        cum_accuracy_target += compute_accuracy(out_target_t, y_target)

        # Backward the toal loss of source and target
        loss.backward()

        # Set optimizers to zero_grad
        optimizer_source.zero_grad()
        optimizer_target.zero_grad()

        # Update weights
        optimizer_source.step()
        optimizer_target.step()
        
        # Update n_samples
        n_samples += x_source.shape[0]
        n_samples += x_target.shape[0]
        
        
        for i in range(k):
            # Extract features and compute output 
            features_target = features_model(x_target)
            out_source_t = classifier_source(features_target)
            out_target_t = classifier_target(features_target)

            # Compute loss
            loss_discrepancy = discrepancy(out_source_t, out_target_t)

            # Zero gradients
            optimizer_features.zero_grad()

            # Backpropagate
            loss_discrepancy.backward()
            
            # Step optimizer
            optimizer_features.step()

        # Step C:
        # Extract features and compute output 
        features_source = feature_model(x_source)
        out_source_s = classifier_source(features_source)
        out_target_s = classifier_target(features_source)

        # Compute cobined output
        out = out_source_s + out_target_s

        # Compute loss
        loss_CE = criterion(out, y_source)

        # Zero gradients
        optimizer_features.zero_grad()

        # Backpropagate
        loss_CE.backward()
        
        # Step optimizer
        optimizer_features.step()


    return {
        'train/discrepancy' : d.cpu().detach().numpy(),
        'train/loss_features' : loss_discrepancy.cpu().detach().numpy(),
        'train/loss_source': cum_loss_source / n_samples,
        'train/loss_target': cum_loss_target / n_samples,
        'train/accuracy_source': cum_accuracy_source / n_samples,
        'train/accuracy_target': cum_accuracy_target / n_samples,
    }

In [None]:
def train_mcd(
        feature_model: nn.Module, 
        classifiers: Tuple[nn.Module, nn.Module],
        optimizers: Tuple[torch.optim.Optimizer, torch.optim.Optimizer, torch.optim.Optimizer], 
        schedulers: Tuple[torch.optim.lr_scheduler._LRScheduler, torch.optim.lr_scheduler._LRScheduler, torch.optim.lr_scheduler._LRScheduler],
        criterion: nn.Module, 
        train_datasets: Tuple[DataLoader, DataLoader],
        eval_dataset: DataLoader,
        n_epochs: int = DEFAULT_EPOCHS,
        k: int = DEFAULT_K_STEPS,
        device: str = DEFAULT_DEVICE,
        save_name: str = "UDA_MCD_P_RW",
    ) -> Tuple[nn.Module, nn.Module, nn.Module]:
    # Extract classifiers
    classifier_source, classifier_target = classifiers

    # Move to device
    feature_model.to(device)
    classifier_source.to(device)
    classifier_target.to(device)

    # Extract schedulers
    scheduler_features, scheduler_source, scheduler_target = schedulers

    # Initialize variables
    best_accuracy = 0
    best_model_f = None
    best_model_s = None
    best_model_t = None

    # Start timer
    start_timer = time.time()

    # Store metrics 
    metrics_visualize = {
        'train/discrepancy': [],
        'train/loss_features': [],
        'train/loss_source': [],
        'train/loss_target': [],
        'train/accuracy_source': [],
        'train/accuracy_target': [],
        'eval/loss': [],
        'eval/accuracy': [],
        'eval/accuracy_source': [],
        'eval/accuracy_target': [],
    }

    # Train loop
    for epoch in range(n_epochs):
        # Train
        train_metrics = train_mcd_one_step(
            features_model=feature_model, 
            classifiers=classifiers, 
            optimizers=optimizers, 
            criterion=criterion, 
            datasets=train_datasets, 
            epoch=epoch,
            k=k,
            device=device,
        )
        # Test
        test_metrics = test_mcd_one_step(
            feature_model=feature_model, 
            classifiers=classifiers, 
            criterion=criterion,
            dataset=eval_dataset, 
            device=device,
        )

        # Put together metrics
        metrics = {**train_metrics, **test_metrics}

        # wandb log
        wandb.log(metrics)

        # Save best model
        if test_metrics['eval/accuracy'] > best_accuracy:
            best_accuracy = test_metrics['eval/accuracy']

            best_model_f = copy.deepcopy(feature_model)
            best_model_s = copy.deepcopy(classifier_source)
            best_model_t = copy.deepcopy(classifier_target)
            
            torch.save(best_model_f.state_dict(), os.path.join(LOG_PATH_MODELS, save_name+"_features_extractor.pth"))
            torch.save(best_model_s.state_dict(), os.path.join(LOG_PATH_MODELS, save_name+"_source.pth"))
            torch.save(best_model_t.state_dict(), os.path.join(LOG_PATH_MODELS, save_name+"_target.pth"))

        # Log metrics
        opt_f, opt_s, opt_t = optimizers
        make_log_print("Train", (epoch+1, n_epochs), time.time() - start_timer, metrics, lr=(round(opt_f.param_groups[0]['lr'], 6), round(opt_s.param_groups[0]['lr'], 6), round(opt_t.param_groups[0]['lr'], 6)))

        # Store metrics
        for key in metrics_visualize.keys():
            metrics_visualize[key].append(metrics[key])

        # Update scheduler
        scheduler_features.step()
        scheduler_source.step()
        scheduler_target.step()

    # Plot metrics
    visualize_results(
        feature_model=best_model_f,
        classifier_model_s=best_model_s,
        classifier_model_t=best_model_t,
        eval_dataset=eval_dataset,
        metrics=metrics_visualize, 
        device=device, 
        save_path=os.path.join(LOG_PATH_IMAGES, save_name))
        
    return best_accuracy

#### $P \rightarrow RW$

In [None]:
feature_model = FeatureModel(resnet_version='resnet18')
model_s = Predictor(input_size=feature_model.output_size)
model_t = Predictor(input_size=feature_model.output_size)

opt_f = get_optimizer(feature_model, optimizer='sgd', lr=1e-2)
opt_s = get_optimizer(model_s, optimizer='sgd', lr=1e-1)
opt_t = get_optimizer(model_t, optimizer='sgd', lr=1e-1)

scheduler_f = get_scheduler(opt_f)
scheduler_s = get_scheduler(opt_s)
scheduler_t = get_scheduler(opt_t)

criterion = get_criterion(criterion='cross_entropy')

# Initialize wandb
experiment_name = 'UDA_P_RW_1'

wandb.init(
    # Set the project where this run will be logged
    project=DEFAULT_WANDB_PROJECT_NAME, 
    # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
    name=experiment_name, 
    # Track hyperparameters and run metadata
    config={
        "feature_model": feature_model.__class__.__name__,
        "source_model": model_s.__class__.__name__,
        "target_model": model_t.__class__.__name__,
        "features_optimizer": opt_f.__class__.__name__,
        "source_optimizer": opt_s.__class__.__name__,
        "target_optimizer": opt_t.__class__.__name__,
        "features_scheduler": scheduler_f.__class__.__name__,
        "source_scheduler": scheduler_s.__class__.__name__,
        "target_scheduler": scheduler_t.__class__.__name__,
        "criterion": criterion.__class__.__name__,
        "training_type": "UDA P-RW Improvement 1",
        "training_dataset": "Products Supervised + RealWorld Unsupervised",
        "evaluation_dataset": "RealWorld",
        "learning_rate_feature_model": opt_f.param_groups[0]['lr'],
        "learning_rate_source_model": opt_s.param_groups[0]['lr'],
        "learning_rate_target_model": opt_t.param_groups[0]['lr'],
        "batch_size": DEFAULT_BATCH_SIZE,
        "epochs": 40,
    }
)

accuracy_uda_p_rw_1 = train_mcd(
    feature_model=feature_model,
    classifiers=(model_s, model_t),
    optimizers=(opt_f, opt_s, opt_t),
    schedulers=(scheduler_f, scheduler_s, scheduler_t),
    criterion=criterion,
    train_datasets=(train_loader_products, train_loader_reals),
    eval_dataset=test_loader_reals,
    n_epochs=DEFAULT_EPOCHS + 10,
    k=DEFAULT_K_STEPS,
    save_name=experiment_name
)

# Close wandb
wandb.finish()

# Delete variables
del feature_model, model_s, model_t, opt_f, opt_s, opt_t, scheduler_f, scheduler_s, scheduler_t, criterion, experiment_name

# Empty cache
if DEFAULT_DEVICE.type == 'cuda':
    torch.cuda.empty_cache()

# Print results
print("Accuracy UDA P-RW: {:.2f}%".format(accuracy_uda_p_rw_1*100))
print("Gain UDA P-RW: {:.2f}%".format((accuracy_uda_p_rw_1 - accuracy_baseline_p_rw)*100))

##### Results 

<center>

| | Accuracy | Loss |
| :---: | :---: | :---: |
| Train source (supervised) | 76.5 % | 0.001 |
| Train target (unsupervised) | 91.2 % | 0.001 |
| Test source | 88.3 % | 0.002 |
| Test target | 88 % | 0.002 |
| Test overall | 89% | 0.002 |

</center>

###### Metrics
<center>

| | precision | recall | f1-score |  support |
| :---: | :---: | :---: | :---: | :---: |
| backpack | 0.90 | 0.86 | 0.88 | 22 |
| bookcase | 0.88 | 0.88 | 0.88 | 17 |
| car jack | 0.68 | 1.00 | 0.81 | 17 |
| comb | 0.88 | 0.85 | 0.87 | 27 |
| crown | 0.88 | 0.94 | 0.91 | 16 |
| file cabinet | 0.86 | 0.86 | 0.86 | 21 |
| flat iron | 1.00 | 0.81 | 0.89 | 21 |
| helicopter | 0.88 | 0.88 | 0.88 | 17 |
| game controller | 0.90 | 0.95 | 0.92 | 19 |
| glasses | 0.68 | 0.86 | 0.76 | 22 |
| ice skates | 0.89 | 0.85 | 0.87 | 20 |
| letter tray | 0.86 | 0.78 | 0.82 | 23 |
| monitor | 0.95 | 1.00 | 0.98 | 21 |
| mug | 1.00 | 1.00 | 1.00 | 21 |
| network switch | 0.94 | 0.75 | 0.83 | 20 |
| over-ear headphones | 0.88 | 0.93 | 0.90 | 15 |
| pen | 0.94 | 1.00 | 0.97 | 17 |
| purse | 0.95 | 0.88 | 0.91 | 24 |
| stand mixer | 1.00 | 0.95 | 0.97 | 19 |
| stroller | 1.00 | 0.86 | 0.92 | 21 |
| | | | | |
| accuracy | | | 0.89 | 400 |
| macro avg | 0.90 | 0.89 | 0.89 | 400 |
| weighted avg | 0.90 | 0.89 | 0.89 | 400 |

</center>

###### Loss and Accuracy curves

<p align="center">
    <img src="https://drive.google.com/uc?id=13e5QlZbba5XoL901SsMKedJkCfDXvMaP" width=400px>
    <img src="https://drive.google.com/uc?id=1T4HTrsFsEaXkrZG9xDGccrjTfJXX-c9U" width=400px>
    <img src="https://drive.google.com/uc?id=1_hs-IGUr5pnR4dXo4l5_5DMb4g_lsJTF" width=400px>
    <figcaption align = "center"><b>Loss & Accuracy & Discrepancy(features only)</b></figcaption>
</p>

###### Confusion Matrix

<p align="center">
    <img src="https://drive.google.com/uc?id=1cKJVp0WDaRR3NnUrAB5DUpoT9axvZZOS" width=600px>
    <figcaption align = "center"><b>Confusion Matrix</b></figcaption>
</p>

###### t-SNE

<p align="center">
    <img src="https://drive.google.com/uc?id=101I23Ef_Uj6PDM1Pj60QbwndGBtag8M9" width=400px>
    <img src="https://drive.google.com/uc?id=1HbIOrOBcvkF5-rfbwvSkijevvQ0GY8Vb" width=400px>
    <figcaption align = "center"><b>t-SNE on Source & Target</b></figcaption>
</p>



##### Discussions : $P \rightarrow RW$

In the table below are reported the results with respect to the MCD without improvement:

<center>

| | Accuracy | 
| :---: | :---: |
| MCD | 88% |
| MCD + Improvement | 89% | 

</center>

Our improvement proposal brought the accuracy higher of 1% than the MCD method without improvement. 

As we can see from the graph, the general accuracy in all the epochs is higher, the behaviour is the same too. For this reason we can say that adding supervised learning based on the combined output on the features extractor can bring an additional value to the method and also higher performances.

<p align="center">
    <img src="https://drive.google.com/uc?id=1JFkTV6aQUuRMeTSeK7H1VYef_zr_ESdP" width=400px>
    <figcaption align = "center"><b>Accuracy MCD and MCD + Improvement</b></figcaption>
</p>

We can see an improvement not only on the accuracy but also in the minimization of the discrepancy during training. Indeed, we can see from the graph below that the discrepancy of the improved MCD is in general lower than the discrepancy of the MCD.

<p align="center">
    <img src="https://drive.google.com/uc?id=1XTLO5-bO_iG-pJzd-jQc5AozsZ2WaDRl" width=400px>
    <figcaption align = "center"><b>Accuracy MCD and MCD + Improvement</b></figcaption>
</p

Furthermore, by comparing the confusion matrix of the MCD and MCD + Improvement, we can see that, even though there are still some errors, the classes that have a very low accuracy reach an higher value with the improvement, while the other classes kept the same value. In particular, we can notice that classes with similar object reach both an higher accuarcy and they are not anymore miss-classified. For example backpack and purse: in the MCD the purse class had a very low accuracy and most of the time it was miss-classified as backpack.

<p align="center">
    <img src="https://drive.google.com/uc?id=151JLK_-ohBakq8jzkAId27S4G_Wpog8g" width=500px>
    <img src="https://drive.google.com/uc?id=1cKJVp0WDaRR3NnUrAB5DUpoT9axvZZOS" width=500px>
    <figcaption align = "center"><b>Confusion Matrix of MCD and MCD + Improvement</b></figcaption>
</p>


#### $RW \rightarrow P$

In [None]:
feature_model = FeatureModel(resnet_version='resnet18')
model_s = Predictor(input_size=feature_model.output_size)
model_t = Predictor(input_size=feature_model.output_size)

opt_f = get_optimizer(feature_model, optimizer='sgd', lr=1e-2)
opt_s = get_optimizer(model_s, optimizer='sgd', lr=1e-1)
opt_t = get_optimizer(model_t, optimizer='sgd', lr=1e-1)

scheduler_f = get_scheduler(opt_f)
scheduler_s = get_scheduler(opt_s)
scheduler_t = get_scheduler(opt_t)

criterion = get_criterion(criterion='cross_entropy')

# Initialize wandb
experiment_name = 'UDA_RW_P_1'

wandb.init(
    # Set the project where this run will be logged
    project=DEFAULT_WANDB_PROJECT_NAME, 
    # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
    name=experiment_name, 
    # Track hyperparameters and run metadata
    config={
        "feature_model": feature_model.__class__.__name__,
        "source_model": model_s.__class__.__name__,
        "target_model": model_t.__class__.__name__,
        "features_optimizer": opt_f.__class__.__name__,
        "source_optimizer": opt_s.__class__.__name__,
        "target_optimizer": opt_t.__class__.__name__,
        "features_scheduler": scheduler_f.__class__.__name__,
        "source_scheduler": scheduler_s.__class__.__name__,
        "target_scheduler": scheduler_t.__class__.__name__,
        "criterion": criterion.__class__.__name__,
        "training_type": "UDA RW-P Improvement 1",
        "training_dataset": "RealWorld Supervised + Products Unsupervised",
        "evaluation_dataset": "Products",
        "learning_rate_feature_model": opt_f.param_groups[0]['lr'],
        "learning_rate_source_model": opt_s.param_groups[0]['lr'],
        "learning_rate_target_model": opt_t.param_groups[0]['lr'],
        "batch_size": DEFAULT_BATCH_SIZE,
        "epochs": 40,
    }
)

accuracy_uda_rw_p_1 = train_mcd(
    feature_model=feature_model,
    classifiers=(model_s, model_t),
    optimizers=(opt_f, opt_s, opt_t),
    schedulers=(scheduler_f, scheduler_s, scheduler_t),
    criterion=criterion,
    train_datasets=(train_loader_reals, train_loader_products),
    eval_dataset=test_loader_products,
    n_epochs=DEFAULT_EPOCHS + 10,
    k=DEFAULT_K_STEPS,
    save_name=experiment_name
)

# Close wandb
wandb.finish()

# Delete variables
del feature_model, model_s, model_t, opt_f, opt_s, opt_t, scheduler_f, scheduler_s, scheduler_t, criterion, experiment_name

# Empty cache
if DEFAULT_DEVICE.type == 'cuda':
    torch.cuda.empty_cache()

# Print results
print("Accuracy UDA RW-P: {:.2f}%".format(accuracy_uda_rw_p_1*100))
print("Gain UDA RW-P: {:.2f}%".format((accuracy_uda_rw_p_1 - accuracy_baseline_rw_p)*100))

##### Results 

<center>

| | Accuracy | Loss |
| :---: | :---: | :---: |
| Train source (supervised) | 90.4 % | 0.001 |
| Train target (unsupervised) | 84.3 % | 0.001 |
| Test source | 95.5 % | 0.001 |
| Test target | 95.3 % | 0.001 |
| Test overall | 95.8 % | 0.001 |

</center>

###### Metrics

<center>

| | precision | recall | f1-score | support |
| :---: | :---: | :---: | :---: | :---: |
| backpack | 0.88 | 1.00 | 0.94 | 15 |
| bookcase | 0.83 | 0.83 | 0.83 | 18 |
| car jack | 1.00 | 0.94 | 0.97 | 17 |
| comb | 1.00 | 1.00 | 1.00 | 19 |
| crown | 0.97 | 1.00 | 0.98 | 29 |
| file cabinet | 1.00 | 0.90 | 0.95 | 21 |
| flat iron | 1.00 | 1.00 | 1.00 | 20 |
| game controller | 0.87 | 0.95 | 0.91 | 21 |
| glasses | 0.92 | 0.83 | 0.87 | 29 |
| helicopter | 1.00 | 1.00 | 1.00 | 17 |
| ice skates | 0.96 | 0.96 | 0.96 | 24 |
| letter tray | 0.89 | 1.00 | 0.94 | 17 |
| monitor | 0.94 | 1.00 | 0.97 | 16 |
| mug | 1.00 | 1.00 | 1.00 | 16 |
| network switch | 1.00 | 1.00 | 1.00 | 19 |
| over-ear headphones | 0.95 | 0.95 | 0.95 | 19 |
| pen | 1.00 | 1.00 | 1.00 | 19 |
| purse | 1.00 | 0.96 | 0.98 | 24 |
| stand mixer | 0.95 | 0.90 | 0.92 | 20 |
| stroller | 1.00 | 1.00 | 1.00 | 20 |
| | | | | |
| accuracy | | | 0.96 | 400 |
| macro avg | 0.96 | 0.96 | 0.96 | 400 |
| weighted avg | 0.96 | 0.96 | 0.96 | 400 |

</center>

###### Loss and Accuracy curves

<p align="center">
    <img src="https://drive.google.com/uc?id=1Lcd55AfzshVfhlZycNhdv1yefEkJ5eCV" width=400px>
    <img src="https://drive.google.com/uc?id=1COPj3-mK6UHvYbNv0cXkB4L78pY5AqcO" width=400px>
    <img src="https://drive.google.com/uc?id=1xD1oAdi5k_tYBzM41cExYfXXhhc3Guf3" width=400px>
    <figcaption align = "center"><b>Loss & Accuracy & Discrepancy(features only)</b></figcaption>
</p>

###### Confusion Matrix

<p align="center">
    <img src="https://drive.google.com/uc?id=13jpRypMmlD-C79RJXXsaZQj81p9gfgTm" width=600px>
    <figcaption align = "center"><b>Confusion Matrix</b></figcaption>
</p>

###### t-SNE

<p align="center">
    <img src="https://drive.google.com/uc?id=10iOJqbUNu1XxQk11ZIX73VKRbmrFciTz" width=400px>
    <img src="https://drive.google.com/uc?id=1IWJwGILCo1eG5rLli7UjAOKrw8Sn3BIG" width=400px>
    <figcaption align = "center"><b>t-SNE on Source & Target</b></figcaption>
</p>

##### Discussion $RW \rightarrow P$

In the table below are reported the results with respect to the MCD without improvement:

<center>

| | Accuracy | 
| :---: | :---: |
| MCD | 96.5% |
| MCD + Improvement | 95.8% | 

</center>

For this type of task the proposal idea did not bring any improvement as we can see from the accuracy. Even though the accuracy of the MCD was already very very good and it was very difficult to bring something to increase the accuracy. 

Also for this case we report the graphs for accuracy and discrepancy to show the results. Also in this case the discrepancy is in general lower than the MCD one, but despite of this we did not have any improvement. 

<p align="center">
    <img src="https://drive.google.com/uc?id=1iDPtMjod1i7aRI3gy8_kxK3NA-kxugjC" width=400px>
    <img src="https://drive.google.com/uc?id=1EiBQIrctj_BGaSsA7LWJUplmAFardVaz" width=400px>
    <figcaption align = "center"><b>t-SNE on Source & Target</b></figcaption>
</p>

Even though there are no improvements in this case, the performance are pretty similar and the classes are still well separated. 
From the confusion matrix we can see that some classes which were missclassified with some others improved their performances, but withthe Improvement started to be missclassified with some others, as we can see for example with game controller and helicopter.
<p align="center">
    <img src="https://drive.google.com/uc?id=1Zo2TUDgOj1wd8SF8fdsuUl6bWgBaBGOL" width=400px>
    <img src="https://drive.google.com/uc?id=13jpRypMmlD-C79RJXXsaZQj81p9gfgTm" width=400px>
    <figcaption align = "center"><b>Confusion Matrix</b></figcaption>
</p>

### Improvement 2 : Loss function

In the previous improvement we have seen that the accuracy of the model is improved, but the loss is not much different. We think that the loss is not improved because the loss function is not the best one for this task. 

In the paper of the MCD method the authors are using the CrossEntropy loss. However, we thought about a better way to compute the loss, which we took from [CDA: Contrastive-adversarial Domain Adaptation](https://arxiv.org/abs/2301.03826).

They propose a loss function which is a combination of a two-stage loss that corresponds to the supervised and unsupervised steps. The loss function is defined as:
 - **Stage 1** : the loss is as a sum of the CrossEntropy loss and a Supervised Contrastive Loss. The Supervised Contrastive Loss ($L_{SupCL}$) is computed as:
   $$L_{SupCL}(X_s, Y_s) = - \sum_{z, z^+ \in D_s} \log \frac{\exp (z^T z^+ / \tau)}{\exp (z^T z^+ / \tau) + \sum_{z^- \in D_s} \exp (z^T z^- / \tau)}$$
   where variable $z_s$ denote the $l_2$ normalized latent embedding generated by $G$ (feature generator) corresponding to the input sample $x_s$. $D_s$ is the source domain while the target domain is $D_t$. The variable $\tau$ refers to the temperature scaling (hyperparameter set to `0.1` as default) which affects how the model learns from hard negatives.
 - **Stage 2** : the loss is a cross-domain contrastive loss:
   $$L_{CrossCL}(X_s, Y_s, X_t) = - \sum_{i = 1 ;\ z_s \in D_s ;\ z_t \in D_t}^N \log \frac{\exp ({z_s^i}^T z^i_t / \tau)}{\exp ({z_s^i}^T z^i_t / \tau) + \sum_{i \neq k = 1}^N \exp ({z_s^i}^T z^k_t / \tau)}$$

However, we have not been able to implement the second stage of loss function because they assume that the target domain has some form of pseudo-labels. The authors propose to use *k*-means clustering to generate pseudo-labels for the target domain. We did not use it since we can not use labels for the target domain.

Thus, the loss function at **stage 2** is implemented as the MCD loss function: CrossEntropy - Discrepancy.

In [None]:
def supervised_contrastive_loss(zs, ys, temperature=100):
    """
    zs is the features output by the resnet
    ys is the label (ground truth)
    """
    # Normalize the embeddings
    zs = F.normalize(zs, dim=1)
    loss = 0
    logits_plus = 0
    logits_neg = 0
    for i in range(DEFAULT_NUM_CLASSES):
        # Initialize z+ and z- as:
        # z+ = the embeddings of the input with the same class as the current input (ground truth class is obtained from the ground truth and class predicted from xs)
        # z- = the embeddings of the input with a different class as the current input (ground truth class is obtained from the ground truth and class predicted from xs)
        z_plus = zs[ys == i]
        z_minus = zs[ys != i]
        for z in z_plus:
          logits_plus = torch.exp(torch.matmul(zs, z.t()) / temperature)
          
          for m in z_minus:
            logits_neg += torch.exp(torch.matmul(zs, m.t())/temperature)

        # Compute the loss for the positive pairs
        loss += torch.log(logits_plus / (logits_plus + logits_neg))
        
    return -loss.mean() #/ 1000

def loss_stage_1(source_x, source_y, target_x, features, temperature: float = 0.1):
    # First get the loss with CrossEntropyLoss
    criterion = nn.CrossEntropyLoss()
    loss = criterion(source_x, source_y) + criterion(target_x, source_y)

    # Then add the loss with the supervised contrastive loss
    return loss + supervised_contrastive_loss(features, source_y, temperature)

def loss_stage_2(source_x, source_y, target_x, features, temperature: float = 0.1):
    # First get the loss with CrossEntropyLoss
    criterion = nn.CrossEntropyLoss()
    loss = criterion(source_x, source_y) + criterion(target_x, source_y)

    # Compute discrepancy loss
    discrepancy_loss = discrepancy(source_x, target_x)

    # Then remove the discrepancy loss
    return loss - discrepancy_loss, discrepancy_loss

#### Train & Test functions

This functions have to be redefined in order to use the new loss function.

In [None]:
def train_improved_loss_one_step(
        features_model: nn.Module, 
        classifiers: Tuple[nn.Module, nn.Module],
        optimizers: Tuple[torch.optim.Optimizer, torch.optim.Optimizer, torch.optim.Optimizer], 
        datasets: Tuple[DataLoader, DataLoader],
        k: int = DEFAULT_K_STEPS,
        device: str = DEFAULT_DEVICE,
    ) -> Tuple[nn.Module, nn.Module, nn.Module]:
    # Extract classifiers
    classifier_source, classifier_target = classifiers

    # Extract optimizers
    optimizer_features, optimizer_source, optimizer_target = optimizers

    # Extract datasets
    train_loader_source, train_loader_target = datasets

    # Set models to train mode
    features_model.train()
    classifier_source.train()
    classifier_target.train()
    
    # Initialize cumulative variables
    cum_loss_source = 0
    cum_loss_target = 0
    cum_accuracy_source = 0
    cum_accuracy_target = 0
    n_samples = 0

    # Train loop    
    for (x_source, y_source), (x_target, y_target) in zip(train_loader_source, train_loader_target):
        # Move to device
        x_source, y_source = x_source.to(device), y_source.to(device)
        x_target, y_target = x_target.to(device), y_target.to(device)

        # Extract features and compute output
        features_source = features_model(x_source)
        out_source = classifier_source(features_source)
        out_target = classifier_target(features_source)

        # Compute loss
        loss_s = loss_stage_1(out_source, y_source, out_target, features_source)
        cum_loss_source += loss_s.item()

        # Compute accuracy
        cum_accuracy_source += compute_accuracy(out_source, y_source)
        cum_accuracy_source += compute_accuracy(out_target, y_source)

        # Set optimizers to zero_grad
        optimizer_features.zero_grad()
        optimizer_source.zero_grad()
        optimizer_target.zero_grad()

        # Backward
        loss_s.backward()

        # Update weights
        optimizer_features.step()
        optimizer_source.step()
        optimizer_target.step()

        ### UDA ###
        # Extract features and compute output - source
        features_source = features_model(x_source)
        out_source_s = classifier_source(features_source)
        out_target_s = classifier_target(features_source)

        # Extract features and compute output - target
        features_target = features_model(x_target)
        out_source_t = classifier_source(features_target)
        out_target_t = classifier_target(features_target)
        
        # Compute loss - Only on source!!!
        loss, d = loss_stage_2(out_source_s, y_source, out_target_s, features_source)
        cum_loss_target += loss.item()

        # Compute accuracy - Can we compute accuracy on target since it is unsupervised? 
        cum_accuracy_target += compute_accuracy(out_source_t, y_target)
        cum_accuracy_target += compute_accuracy(out_target_t, y_target)

        # Backward the toal loss of source and target
        loss.backward()

        # Set optimizers to zero_grad
        optimizer_source.zero_grad()
        optimizer_target.zero_grad()

        # Update weights
        optimizer_source.step()
        optimizer_target.step()
        
        # Update n_samples
        n_samples += x_source.shape[0]
        n_samples += x_target.shape[0]
        
        for i in range(k):
            # Extract features and compute output 
            features_target = features_model(x_target)
            out_source_t = classifier_source(features_target)
            out_target_t = classifier_target(features_target)

            # Compute loss
            loss_discrepancy = discrepancy(out_source_t, out_target_t)

            # Zero gradients
            optimizer_features.zero_grad()

            # Backpropagate
            loss_discrepancy.backward()
            
            # Step optimizer
            optimizer_features.step()


    return {
        'train/discrepancy' : d,
        'train/loss_features' : loss_discrepancy,
        'train/loss_source': cum_loss_source / n_samples,
        'train/loss_target': cum_loss_target / n_samples,
        'train/accuracy_source': cum_accuracy_source / n_samples,
        'train/accuracy_target': cum_accuracy_target / n_samples,
    }

In [None]:
def train_improved(
        feature_model: nn.Module, 
        classifiers: Tuple[nn.Module, nn.Module],
        optimizers: Tuple[torch.optim.Optimizer, torch.optim.Optimizer, torch.optim.Optimizer], 
        schedulers: Tuple[torch.optim.lr_scheduler._LRScheduler, torch.optim.lr_scheduler._LRScheduler, torch.optim.lr_scheduler._LRScheduler],
        train_datasets: Tuple[DataLoader, DataLoader],
        eval_dataset: DataLoader,
        n_epochs: int = DEFAULT_EPOCHS,
        k: int = DEFAULT_K_STEPS,
        device: str = DEFAULT_DEVICE,
        save_name: str = "UDA_MCD_P_RW",
    ) -> Tuple[nn.Module, nn.Module, nn.Module]:
    # Extract classifiers
    classifier_source, classifier_target = classifiers

    # Move to device
    feature_model.to(device)
    classifier_source.to(device)
    classifier_target.to(device)

    # Extract schedulers
    scheduler_features, scheduler_source, scheduler_target = schedulers

    # Initialize variables
    best_accuracy = 0
    best_model_f = None
    best_model_s = None
    best_model_t = None

    # Start timer
    start_timer = time.time()

    # Train loop
    for epoch in range(n_epochs):
        # Train
        train_metrics = train_improved_loss_one_step(
            features_model=feature_model, 
            classifiers=classifiers, 
            optimizers=optimizers, 
            datasets=train_datasets, 
            k=k,
            device=device,
        )
        # Test
        test_metrics = test_mcd_one_step(
            feature_model=feature_model, 
            classifiers=classifiers, 
            dataset=eval_dataset, 
            device=device,
        )

        # Put together metrics
        metrics = {**train_metrics, **test_metrics}

        # Save best model
        if test_metrics['eval/accuracy'] > best_accuracy:
            best_accuracy = test_metrics['eval/accuracy']

            best_model_f = copy.deepcopy(feature_model)
            best_model_s = copy.deepcopy(classifier_source)
            best_model_t = copy.deepcopy(classifier_target)
            
            torch.save(best_model_f.state_dict(), os.path.join(LOG_PATH_MODELS, save_name+"_features_extractor.pth"))
            torch.save(best_model_s.state_dict(), os.path.join(LOG_PATH_MODELS, save_name+"_source.pth"))
            torch.save(best_model_t.state_dict(), os.path.join(LOG_PATH_MODELS, save_name+"_target.pth"))

        # Log metrics
        opt_f, opt_s, opt_t = optimizers
        make_log_print("Train", (epoch+1, n_epochs), time.time() - start_timer, metrics, lr=(round(opt_f.param_groups[0]['lr'], 6), round(opt_s.param_groups[0]['lr'], 6), round(opt_t.param_groups[0]['lr'], 6)))

        # Update scheduler
        scheduler_features.step()
        scheduler_source.step()
        scheduler_target.step()

    visualize_results

    return best_accuracy

#### $P \rightarrow RW$

In [None]:
feature_model = FeatureModel(resnet_version='resnet18')
model_s = Predictor(input_size=feature_model.output_size)
model_t = Predictor(input_size=feature_model.output_size)

opt_f = get_optimizer(feature_model, optimizer='sgd', lr=1e-2)
opt_s = get_optimizer(model_s, optimizer='sgd', lr=1e-1)
opt_t = get_optimizer(model_t, optimizer='sgd', lr=1e-1)

scheduler_f = get_scheduler(opt_f)
scheduler_s = get_scheduler(opt_s)
scheduler_t = get_scheduler(opt_t)

# Initialize wandb
experiment_name = 'Improvment_Loss_P_RW'

wandb.init(
    # Set the project where this run will be logged
    project=DEFAULT_WANDB_PROJECT_NAME, 
    # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
    name=experiment_name, 
    # Track hyperparameters and run metadata
    config={
        "feature_model": feature_model.__class__.__name__,
        "source_model": model_s.__class__.__name__,
        "target_model": model_t.__class__.__name__,
        "features_optimizer": opt_f.__class__.__name__,
        "source_optimizer": opt_s.__class__.__name__,
        "target_optimizer": opt_t.__class__.__name__,
        "features_scheduler": scheduler_f.__class__.__name__,
        "source_scheduler": scheduler_s.__class__.__name__,
        "target_scheduler": scheduler_t.__class__.__name__,
        "criterion": "2-Stage Loss",
        "training_type": "UDA P-RW Improved Loss",
        "training_dataset": "Products Supervised + RealWorld Unsupervised",
        "evaluation_dataset": "RealWorld",
        "learning_rate_feature_model": opt_f.param_groups[0]['lr'],
        "learning_rate_source_model": opt_s.param_groups[0]['lr'],
        "learning_rate_target_model": opt_t.param_groups[0]['lr'],
        "batch_size": DEFAULT_BATCH_SIZE,
        "epochs": DEFAULT_EPOCHS,
    }
)

accuracy_improvement_loss_p_rw = train_mcd(
    feature_model=feature_model,
    classifiers=(model_s, model_t),
    optimizers=(opt_f, opt_s, opt_t),
    schedulers=(scheduler_f, scheduler_s, scheduler_t),
    train_datasets=(train_loader_products, train_loader_reals),
    eval_dataset=test_loader_reals,
    n_epochs=DEFAULT_EPOCHS,
    k=DEFAULT_K_STEPS,
    save_name=experiment_name
)

# Close wandb
wandb.finish()

# Delete variables
del feature_model, model_s, model_t, opt_f, opt_s, opt_t, scheduler_f, scheduler_s, scheduler_t, criterion, experiment_name

# Empty cache
if DEFAULT_DEVICE.type == 'cuda':
    torch.cuda.empty_cache()

# Print results
print("Accuracy UDA P-RW: {:.2f}%".format(accuracy_improvement_loss_p_rw*100))
print("Gain UDA P-RW: {:.2f}%".format((accuracy_improvement_loss_p_rw - accuracy_baseline_p_rw)*100))

#### $RW \rightarrow P$

In [None]:
feature_model = FeatureModel(resnet_version='resnet18')
model_s = Predictor(input_size=feature_model.output_size)
model_t = Predictor(input_size=feature_model.output_size)

opt_f = get_optimizer(feature_model, optimizer='sgd', lr=1e-2)
opt_s = get_optimizer(model_s, optimizer='sgd', lr=1e-1)
opt_t = get_optimizer(model_t, optimizer='sgd', lr=1e-1)

scheduler_f = get_scheduler(opt_f)
scheduler_s = get_scheduler(opt_s)
scheduler_t = get_scheduler(opt_t)

# Initialize wandb
experiment_name = 'Improvment_Loss_RW_P'

wandb.init(
    # Set the project where this run will be logged
    project=DEFAULT_WANDB_PROJECT_NAME, 
    # We pass a run name (otherwise it’ll be randomly assigned, like sunshine-lollypop-10)
    name=experiment_name, 
    # Track hyperparameters and run metadata
    config={
        "feature_model": feature_model.__class__.__name__,
        "source_model": model_s.__class__.__name__,
        "target_model": model_t.__class__.__name__,
        "features_optimizer": opt_f.__class__.__name__,
        "source_optimizer": opt_s.__class__.__name__,
        "target_optimizer": opt_t.__class__.__name__,
        "features_scheduler": scheduler_f.__class__.__name__,
        "source_scheduler": scheduler_s.__class__.__name__,
        "target_scheduler": scheduler_t.__class__.__name__,
        "criterion": "2-Stage Loss",
        "training_type": "UDA RW-P Improved Loss",
        "training_dataset": "RealWorld Supervised + Products Unsupervised",
        "evaluation_dataset": "Products",
        "learning_rate_feature_model": opt_f.param_groups[0]['lr'],
        "learning_rate_source_model": opt_s.param_groups[0]['lr'],
        "learning_rate_target_model": opt_t.param_groups[0]['lr'],
        "batch_size": DEFAULT_BATCH_SIZE,
        "epochs": DEFAULT_EPOCHS,
    }
)

accuracy_improvement_loss_rw_p = train_mcd(
    feature_model=feature_model,
    classifiers=(model_s, model_t),
    optimizers=(opt_f, opt_s, opt_t),
    schedulers=(scheduler_f, scheduler_s, scheduler_t),
    train_datasets=(train_loader_reals, train_loader_products),
    eval_dataset=test_loader_products,
    n_epochs=DEFAULT_EPOCHS,
    k=DEFAULT_K_STEPS,
    save_name=experiment_name
)

# Close wandb
wandb.finish()

# Delete variables
del feature_model, model_s, model_t, opt_f, opt_s, opt_t, scheduler_f, scheduler_s, scheduler_t, experiment_name

# Empty cache
if DEFAULT_DEVICE.type == 'cuda':
    torch.cuda.empty_cache()

# Print results
print("Accuracy UDA P-RW: {:.2f}%".format(accuracy_improvement_loss_rw_p*100))
print("Gain UDA P-RW: {:.2f}%".format((accuracy_improvement_loss_rw_p - accuracy_baseline_p_rw)*100))

#### Discussion

Unfortunately, on both directions the results obtained were not as good as we expected. The accuracy of the model is not improved, it is even worse than the baseline. Indeed, the accuracy of the model starts around 5% which is not good. Moreover, during the epochs the accuracy is not improving, it freezes at 5% and the loss is not decreasing.

We think that a crucial factor is that we have not been able to implement the second stage of the loss function. Indeed, the authors propose to use *k*-means clustering to generate pseudo-labels for the target domain. We did not use it since we can not use labels for the target domain.


## Final discussion

In this project we have implemented the MCD method and we have tried to improve it. 

We started from a baseline that was pretty good (80.7% for the $P \rightarrow RW$ direction and 92.2% for the $RW \rightarrow P$ direction). But we had the margin to improve it since the upperbounds were 9.5% and 5.1% better than the baseline. 

With the MCD method we have obtained a good accuracy, with the $P \rightarrow RW$ direction we have obtained an accuracy of 88.0% with some margin for other improvements (2.5%) while with the $RW \rightarrow P$ direction we have obtained an accuracy of 96.5% with very little margin for improvements (0.8%).

We tried to improve the MCD method with an idea that we had and then by changing the loss function with the one proposed in [CDA: Contrastive-adversarial Domain Adaptation](https://arxiv.org/abs/2301.03826). With the first of the two improvements we have obtained a better accuracy for the $P \rightarrow RW$ direction, while with the second one we have not obtained any improvement, it even got worse.

The overall results are reported in the table below:

<center>

| Direction | Stage | Accuracy |
| :---: | :---: | :---: |
| $P \rightarrow RW$ | Baseline | 80.7% |
| $P \rightarrow RW$ | MCD | 88.0 % |
| $P \rightarrow RW$ | Improvement 1 | 89.0% |
| $P \rightarrow RW$ | Upper bound | 90.5% |
| | | |
| $RW \rightarrow P$ | Baseline | 92.2% |
| $RW \rightarrow P$ | MCD | 96.5% |
| $RW \rightarrow P$ | Improvement 1 | 95.8% |
| $RW \rightarrow P$ | Upper bound | 97.3% |

</center>

### Conclusions

In conclusion, we have implemented the MCD method and we have tried to improve it. We have obtained very good results both on $P \rightarrow RW$ direction and $RW \rightarrow P$. The proposed idea, improved only the result for the first direction, while for the second one reached somehow the same accuracy, without bringing any improvement. 

Finally we can say that MCD works already very well, the proposed idea is a plus, especially for those cases that are more difficult to work with.