## <font color=green> Domain Adaptation using Domain Adversarial Neural Networks </font>

### <font color=blue> Goal of this tutorial: </font>
- Know the background of Domain Adaptation and Domain Adversarial Neural Networks (DANN)
- Implement DANN for sentiment domain adaptation using PyTorch

###  <font color=blue> General: </font>
- This notebook was last tested on Python 3.6.4, PyTorch 0.4.0, scikit-learn 0.21.3
- We would like to acknowledge the DANN repository from fungtion (https://github.com/fungtion/DANN) which we used as a reference to code up the DANN model.

## <font color=green> Background </font>

### <font color=blue> Problem: </font>
Domain adapation is the problem of learning a machine learning classifier (say) in a domain with no labeled examples. The straightforward approach to build a classifier in this setting is to manually annotate some instances in the domain. Unfortunately, annotation work is either time consuming or expensive or both in most cases. 

### <font color=blue> How about leveraging labeled examples from a related domain?  </font>
That's a good start. We can maybe train our classifier on the labeled domain (let's call it source domain) and evaluate on the unlabled domain of interest (let's call it target domain). This approach mostly likely does not work as there is a shift in the distribution of data in the training set from the test set. In machine learning terms, we say that the approach breaks the i.i.d (independently sampled and identically distributed) assumption of machine learning models that the distribution of the training data and the test data should be identical.

### <font color=blue> What would be a good application of domain adaptation? </font>
Let's say we want to build a binary (+ve/-ve) sentiment classifier in Automobile domain (target domain). Although we generally have a lot of unlabeled reviews in Automobile domain, we don't have any labeled examples from this domain to work with. Maybe we do have access to labeled sentiment reviews (+ve/-ve) from a related domain, Books (source domain). So the problem becomes: given labeled examples from a source domain (Books) and unlabeled examples from a target domain (Automobile), how do we go about building a sentiment classifier for the target domain (Automobile) which is reasonable? 

### <font color=blue> What's a reasonable model for domain adaptation? </font>
A model which achieves good accuracy is generally reasonable. So we test the predictions of our domain adaption model on few labeled examples on the target domain (Automobile) to measure the performance. Note that the labeled examples in the target domain should be treated as the test set and hence we should always follow the golden rule of machine learning (i.e., never use them to influence our model in anyway).

### <font color=blue> Domain Adversarial Neural Networks: </font>
<img src="images/dann_architecture.jpeg" alt="DANN Architecture" title="DANN - Architecture" />
Domain Adversarial Neural Networks tries to solve this problem by training a model that simulatenously does two things: i) learns representation that makes examples from source and target domain appear similarly (domain classification) and ii) simultaneously optimizes the representation for achieving minimal error in classifying labeled samples from the source domain (source classification). We can imagine a 1 hidden layer neural network which 
1. takes the example as input (e.g., tf-idf vector for a review)
2. passes it to affine transformation followed by non-linearity (e.g, sigmoid) to compute hidden representation
3. passes the hidden representation to a classifier that predicts the sentiment label of the source example WELL. (minimizing the source classification error)
4. passes the same hidden representation to another classifier that predicts the domain of the input example BADLY. (maximizing the domain classification error)
- The objective in iii) ensures the internal representation of neural network contains no discriminative information about the domain of the input. The other objective in iv) preserves low error rate on classification of source samples.  And the intuition of DANN is to make the source accuracy correspond to target accuracy when the source and target domain distributions are made similar.

### <font color=blue> References: </font>
To know more about DANN, take a look at the following articles:
1. Domain-Adversarial Neural Networks https://arxiv.org/pdf/1412.4446.pdf
2. Unsupervised Domain Adaptation by Backpropagation http://sites.skoltech.ru/compvision/projects/grl/files/paper.pdf


Let's do all the imports.

In [10]:
'''
One place for all the imports
'''
import os
import math
import numpy as np
from scipy import sparse
from collections import Counter
import random
from tqdm import trange

import torch
import torch.nn as nn
import torch.optim as optim
import torch.backends.cudnn as cudnn
from torch.autograd import Function

# set the seed
manual_seed = 123
random.seed(manual_seed)
np.random.seed(manual_seed)
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
  torch.cuda.manual_seed(manual_seed)

Prepare the Amazon Movie Reviews Dataset
- We'll try to stick to the specs. mentioned in https://arxiv.org/pdf/1412.4446.pdf
- We'll focus on adapation from Books to DVD domain.
- We'll use the preprocessed dataset from http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
- Download link: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/processed_acl.tar.gz

In [5]:
!wget http://www.cs.jhu.edu/~mdredze/datasets/sentiment/processed_acl.tar.gz
!tar -xvf processed_acl.tar.gz

--2021-04-03 03:30:24--  http://www.cs.jhu.edu/~mdredze/datasets/sentiment/processed_acl.tar.gz
Resolving www.cs.jhu.edu (www.cs.jhu.edu)... 128.220.13.64
Connecting to www.cs.jhu.edu (www.cs.jhu.edu)|128.220.13.64|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19633323 (19M) [application/x-gzip]
Saving to: ‘processed_acl.tar.gz.1’


2021-04-03 03:30:28 (8.08 MB/s) - ‘processed_acl.tar.gz.1’ saved [19633323/19633323]

processed_acl/
processed_acl/dvd/
processed_acl/dvd/negative.review
processed_acl/dvd/unlabeled.review
processed_acl/dvd/positive.review
processed_acl/books/
processed_acl/books/negative.review
processed_acl/books/unlabeled.review
processed_acl/books/positive.review
processed_acl/kitchen/
processed_acl/kitchen/negative.review
processed_acl/kitchen/unlabeled.review
processed_acl/kitchen/positive.review
processed_acl/electronics/
processed_acl/electronics/negative.review
processed_acl/electronics/unlabeled.review
processed_acl/electronics/positive.

In [11]:
# location of the data and size of the vocabulary
DATA_DIR = "processed_acl"
VOCAB_SIZE = 5000

# creates the vocab from the preprocessed features
def create_vocab(documents):
  # count all the tokens in both the files (document frequency)
  vocab_count = Counter()
  for doc in documents:
    doc = doc.strip()
    tokens = [token.split(":")[0] for token in doc.split()[0:-1]] # last token is the label so we ignore it
    for token in set(tokens):
      vocab_count[token] += 1
  # create the token to id and id to token mappings
  t2i, i2t = {}, {}
  for token, _ in vocab_count.most_common()[:VOCAB_SIZE]:
    t2i[token] = len(i2t)
    i2t[t2i[token]] = token
  print("created vocab. of size %d"%len(t2i))
  print("top 10 tokens ...")
  print(list(t2i)[0:10])
  return t2i, i2t, vocab_count

# create the sparse tf-idf representation from the reviews
def tfidf_docs(documents, t2i, i2t, vocab_count):
  coords_1, coords_2, values, y = [], [], [], []
  row_id = 0
  n, d = len(documents), len(t2i)
  for doc in documents:
    items = doc.split()
    for item in items[0:-1]:
      token, freq = item.split(":")
      if token in t2i:
        col_id = t2i[token]
        # we will use the weighing scheme 2 from the recommended options 
        # ref: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
        tf_score = 1.0 + math.log(float(freq))
        idf_score = math.log(1.0 + float(n)/vocab_count[token])
        coords_1.append(row_id)
        coords_2.append(col_id)
        values.append(tf_score*idf_score)
    label = 1 if items[-1].split(":")[1] == "positive" else 0
    y.append(label)
    row_id = row_id + 1
  X = sparse.coo_matrix((values, (coords_1, coords_2)), shape=(n, d))
  y = np.array(y)
  print("shape of inputs = %d, %d"%(X.shape[0], X.shape[1]))
  print("number of non-zero entries = %d"%(X.count_nonzero()))
  return X, y
    
# reads the labeled documents to X, y
def read_labeled_dir(domain, vocab=None):
  print("processing the labeled %s domain"%domain)
  
  # file paths
  positive_f = os.path.join(DATA_DIR, domain, 'positive.review')
  negative_f = os.path.join(DATA_DIR, domain, 'negative.review')
    
  # load both files to memory
  positive_documents = [line.strip() for line in open(positive_f)]
  negative_documents = [line.strip() for line in open(negative_f)]
  total_documents = positive_documents + negative_documents
  random.shuffle(total_documents)
  
  if not vocab:
    # read the vocab
    t2i, i2t, vocab_count = create_vocab(total_documents)
  else:
    t2i, i2t, vocab_count = vocab['t2i'], vocab['i2t'], vocab['vocab_count']

  # create the tf-idf representation for all the documents
  X, y = tfidf_docs(total_documents, t2i, i2t, vocab_count)
  
  return {'inputs': X, 'labels': y}, {'t2i': t2i, 'i2t': i2t, 'vocab_count': vocab_count}

# read the unlabeled documents to X
def read_unlabeled_file(domain, vocab):
  print("processing the unlabeled %s domain"%domain)
  
  # file paths
  unlab_f = os.path.join(DATA_DIR, domain, 'unlabeled.review')
  
  # load the content to memory
  unlab_documents = [line.strip() for line in open(unlab_f)]
  random.shuffle(unlab_documents)
  unlab_documents = unlab_documents[0:2000]
  
  t2i, i2t, vocab_count = vocab['t2i'], vocab['i2t'], vocab['vocab_count']
  
  # create the tf-idf representation for all the unlabeled documents
  X, _ = tfidf_docs(unlab_documents, t2i, i2t, vocab_count)
  
  return {'inputs': X}

# read the labeled data from source domain
books_labeled_data, books_vocab = read_labeled_dir("books")

# read the labeled data from target domain
dvd_labeled_data, _ = read_labeled_dir("dvd", vocab=books_vocab)

# read the unlabeled data from target domain
dvd_unlabeled_data = read_unlabeled_file("dvd", books_vocab)

processing the labeled books domain
created vocab. of size 5000
top 10 tokens ...
['book', 'i', 'this_book', 'not', 'was', 'you', 'read', 'one', 'all', 'about']
shape of inputs = 2000, 5000
number of non-zero entries = 198264
processing the labeled dvd domain
shape of inputs = 2000, 5000
number of non-zero entries = 175503
processing the unlabeled dvd domain
shape of inputs = 2000, 5000
number of non-zero entries = 174850


#### Create the PyTorch modules

In [13]:
# gradient reversal layer
class ReverseLayerF(Function):
  @staticmethod
  def forward(ctx, x, lmbda):
    ctx.lmbda = lmbda
    return x.view_as(x)

  @staticmethod
  def backward(ctx, grad_output):
    output = grad_output.neg() * ctx.lmbda
    return output, None

# DANN layers
class DANN(nn.Module):
  def __init__(self, d, h, lmbda):
    super(DANN, self).__init__()
    self.lmbda = lmbda
    self.input_to_hidden = nn.Linear(d, h)
    self.class_classifier = nn.Linear(h, 2) # classes: positive vs negative
    self.domain_classifier = nn.Linear(h, 2) # classes: source vs target
  
  def forward(self, input_data):
    hidden_rep = self.input_to_hidden(input_data)
    class_output = self.class_classifier(hidden_rep)
    reverse_feature = ReverseLayerF.apply(hidden_rep, self.lmbda)
    domain_output = self.domain_classifier(reverse_feature)
    return class_output, domain_output

#### Create the model instance, optimizer, loss and so on

In [14]:
# hyperparameters for training the model
ALPHA = 0.001 # learning rate
HIDDEN_SIZE = 200 # search space [1, 5, 12, 25, 50, 75, 100, 150, 200]
BATCH_SIZE = 50
EPOCHS = 5
LAMBDA = 0.1 # search space among 9 values between 10^{−2} and 1 on a logarithmic scale

# create the model instance
n, d = books_labeled_data["inputs"].shape
model = DANN(d, HIDDEN_SIZE, LAMBDA)
model.to(device)

# setup optimizer
optimizer = optim.SGD(model.parameters(), lr=ALPHA)

# setup both the loss
loss_class = torch.nn.NLLLoss()
loss_domain = torch.nn.NLLLoss()

# ensure all the parameters of the model are learnable
for p in model.parameters():
  p.requires_grad = True

#### Train the model on Books reviews

In [17]:
# collect the training data
X_src = torch.from_numpy(books_labeled_data["inputs"].toarray()).float()
y_src_class = torch.from_numpy(books_labeled_data["labels"])
X_targ = torch.from_numpy(dvd_unlabeled_data["inputs"].toarray()).float()

# placeholders for holding the current batch
cur_X_src = torch.cuda.FloatTensor(BATCH_SIZE, d, device=device)
cur_y_src_class = torch.zeros(BATCH_SIZE, dtype=torch.long, device=device)
cur_y_src_domain = torch.zeros(BATCH_SIZE, dtype=torch.long, device=device)
cur_X_targ = torch.cuda.FloatTensor(BATCH_SIZE, d, device=device)
cur_y_targ_domain = torch.ones(BATCH_SIZE, dtype=torch.long, device=device)

# start training
print('training...')
model.train()
num_batches = n // BATCH_SIZE
for epoch in trange(EPOCHS):
  rand_idx = np.random.permutation(n)
  for bi in range(num_batches):
    # prepare batch
    for sample_i in range(BATCH_SIZE):
      cur_idx = rand_idx[BATCH_SIZE*bi + sample_i]
      cur_X_src[sample_i] = X_src[cur_idx]
      cur_y_src_class[sample_i] = y_src_class[cur_idx]
      cur_X_targ[sample_i] = X_targ[cur_idx]
    # train the model using this batch
    model.zero_grad() # clears the gradient buffer
    # source side losses
    class_output, domain_output = model(cur_X_src)
    error_src_class = loss_class(class_output, cur_y_src_class)
    error_src_domain = loss_domain(domain_output, cur_y_src_domain)
    # target side losses
    _, domain_output = model(cur_X_targ)
    error_src_domain = loss_domain(domain_output, cur_y_targ_domain)
    # total losses
    error_this_batch = error_src_class + error_src_domain + error_src_domain
    # backward prop.
    error_this_batch.backward()
    optimizer.step()

  0%|          | 0/5 [00:00<?, ?it/s]

training...


100%|██████████| 5/5 [00:00<00:00,  5.33it/s]


#### Evaluate the model on DVD reviews

In [None]:
# collect the evaluation data
X_targ = torch.from_numpy(dvd_labeled_data["inputs"].toarray()).float()
y_targ_class = torch.from_numpy(dvd_labeled_data["labels"])

# placeholders for holding the current batch
cur_X_targ = torch.FloatTensor(BATCH_SIZE, d, device=device)
cur_y_targ_class = torch.zeros(BATCH_SIZE, dtype=torch.long, device=device)

model.eval()
num_test_instances = X_targ.shape[0]
num_batches = num_test_instances // BATCH_SIZE
errors = 0.0
with torch.no_grad():
  for bi in range(num_batches):
    # prepare batch
    for sample_i in range(BATCH_SIZE):
      cur_idx = BATCH_SIZE*bi + sample_i
      cur_X_targ[sample_i] = X_targ[cur_idx]
      cur_y_targ_class[sample_i] = y_targ_class[cur_idx]
    pred_class_output, _ = model(cur_X_targ)
    # update errors
    for sample_i in range(BATCH_SIZE):
      cur_label = 0 if pred_class_output[sample_i][0] > pred_class_output[sample_i][1] else 1
      if cur_y_targ_class[sample_i] != cur_label:
        errors += 1.0
print("evaluation error in target labeled samples = %.3f"%(errors/num_test_instances))

evaluation error in target labeled samples = 0.399


That's it!