# Guide to Transformers Domain Adaptation
This guide illustrates an end-to-end workflow of domain adaptation, where we domain-adapt a transfomer model for biomedical NLP applications.

It showcases the two domain adaptation techniques we investigated in our research:
1. Data Selection
2. Vocabulary Augmentation

Following that, we demonstrate how such a domain-adapted Transformers model is compatible with 🤗 `transformers`'s training interface and how it outperforms an out-of-the-box (non-domain adapted) model.

These techniques are applied to BERT small but the codebase is written to be generalizable to other classes of Transformers supported by HuggingFace.

### Caveats
For this guide, we use a much smaller subset (<0.05%) of the in-domain corpora due to memory and time constraints.

In [15]:
%load_ext line_profiler

ModuleNotFoundError: No module named 'line_profiler'

In [6]:
! pip install -e ../../adaptation-metrics


Obtaining file:///home/dboumber/work/git/adaptation-metrics
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: adaptation-metrics
  Building editable for adaptation-metrics (pyproject.toml) ... [?25ldone
[?25h  Created wheel for adaptation-metrics: filename=adaptation_metrics-0.4.1-py3-none-any.whl size=2132 sha256=8bd6fe5959c0cf33542c68144f0e6837daeda7658c3e2156fad2cd5bbd05b066
  Stored in directory: /tmp/pip-ephem-wheel-cache-2hye0p44/wheels/2f/3b/69/35e22713bc99f0ab155794a707a69c43422463b0993bdc761a
Successfully built adaptation-metrics
Installing collected packages: adaptation-metrics
  Attempting uninstall: adaptation-metrics
    Found existing installation: adaptation-metrics 0.4.1
    Uninstalling adaptation-metrics-0.4.1:
      Successfully 

## Constants
We first define some constants, including the appropriate model card and relevant paths to text corpora.

There are two types of corpora in the context of Domain Adaptation:

1. Fine-Tuning Corpus
> Given an NLP task (e.g. text classification, summarization, etc.), the text portion of this dataset is the fine-tuning corpus.

2. In-Domain Corpus
> This is an unsupervised text dataset that is used for domain pre-training. The text domain is the same as, if not broader than, the domain of fine-tuning corpus.

In [2]:

from huggingface_hub import  notebook_login


notebook_login()




# Fine-tuning corpus

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [1]:
from datasets import load_dataset

model_card = 'bert-base-uncased'
subdomains = {
    'fake_news':load_dataset('redasers/difraud', 'fake_news'),
    'phishing':load_dataset('redasers/difraud', 'phishing'),
    'job_scams':load_dataset('redasers/difraud', 'job_scams'),
    'political_statements':load_dataset('redasers/difraud', 'political_statements'),
    'product_reviews':load_dataset('redasers/difraud', 'product_reviews'),
    'sms':load_dataset('redasers/difraud', 'sms'),
    'twitter_rumours':load_dataset('redasers/difraud', 'twitter_rumours'),
    }



In [2]:
from tqdm.notebook import tqdm


DOMAIN = "twitter_rumours"

domain_data = {}
domain_labels = {}
source_domains = {}

for target in tqdm(subdomains):
    if target not in domain_data:
        domain_data[target] = {}
        domain_labels[target] = {}
        source_domains[target] = {}
    domain_data[target]["sources"] = []
    domain_data[target]["target"] = subdomains[target]["train"]["text"]
    domain_data[target]["sources"] = []
    domain_labels[target]["target"] = subdomains[target]["train"]["label"]
    for source in subdomains:
        if source == target:
            continue
        domain_data[target][source] = subdomains[source]["train"]["text"]
        domain_labels[target][source] = subdomains[target]["train"]["label"]
    for source in subdomains:
        if source == target:
            continue
        domain_data[target][source] = subdomains[source]["train"]["label"]
    
        if "sources" not in domain_data[target]:
            domain_data[target]["sources"] = []
        if "sources" not in domain_labels[target]:
            domain_labels[target]["sources"] = []
        if "sources" not in source_domains[target]:
            source_domains[target]["sources"] = []

        domain_data[target]["sources"].extend(subdomains[source]["train"]["text"])
        domain_labels[target]["sources"].extend(subdomains[source]["train"]["label"])
    #source_domains[target]["sources"].extend([src for _, src in enumerate(subdomains[source]["train"]["label"])])
    

   
print(domain_data[DOMAIN]["target"][0:10])


X_target = domain_data[DOMAIN]["target"]
X_sources = domain_data[DOMAIN]["sources"]
y_target = domain_labels[DOMAIN]["target"] 
y_sources = domain_labels[DOMAIN]["sources"]

print(len(X_target), len(X_sources))

  0%|          | 0/7 [00:00<?, ?it/s]

["I don't know how this is going to end. But man, it doesn't feel good right now. #ferguson http://t.co/OKwtolhshi", 'Passenger plane carrying 148 people crashes in the French Alps en route from Spain to Germany. #Germanwings', 'Photos of the scene unfolding after multiple shootings in Ottawa this morning. http://t.co/xq8Ihiuf5X http://t.co/wS4PkA5ddg', 'ISIS FLAG VISIBLE AS GUNMAN SEIZES SYDNEY CAFE, HOLDS HOSTAGES', '→ http://t.co/6W6HpstrfA #planecrash plane crash Drexel University graduate, mom among Germanwings plane crash victi… http://t.co/htTp0ALZye', 'BREAKING: @ctvottawa CONFIRMS: One shooter dead. Police working under assumption more than one shooter. 3 shooting incidents #ottnews', "'Stupidity will not win', says surviving Charlie Hebdo journalist who confirms to @AFP that newspaper will come out next week.", '#BREAKING One woman shot in drive-by shooting on 1300 Highmont near the QT. Cannot confirm if the victim survived. #ferguson #mikebrown', 'FOX NEWS ALERT: @GregPalkot

In [3]:
import pandas as pd

df_src = pd.DataFrame({'text': X_sources, 'label': y_sources})
df_tgt = pd.DataFrame({'text': X_target, 'label': y_target})
df_dev = df_src.sample(frac=0.05)
df_src = df_src.drop(df_dev.index)
df_dev = df_dev.reset_index(drop=True).sample(frac=1)
df_src = df_src.reset_index(drop=True).sample(frac=1)
df_tgt = df_tgt.reset_index(drop=True).sample(frac=1)


In [4]:
print(len(df_src), len(df_tgt), len(df_dev))

68447 4631 3602


### Load model and tokenizer
Next we load the model and its corresponding tokenizer.

In [5]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained(model_card)
tokenizer = AutoTokenizer.from_pretrained(model_card, use_fast=True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
from adaptation_metrics import DataSelector


similarity_metrics = ["jensen-shannon", "cosine",]
    #"jensen-shannon", 
    #"renyi",
    #"cosine", 
    #"euclidean", 
    #"variational", 
    #"bhattacharyya",]

diversity_metrics = [ "type_token_ratio", "entropy",]
    #"num_token_types",
    #"type_token_ratio",
    #"entropy",
    #"simpsons_index",
    #"renyi_entropy",

selector = DataSelector(keep=1000,
                        tokenizer=tokenizer, 
                        diversity_metrics=diversity_metrics, 
                        similarity_metrics=similarity_metrics)




In [9]:
df_tgt["text"].values.tolist()

["Here's a transcript of the Prime Minister's press conference on #SydneySiege #MartinPlaceSiege http://t.co/UjX25PbCg8",
 "Makes me sad what's going on in Ottawa right now. Thoughts and prays to everyone involved. Everybody stay safe",
 'RT @scottbix: Incredible shot: A woman gives mouth-to-mouth to a fallen soldier at the War Memorial http://t.co/eTlJyKBRy8 #hillshooting',
 "Shooting up public places than claiming insanity, stealing other people's land, mass genocides RT @PoeticGenius19: What's white culture?",
 'Snipers set up on National Art Gallery as we remain barricaded in Centre Block on Parliament Hill #cdnpoli. http://t.co/lWKaxLI9jO',
 'These 3 words say it all. #ferguson #MikeBrown http://t.co/Wuzy0YVEUg',
 'What we know about low-cost airline Germanwings whose plane crashed in the French Alps today http://t.co/dpVjeGkEGT http://t.co/0HkXddItLs',
 'even after #CharlieHebdo massacre,journalists like @sagarikaGhose (in todays #HindustanTimes) wl continue t defend #islam & run

In [10]:
selector.fit(df_tgt["text"].values.tolist())

Token indices sequence length is longer than the specified maximum sequence length for this model (167104 > 512). Running this sequence through the model will result in indexing errors


In [12]:
results = selector.transform(df_dev["text"].tolist())

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Since we specified `keep=0.5` in the `DataSelector`, the selected corpus should be half the size of the in-domain corpus, containing the top 50% most relevant documents.

## Vocabulary Augmentation
We can extend the existing vocabulary of the model to include domain-specific terminology. This allows for the representation such terminology to be explicit learnt during domain pre-training.

In [None]:
len(selected_corpus)

In [None]:
from adatation_metrics import VocabAugmentor

target_vocab_size = 31_000  # len(tokenizer) == 30_522

augmentor = VocabAugmentor(
    tokenizer=tokenizer,
    cased=False,
    target_vocab_size=target_vocab_size
)
# Obtain new domain-specific terminology based on the fine-tuning corpus
new_tokens = augmentor.get_new_tokens()


In [None]:
print(new_tokens[:20])

#### Update model and tokenizer with new vocab terminologies

In [None]:
tokenizer.add_tokens(new_tokens)
model.resize_token_embeddings(len(tokenizer))

## Domain Pre-Training
Domain pre-training is the third step in domain adaptation — we continue training Transformer models with the same pre-training procedure on the in-domain corpus.

#### Create dataset

In [None]:
import itertools as it
from pathlib import Path
from typing import Sequence, Union, Generator

from datasets import load_dataset
from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments

In [None]:
datasets = load_dataset(
    'text',
    data_files={
        "train": dpt_corpus_train_data_selected,
        "val": dpt_corpus_val
    }
)

tokenized_datasets = datasets.map(
    lambda examples: tokenizer(examples['text'], truncation=True, max_length=model.config.max_position_embeddings),
    batched=True
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

#### Instantiate TrainingArguments and Trainer

In [None]:
training_args = TrainingArguments(
    output_dir="./results/domain_pre_training",
    overwrite_output_dir=True,
    max_steps=100,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    evaluation_strategy="steps",
    save_steps=50,
    save_total_limit=2,
    logging_steps=50,
    seed=42,
    # fp16=True,
    dataloader_num_workers=2,
    disable_tqdm=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['val'],
    data_collator=data_collator,
    tokenizer=tokenizer,  # This tokenizer has new tokens
)

In [None]:
trainer.train()

## Fine-Tuning for Specific Tasks
We can plug our domain-adapted model for any fine-tuning tasks supported by HuggingFace.

For this guide, we will compare the performance between an out-of-the-box (OOB) model performs against a domain-adapted model for Named Entity Recognitition on the BC2GM dataset, a popular biomedical benchmarking dataset.

Utility functions for NER preprocessing and evaluation are adapted from HuggingFace's [NER fine-tuning example notebook](https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb).

#### Preprocess raw dataset to form NER dataset

In [None]:
from typing import NamedTuple
from functools import partial
from typing_extensions import Literal

import numpy as np
from datasets import Dataset, load_dataset, load_metric


class Example(NamedTuple):
    token: str
    label: str

def load_ner_dataset(mode: Literal['train', 'val', 'test']):
    file = f"data/BC2GM_{mode}.tsv"
    examples = []
    with open(file) as f:
        token = []
        label = []
        for line in f:
            if line.strip() == "":
                examples.append(Example(token=token, label=label))
                token = []
                label = []
                continue
            t, l = line.strip().split("\t")
            token.append(t)
            label.append(l)

    res = list(zip(*[(ex.token, ex.label) for ex in examples]))
    d = {'token': res[0], 'labels': res[1]}
    return Dataset.from_dict(d)


def tokenize_and_align_labels(examples, tokenizer):
    tokenized_inputs = tokenizer(examples["token"], truncation=True, is_split_into_words=True)
    label_to_id = dict(map(reversed, enumerate(label_list)))

    labels = []
    for i, label in enumerate(examples["labels"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label_to_id[label[word_idx]])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label_to_id[label[word_idx]])
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs


def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [None]:
%%capture
# Install `seqeval`
!pip install seqeval

In [None]:
label_list = ["O", "B", "I"]
metric = load_metric('seqeval')

train_dataset = load_ner_dataset('train')
val_dataset = load_ner_dataset('val')
test_dataset = load_ner_dataset('test')

#### Instantiate NER models
Here we instantiate three task-specific NER models for comparison:
1. `da_model`: A domain-adapted NER model we just trained in this guide
2. `da_full_corpus_model`: The same domain-adapted NER model except that it was trained on the full in-domain training corpus
3. `oob_model`: An out-of-the-box BERT NER model (not domain-adapted)

In [None]:
from transformers import AutoModelForTokenClassification, DataCollatorForTokenClassification

best_checkpoint = './results/domain_pre_training/checkpoint-100'
da_model = AutoModelForTokenClassification.from_pretrained(best_checkpoint, num_labels=len(label_list))

da_full_corpus_model = AutoModelForTokenClassification.from_pretrained('./domain-adapted-bert', num_labels=len(label_list))
full_corpus_tokenizer = AutoTokenizer.from_pretrained('./domain-adapted-bert')

oob_tokenizer = AutoTokenizer.from_pretrained(model_card)
oob_model = AutoModelForTokenClassification.from_pretrained(model_card, num_labels=len(label_list))

#### Create datasets, TrainingArguments and Trainer for each model

In [None]:
from typing import Dict

from datasets import Dataset


def preprocess_datasets(tokenizer, **datasets) -> Dict[str, Dataset]:
    tokenize_ner = partial(tokenize_and_align_labels, tokenizer=tokenizer)
    return {k: ds.map(tokenize_ner, batched=True) for k, ds in datasets.items()}

######################
##### `da_model` #####
######################
da_datasets = preprocess_datasets(
    tokenizer,
    train=train_dataset,
    val=val_dataset,
    test=test_dataset
)

training_args = TrainingArguments(
    output_dir="./results/domain_adapted_fine_tuning",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    save_steps=200,
    save_total_limit=2,
    logging_steps=100,
    seed=42,
    fp16=True,
    dataloader_num_workers=2,
    disable_tqdm=False
)

da_trainer = Trainer(
    model=da_model,
    args=training_args,
    train_dataset=da_datasets['train'],
    eval_dataset=da_datasets['val'],
    data_collator=DataCollatorForTokenClassification(tokenizer),
    tokenizer=tokenizer,  # This tokenizer has new tokens
    compute_metrics=compute_metrics
)


##################################
##### `da_model_full_corpus` #####
##################################
da_full_corpus_datasets = preprocess_datasets(
    full_corpus_tokenizer,
    train=train_dataset,
    val=val_dataset,
    test=test_dataset
)

training_args = TrainingArguments(
    output_dir="./results/domain_adapted_full_corpus_fine_tuning",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    save_steps=200,
    save_total_limit=2,
    logging_steps=100,
    seed=42,
    fp16=True,
    dataloader_num_workers=2,
    disable_tqdm=False
)

da_full_corpus_trainer = Trainer(
    model=da_full_corpus_model,
    args=training_args,
    train_dataset=da_full_corpus_datasets['train'],
    eval_dataset=da_full_corpus_datasets['val'],
    data_collator=DataCollatorForTokenClassification(full_corpus_tokenizer),
    tokenizer=full_corpus_tokenizer,  # This tokenizer has new tokens
    compute_metrics=compute_metrics
)


#######################
##### `oob_model` #####
#######################
oob_datasets = preprocess_datasets(
    oob_tokenizer,
    train=train_dataset,
    val=val_dataset,
    test=test_dataset
)

training_args = TrainingArguments(
    output_dir="./results/out_of_the_box_fine_tuning",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    save_steps=200,
    save_total_limit=2,
    logging_steps=100,
    seed=42,
    fp16=True,
    dataloader_num_workers=2,
    disable_tqdm=False
)

oob_model_trainer = Trainer(
    model=oob_model,
    args=training_args,
    train_dataset=oob_datasets['train'],
    eval_dataset=oob_datasets['val'],
    data_collator=DataCollatorForTokenClassification(oob_tokenizer),
    tokenizer=oob_tokenizer,  # This is the original tokenizer (without domain-specific tokens)
    compute_metrics=compute_metrics
)

#### Train and evaluate `da_model`

In [None]:
da_trainer.train()
da_trainer.evaluate(da_datasets['test'])

#### Train and evaluate `da_model_full_corpus`

In [None]:
da_full_corpus_trainer.train()
da_full_corpus_trainer.evaluate(da_full_corpus_datasets['test'])

#### Train and evaluate `oob_model`

In [None]:
oob_model_trainer.train()
oob_model_trainer.evaluate(oob_datasets['test'])

#### Results
We see that out of the three models, `da_full_corpus_model` (which was domain-adapted on the entire in-domain training corpus) outperforms the `oob_model` by over 2% on the test F1 score. In fact, this `da_full_corpus_model` model is one of many domain-adapted models we trained that outperforms SOTA on BC2GM.

Also, `da_model` underperforms `oob_model`. This is to be expected, as `da_model` underwent minimal domain pre-training in this guide.

## Conclusion
In this guide, you have seen how to use `DataSelector` and `VocabAugmentor` to domain-adapt a transformers model, by performing Data Selection and Vocabulary Augmentation respectively.

You have also seen that they are compatible with all of HuggingFace products: `transformers`, `tokenizers` and `datasets`.

Finally, it is shown that a model domain-adapted on the full in-domain corpus performs better than an out-of-the-box model.