## Experiment description:

Two conclusions of interest are drawn by the authors of the VOLT paper (https://arxiv.org/abs/2012.15671). Firstly, using the VOLT-generated vocabulary one can substantially increase downstream task performance of the model without tuning other hyperparameters. Secondly, given the same target size, VOLT- and BPE-generated vocabularies are higly overlapped, so in practice one is not restricted to using the VOLT-obtained vocabulary and can only use the optimal size instead. As the paper focuses on experimenting with language translation tasks, it would be helpful to test given results on a text classification task.

For our experiment we will be finetuning the DistilBERT model from the huggingface library. We will compare the performance of the model using default VS. VOLT-generated vocabulary sizes for a range of datasets, to see if the drastic performance boost without further parameter tuning is possible.

# install

In [1]:
!pip install transformers
!pip install datasets
!pip install tokenizers

Collecting transformers
  Downloading transformers-4.12.3-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.0-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 6.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 36.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 42.7 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 35.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attem

# config

In [2]:
##
## for volt evaluation, uncomment one of the datasets here
## size of the "small" set should match the volt output, obtained in
## the last cell of "Find the optimal size" section
##

# 1.1 imdb small
#dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'imdb', None, 'unsupervised', 'text', None, 2, 6000

# 1.2 imdb big
#dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'imdb', None, 'unsupervised', 'text', None, 2, 30000

# 2.1 hate small
dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'tweet_eval', 'hate', 'train', 'text', None, 2, 6000

# 2.2 hate big
#dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'tweet_eval', 'hate', 'train', 'text', None, 2, 30000

# 3.1 emotion small
#dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'tweet_eval', 'emotion', 'train', 'text', None, 4, 7000

# 3.2 emotion big
#dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'tweet_eval', 'emotion', 'train', 'text', None, 4, 30000

# 4.1 sentiment small
#dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'tweet_eval', 'sentiment', 'train', 'text', None, 3, 8000

# 4.2 sentiment big
#dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'tweet_eval', 'sentiment', 'train', 'text', None, 3, 30000

# 5.1 offensive small
#dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'tweet_eval', 'offensive', 'train', 'text', None, 2, 7000

# 5.2 offensive big
#dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'tweet_eval', 'offensive', 'train', 'text', None, 2, 30000

# 6.1 irony small
#dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'tweet_eval', 'irony', 'train', 'text', None, 2, 8000

# 6.2 irony big
#dataset_id, task, tok_train_fold, sentence1_key, sentence2_key, num_labels, vocab_len = 'tweet_eval', 'irony', 'train', 'text', None, 2, 30000

validation_key = 'test'

In [3]:
from datasets import load_dataset
dataset = load_dataset(dataset_id, task)

Downloading:   0%|          | 0.00/2.37k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

Downloading and preparing dataset tweet_eval/hate (download: 1.62 MiB, generated: 1.72 MiB, post-processed: Unknown size, total: 3.35 MiB) to /root/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


  0%|          | 0/6 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/490k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/166k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/634 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/62.8k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256 [00:00<?, ?B/s]

  0%|          | 0/6 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset tweet_eval downloaded and prepared to /root/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
dataset['train'][0:6]

{'label': [0, 1, 1, 1, 0, 0],
 'text': ['@user nice new signage. Are you not concerned by Beatlemania -style hysterical crowds crongregating on you…',
  'A woman who you fucked multiple times saying yo dick small is a compliment you know u hit that spot 😎',
  '@user @user real talk do you have eyes or were they gouged out by a rapefugee?',
  'your girlfriend lookin at me like a groupie in this bitch!',
  'Hysterical woman like @user',
  'Me flirting- So tell me about your father...']}

# Find optimal size

In [5]:
!git clone https://github.com/Jingjing-NLP/VOLT/

Cloning into 'VOLT'...
remote: Enumerating objects: 4126, done.[K
remote: Counting objects: 100% (58/58), done.[K
remote: Compressing objects: 100% (49/49), done.[K
remote: Total 4126 (delta 15), reused 49 (delta 9), pack-reused 4068[K
Receiving objects: 100% (4126/4126), 20.19 MiB | 18.49 MiB/s, done.
Resolving deltas: 100% (679/679), done.


In [6]:
%cd /content/VOLT
!git clone https://github.com/moses-smt/mosesdecoder.git
!git clone https://github.com/rsennrich/subword-nmt.git
!pip3 install sentencepiece
!pip3 install tqdm 
%cd POT
!pip3 install --editable ./ -i https://pypi.doubanio.com/simple --user
%cd ../

/content/VOLT
Cloning into 'mosesdecoder'...
remote: Enumerating objects: 148070, done.[K
remote: Counting objects: 100% (498/498), done.[K
remote: Compressing objects: 100% (206/206), done.[K
remote: Total 148070 (delta 315), reused 433 (delta 289), pack-reused 147572[K
Receiving objects: 100% (148070/148070), 129.86 MiB | 19.28 MiB/s, done.
Resolving deltas: 100% (114341/114341), done.
Cloning into 'subword-nmt'...
remote: Enumerating objects: 580, done.[K
remote: Counting objects: 100% (4/4), done.[K
remote: Compressing objects: 100% (4/4), done.[K
remote: Total 580 (delta 0), reused 1 (delta 0), pack-reused 576[K
Receiving objects: 100% (580/580), 237.41 KiB | 2.42 MiB/s, done.
Resolving deltas: 100% (349/349), done.
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.4 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed 

In [7]:
with open("corpus.txt", "w+") as f:
    for fold in dataset:
        for elem in dataset[fold]:
            f.write(elem['text']+"\n")

In [8]:
#Assume source_file is the file stroing your data

#subword-nmt style:
!mkdir bpeoutput
#BPE_CODE=bpeoutput/code # the path to save vocabulary
!python3 subword-nmt/learn_bpe.py -s 30000 --input corpus.txt --output bpeoutput/code

  args.input = codecs.open(args.input.name, encoding='utf-8')
  args.output = codecs.open(args.output.name, 'w', encoding='utf-8')
no pair has frequency >= 2. Stopping


In [9]:
!python3 subword-nmt/apply_bpe.py -c bpeoutput/code --input corpus.txt --output bpeoutput/source.file

  args.codes = codecs.open(args.codes.name, encoding='utf-8')
  args.input = codecs.open(args.input.name, encoding='utf-8')
  args.output = codecs.open(args.output.name, 'w', encoding='utf-8')


In [10]:
#
# !! find best size here !!
# (check that the size is the same as in this dataset's "small" config)
#

!python3 ot_run.py --source_file bpeoutput/source.file \
          --token_candidate_file bpeoutput/code \
          --vocab_file bpeoutput/vocab --max_number 10000 --interval 1000  --loop_in_ot 500 --tokenizer subword-nmt --size_file bpeoutput/size 

reading candidate tokens
100% 29306/29306 [00:35<00:00, 836.91it/s]
reading char file
100% 12970/12970 [00:01<00:00, 6656.32it/s]
  v = np.divide(b, KtransposeU)
best size:  6000
One optional solution is that you can use this size to generated vocabulary in subword-nmt or sentencepiece
100% 6000/6000 [00:02<00:00, 2382.91it/s]
Traceback (most recent call last):
  File "ot_run.py", line 226, in <module>
    write_vocab(oldtokens, Gs, chars, vocab_file, threshold) #generate the vocabulary based on the optimal matrix
  File "ot_run.py", line 107, in write_vocab
    left, right = token.split(" ")
ValueError: not enough values to unpack (expected 2, got 1)


## optional: generate vocabulary here

In [11]:
#!echo "#version: 0.2" > bpeoutput/vocab.seg # add version info
#!echo bpeoutput/vocab >> bpeoutput/vocab.seg

In [12]:
# python3 subword-nmt/apply_bpe.py -c bpeoutput/vocab --input corpus.txt --output bpeoutput/source.file

# config

In [13]:
# change the vovabulary size here if it doesn't match
# with the precalculated one for some reason

#
# vocab_len = #best size from 3 cells above
#

validation_key = 'test'

In [14]:
from datasets import load_dataset
dataset = load_dataset(dataset_id, task)

Reusing dataset tweet_eval (/root/.cache/huggingface/datasets/tweet_eval/hate/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343)


  0%|          | 0/3 [00:00<?, ?it/s]

# Train tokenizer

In [15]:
#https://huggingface.co/docs/tokenizers/python/latest/quicktour.html

# because of how the VOLT paper describes its experiments,
# we will be using BPE tokenizers for tuning BERT (instead of the bert-standard, subword tokenizers)

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers import normalizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents
from tokenizers.pre_tokenizers import Whitespace

bert_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])

bert_tokenizer.pre_tokenizer = Whitespace()

In [16]:
from tokenizers.processors import TemplateProcessing

bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

In [17]:
def batch_iterator(batch_size=1000):
    d = dataset[tok_train_fold]
    for i in range(0, len(d), batch_size):
        yield d[i : i + batch_size]["text"]

In [18]:
from tokenizers.trainers import BpeTrainer

# be sure to set the correct vocab_len to experiment with

bpe_trainer = BpeTrainer(
    vocab_size=vocab_len,
    special_tokens=["[UNK]", "[CLS]", "[SEP]",
                    "[PAD]", "[MASK]"]
)
bert_tokenizer.train_from_iterator(batch_iterator(), trainer=bpe_trainer, length=len(dataset))

# Train Bert

In [19]:
from datasets import load_metric
metric = load_metric("f1")

Downloading:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

In [20]:
from transformers import PreTrainedTokenizerFast

#bert_tokenizer.enable_truncation(max_length, stride=0, strategy='longest_first')
#bert_tokenizer.enable_padding()

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=bert_tokenizer, model_max_length=512)

In [21]:
fast_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

0

In [22]:
def preprocess_function(examples):
    if sentence2_key is None:
        return fast_tokenizer(examples[sentence1_key], truncation=True, padding=True)
    return fast_tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True, padding=True)

encoded_dataset = dataset.map(preprocess_function, batched=True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [23]:
#bert_tokenizer.encode(dataset[fold]['text'][0])

In [24]:
model_checkpoint = "distilbert-base-uncased"
batch_size = 24 # colab can handle 24 for certain; change if you need to

In [25]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

In [26]:
metric_name = "f1"
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-{task}",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    gradient_accumulation_steps=1,
)

In [27]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    if task != "stsb":
        predictions = np.argmax(predictions, axis=1)
    else:
        predictions = predictions[:, 0]
    return metric.compute(predictions=predictions, references=labels,
                          average='macro')

In [28]:
# tune the model

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=fast_tokenizer,
    compute_metrics=compute_metrics,
)

In [29]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text, token_type_ids.
***** Running training *****
  Num examples = 9000
  Num Epochs = 5
  Instantaneous batch size per device = 24
  Total train batch size (w. parallel, distributed & accumulation) = 24
  Gradient Accumulation steps = 1
  Total optimization steps = 1875


Epoch,Training Loss,Validation Loss,F1
1,No log,0.990698,0.429822
2,0.611600,1.136201,0.435168
3,0.475800,1.316331,0.376632
4,0.412700,1.37334,0.375298
5,0.412700,1.421763,0.386868


The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text, token_type_ids.
***** Running Evaluation *****
  Num examples = 2970
  Batch size = 8
Saving model checkpoint to distilbert-base-uncased-finetuned-hate/checkpoint-375
Configuration saved in distilbert-base-uncased-finetuned-hate/checkpoint-375/config.json
Model weights saved in distilbert-base-uncased-finetuned-hate/checkpoint-375/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-finetuned-hate/checkpoint-375/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-finetuned-hate/checkpoint-375/special_tokens_map.json
The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text, token_type_ids.
***** Running Evaluation *****
  Num examples = 2970
  Batch size = 8
Saving model checkpoint to dist

TrainOutput(global_step=1875, training_loss=0.473513525390625, metrics={'train_runtime': 1057.9383, 'train_samples_per_second': 42.536, 'train_steps_per_second': 1.772, 'total_flos': 1341331761940992.0, 'train_loss': 0.473513525390625, 'epoch': 5.0})

In [30]:
trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text, token_type_ids.
***** Running Evaluation *****
  Num examples = 2970
  Batch size = 8


{'epoch': 5.0,
 'eval_f1': 0.4351675942421997,
 'eval_loss': 1.1362006664276123,
 'eval_runtime': 27.6086,
 'eval_samples_per_second': 107.575,
 'eval_steps_per_second': 13.474}