# BERT Classifier

The BERT classifier pipeline was built using [this guide](https://mccormickml.com/2019/07/22/BERT-fine-tuning/).

### 1. GPU Setup

Ensures that a GPU is enabled in the current runtime.

In [None]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [None]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


## 2. Loading Dataset

Loads the CoLA dataset into memory.

In [None]:
!pip install wget

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp37-none-any.whl size=9681 sha256=e8a9a7eb65a3909cf34449e58924c4cdf2af01505c0b5599c8e61f90ba841ec7
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
import wget
import os

print('Downloading dataset...')

# The URL for the dataset zip file.
url = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'

# Download the file (if we haven't already)
if not os.path.exists('./cola_public_1.1.zip'):
    wget.download(url, './cola_public_1.1.zip')

Downloading dataset...


In [None]:
# Unzip the dataset (if we haven't already)
if not os.path.exists('./cola_public/'):
    !unzip cola_public_1.1.zip

Archive:  cola_public_1.1.zip
   creating: cola_public/
  inflating: cola_public/README      
   creating: cola_public/tokenized/
  inflating: cola_public/tokenized/in_domain_dev.tsv  
  inflating: cola_public/tokenized/in_domain_train.tsv  
  inflating: cola_public/tokenized/out_of_domain_dev.tsv  
   creating: cola_public/raw/
  inflating: cola_public/raw/in_domain_dev.tsv  
  inflating: cola_public/raw/in_domain_train.tsv  
  inflating: cola_public/raw/out_of_domain_dev.tsv  


## 3. Data Parsing and Preprocessing

We need to parse and tokenize our data before formatting it so that it is acceptable by BERT. To format it properly, we must:

1. Add special tokens to the start and end of each sentence.
2. Pad & truncate all sentences to a single constant length.
3. Explicitly differentiate real tokens from padding tokens with the “attention mask”.

Augmented samples are also generated in this section.

### 3A. Preprocessing

Loads data into arrays, partitions data into labeled and unlabeled sections.

In [None]:
import pandas as pd

# Load the dataset into a pandas dataframe.
df = pd.read_csv("./cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# Display 10 random rows from the data.
df.sample(10)

Number of training sentences: 8,551



Unnamed: 0,sentence_source,label,label_notes,sentence
5183,kl93,1,,I would dance with Mary or Sue.
65,gj04,0,*?,Bill floated into the cave for hours.
6911,m_02,1,,Jim was enthusiastically chopping logs.
2888,l-93,0,*,The child and her mother clung.
1551,r-67,1,,"Tom, Dick, and Harry know it."
7029,sgww85,0,*,Kim alienated cats and beaten his dog.
7446,sks13,0,*,Mary thinks for Bill to come.
4640,ks08,1,,John has driven the car.
6966,m_02,1,,The vase was smashed deliberately.
7659,sks13,1,,John hopes to sleep.


In [None]:
# Get the lists of sentences and their labels.
# Split off unlabeled data

import numpy as np

sentences_full = df.sentence.values
labels_full = df.label.values

print('dataset size:', len(sentences_full))

# Use FULL 2x samples
# labeled_samples = 2000
# sentences = sentences_full[:]
# labels = labels_full[:labeled_samples]
# labels = np.concatenate((labels, np.array([-1] * (len(sentences_full) - labeled_samples))), axis=None)

# Use 2000 samples
labeled_samples = 2000
sentences = sentences_full[:]
labels = labels_full[:labeled_samples]
labels = np.concatenate((labels, np.array([-1] * (len(sentences_full) - labeled_samples))), axis=None)

print(len(sentences), len(labels))
print(labels)

print(len(sentences), len(labels))
print(labels)

dataset size: 8551
8551 8551
[ 1  1  1 ... -1 -1 -1]
8551 8551
[ 1  1  1 ... -1 -1 -1]


In [None]:
# Load Yelp dataset (yelp no longer used)

# import json
# from tqdm import tqdm

# yelp_sentences = []
# yelp_star_labels = []
# yelp_binary_labels = []
# line_counter = 0

# with open(yelp_filepath) as file:
#     for line in tqdm(file):
#         if line_counter > 2200:
#             break
#         line_counter += 1
#         review = json.loads(line)
#         yelp_sentences.append(review['text'])
#         stars = review['stars']
#         yelp_star_labels.append(stars)
#         yelp_binary_labels.append(1 if stars > 2 else 0)

# print(yelp_sentences[:3])
# print(yelp_star_labels[:3])
# print(yelp_binary_labels[:3])
# print(len(yelp_sentences))

# Make part of data unlabeled

# Should use something like 4,000 labeled, 16,000 unlabeled
# Currently using 500 labeled, 1500 unlabeled
# sentences = yelp_sentences[:2000]
# labels = yelp_binary_labels[:2000]
# # labels += [-1] * 1500

# print(len(sentences), len(labels))

### 3B. Data Augmentation

This section contains the UDA code. TFIDF replacement and back translation are used to augment the sentences.

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 21.8MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 52.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 49.3MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=ab99

In [None]:
from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Loading BERT tokenizer...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




#### 3B1. TFIDF Replacement

This section contains our implementation of the TFIDF replacement described in the paper.

In [None]:
import os
import math
import string
import numpy as np
from tqdm import tqdm
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.tokenize import RegexpTokenizer
nltk_tokenizer_words_only = RegexpTokenizer(r'\w+')

# Load corpus and build data structures

corpus = sentences
corpus_tokenized = []
for sent in tqdm(corpus):
    temp_tokens = nltk_tokenizer_words_only.tokenize(sent)
    
    corpus_tokenized.append(' '.join(
        [t.lower() for t in temp_tokens]))

# Hyperparameter P
p = 0.7

corpus_size = len(corpus)
vectorizer = TfidfVectorizer(token_pattern = r"(?u)\b\w+\b")
tfidf_matrix = vectorizer.fit_transform(corpus_tokenized)
idf_scores = {}
frequencies = {}
scores = np.zeros(len(vectorizer.vocabulary_))

for sentence in tqdm(corpus_tokenized):
    document_words = set()
    for word in sentence.split(' '):
        # Count frequencies
        if word not in frequencies:
            frequencies[word] = 0
        frequencies[word] += 1

        # Document count
        if word not in document_words:
            if word not in idf_scores:
                idf_scores[word] = 0
            idf_scores[word] += 1
            document_words.add(word)

# Update idf_scores from df to idf
for word in tqdm(idf_scores.keys()):
    df = idf_scores[word]
    idf = math.log(corpus_size / df)
    idf_scores[word] = idf

# Build probabilities for word selection
for word in tqdm(vectorizer.vocabulary_.keys()):
    if word == 'unk':
        continue
    idx = vectorizer.vocabulary_[word]
    scores[idx] = idf_scores[word] * frequencies[word]

# Compute max - cur_score
max_val = scores.max()
for idx, score in enumerate(scores):
    scores[idx] = max_val - score

# Normalize
total = 0
for score in scores:
    total += score
scores /= total

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


100%|██████████| 8551/8551 [00:00<00:00, 248558.45it/s]
100%|██████████| 8551/8551 [00:00<00:00, 172401.25it/s]
100%|██████████| 5392/5392 [00:00<00:00, 937360.10it/s]
100%|██████████| 5392/5392 [00:00<00:00, 909063.72it/s]


In [None]:
from nltk.tokenize import RegexpTokenizer
nltk_tokenizer_with_hashtags = RegexpTokenizer(r'\w+|##\w+')

def sample_corpus_keywords():
    idx = np.random.choice(range(len(scores)), p=scores)
    return vectorizer.get_feature_names()[idx]

def tfidf_replacement(idx, sentence):
    # Remove punctuation
    # sentence = ' '.join(nltk_tokenizer_with_hashtags.tokenize(sentence))

    tokens = nltk_tokenizer_words_only.tokenize(sentence)
    tokens = [t.lower() for t in tokens]
    tfidf_vector = tfidf_matrix[idx]
    C = tfidf_vector.max()
    Z = 0

    # Calculate Z
    for token in tokens:
        token_idx = vectorizer.vocabulary_[token]
        tfidf = tfidf_vector[0, token_idx]
        Z += C - tfidf
    Z /= len(tokens)
    
    # Calculate probabilities of replacement and make replacements
    new_sentence = []
    for token in tokens:
        token_idx = vectorizer.vocabulary_[token]
        tfidf = tfidf_vector[0, token_idx]
        prob = min(p * (C - tfidf) / Z, 1)

        # Make replacement with probability
        if np.random.uniform() < prob:
            new_sentence.append(sample_corpus_keywords())
        else:
            new_sentence.append(token)

    return ' '.join(new_sentence).replace(' ##', '')

#### 3B2. Back Translation

This section contains our implementation of back translation as described in the paper.

In [None]:
# Back Translation
!pip install BackTranslation

from nltk.tokenize import RegexpTokenizer
nltk_tokenizer_words_only = RegexpTokenizer(r'\w+')

from BackTranslation import BackTranslation
trans_kr = BackTranslation(url=[
        'translate.google.com',
        'translate.google.co.kr',
        ])

def backtranslate(sent):
    sent = ' '.join(nltk_tokenizer_words_only.tokenize(sent))
    try:
        result = trans_kr.translate(sent, src='en', tmp = 'zh-cn')
        return result.result_text
    except:
        return sent

Collecting BackTranslation
  Downloading https://files.pythonhosted.org/packages/84/47/df1973efb7b3f2ffe2950f7342c1433425e39f515f6bd5a391e011cf6352/BackTranslation-0.3.0-py3-none-any.whl
Collecting googletrans==4.0.0rc1
  Downloading https://files.pythonhosted.org/packages/fa/0d/a5fe8fb53dbf401f8a34cef9439c4c5b8f5037e20123b3e731397808d839/googletrans-4.0.0rc1.tar.gz
Collecting httpx==0.13.3
[?25l  Downloading https://files.pythonhosted.org/packages/54/b4/698b284c6aed4d7c2b4fe3ba5df1fcf6093612423797e76fbb24890dd22f/httpx-0.13.3-py3-none-any.whl (55kB)
[K     |████████████████████████████████| 61kB 9.2MB/s 
[?25hCollecting sniffio
  Downloading https://files.pythonhosted.org/packages/52/b0/7b2e028b63d092804b6794595871f936aafa5e9322dcaaad50ebf67445b3/sniffio-1.2.0-py3-none-any.whl
Collecting httpcore==0.9.*
[?25l  Downloading https://files.pythonhosted.org/packages/dd/d5/e4ff9318693ac6101a2095e580908b591838c6f33df8d3ee8dd953ba96a8/httpcore-0.9.1-py3-none-any.whl (42kB)
[K     |██████

#### 3B3. Augment the data!

In [None]:
# TFIDF Replacement
# Only augments unlabeled exmaples
sentence_id_to_augmentations = {}
augmentation = 'TFIDF'

# for idx, data in tqdm(enumerate(zip(sentences, labels)), position=0, leave=True):
#     sentence, label = data
#     if label == -1:
#         if augmentation == 'BT':
#             replacement = backtranslate(sentence)
#         else:
#             replacement = tfidf_replacement(idx, sentence)
#         sentence_id_to_augmentations[idx] = replacement

for idx, data in tqdm(enumerate(zip(sentences, labels)), position=0, leave=True):
    sentence, label = data
    if label == -1:
        sentence_id_to_augmentations[idx] = []
        for i in range(5):
            replacement = tfidf_replacement(idx, sentence)
            # replacement = backtranslate(sentence)
            sentence_id_to_augmentations[idx].append(replacement)

8551it [08:00, 17.79it/s]


In [None]:
sentence_id_to_augmentations[2001]

['concert noise neck homer headache today terry',
 'france noise gave trainer headache enjoy terry',
 'field noise gave rare headache spirits terry',
 'blamed noise solid anxious headache spread terry',
 'npr noise cream few headache exist terry']

### 3C. Tokenization

The original and augmented sentences are vectorized in this section.

In [None]:
# Determine max length sentence

SENT_MAX_LEN = 0
greater_than_max = 0

# For every sentence...
for sent in sentences:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(sent, add_special_tokens=True)

    # Update the maximum sentence length.
    if len(input_ids) > 511:
        greater_than_max += 1
    SENT_MAX_LEN = max(SENT_MAX_LEN, len(input_ids))

SENT_MAX_LEN += 1
print('Max sentence length plus 1: ', SENT_MAX_LEN)
print('Sentences greater than 511 words:', greater_than_max)

# encoding_size = min(SENT_MAX_LEN, 64)
encoding_size = 64

print('Encoding dim:', encoding_size)

Max sentence length plus 1:  48
Sentences greater than 511 words: 0
Encoding dim: 64


In [None]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for idx, sent in enumerate(sentences):
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = encoding_size,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )

    # Tag tensor with sentence id
    encoded_dict['input_ids'][0][len(encoded_dict['input_ids'][0]) - 1] = idx
    
    # Make sure last bit of attention mask is zeroed
    encoded_dict['attention_mask'][0][len(encoded_dict['attention_mask'][0]) - 1] = 0

    # Add the encoded sentence to the list.
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[:3])
print('Token IDs:', input_ids[:3])
print('masks:', attention_masks[:3])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Original:  ["Our friends won't buy this analysis, let alone the next one we propose."
 "One more pseudo generalization and I'm giving up."
 "One more pseudo generalization or I'm giving up."]
Token IDs: tensor([[  101,  2256,  2814,  2180,  1005,  1056,  4965,  2023,  4106,  1010,
          2292,  2894,  1996,  2279,  2028,  2057, 16599,  1012,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0],
        [  101,  2028,  2062, 18404,  2236,  3989,  1998,  1045,  1005,  1049,
          3228,  2039,  1012,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     

In [None]:
# # Tokenize the augmentations
# for idx in sentence_id_to_augmentations.keys():
#     sent = sentence_id_to_augmentations[idx]
#     encoded_dict = tokenizer.encode_plus(
#                         sent,                      # Sentence to encode.
#                         add_special_tokens = True, # Add '[CLS]' and '[SEP]'
#                         max_length = encoding_size,           # Pad & truncate all sentences.
#                         pad_to_max_length = True,
#                         return_attention_mask = True,   # Construct attn. masks.
#                         return_tensors = 'pt',     # Return pytorch tensors.
#                    )

#     # Replace value in dictionary with tensor & attention mask
#     sentence_id_to_augmentations[idx] = (encoded_dict['input_ids'], encoded_dict['attention_mask'])

# Tokenize the augmentations
for idx in sentence_id_to_augmentations.keys():
    aug_list = sentence_id_to_augmentations[idx]
    aug_list_tokenized = []
    for sent in aug_list:
        encoded_dict = tokenizer.encode_plus(
                            sent,                      # Sentence to encode.
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = encoding_size,           # Pad & truncate all sentences.
                            pad_to_max_length = True,
                            return_attention_mask = True,   # Construct attn. masks.
                            return_tensors = 'pt',     # Return pytorch tensors.
                    )

        # Replace value in dictionary with tensor & attention mask
        aug_list_tokenized.append((encoded_dict['input_ids'], encoded_dict['attention_mask']))
    sentence_id_to_augmentations[idx] = aug_list_tokenized

print(sentence_id_to_augmentations[len(sentences) - 1])



[(tensor([[  101, 12361,  6293,  5416,  2309,  2004, 26775, 20755,  1060,  4234,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]]), tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])), (tensor([[  101,  7270,  7165,  2055, 15723,  2131,  2005,  4234,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0, 

## 4. Dataloader

Dataloader is initialized with the entire training set.

In [None]:
from torch.utils.data import TensorDataset, random_split

dataset = TensorDataset(input_ids, attention_masks, labels)
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
batch_size = 32

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            dataset,  # The training samples.
            sampler = RandomSampler(dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

## 5. Train Model

Uses the BertForSequenceClassification to train a classifier model. Hyperparameters are copied from the blogpost.

### 5A. Training Setup

Imports the model, sets hyperparameters and optimizer. Loads helper functions.

In [None]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
model.cuda()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )

In [None]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs. The BERT authors recommend between 2 and 4. 
# We chose to run for 4, but we'll see later that this may be over-fitting the
# training data.
epochs = 4

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

In [None]:
# Helper functions
import numpy as np

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

### 5B. Training Loop

In [None]:
# Hyperparameters
beta = 0.9
temperature = 0.4
weight = 0.5
epochs = 15

#### 5B1.UDA TRAINING BLOCK

This section trains the model using UDA.

In [None]:
## TRAINING FOR UDA ONLY

import random
import numpy as np
import torch.nn.functional as F

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # In PyTorch, calling `model` will in turn call the model's `forward` 
        # function and pass down the arguments. The `forward` function is 
        # documented here: 
        # https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification
        # The results are returned in a results object, documented here:
        # https://huggingface.co/transformers/main_classes/output.html#transformers.modeling_outputs.SequenceClassifierOutput
        # Specifically, we'll get the loss (because we provided labels) and the
        # "logits"--the model outputs prior to activation.

        result = model(b_input_ids, 
                       token_type_ids=None, 
                       attention_mask=b_input_mask, 
                    #    labels=b_labels,
                       return_dict=True)
        
        # loss: 1D tensor holding loss value
        # logits: (batch_size, num_classes) tensor holding logits of each classification

        # loss = result.loss
        sup_loss = torch.tensor(0.0, requires_grad=True).to(device)
        unsup_loss = torch.tensor(0.0, requires_grad=True).to(device)
        sup_loss_count = 0
        unsup_loss_count = 0
        logits = result.logits

        for ind_input_ids, ind_logits, ind_label in zip(b_input_ids, logits, b_labels):
            if ind_label != -1:
                sup_loss += F.cross_entropy(ind_logits.unsqueeze(0), ind_label.unsqueeze(0))
                sup_loss_count += 1
            else:
                # Check if above threshold 
                if torch.max(F.softmax(ind_logits)) > beta:
                    # Grab index of sentence
                    idx = ind_input_ids[len(ind_input_ids) - 1].item()
                    
                    # print('ind_logits', ind_logits)

                    # Sharpen the predictions
                    sharpened = F.log_softmax(torch.div(ind_logits, temperature))
                    sharpened_2d = torch.unsqueeze(sharpened, 0)
                    
                    # print('sharpened', sharpened)

                    # Iterate over all augmentations
                    aug_list = sentence_id_to_augmentations[idx]

                    for aug_input_ids, aug_input_mask in aug_list:
                        # Calculate classification on augmented sample

                        # print('aug input ids:', aug_input_ids.shape)
                        # print('aug input mask:', aug_input_mask.shape)

                        aug_result = model(aug_input_ids.to(device), 
                            token_type_ids=None, 
                            attention_mask=aug_input_mask.to(device), 
                            return_dict=True)

                        aug_label = torch.unsqueeze(torch.argmax(aug_result.logits), 0)

                        # print('sharpened2d:', sharpened_2d)
                        # print('aug_result_logits:', aug_result.logits)
                        # print('aug_label:', aug_label)

                        unsup_loss += F.cross_entropy(sharpened_2d, aug_label)
                        unsup_loss_count += 1

        loss = torch.tensor(0.0, requires_grad=True).to(device)

        if sup_loss_count > 0:
            loss += (sup_loss / sup_loss_count)
        if unsup_loss_count > 0:
            loss += (weight * (unsup_loss / unsup_loss_count))


        # print('train loss', loss, 'loss count', loss_count)   

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))
        
print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...




  Batch    40  of    268.    Elapsed: 0:00:14.
  Batch    80  of    268.    Elapsed: 0:00:27.




  Batch   120  of    268.    Elapsed: 0:00:41.
  Batch   160  of    268.    Elapsed: 0:00:54.
  Batch   200  of    268.    Elapsed: 0:01:09.
  Batch   240  of    268.    Elapsed: 0:01:23.

  Average training loss: 0.69
  Training epcoh took: 0:01:32

Running Validation...

Training...
  Batch    40  of    268.    Elapsed: 0:00:15.
  Batch    80  of    268.    Elapsed: 0:00:30.
  Batch   120  of    268.    Elapsed: 0:00:45.
  Batch   160  of    268.    Elapsed: 0:01:00.
  Batch   200  of    268.    Elapsed: 0:01:15.
  Batch   240  of    268.    Elapsed: 0:01:31.

  Average training loss: 0.82
  Training epcoh took: 0:01:43

Running Validation...

Training...
  Batch    40  of    268.    Elapsed: 0:00:17.
  Batch    80  of    268.    Elapsed: 0:00:35.
  Batch   120  of    268.    Elapsed: 0:00:51.
  Batch   160  of    268.    Elapsed: 0:01:08.
  Batch   200  of    268.    Elapsed: 0:01:25.
  Batch   240  of    268.    Elapsed: 0:01:43.

  Average training loss: 0.87
  Training epcoh took

#### 5B2. SUPERVISED TRAINING BLOCK

This section contains the original supervised training code.

In [None]:
## TRAINING LOOP FOR SUPERVISED MODEL (run this block only if the previous block was not run)
import random

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

training_stats = []
# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss after each epoch so we can plot them.
loss_values = []

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_loss = 0
    total_eval_loss = 0
    total_eval_accuracy = 0


    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # This will return the loss (rather than the model output) because we
        # have provided the `labels`.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        outputs = model(b_input_ids, 
                    token_type_ids=None, 
                    attention_mask=b_input_mask, 
                    labels=b_labels)
        
        # The call to `model` always returns a tuple, so we need to pull the 
        # loss value out of the tuple.
        loss = outputs[0]

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)            
    
    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)

    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        # Telling the model not to compute or store gradients, saving memory and
        # speeding up validation
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # This will return the logits rather than the loss because we have
            # not provided labels.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            outputs = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask)
        
        # Get the "logits" output by the model. The "logits" are the output
        # values prior to applying an activation function like the softmax.
        logits = outputs[0]

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        nb_eval_steps += 1

        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)
    
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")

## 6. Model Evaluation

Testing data is used to evaluate the performance of the model.

### 6A. Loading Dataset


In [None]:
import pandas as pd

# Load the dataset into a pandas dataframe.
df = pd.read_csv("./cola_public/raw/out_of_domain_dev.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

# Report the number of sentences.
print('Number of test sentences: {:,}\n'.format(df.shape[0]))

# Create sentence and label lists
sentences = df.sentence.values
labels = df.label.values

# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for sent in sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 64,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# Set the batch size.  
batch_size = 32  

# Create the DataLoader.
prediction_data = TensorDataset(input_ids, attention_masks, labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

Number of test sentences: 516





### 6B. Evaluating on test dataset

In [None]:
# Prediction on test set

print('Predicting labels for {:,} test sentences...'.format(len(input_ids)))

# Put model in evaluation mode
model.eval()

# Tracking variables 
predictions , true_labels = [], []

# Predict 
for batch in prediction_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels = batch
    
    # Telling the model not to compute or store gradients, saving memory and 
    # speeding up prediction
    with torch.no_grad():
        # Forward pass, calculate logit predictions.
        result = model(b_input_ids, 
                        token_type_ids=None, 
                        attention_mask=b_input_mask,
                        return_dict=True)

    logits = result.logits

    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    
    # Store predictions and true labels
    predictions.append(logits)
    true_labels.append(label_ids)

correct = 0
total = 0
for px, tx in zip(predictions, true_labels):
    for p, t in zip(px, tx):
        if np.argmax(p) == t:
            correct += 1
        total += 1

print('epochs:', epochs)
print('beta:', beta)
print('temp:', temperature)
print('weight:', weight)
# print('augmentation:', augmentation)

print("Accuracy:", correct / total)

print('    DONE.')

Predicting labels for 516 test sentences...
epochs: 15
beta: 0.9
temp: 0.4
weight: 0.5
Accuracy: 0.7577519379844961
    DONE.


In [None]:
print('Positive samples: %d of %d (%.2f%%)' % (df.label.sum(), len(df.label), (df.label.sum() / len(df.label) * 100.0)))

In [None]:
from sklearn.metrics import matthews_corrcoef

matthews_set = []

# Evaluate each test batch using Matthew's correlation coefficient
print('Calculating Matthews Corr. Coef. for each batch...')

# For each input batch...
for i in range(len(true_labels)):
  
  # The predictions for this batch are a 2-column ndarray (one column for "0" 
  # and one column for "1"). Pick the label with the highest value and turn this
  # in to a list of 0s and 1s.
  pred_labels_i = np.argmax(predictions[i], axis=1).flatten()
  
  # Calculate and store the coef for this batch.  
  matthews = matthews_corrcoef(true_labels[i], pred_labels_i)                
  matthews_set.append(matthews)

In [None]:
# Create a barplot showing the MCC score for each batch of test samples.
ax = sns.barplot(x=list(range(len(matthews_set))), y=matthews_set, ci=None)

plt.title('MCC Score per Batch')
plt.ylabel('MCC Score (-1 to +1)')
plt.xlabel('Batch #')

plt.show()

In [None]:
# Combine the results across all batches. 
flat_predictions = np.concatenate(predictions, axis=0)

# For each sample, pick the label (0 or 1) with the higher score.
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

# Combine the correct labels for each batch into a single list.
flat_true_labels = np.concatenate(true_labels, axis=0)

# Calculate the MCC
mcc = matthews_corrcoef(flat_true_labels, flat_predictions)

print('Total MCC: %.3f' % mcc)