<a href="https://colab.research.google.com/github/daveDoesData/EE6363/blob/master/Spring2019_SQuAD_BERT_TransferLearning_EE6363.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**David Hardage, EE 6363 Spring 2019 Final Project**

#Introduction
To quote numerous bloggers, writers and natural language processing (NLP) enthusiast, last year marked NLP’s “ImageNet moment”. In 2018, the introduction of large pretrained models enabled practitioners to utilize transfer learning in language modeling task. The three most well known of these models ELMo, GPT, and BERT set the bar on performance for many NLP benchmark tasks. Of these three, BERT (Bidirectional Encoder Representations from Transformers) ended the year as “king” of the NLP hill. In this write up, I will cover the inner workings of BERT and look at BERT’s use in a question answering task using the Stanford Question Answering Dataset (SQuAD) 2.0.  

To start, let’s briefly look at BERT’s preticessors the Allen Institute’s ELMo and  OpenAI’s GPT.

![alt text](https://1.bp.blogspot.com/-RLAbr6kPNUo/W9is5FwUXmI/AAAAAAAADeU/5y9466Zoyoc96vqLjbruLK8i_t8qEdHnQCLcBGAs/s640/image3.png)

Image: Google AI Blog [1]

ELMo’s architecture is based on bidirectional LSTMs, so this language model has a sense of the preceding and following word. In addition, ELMo’s embeddings are based on the entire sequence, so the term “bank” will have a different representation in the sentences “I am going to the bank” and “You fell off the river bank”[2]. While ELMo is usable pretrained, it requires incorporation into more complex task specific architectures for state of the art performance. Building off of ELMo,  OpenAI’s GPT uses transformers which allow for the pretrained model to be used for state of the are performance with less architectural overhead [4]. However, GPT is not bidirectional like ELMo. In a way, BERT incorporates the best of ELMo and GPT into its architecture. Like GPT, BERT’s architecture consist of transformers. However BERT incorporates bidirectionality. Before diving much further into BERT, I will provide a quick recap of attention and describe the transformers mentioned in the paragraph. 

# Transformers and Attention 

No, it is not a Decepticon or Autobot. A transformer is a new neural network architecture which uses attention without any recurrent cells. The absence of recurrence increases the ability of practitioners to distribute computation when using transformer based architectures [5]. In order to better understand how a transformer functions, I will provide a short recap of attention in sequence to sequence networks. 

![alt text](https://drive.google.com/uc?export=view&id=1fnj4Ue19_bSfD92fdzdGFAieZ5e5YkvK)

Image: NUS CS6101 Deep Learning for NLP S8 [6]

In the diagram above, the encoder has hidden states (in red) for each time step in the input sequence. These hidden states contain some information for that particular token at that time step. The dot product of the decoder RNN hidden state (in green) and each hidden state in the encoder RNN creates the attention scores. These attention scores go through a softmax to produce the attention distribution. The resulting attention distribution is combined with the encoder’s hidden states to product a weighted sum vector with the dimensions of your hidden states. This attention vector contains more information from the encoder hidden states which received the most “attention” [6].

Transformers operate off of three vector types queries (Q), keys (K) and values (V). In context to the diagram above, the keys are the encoder’s hidden state and the query vectors the decoder hidden state. The dot product and softmax of these two is combined with the values vector to obtain the attention output. In a full transformer architecture, this process takes place in encoder and decoder cells. In the diagram below, the encoder is represented on the left and the decoder on the right. 

 ![alt text](https://drive.google.com/uc?export=view&id=1nTMbuHOEfdH7pabqXwai_7yVK8WR03GV)
Image: All You Need is Attention [5]

Since there are not any recurrent cells in this architecture, all inputs are summed with a positional encoding (PE) based on the token’s location in a sequence. In the paper, this PE is generated with sin and cosine. The authors hypothesize this will allow the model to better handle inputs larger than those seen in training. BERT does not use this type of PE; instead it uses a simple token embedding for positional information.

Once the PE is added to  the input, Q,K, and V move into the multi-head attention portion (black circle). First, these vectors are pass through a linear projection. The result is multiple weighted Q,K, and V vectors all randomly initialized. Since their initializations are not the same, each of these representation sub spaces will look at the relationships between the input tokens in each sequence differently. 

Now that we have Q, K and V in multiple representation subspaces, each moves into scaled dot product attention (purple circle). The attention is the same as described in the above recap of attention, so let’s focus on the one difference: the scale. The dot product of Q and V is scaled by the square root of the dimension of the key vector. This helps to ensure a more stable gradient during training. 
 
 ![alt text](https://cdn-images-1.medium.com/max/1800/1*lH5NKkkZjsGvjQmlis3_uw.png)
 
Image: All You Need is Attention [5]

To put all of the attention spaces back together, each head is concatenated to each other and multiplied by a weight matrix of the output which is trained jointly with the model. In short, this step takes all the outputs from the different attention heads combines their information and puts them into the correct dimensions to move forward into the model.

The next step in the encoder cell is a residual connection where an element wise addition with the original input and normalization of this sum is performed. In essence, this just adds learned information to the original input and normalizes to keep it from growing too big. After this residual connection, the tokens are passed into a feed forward network. This network is considered position wise because each vector passes through the exact same network separately. Overall, this network acts a convolutional layer with a kernel size of one. Last, the output of the feed forward network goes through another residual connection before leaving this particular encoding cell. 

As for the decoder, it essentially goes through the same process, but with the input shifted and masking. The masking is done to prevent the decoder from “cheating” by being able to “see” the correct token. Since BERT does not utilize decoders, I will stop here, but I have included some helpful visual representations of the decoder below:

![alt text](https://i1.wp.com/mlexplained.com/wp-content/uploads/2019/02/Screen-Shot-2019-02-10-at-5.58.36-PM.png?resize=1024%2C539)

Image: Paper Dissected: “Attention is All You Need” Explained [7]

![alt text](https://i0.wp.com/mlexplained.com/wp-content/uploads/2019/02/Screen-Shot-2019-02-10-at-6.05.02-PM.png?resize=1024%2C470)
Image: Paper Dissected: “Attention is All You Need” Explained [7]

# BERT

As mentioned above, BERT only uses the encoder cells from the transformer architecture. These encoders are stacked on top of each other and learn the relationships between words and sentences from two “pre-training” task:
*   "Masked Language Model": 15% of tokens from an input sequence are masked (some are randomly replaced with the wrong word). The model then needs to predict the id of this masked token. 
*   "Next Sentence Prediction: Takes input sentence pairs, replaces 50% of the second sentences with a random sentence, and trains to learn sentence relationships. Required for task like question answering.

These task are performed with using word pieces from BookCorpus (800M words) and english Wikipedia (2,500M words) in two architectures:
*   BERT-Base: 12-layer, 768-hidden, 12-heads, 110M parameters
*   BERT-Large: 24-layer, 1024-hidden, 16-heads, 340M parameters

There are cased and uncased versions of both Base and Large available for use depending on whether case is important on the language modeling task the pretrained network is being used to solve. For my implementation I use BERT-Base because BERT-Large requires too much memory using the standard implementation with the Adam optimizer on a GPU with 12-15gb of memory. To use BERT-Large, a cloud TPU is recommended [8]. 

Let's review the architecture used and how it functions in question answering task.
 ![alt text](https://drive.google.com/uc?export=view&id=1BW6o4k6ZCWAGierc6DPcZqAhtBdzyhWb)
 
Starting from the bottom, I input questions followed by context paragraphs from SQuAD 2.0. This dataset contains 100,000 question context paragraph pairs in which the context contains the answer and 50,000 question context pairs where the question is unanswerable with the provided context. The intent of SQuAD 2.0 is to create more robust reading comprehension systems by challenging researchers to create systems which known when they cannot answer a question [9].

For the input sequence pairs, the first sequence starts with [CLS] and the two sequences end with [SEP]. The words are tokenized by word piece and assigned three embeddings which are summed  up and passed forward into the model:
*    Token Embedding: a 786 dimensional vector representation for each word piece.
*    Sentence Embedding: a token to distinguish between sequences in a paired input. The first sequence received 0 as an input and the next receives 1.
*    Positional Embedding: a learned positional embedding which supports sequence lengths up to the max input size of 512 tokens.  

Once summed together these pass up into the first encoder cell which goes through the process explained earlier in the transformer section with one alteration. In BERT the feed forward network uses GELU instead of ReLU. GELU is used because of the performance improvements it has over RELU due to its increased curvature (it can be negative) and non-monotonicity [10].

 ![alt text](https://drive.google.com/uc?export=view&id=1Fs3onc3l4exOYHbYK-CKrYaHi8Jc87Ss)
 
After the sequences pass through BERT, they hit a linear layer whose output is the position for the first and last token in the answer span. These predictions are refined during training using cross entropy loss.

`
loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2
`

In order to account for null answers, a threshold is set (default of 0). During training the start and end logits are stored and if these start and end logits are stored. If the sum of these logits subtracted by the minimum the sum of all start and end logits does not surpass the threshold, then the model predicts “” (the response for no answer). 

I trained this model for approximately 4 hrs on one Tesla P100 using a batch size of 15 (reduced form the default 32 due to memory constraints), and I was able to achieve fairly impressive performance for detecting when the model was unable to answer a question:
*    True Negative (No Answer) - 0.811
*    False Negative - 0.231
*    False Omission Rate - 0.222

To conclude, BERT is a powerful tool for transfer learning. Even with a decreased batch size, the model  was able to correctly identify 81% of the questions it could not answer.  In the future, I plan to utilize multiple-GPUs for larger batch sizes and sequence lengths to fine tune questions answers. 

For this project, I used Hugging Face’s pytorch implementation of BERT. This implementation breaks down the different BERT components into easy to follow torch nn modules. This code is included in the appendix[11]. 


# Sources 
[1]https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

[2]https://jalammar.github.io/illustrated-bert/

[3]https://allennlp.org/elmo

[4]https://openai.com/blog/language-unsupervised/

[5] Attention is All You Need https://arxiv.org/abs/1706.03762

[6] NUS CS6101 Deep Learning for NLP S8 https://www.youtube.com/watch?v=yCdl2afW88k

[7]http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/

[8]https://github.com/google-research/bert

[9]https://rajpurkar.github.io/SQuAD-explorer/

[10] https://arxiv.org/pdf/1606.08415.pdf

[11]https://github.com/huggingface/pytorch-pretrained-BERT


# Archive Bert Pytorch Implementation

In [0]:
%%capture
pip install pytorch-pretrained-bert

In [0]:
%%bash
git clone https://github.com/huggingface/pytorch-pretrained-BERT.git

Cloning into 'pytorch-pretrained-BERT'...


In [0]:
%%bash
cp pytorch-pretrained-BERT/examples/run_squad.py .

In [0]:
"""
import of run_squad will fail due to issue calling args from file_utils with 
version of pytorch-pretrained-bert from pip this cell deletes that line

deleted line:
from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE, 
WEIGHTS_NAME, CONFIG_NAME
"""
with open("run_squad.py", "r") as infile:
    lines = infile.readlines()
with open("run_squad.py", "w") as outfile:
    for pos, line in enumerate(lines):
        if pos != 36:
            outfile.write(line)

In [0]:
import torch
import copy
import json
import logging
import math
import os
import shutil
import tarfile
import tempfile
import sys
from io import open
from torch import nn
from torch.nn import CrossEntropyLoss
from pytorch_pretrained_bert import modeling
from pytorch_pretrained_bert import BertTokenizer, BertModel
from pytorch_pretrained_bert.modeling import load_tf_weights_in_bert, BertConfig

from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE 
from run_squad import *


# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

In [0]:
# mannually define config name and weights name args due to issue with version of pytorch-pretrained-bert from pip
CONFIG_NAME = "config.json"
WEIGHTS_NAME = "pytorch_model.bin"
PRETRAINED_MODEL_ARCHIVE_MAP = {
    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz",
    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz",
    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz",
    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz",
    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz",
    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz",
}
BERT_CONFIG_NAME = 'bert_config.json'
TF_WEIGHTS_NAME = 'model.ckpt'

In [0]:
def gelu(x):
    """Implementation of the gelu activation function.
        For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):
        0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
        Also see https://arxiv.org/abs/1606.08415
    """
    return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))


def swish(x):
    return x * torch.sigmoid(x)


ACT2FN = {"gelu": gelu, "relu": torch.nn.functional.relu, "swish": swish}

In [0]:
class BertEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings.
    """
    def __init__(self, config):
        super(BertEmbeddings, self).__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=0)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids, token_type_ids=None):
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
        position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        words_embeddings = self.word_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = words_embeddings + position_embeddings + token_type_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

In [0]:
class BertSelfAttention(nn.Module):
    def __init__(self, config):
        super(BertSelfAttention, self).__init__()
        if config.hidden_size % config.num_attention_heads != 0:
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (config.hidden_size, config.num_attention_heads))
        self.num_attention_heads = config.num_attention_heads
        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
        self.all_head_size = self.num_attention_heads * self.attention_head_size

        self.query = nn.Linear(config.hidden_size, self.all_head_size)
        self.key = nn.Linear(config.hidden_size, self.all_head_size)
        self.value = nn.Linear(config.hidden_size, self.all_head_size)

        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(self, hidden_states, attention_mask):
        mixed_query_layer = self.query(hidden_states)
        mixed_key_layer = self.key(hidden_states)
        mixed_value_layer = self.value(hidden_states)

        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
        attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)

        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        attention_probs = self.dropout(attention_probs)

        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)
        return context_layer

class BertSelfOutput(nn.Module):
    def __init__(self, config):
        super(BertSelfOutput, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

class BertAttention(nn.Module):
    def __init__(self, config):
        super(BertAttention, self).__init__()
        self.self = BertSelfAttention(config)
        self.output = BertSelfOutput(config)

    def forward(self, input_tensor, attention_mask):
        self_output = self.self(input_tensor, attention_mask)
        attention_output = self.output(self_output, input_tensor)
        return attention_output

In [0]:
class BertIntermediate(nn.Module):
    def __init__(self, config):
        super(BertIntermediate, self).__init__()
        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
        if isinstance(config.hidden_act, str) or (sys.version_info[0] == 2 and isinstance(config.hidden_act, unicode)):
            self.intermediate_act_fn = ACT2FN[config.hidden_act]
        else:
            self.intermediate_act_fn = config.hidden_act

    def forward(self, hidden_states):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.intermediate_act_fn(hidden_states)
        return hidden_states

In [0]:
class BertOutput(nn.Module):
    def __init__(self, config):
        super(BertOutput, self).__init__()
        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_states, input_tensor):
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)
        return hidden_states

In [0]:
class BertLayer(nn.Module):
    def __init__(self, config):
        super(BertLayer, self).__init__()
        self.attention = BertAttention(config)
        self.intermediate = BertIntermediate(config)
        self.output = BertOutput(config)

    def forward(self, hidden_states, attention_mask):
        attention_output = self.attention(hidden_states, attention_mask)
        intermediate_output = self.intermediate(attention_output)
        layer_output = self.output(intermediate_output, attention_output)
        return layer_output

In [0]:
class BertPreTrainedModel(nn.Module):
    """ An abstract class to handle weights initialization and
        a simple interface for dowloading and loading pretrained models.
    """
    def __init__(self, config, *inputs, **kwargs):
        super(BertPreTrainedModel, self).__init__()
        if not isinstance(config, BertConfig):
            raise ValueError(
                "Parameter config in `{}(config)` should be an instance of class `BertConfig`. "
                "To create a model from a Google pretrained model use "
                "`model = {}.from_pretrained(PRETRAINED_MODEL_NAME)`".format(
                    self.__class__.__name__, self.__class__.__name__
                ))
        self.config = config

    def init_bert_weights(self, module):
        """ Initialize the weights.
        """
        if isinstance(module, (nn.Linear, nn.Embedding)):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
        elif isinstance(module, BertLayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        if isinstance(module, nn.Linear) and module.bias is not None:
            module.bias.data.zero_()

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path, state_dict=None, cache_dir=None,
                        from_tf=False, *inputs, **kwargs):
        """
        Instantiate a BertPreTrainedModel from a pre-trained model file or a pytorch state dict.
        Download and cache the pre-trained model file if needed.
        Params:
            pretrained_model_name_or_path: either:
                - a str with the name of a pre-trained model to load selected in the list of:
                    . `bert-base-uncased`
                    . `bert-large-uncased`
                    . `bert-base-cased`
                    . `bert-large-cased`
                    . `bert-base-multilingual-uncased`
                    . `bert-base-multilingual-cased`
                    . `bert-base-chinese`
                - a path or url to a pretrained model archive containing:
                    . `bert_config.json` a configuration file for the model
                    . `pytorch_model.bin` a PyTorch dump of a BertForPreTraining instance
                - a path or url to a pretrained model archive containing:
                    . `bert_config.json` a configuration file for the model
                    . `model.chkpt` a TensorFlow checkpoint
            from_tf: should we load the weights from a locally saved TensorFlow checkpoint
            cache_dir: an optional path to a folder in which the pre-trained models will be cached.
            state_dict: an optional state dictionnary (collections.OrderedDict object) to use instead of Google pre-trained models
            *inputs, **kwargs: additional input for the specific Bert class
                (ex: num_labels for BertForSequenceClassification)
        """
        if pretrained_model_name_or_path in PRETRAINED_MODEL_ARCHIVE_MAP:
            archive_file = PRETRAINED_MODEL_ARCHIVE_MAP[pretrained_model_name_or_path]
        else:
            archive_file = pretrained_model_name_or_path
        # redirect to the cache, if necessary
        try:
            resolved_archive_file = cached_path(archive_file, cache_dir=cache_dir)
        except EnvironmentError:
            logger.error(
                "Model name '{}' was not found in model name list ({}). "
                "We assumed '{}' was a path or url but couldn't find any file "
                "associated to this path or url.".format(
                    pretrained_model_name_or_path,
                    ', '.join(PRETRAINED_MODEL_ARCHIVE_MAP.keys()),
                    archive_file))
            return None
        if resolved_archive_file == archive_file:
            logger.info("loading archive file {}".format(archive_file))
        else:
            logger.info("loading archive file {} from cache at {}".format(
                archive_file, resolved_archive_file))
        tempdir = None
        if os.path.isdir(resolved_archive_file) or from_tf:
            serialization_dir = resolved_archive_file
        else:
            # Extract archive to temp dir
            tempdir = tempfile.mkdtemp()
            logger.info("extracting archive file {} to temp dir {}".format(
                resolved_archive_file, tempdir))
            with tarfile.open(resolved_archive_file, 'r:gz') as archive:
                archive.extractall(tempdir)
            serialization_dir = tempdir
        # Load config
        config_file = os.path.join(serialization_dir, CONFIG_NAME)
        if not os.path.exists(config_file):
            # Backward compatibility with old naming format
            config_file = os.path.join(serialization_dir, BERT_CONFIG_NAME)
        config = BertConfig.from_json_file(config_file)
        logger.info("Model config {}".format(config))
        # Instantiate model.
        model = cls(config, *inputs, **kwargs)
        if state_dict is None and not from_tf:
            weights_path = os.path.join(serialization_dir, WEIGHTS_NAME)
            state_dict = torch.load(weights_path, map_location='cpu')
        if tempdir:
            # Clean up temp dir
            shutil.rmtree(tempdir)
        if from_tf:
            # Directly load from a TensorFlow checkpoint
            weights_path = os.path.join(serialization_dir, TF_WEIGHTS_NAME)
            return load_tf_weights_in_bert(model, weights_path)
        # Load from a PyTorch state_dict
        old_keys = []
        new_keys = []
        for key in state_dict.keys():
            new_key = None
            if 'gamma' in key:
                new_key = key.replace('gamma', 'weight')
            if 'beta' in key:
                new_key = key.replace('beta', 'bias')
            if new_key:
                old_keys.append(key)
                new_keys.append(new_key)
        for old_key, new_key in zip(old_keys, new_keys):
            state_dict[new_key] = state_dict.pop(old_key)

        missing_keys = []
        unexpected_keys = []
        error_msgs = []
        # copy state_dict so _load_from_state_dict can modify it
        metadata = getattr(state_dict, '_metadata', None)
        state_dict = state_dict.copy()
        if metadata is not None:
            state_dict._metadata = metadata

        def load(module, prefix=''):
            local_metadata = {} if metadata is None else metadata.get(prefix[:-1], {})
            module._load_from_state_dict(
                state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs)
            for name, child in module._modules.items():
                if child is not None:
                    load(child, prefix + name + '.')
        start_prefix = ''
        if not hasattr(model, 'bert') and any(s.startswith('bert.') for s in state_dict.keys()):
            start_prefix = 'bert.'
        load(model, prefix=start_prefix)
        if len(missing_keys) > 0:
            logger.info("Weights of {} not initialized from pretrained model: {}".format(
                model.__class__.__name__, missing_keys))
        if len(unexpected_keys) > 0:
            logger.info("Weights from pretrained model not used in {}: {}".format(
                model.__class__.__name__, unexpected_keys))
        if len(error_msgs) > 0:
            raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
                               model.__class__.__name__, "\n\t".join(error_msgs)))
        return model

In [0]:
class BertEncoder(nn.Module):
    def __init__(self, config):
        super(BertEncoder, self).__init__()
        layer = BertLayer(config)
        self.layer = nn.ModuleList([copy.deepcopy(layer) for _ in range(config.num_hidden_layers)])

    def forward(self, hidden_states, attention_mask, output_all_encoded_layers=True):
        all_encoder_layers = []
        for layer_module in self.layer:
            hidden_states = layer_module(hidden_states, attention_mask)
            if output_all_encoded_layers:
                all_encoder_layers.append(hidden_states)
        if not output_all_encoded_layers:
            all_encoder_layers.append(hidden_states)
        return all_encoder_layers

In [0]:
class BertModel(BertPreTrainedModel):
    """BERT model ("Bidirectional Embedding Representations from a Transformer").
    Params:
        config: a BertConfig class instance with the configuration to build a new model
    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `output_all_encoded_layers`: boolean which controls the content of the `encoded_layers` output as described below. Default: `True`.
    Outputs: Tuple of (encoded_layers, pooled_output)
        `encoded_layers`: controled by `output_all_encoded_layers` argument:
            - `output_all_encoded_layers=True`: outputs a list of the full sequences of encoded-hidden-states at the end
                of each attention block (i.e. 12 full sequences for BERT-base, 24 for BERT-large), each
                encoded-hidden-state is a torch.FloatTensor of size [batch_size, sequence_length, hidden_size],
            - `output_all_encoded_layers=False`: outputs only the full sequence of hidden-states corresponding
                to the last attention block of shape [batch_size, sequence_length, hidden_size],
        `pooled_output`: a torch.FloatTensor of size [batch_size, hidden_size] which is the output of a
            classifier pretrained on top of the hidden state associated to the first character of the
            input (`CLS`) to train on the Next-Sentence task (see BERT's paper).
    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
    config = modeling.BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
    model = modeling.BertModel(config=config)
    all_encoder_layers, pooled_output = model(input_ids, token_type_ids, input_mask)
    ```
    """
    def __init__(self, config):
        super(BertModel, self).__init__(config)
        self.embeddings = BertEmbeddings(config)
        self.encoder = BertEncoder(config)
        self.pooler = BertPooler(config)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=True):
        if attention_mask is None:
            attention_mask = torch.ones_like(input_ids)
        if token_type_ids is None:
            token_type_ids = torch.zeros_like(input_ids)

        # We create a 3D attention mask from a 2D tensor mask.
        # Sizes are [batch_size, 1, 1, to_seq_length]
        # So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
        # this attention mask is more simple than the triangular masking of causal attention
        # used in OpenAI GPT, we just need to prepare the broadcast dimension here.
        extended_attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

        embedding_output = self.embeddings(input_ids, token_type_ids)
        encoded_layers = self.encoder(embedding_output,
                                      extended_attention_mask,
                                      output_all_encoded_layers=output_all_encoded_layers)
        sequence_output = encoded_layers[-1]
        pooled_output = self.pooler(sequence_output)
        if not output_all_encoded_layers:
            encoded_layers = encoded_layers[-1]
        return encoded_layers, pooled_output

In [0]:
class BertForQuestionAnswering(BertPreTrainedModel):
    """BERT model for Question Answering (span extraction).
    This module is composed of the BERT model with a linear layer on top of
    the sequence output that computes start_logits and end_logits
    Params:
        `config`: a BertConfig class instance with the configuration to build a new model.
    Inputs:
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
            with the word token indices in the vocabulary(see the tokens preprocessing logic in the scripts
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
            a `sentence B` token (see BERT paper for more details).
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
            input sequence length in the current batch. It's the mask that we typically use for attention when
            a batch has varying length sentences.
        `start_positions`: position of the first token for the labeled span: torch.LongTensor of shape [batch_size].
            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
            into account for computing the loss.
        `end_positions`: position of the last token for the labeled span: torch.LongTensor of shape [batch_size].
            Positions are clamped to the length of the sequence and position outside of the sequence are not taken
            into account for computing the loss.
    Outputs:
        if `start_positions` and `end_positions` are not `None`:
            Outputs the total_loss which is the sum of the CrossEntropy loss for the start and end token positions.
        if `start_positions` or `end_positions` is `None`:
            Outputs a tuple of start_logits, end_logits which are the logits respectively for the start and end
            position tokens of shape [batch_size, sequence_length].
    Example usage:
    ```python
    # Already been converted into WordPiece token ids
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
    model = BertForQuestionAnswering(config)
    start_logits, end_logits = model(input_ids, token_type_ids, input_mask)
    ```
    """
    def __init__(self, config):
        super(BertForQuestionAnswering, self).__init__(config)
        self.bert = BertModel(config)
        # TODO check with Google if it's normal there is no dropout on the token classifier of SQuAD in the TF version
        # self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.qa_outputs = nn.Linear(config.hidden_size, 2)
        self.apply(self.init_bert_weights)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, start_positions=None, end_positions=None):
        sequence_output, _ = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        if start_positions is not None and end_positions is not None:
            # If we are on multi-GPU, split add a dimension
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions.clamp_(0, ignored_index)
            end_positions.clamp_(0, ignored_index)

            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2
            return total_loss
        else:
            return start_logits, end_logits


In [0]:
model = BertForQuestionAnswering.from_pretrained(bert_model, cache_dir=None)



---



In [0]:
torch.cuda.set_device(0)
device = torch.device("cuda", 0)
n_gpu = 1

In [0]:
train_file = 'train-v2.0.json'
predict_file = 'dev-v2.0.json' 
output_dir = 'squad2_log'
bert_model = 'bert-base-cased'
train_batch_size = 32
num_train_epochs = 3
learning_rate = 5e-5
gradient_accumulation_steps = 1 #Number of updates steps to accumulate before performing a backward/update pass.
warmup_proportion = 0.1
doc_stride = 128
max_query_length = 64
max_seq_length = 384

predict_batch_size = 8
n_best_size = 20
max_answer_length = 30

seed = 42

tokenizer = BertTokenizer.from_pretrained(bert_model, do_lower_case=False)

INFO:pytorch_pretrained_bert.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt not found in cache, downloading to /tmp/tmpf6ohop58
100%|██████████| 213450/213450 [00:00<00:00, 2465797.03B/s]
INFO:pytorch_pretrained_bert.file_utils:copying /tmp/tmpf6ohop58 to cache at /root/.pytorch_pretrained_bert/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
INFO:pytorch_pretrained_bert.file_utils:creating metadata file for /root/.pytorch_pretrained_bert/5e8a2b4893d13790ed4150ca1906be5f7a03d6c4ddf62296c383f6db42814db2.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
INFO:pytorch_pretrained_bert.file_utils:removing temp file /tmp/tmpf6ohop58
INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt from cache at /root/.pytorch_pretrained_bert/5e8a2b4893d13790ed4150ca1906be5f7a03d6c

In [0]:
if os.path.exists(output_dir) and os.listdir(output_dir):
  raise ValueError("Output directory () already exists and is not empty.")
if not os.path.exists(output_dir):
  os.makedirs(output_dir)

In [0]:
model = BertForQuestionAnswering.from_pretrained(bert_model, cache_dir=None)

In [0]:
train_examples = read_squad_examples(input_file=train_file, is_training=True, version_2_with_negative=True)
num_train_optimization_steps = int(len(train_examples)/train_batch_size / gradient_accumulation_steps) * num_train_epochs

# Prepare model
model = BertForQuestionAnswering.from_pretrained(bert_model, cache_dir=None)
model.to(device)

INFO:pytorch_pretrained_bert.file_utils:https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz not found in cache, downloading to /tmp/tmplmxrjei1
100%|██████████| 404400730/404400730 [00:07<00:00, 55536672.01B/s]
INFO:pytorch_pretrained_bert.file_utils:copying /tmp/tmplmxrjei1 to cache at /root/.pytorch_pretrained_bert/a803ce83ca27fecf74c355673c434e51c265fb8a3e0e57ac62a80e38ba98d384.681017f415dfb33ec8d0e04fe51a619f3f01532ecea04edbfd48c5d160550d9c
INFO:pytorch_pretrained_bert.file_utils:creating metadata file for /root/.pytorch_pretrained_bert/a803ce83ca27fecf74c355673c434e51c265fb8a3e0e57ac62a80e38ba98d384.681017f415dfb33ec8d0e04fe51a619f3f01532ecea04edbfd48c5d160550d9c
INFO:pytorch_pretrained_bert.file_utils:removing temp file /tmp/tmplmxrjei1
INFO:pytorch_pretrained_bert.modeling:loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz from cache at /root/.pytorch_pretrained_bert/a803ce83ca27fecf74c355673c434e51c265fb8a3e0e5

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1)
            )
          )
          (intermediate): BertIntermediate

In [0]:
# Prepare optimizer
param_optimizer = list(model.named_parameters())

# hack to remove pooler, which is not used
# thus it produce None grad that break apex
param_optimizer = [n for n in param_optimizer if 'pooler' not in n[0]]

no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]

# skipping 911-922 not more than one gpu 
#skipping 937-952 bc not using 16 float position 

optimizer = BertAdam(optimizer_grouped_parameters,lr=learning_rate,
                warmup=warmup_proportion,
                t_total=num_train_optimization_steps)


In [0]:
global_step = 0

train_features = convert_examples_to_features(
        examples=train_examples,
        tokenizer=tokenizer,
        max_seq_length=max_seq_length,
        doc_stride=doc_stride,
        max_query_length=max_query_length,
        is_training=True)
logger.info("***** Running training *****")
logger.info("  Num orig examples = %d", len(train_examples))
logger.info("  Num split examples = %d", len(train_features))
logger.info("  Batch size = %d", train_batch_size)
logger.info("  Num steps = %d", num_train_optimization_steps)
all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
all_start_positions = torch.tensor([f.start_position for f in train_features], dtype=torch.long)
all_end_positions = torch.tensor([f.end_position for f in train_features], dtype=torch.long)
train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids,
                           all_start_positions, all_end_positions)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=train_batch_size)

INFO:run_squad:*** Example ***
INFO:run_squad:unique_id: 1000000000
INFO:run_squad:example_index: 0
INFO:run_squad:doc_span_index: 0
INFO:run_squad:tokens: [CLS] When did Bey ##on ##ce start becoming popular ? [SEP] Beyoncé G ##iselle Knowles - Carter ( / [UNK] / bee - Y ##ON - say ) ( born September 4 , 1981 ) is an American singer , songwriter , record producer and actress . Born and raised in Houston , Texas , she performed in various singing and dancing competitions as a child , and rose to fame in the late 1990s as lead singer of R & B girl - group Destiny ' s Child . Man ##aged by her father , Math ##ew Knowles , the group became one of the world ' s best - selling girl groups of all time . Their hiatus saw the release of Beyoncé ' s debut album , Dangerous ##ly in Love ( 2003 ) , which established her as a solo artist worldwide , earned five Grammy Awards and featured the Billboard Hot 100 number - one singles " Crazy in Love " and " Baby Boy " . [SEP]
INFO:run_squad:token_to_or

In [0]:
model.train()
for _ in trange(int(num_train_epochs), desc="Epoch"):
    for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
        batch = tuple(t.to(device) for t in batch)
        input_ids, input_mask, segment_ids, start_positions, end_positions = batch
        loss = model(input_ids, segment_ids, input_mask, start_positions, end_positions)
        if gradient_accumulation_steps > 1:
            loss = loss / gradient_accumulation_steps
        loss.backward()
        if (step + 1) % gradient_accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()
            global_step += 1

# Save a trained model, configuration and tokenizer
model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self
# If we save using the predefined names, we can load using `from_pretrained`
output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)