# Modify Hugging Face BERT
Before we move on it may be worth reviewing the BERT paper: https://arxiv.org/abs/1810.04805

Please ensure that you are running this notebook with reference to https://github.com/aws/aws-neuron-sdk/tree/master/src/examples/pytorch/bert_tutorial/README.md, or you may miss key installation steps and fail to produce required artifacts in prior steps.

In particular we can note that we can think for BERT as having three stages:  Embeddings, encoding and output processing.  We can think of the Neuron hardware as providing a lot of acceleration where we are doing mostly linear algebra.

The embedding stages (which are basically lookup tables at runtime) don't get much benefit on Neuron hardware, but the Encoding step does.  So we want to make minimally invasive changes to just override the Encoder stage to run on Neuron.

To do this we inherit classes from BertEncoder, BertModel and BertForSequenceClassification classes.  This allows for a modified implementation of the NeuronBertEncoder with the same top level semantics

We'll do the following:

* Create a child class for BertForSequenceClassification which *just* changes the construction
* Create a child class for BertModel.  
  * Modify the constructor to use our derived BertEncoder
  * Modify the forward method by copying the original, but adding a torch.neuron trace to the enncoder
* Create a child class for the NeuronBertEncoder
  * We need to do this since Hugging Face BERT uses some None variables and lists of None which the PyTorch jit compiler does not like (torch.jit.trace).  Since we leverage PyTorch JIT trace in torch.neuron.trace we need to remove some of the arguments we aren't using as we compiile our MRPC runtime.

## Before you start!

**Check you have run aws configure and setup your user credentials - otherwise these steps will fail**

### Check your region

In [None]:
## If you do not want to train and operate in us-east-1 change this region
## Make sure that all steps of the tutorial use the same region
REGION="us-east-1"

FOLDER="bert_tutorial"
CANONICAL_LOCATION="s3://aws-neuron-public-tutorial-content-us-east-1/frameworks/pytorch/bert/bert_tutorial/bert-large-uncased-mrpc.tar.gz"

### Import some required modules

In [None]:
from urllib.parse import urlsplit
from urllib.parse import urlparse
from botocore.exceptions import ClientError
from datetime import date

import boto3
import botocore
import os
import time

import torch
import torch.neuron

from transformers import BertModel, BertForSequenceClassification, BertTokenizer
from transformers.modeling_bert import BertEncoder

## Pre-step, check for adapted model and copy if needed

### Routine to upload files to S3

In [None]:
def upload_and_check_file( s3_location, filename ):
    
    try:
        boto3_sess = boto3.session.Session()
    except botocore.exceptions.NoCredentialsError:
        print("No credentials:  Use 'aws confgure' to setup credentials or configure isengard (Amazon internal)")
        raise

    o = urlsplit(s3_location, allow_fragments = True)
    mod_path = os.path.dirname( o.path )
    mod_path = mod_path.lstrip('/')

    print()
    print("Copy model to: s3://" + o.netloc + "/" + mod_path + "/")

    assert( os.path.exists(filename) )

    try:
        s3_client = boto3_sess.client('s3')
        print("Uploading ...")
        response = s3_client.upload_file(filename, o.netloc, mod_path + "/" + filename )
        if response == None:
            print(" ... no errors")
        else:
            print("Response: {}".format(response))
    except ClientError as e:
        print(e)
        raise
    except:
        raise

    print()
    print("Check the file uploaded OK ...")
    s3_resource = boto3_sess.resource('s3')
    bucket = s3_resource.Bucket(o.netloc)
    key = mod_path + "/" + filename
    full_name = "s3://" + bucket.name + "/" + key

    objs = list(bucket.objects.filter(Prefix=key))

    print()
    if len(objs) > 0 and objs[0].key == key:
        print("{} exists!".format(full_name))
    else:
        print("{} doesn't exist".format(full_name))


### Routine to check S3 locations exist

In [None]:
def test_s3_location( bucket_name, key ):
    print("Check the S3 files exist ...")
    s3_resource = boto3_sess.resource('s3')
    bucket = s3_resource.Bucket(bucket_name)

    try:
        objs = list(bucket.objects.filter(Prefix=key))
    except s3_resource.meta.client.exceptions.NoSuchBucket:
        return False
    
    if len(objs) > 0 and objs[0].key == key:
        return True

    return False

### Check that an S3 bucket and known location exist for this account
If you didn't run Stage 1 (optional), and don't have the adapted MRPC model this code will:
* Download one prepared earlier
* Create a new S3 bucket in your account
* Upload the adapted MRPC model

If the expected file already exists this step will proceed without any additional downloads or uploads

In [None]:
from __future__ import print_function
import sys

def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

if REGION == None or FOLDER == None or CANONICAL_LOCATION == None:
    eprint("The following variables need to be set:")
    eprint("REGION = {}".format(REGION))
    eprint("FOLDER = {}".format(FOLDER))
    eprint("CANONICAL_LOCATION = {}".format(CANONICAL_LOCATION))
    eprint()
    eprint("Did you forget to execute the top cell?")

    raise 

bucket_prefix="inferentia-test-"

try:
    boto3_sess = boto3.session.Session()
except botocore.exceptions.NoCredentialsError:
    print("No credentials:  Use 'aws confgure' to setup credentials or configure isengard (Amazon internal)")
    raise

try:
    sts_client = boto3.client('sts')
    response = sts_client.get_caller_identity()
    #if response == None:
    #    print(" ... no errors")
    #else:
    #    print("Response: {}".format(response))
except ClientError as e:
    print(e)
    raise
except:
    raise

ACCOUNT=response['Account']
#TIMESTAMP=date.today().strftime("%Y-%m-%d")

bucket_name=bucket_prefix + ACCOUNT
bucket_path=FOLDER
filename="bert-large-uncased-mrpc.tar.gz"
key = bucket_path + "/" + filename
s3_location = "s3://" + bucket_name + "/" + key

if test_s3_location( bucket_name, key ):
    print("{} exists!  using your adapted model".format(s3_location))
else:
    print("{} doesn't exist! Copying the AWS Neuron default model, and upload!".format(s3_location))
    
    ## Create a bucket in your account!
    print("Using region '{}'".format(REGION))
    s3_client = boto3.client('s3', region_name=REGION)
    
    response = None
    if REGION != "us-east-1":
        location = {'LocationConstraint': REGION}   
        response = s3_client.create_bucket(Bucket=bucket_name,
                                CreateBucketConfiguration=location)
    else:
        response = s3_client.create_bucket(Bucket=bucket_name)
    
    print("Created bucket at {} response = {}".format(bucket_name, response))
    
    ## Download the model
    s3_resource = boto3_sess.resource('s3')
    parsed = urlparse(CANONICAL_LOCATION)
    path = parsed.path.lstrip('/')
    adapted_model_file = os.path.basename(path)
    s3_resource.Bucket(parsed.netloc).download_file(path, adapted_model_file)
    os.path.exists(adapted_model_file)
    
    print("Downloaded file")
    
    ## Upload the adapted model to your S3 (for later use)
    upload_and_check_file( s3_location=s3_location, filename=adapted_model_file)
    
assert(test_s3_location(bucket_name, key))

print("S3 location = {} confirmed!".format(s3_location))


### Modify BertEncoder

This is a modified Encoder class.  If you compare with the original code you can see that we have stripped off two unused arguments in the forward method.  Instead we will initialize them in the body of the forward method.  This change was required to allow torch.jit.trace to trace the model

In [None]:
class NeuronForwardBertEncoder(BertEncoder):
    def __init__(self, config):
        super().__init__(config)
        self.opt_encoder = None
        
    def forward(
        self,
        hidden_states,
        attention_mask
    ):
        ## Changes to allow torch.jit.trace to run
        head_mask=[None] * len(self.layer)
        encoder_hidden_states=None
        encoder_attention_mask=None
        
        all_hidden_states = ()
        all_attentions = ()
        for i, layer_module in enumerate(self.layer):
            if self.output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states,)

            layer_outputs = layer_module(
                hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask
            )
            hidden_states = layer_outputs[0]

            if self.output_attentions:
                all_attentions = all_attentions + (layer_outputs[1],)

        # Add last layer
        if self.output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)

        outputs = (hidden_states,)
        if self.output_hidden_states:
            outputs = outputs + (all_hidden_states,)
        if self.output_attentions:
            outputs = outputs + (all_attentions,)
        return outputs  # last-layer hidden state, (all hidden states), (all attentions)



### Modify BertModel

Here we override the BERT model class.  This is a straight copy of the original code with three simple modifications.

1. We are using the the modified BertEncoder forward function
1. We are invoking neuron trace in the forward path
1. We are passing some additional arguments we'll use for compilation

Additionally, there is also some commented code which will output the torchscript sub-graph (and human readable format) and example input to the sub-graph

When we construct and invoke our top level class it will create a compiled module.  We'll see below how we can then save that into an Neuron optimized PyTorch file.

In [None]:
class NeuronForwardBertModel(BertModel):
    
    def __init__(self, config, compiler_options, optimization=None, use_cached_compiler_output=False ):
        super().__init__(config)
        assert(self.config.torchscript == True)
        self.compiler_options = compiler_options
        self.optimization=optimization
        self.use_cached_compiler_output=use_cached_compiler_output
        self.encoder = NeuronForwardBertEncoder(config)

        self.init_weights()
    
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
    ):
        
        ## Now use copied code from BertModel.forward
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is not None:
            input_shape = input_ids.size()
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.size()[:-1]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        device = input_ids.device if input_ids is not None else inputs_embeds.device

        if attention_mask is None:
            attention_mask = torch.ones(input_shape, device=device)
        if token_type_ids is None:
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)

        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        # ourselves in which case we just need to make it broadcastable to all heads.
        if attention_mask.dim() == 3:
            extended_attention_mask = attention_mask[:, None, :, :]
        elif attention_mask.dim() == 2:
            # Provided a padding mask of dimensions [batch_size, seq_length]
            # - if the model is a decoder, apply a causal mask in addition to the padding mask
            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
            if self.config.is_decoder:
                batch_size, seq_length = input_shape
                seq_ids = torch.arange(seq_length, device=device)
                causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
                causal_mask = causal_mask.to(
                    attention_mask.dtype
                )  # causal and attention masks must have same type with pytorch version < 1.3
                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
            else:
                extended_attention_mask = attention_mask[:, None, None, :]
        else:
            raise ValueError(
                "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
                    input_shape, attention_mask.shape
                )
            )

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibility
        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0

        # If a 2D ou 3D attention mask is provided for the cross-attention
        # we need to make broadcastabe to [batch_size, num_heads, seq_length, seq_length]
        if self.config.is_decoder and encoder_hidden_states is not None:
            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
            if encoder_attention_mask is None:
                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)

            if encoder_attention_mask.dim() == 3:
                encoder_extended_attention_mask = encoder_attention_mask[:, None, :, :]
            elif encoder_attention_mask.dim() == 2:
                encoder_extended_attention_mask = encoder_attention_mask[:, None, None, :]
            else:
                raise ValueError(
                    "Wrong shape for encoder_hidden_shape (shape {}) or encoder_attention_mask (shape {})".format(
                        encoder_hidden_shape, encoder_attention_mask.shape
                    )
                )

            encoder_extended_attention_mask = encoder_extended_attention_mask.to(
                dtype=next(self.parameters()).dtype
            )  # fp16 compatibility
            encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -10000.0
        else:
            encoder_extended_attention_mask = None

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        if head_mask is not None:
            if head_mask.dim() == 1:
                head_mask = head_mask.unsqueeze(0).unsqueeze(0).unsqueeze(-1).unsqueeze(-1)
                head_mask = head_mask.expand(self.config.num_hidden_layers, -1, -1, -1, -1)
            elif head_mask.dim() == 2:
                head_mask = (
                    head_mask.unsqueeze(1).unsqueeze(-1).unsqueeze(-1)
                )  # We can specify head_mask for each layer
            head_mask = head_mask.to(
                dtype=next(self.parameters()).dtype
            )  # switch to fload if need + fp16 compatibility
        else:
            head_mask = [None] * self.config.num_hidden_layers

        embedding_output = self.embeddings(
            input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
        )
        
        #############################
        ## Neuron changes start here

        ## Set to eval - set up the inputs for tracing
        self.encoder.eval()        
        example_inputs=[embedding_output,extended_attention_mask]

        ## We move these args from the example inputs since they break jit.trace
        #,head_mask,encoder_hidden_states,encoder_extended_attention_mask]

        ### Use this code to output the torch script graph for inspection, as well as the inputs for a sample
        ### to allow for work directly using torchscript
        """
        encoder = torch.jit.trace( self.encoder, example_inputs=example_inputs )  
        print("Save JIT trace to test.pt")
        torch.jit.save( encoder, "./test.pt")
        torch.save( example_inputs, "example_tensor.pt")
        with open("./test_graph.txt", "w") as f:
            f.write(str(encoder.graph))
        exit(1)
        """

        ## Compile the neuron code into the encode - once we trace the whole model and save we have what we need
        self.encoder = torch.neuron.trace( 
            self.encoder, example_inputs=example_inputs, 
            fallback=False, 
            compiler_workdir="./compile", 
            compiler_args=self.compiler_options, 
            optimize=self.optimization, 
            use_cached_compiler_output=self.use_cached_compiler_output )
        
        """
        # Orginal code
        encoder_outputs = self.encoder(
            embedding_output,
            attention_mask=extended_attention_mask,
            head_mask=head_mask,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_extended_attention_mask, 
        )
        """

        # NOTE: Using ordered args rather than KW args
        #encoder_outputs = self.encoder(
        #    embedding_output,
        #    attention_mask=extended_attention_mask
        #)
        encoder_outputs = self.encoder(
            embedding_output,
            extended_attention_mask
        )
        
        ## END Neuron changes
        #############################
        
        sequence_output = encoder_outputs[0]
        pooled_output = self.pooler(sequence_output)

        outputs = (sequence_output, pooled_output,) + encoder_outputs[
            1:
        ]  # add hidden_states and attentions if they are here
        
        return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)


### Modify BertForSequenceClassification

The top level wrapper class.  This is just a constructor over-ride replaceing the old Bert Model with our updated one, and passing through some additional argument for torch.neuron.trace

In [None]:
## Simple wrapper class to pass through some additional variables        
class NeuronForwardBertForSequenceClassification(BertForSequenceClassification):
    
    def __init__(self, config, compiler_options, optimization=None, use_cached_compiler_output=False ):

        super().__init__(config)
        
        self.bert = NeuronForwardBertModel(config,compiler_options, optimization=optimization, use_cached_compiler_output=use_cached_compiler_output)
        
        self.init_weights()

### Download Adapted MRPC Model

This code will fetch the model we saved earlier.  

### Download and decompress the S3 location in the cell above

In [None]:
import boto3
import botocore
import tarfile
import os
from urllib.parse import urlparse

try:
    boto3_sess = boto3.session.Session()
except botocore.exceptions.NoCredentialsError:
    print("No credentials:  Use 'aws confgure' to setup credentials or configure isengard (Amazon internal)")
    raise
except:
    raise

s3 = boto3_sess.resource('s3')
parsed = urlparse(s3_location)    

path = parsed.path.lstrip('/')
saved_model_tgz = os.path.basename(path)
    
if not os.path.exists(saved_model_tgz):
    print("Downloading file")
    s3.Bucket(parsed.netloc).download_file(path, saved_model_tgz)
else:
    print("File already downloaded")

directory = "./"
previous_directory_state = set(os.listdir(directory))
print("Decompressing file")
t = tarfile.open(saved_model_tgz)
t.extractall()

current_directory_state = set(os.listdir(directory))
changed_filenames = current_directory_state - previous_directory_state

print("Changes:")

for file in changed_filenames:
    print(file)

print("-- Pretrained model downloaded and decompressed")

### List the local files

You can visually confirm that the adapted model was downloaded and unzipped

In [None]:
!ls -al 

### Set up a sanity test, neuron trace and save the model

This code will set up some input values, run a CPU sanity test, compile the encoder and generate several loadable PyTorch models.

Detail:

1. We are using the tokenizer related to the adapted model to create input tokens.  With a maximum length of 128 to encode the sentence pairs, a mask to say which sentences are which, and a mask to show which elements are padding
1. We run the inference on CPU using our adapted mode to sanity check out results (note it takes 80 - 90 seconds to run two BERT large inferences on CPU)
1. We compile the sub-graph with modified network by running a forward pass (this takes ~30-40 minutes)
1. We do a torch.jit.trace on the whole thing in memory, and not just the sub-graph we have neuron optimized (this creates torchscript for the entire model, which we can save and load)
1. We test save the model - then test loading it
1. We repeat this process four times to generate models which run on each of the four neuron cores of a single inf1, but skip the costly part of the loop (the compilation)
1. We upload the set of four models to S3

In [None]:
def compile_bert():
    print("-- Loading MRPC Adapted BERT")
    tokenizer = BertTokenizer.from_pretrained("./bert-large-uncased-mrpc/")
 
    ## Run four inferences per neuron core
    batch_size = 4
    
    # Maximum input length for the two combined sentence input is 128
    max_length = 128
    
    # Let's make four models to load four neuron cores (one inf1.xlarge)
    num_models = 4
    
    ## Example sentences
    sentence_0 = "Federal agents said yesterday they are investigating the theft."
    sentence_1 = "The agents indicated they were looking for stolen property."
    sentence_2 = "A Cuban architect was sentenced to 20 years in prison Friday."
    inputs_1 = tokenizer.encode_plus(sentence_0, sentence_1, add_special_tokens=True, return_tensors='pt', max_length= max_length, pad_to_max_length=True)
    inputs_2 = tokenizer.encode_plus(sentence_0, sentence_2, add_special_tokens=True, return_tensors='pt', max_length= max_length ,pad_to_max_length=True)

    tokens_tensor = inputs_1['input_ids']
    segments_tensors = inputs_1['token_type_ids']
    attention_mask = inputs_1['attention_mask']

    print("=== Confirm that BERT is doing something sane ===")
    print("Sentence 0 = '{}'".format(sentence_0))
    print("Sentence 1 = '{}'".format(sentence_1))
    print("Sentence 2 = '{}'".format(sentence_2))
    print()
    
    print("==Tokens (word ids) tensor==")
    print(tokens_tensor)
    print(tokens_tensor.size())
    print()
    print("NB: 101 = [CLS] == start, 102 = [SEP] == separator")

    print("==Segment tensor (which word are in each sentence to compare)==")
    print(segments_tensors)
    print(segments_tensors.size())
    print()

    print("==Attention tensor (which word are in each sentence to compare)==")
    print("==(words vs padding to 128 constant width)==")
    print(attention_mask)
    print(attention_mask.size())
    print()
    
    # Batch 2
    if batch_size == 2:
        tokens_tensor = torch.cat( [ inputs_1['input_ids'], inputs_2['input_ids'] ] )
        segments_tensors = torch.cat( [ inputs_1['token_type_ids'], inputs_2['token_type_ids'] ] )
        attention_mask = torch.cat( [ inputs_1['attention_mask'], inputs_2['attention_mask'] ] )

    # Batch 4
    if batch_size == 4:
        tokens_tensor = torch.cat( [ inputs_1['input_ids'], inputs_2['input_ids'], inputs_1['input_ids'], inputs_2['input_ids'] ] )
        segments_tensors = torch.cat( [ inputs_1['token_type_ids'], inputs_2['token_type_ids'], inputs_1['token_type_ids'], inputs_2['token_type_ids'] ] )
        attention_mask = torch.cat( [ inputs_1['attention_mask'], inputs_2['attention_mask'], inputs_1['attention_mask'], inputs_2['attention_mask'] ] )
    
    dummy_input_1 = [tokens_tensor, attention_mask, segments_tensors]
    
    print("Trace input sizes")
    print("===")
    for inp in dummy_input_1:
        print( inp.size() )
    
    bert_pretrained_model_dir="bert-large-uncased-mrpc/"
    
    ## These compiler options are not user tunable for now, please reach out to the neuron team through
    ## github if you feel that the default compiler options do not work correctly for your model
    ## these are tuned for BERT large
    compiler_options="--verbose=1 -O2"
    optimization="aggressive"
    neuron_model_output_names = []
    
    print()
    print("-- Generating {} model(s)".format(num_models))
    for i in range(num_models):
        name = "bert_large_mrpc_pytorch_batch" + str(batch_size) + '_' + str(i) + ".pt"
        print(" - {}".format(name) )
        neuron_model_output_names.append(name)
    
    print()
    print("Load pretrained model")
    
    neuron_model = NeuronForwardBertForSequenceClassification.from_pretrained( 
        bert_pretrained_model_dir, 
        torchscript=True, 
        compiler_options=compiler_options, 
        optimization=optimization )
    neuron_model.eval()
    
    ## Now compile for neuron test
    print()
    print("Partially compile Neuron Test BERT for PyTorch '{}'".format(neuron_model_output_names[0]) )
    start = time.time()    
    
    output = neuron_model(*dummy_input_1)
    print(output)
    delta = time.time() - start
    print("Compile time is {} seconds".format(delta))
    
    start = time.time()  
    
    ## What is this for?  To save the whole model we need to do a regular jit trace (not a neuron trace)
    ## over the *rest* of the model.  The already traced neuron part will be skipped. Then we have something 
    ## we can save and load into pytorch
    neuron_model = torch.jit.trace( neuron_model, example_inputs=dummy_input_1 )
    delta = time.time() - start
    
    print("Retrace time (to save) is {} seconds".format(delta))
    torch.jit.save( neuron_model, neuron_model_output_names.pop(0))
    
    num_models_left = num_models - 1

    for _ in range(num_models_left):
        print()
        print("Load pretrained model")
        
        ## We are telling the code to skip the time consuming part - compiling the model    
        neuron_model = NeuronForwardBertForSequenceClassification.from_pretrained( 
            bert_pretrained_model_dir, 
            torchscript=True, 
            compiler_options=compiler_options, 
            optimization=optimization, 
            use_cached_compiler_output=True )
        
        neuron_model.eval()
        
        ## Now compile for neuron test
        print()
        print("Partially compile Neuron Test BERT for PyTorch '{}'".format(neuron_model_output_names[0]) )
        print("  -> compiler_args={}".format(compiler_options))
        start = time.time()
        output = neuron_model( *dummy_input_1 )
        delta = time.time() - start
        print("Compile time is {} seconds".format(delta))
        print(output)
        
        print("JIT trace the whole graph now so we can save it in torchscript #{} (retrace)".format(i))
        start = time.time()    
        neuron_model = torch.jit.trace( neuron_model, example_inputs=dummy_input_1 )
        delta = time.time() - start
        print("Retrace time (to save) is {} seconds".format(delta))
        
        #print("Attempt to save and reload")
        print("Save file")
        torch.jit.save( neuron_model, neuron_model_output_names.pop(0))        

## Now we can run it!
The compilation will take some time, but produces a very fast BERT large for inference.

A number of compiler messages will appear in your jupyter notebook log.  This is normal since it is running as a sub-process.  You will also see messages like:

```
[E neuron_runtime.cpp:85] grpc server unix:/run/neuron.sock is unavailable. Is neuron-rtd running? Is socket /run/neuron.sock writable?
[E neuron_op_impl.cpp:52] Warning: Neuron runtime cannot be initialized; falling back to CPU execution
[E neuron_op_impl.cpp:53] Warning: Tensor output are ** NOT CALCULATED ** during CPU execution and only indicate tensor shape
```

These are normal on CPU and not a source for concern

In [None]:
compile_bert()

### Upload the created files to S3 (same bucket we fetched from)

In [None]:
import boto3
import botocore
from urllib.parse import urlsplit
from botocore.exceptions import ClientError
import os

        
assert( s3_location != None )

batch_size=4
num_models=4
neuron_model_output_names=[]
for i in range(num_models):
    name = "bert_large_mrpc_pytorch_batch" + str(batch_size) + '_' + str(i) + ".pt"
    print(" - {}".format(name) )
    neuron_model_output_names.append(name)

assert( neuron_model_output_names != None )
assert( len(neuron_model_output_names) != 0 )

for filename in neuron_model_output_names:
    upload_and_check_file( s3_location, filename )
