## Set up your development environment

All the setup for your development work can be accomplished in a Python notebook.  Setup includes:

* Importing Python packages
* Connecting to a workspace to enable communication between your local computer and remote resources
* Creating an experiment to track all your runs
* Creating a remote compute target to use for training

### Import packages

Import Python packages you need in this session. Also display the Azure Machine Learning SDK version.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

import azureml.core
from azureml.core import Workspace

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

### Connect to workspace

Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `ws`.

In [None]:
# load workspace configuration from the config.json file in the current folder.
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep='\t')

### Create experiment

Create an experiment to track the runs in your workspace. A workspace can have muliple experiments. 

In [None]:
experiment_name = 'bert-pretrain'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

### Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support. In this tutorial, you create Azure Machine Learning Compute as your training environment. The code below creates the compute clusters for you if they don't already exist in your workspace.

**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

# Sort out workbook VM stuff

Sometimes the environment variables inside of the workbook vm so that we cant even see pip, we also can add a local folder to pythonpath.

NOTE none of this is relevant for the compute VM, its all to test things locally.

In [None]:
!printenv

In [None]:
import sys

print(sys.executable)

In [None]:
import os
os.getcwd()

In [None]:
%env CONDA_PYTHON_EXE=/anaconda/bin/python
%env CONDA_DEFAULT_ENV=azureml_py36
%env PATH=/home/azureuser/bin:/home/azureuser/.local/bin:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/bin:/anaconda/envs/azureml_py36/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/usr/local/cuda/bin:/dsvm/tools/cntk/cntk/bin

In [None]:
!conda --version

In [None]:
!python --version

In [None]:
!which python

In [None]:
!pip --version

In [None]:
!pip install tensorflow-gpu==1.13.1
!pip install nltk
!pip install regex

# Local test of assets

In [None]:
import os
import sys
import numpy as np
import json
import nltk
import pandas as pd
import csv
import random
import logging
import tensorflow as tf
from collections import Counter
import pathlib
import pickle

import modeling, optimization, tokenization
from run_pretraining import input_fn_builder, model_fn_builder

from text_preprocessing import tokenizer_word
from language_model_processing import read_raw_data_preprocess_and_save, create_vocab_df
from bpe import create_token_vocabulary, get_stats, merge_vocab, Encoder

In [None]:
language_maps_dir = "/mnt/azmnt/code/Users/Peter.Usherwood/BERT Pretrain/configandvocab"

def save_obj(obj, directory, name):
    with open(directory / "{}.pkl".format(name), 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(name, directory):
    with open(os.path.join(directory, name + '.pkl'), 'rb') as f:
        return pickle.load(f)
      
vocab_to_id = load_obj('vocab_to_id', str(language_maps_dir))
len(vocab_to_id)

In [None]:
import modeling, optimization, tokenization

testcase = "Olá isso é mais uma BAGUNCA 😂😂😂"
bert_tokenizer = tokenization.FullTokenizer(language_maps_dir)
print(testcase)
print(bert_tokenizer.tokenize(testcase))

In [None]:
import json

bert_base_config = {
  "attention_probs_dropout_prob": 0.1, 
  "directionality": "bidi", 
  "hidden_act": "gelu", 
  "hidden_dropout_prob": 0.1, 
  "hidden_size": 768, 
  "initializer_range": 0.02, 
  "intermediate_size": 3072, 
  "max_position_embeddings": 512, 
  "num_attention_heads": 12, 
  "num_hidden_layers": 12, 
  "pooler_fc_size": 768, 
  "pooler_num_attention_heads": 12, 
  "pooler_num_fc_layers": 3, 
  "pooler_size_per_head": 128, 
  "pooler_type": "first_token_transform", 
  "type_vocab_size": 2, 
  "vocab_size": len(vocab_to_id)
}

with open(os.path.join(language_maps_dir, 'bert_config.json'), 'w') as f:
    json.dump(bert_base_config, f)
    
print(bert_base_config)
####################################load_vocab

# Create a Datastore from blob

(https://aka.ms/azureml/howto/createdatasets)

In [None]:
from azureml.core.dataset import Dataset

# Register the datastore with the workspace
ds = Datastore.register_azure_blob_container(workspace=ws, 
                                             datastore_name='BERT_Preprocessed_Data',
                                             container_name='bertpretraining',
                                             account_name='ktbrdsdevstorage',
                                             account_key='56s8hzdloAaLlimY0CfMAHupiXwqTaBH6qMYPzdxA9bgLUGlDaXIJ3G8eltaPlptCGGpNc7edW5aN4vPkDOaKg=='
                                            )

# Help from: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data

# Print the workspace attributes
print('Datastore name: ' + ds.name, 
      'Container name: ' + ds.container_name, 
      'Datastore type: ' + ds.datastore_type, 
      'Workspace name: ' + ds.workspace.name, sep = '\n')

# Make training Script

### Azure ML concepts  
Please note the following three things in the code below:
1. The script accepts arguments using the argparse package. In this case there is one argument `--data_folder` which specifies the file system folder in which the script can find the MNIST data
```
    parser = argparse.ArgumentParser()
    parser.add_argument('--data_folder')
```
2. The script is accessing the Azure ML `Run` object by executing `run = Run.get_context()`. Further down the script is using the `run` to report the training accuracy and the validation accuracy as training progresses.
```
    run.log('training_acc', np.float(acc_train))
    run.log('validation_acc', np.float(acc_val))
```
3. When running the script on Azure ML, you can write files out to a folder `./outputs` that is relative to the root directory. This folder is specially tracked by Azure ML in the sense that any files written to that folder during script execution on the remote target will be picked up by Run History; these files (known as artifacts) will be available as part of the run history record.

In [None]:
%%writefile tf_bert.py
# Write script

import numpy as np
import argparse
import os
import tensorflow as tf
import glob

from azureml.core import Run

parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
args = parser.parse_args()

data_folder = args.data_folder
print('Data folder:', data_folder)

# Input data pipeline config
TRAIN_BATCH_SIZE = 64 #@param {type:"integer"}
MAX_PREDICTIONS = 20 #@param {type:"integer"}
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param

# Training procedure config
EVAL_BATCH_SIZE = 64
LEARNING_RATE = 2e-5
TRAIN_STEPS = 1000000 #@param {type:"integer"}
SAVE_CHECKPOINTS_STEPS = 250 #@param {type:"integer"}


model_weights_dir = './outputs/model'
pretraining_data_dir = '/mnt/azmnt/code/Users/Peter.Usherwood/BERT Pretrain/pretrainingbasedata'

VOCAB_FILE = language_maps_dir + '/vocab_file.csv'
CONFIG_FILE = language_maps_dir + '/bert_config.json'

INIT_CHECKPOINT = tf.train.latest_checkpoint(model_weights_dir)

bert_config = modeling.BertConfig.from_json_file(CONFIG_FILE)
input_files = tf.gfile.Glob(os.path.join(pretraining_data_dir,'*tfrecord'))

USE_TPU = False

#Model
model_fn = model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=INIT_CHECKPOINT,
      learning_rate=LEARNING_RATE,
      num_train_steps=TRAIN_STEPS,
      num_warmup_steps=10,
      use_tpu=USE_TPU,
      use_one_hot_embeddings=True)

run_config = tf.contrib.tpu.RunConfig(
    model_dir=model_weights_dir,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    keep_checkpoint_max=5,
    keep_checkpoint_every_n_hours=1,
    log_step_count_steps=100)

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=USE_TPU,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE)
  
train_input_fn = input_fn_builder(
        input_files=input_files,
        max_seq_length=MAX_SEQ_LENGTH,
        max_predictions_per_seq=MAX_PREDICTIONS,
        is_training=True)

#Train
estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)

Notice how the script gets data and saves models:

+ The training script reads an argument to find the directory containing the data.  When you submit the job later, you point to the dataset for this argument:
`parser.add_argument('--data-folder', type=str, dest='data_folder', help='data directory mounting point')`

# Train on a remote cluster

For this task, submit the job to the remote training cluster you set up earlier.  To submit a job you:
* Create a directory
* Create a training script
* Create an estimator object
* Submit the job 

### Create a directory

Create a directory to deliver the necessary code from your computer to the remote resource.

In [None]:
import os
script_folder = os.path.join(os.getcwd(), "virtual_assistant")
os.makedirs(script_folder, exist_ok=True)

### Create an estimator

An estimator object is used to submit the run. Azure Machine Learning has pre-configured estimators for common machine learning frameworks, as well as generic Estimator. Create SKLearn estimator for scikit-learn model, by specifying

* The name of the estimator object, `est`
* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. 
* The compute target.  In this case you will use the AmlCompute you created
* The training script name, train.py
* Parameters required from the training script 

In this tutorial, the target is AmlCompute. All files in the script folder are uploaded into the cluster nodes for execution. The data_folder is set to use the dataset.

In [None]:
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

# set up environment\n
env = Environment('my_env')
# ensure latest azureml-dataprep and other required packages installed in the environment
cd = CondaDependencies.create(pip_packages=['keras',
                                            'azureml-sdk',
                                            'tensorflow==1.13.1',
                                            'matplotlib',
                                            'tensorflow-hub',
                                            'bokeh',
                                            'tf-sentencepiece',
                                            'simpleneighbors',
                                            'tqdm',
                                            'matplotlib',
                                            'sklearn',
                                            'azureml-dataprep[pandas,fuse]>=1.1.14'])

env.python.conda_dependencies = cd

In [None]:
from azureml.train.dnn import TensorFlow

script_params = {}

est = TensorFlow(source_directory=script_folder,
                 script_params=script_params,
                 compute_target=compute_target,
                 entry_script='tf_virtual_assistant.py', 
                 framework_version='1.13',
                 environment_definition= env)

### Submit the job to the cluster

Run the experiment by submitting the estimator object. And you can navigate to Azure portal to monitor the run.

In [None]:
run = exp.submit(config=est)
run

Since the call is asynchronous, it returns a **Preparing** or **Running** state as soon as the job is started.

## Monitor a remote run

In total, the first run takes **approximately 10 minutes**. But for subsequent runs, as long as the dependencies (`conda_packages` parameter in the above estimator constructor) don't change, the same image is reused and hence the container start up time is much faster.

Here is what's happening while you wait:

- **Image creation**: A Docker image is created matching the Python environment specified by the estimator. The image is built and stored in the ACR (Azure Container Registry) associated with your workspace. Image creation and uploading takes **about 5 minutes**. 

  This stage happens once for each Python environment since the container is cached for subsequent runs.  During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. Scaling typically takes **about 5 minutes.**

- **Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the files in the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.

- **Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.


You can check the progress of a running job in multiple ways. This tutorial uses a Jupyter widget as well as a `wait_for_completion` method. 

### Jupyter widget

Watch the progress of the run with a Jupyter widget.  Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

In [None]:
# specify show_output to True for a verbose log
run.wait_for_completion(show_output=True) 

In the next tutorial you will explore this model in more detail.

# Register model

The last step in the training script wrote the file `outputs/sklearn_mnist_model.pkl` in a directory named `outputs` in the VM of the cluster where the job is executed. `outputs` is a special directory in that all content in this  directory is automatically uploaded to your workspace.  This content appears in the run record in the experiment under your workspace. Hence, the model file is now also available in your workspace.

You can see files associated with that run.

In [None]:
run.get_file_names()

In [None]:
# create a model folder in the current directory
os.makedirs('./model', exist_ok=True)

for f in run.get_file_names():
    if f.startswith('outputs/model'):
        output_file_path = os.path.join('./model', f.split('/')[-1])
        print('Downloading from {} to {} ...'.format(f, output_file_path))
        run.download_file(name=f, output_file_path=output_file_path)