# Intel CPU/DLBoost vs. nvidia GPU/CUDA on BERT in Tensorflow 2.x

Prepared by Forrest Sheng Bao

For NLP class at Iowa State University

Adapted from https://www.tensorflow.org/official_models/fine_tuning_bert

# Introduction
To gain ground in the deep learing (DL) revolution, Intel released [DL Boost](https://www.intel.com/content/www/us/en/artificial-intelligence/deep-learning-boost.html), a set of new instructions and software tools, such that their CPUs won't look too slow in DL training. Intel is also maintaining [a tailored version of Tensorflow](https://software.intel.com/content/www/us/en/develop/articles/intel-optimization-for-tensorflow-installation-guide.html) for their CPUs, including those with DL Boost.  But how does it really turn out? 

To compare the DL power between Intel's CPUs with DLBoost and nVidia's consumer-level GPUs, I will finetune the BERT base model on the MRPC task. 
Same Python code, same machine just different (co)-processors to run the code. 


**Configurations**: 
* Intel CPU: i9-10980XE, 18 cores, 165W, $1000 retail price
* nVidia GPU: RTX 3090, 350W, $1500 retail price (although it could be $3000)
* RAM 64GB

All packages are binary installed via `pip`. 

# Conclusion (for those who cannot wait)

Intel lost the game completely! 

**training speed**:
* 3 seconds per step in Intel-optimized Tensorflow on an Intel CPU (10980XE) with DL Boost
* 140 milliseconds per step in generic Tensorflow on RTX 3090. 

The Intel CPU is 50% power-hungry but only 5% capable compared to RTX 3090. 
In other words, for the same amount of energy consumed, the Intel CPU only gets 1/10 of the work done. 
If you further factor in the price tag (speed per Watt per dollar), the Intel CPU is still only 1/3 worthy than RTX 3090. 

**Takeaway**: If you are an DL researcher, do not wait your time and money on using Intel CPUs. Follow the crowd. Use GPUs. 


# Setting up Intel-optimized Tensorflow 

Create an virtual environment if you do not want Intel's TF to mess with official TF that uses CUDA. 
When not detecting the GPU, the official/generic TF will also use CPU cores but that would be generic TF code which cannot fully unleash the power of Intel DL Boost. 

First,

```bash
pip install  tensorflow_hub tensorflow_datasets  tf-models-official
```

Due to dependencies, this will install the official Tensorflow which has no 
special optimization for Intel's CPUs. 

Second, let's uninstall the official Tensorflow:
```bash
pip uninstall tensorflow
```

Third, install the Intel-optimized Tensorflow. 
```bash
pip install intel-tensorflow-avx512
```

Now, please select the proper virtual environment, kernel and/or Tensorflow 
version to run the experiment. 

In [14]:
import os,json

import numpy as np
# import matplotlib.pyplot as plt

import tensorflow as tf

import tensorflow_hub as hub
import tensorflow_datasets as tfds
tfds.disable_progress_bar()

# Load BERT related 
# !pip install tensorflow_hub tensorflow_datasets
# !pip3 install -q tf-models-official==2.4.0
from official.modeling import tf_utils
from official import nlp
from official.nlp import bert
import official.nlp.optimization
import official.nlp.bert.bert_models
import official.nlp.bert.configs
import official.nlp.bert.run_classifier
import official.nlp.bert.tokenization
import official.nlp.data.classifier_data_lib
import official.nlp.modeling.losses
import official.nlp.modeling.models
import official.nlp.modeling.networks

In [15]:
# Settings for Intel CPUs. Should have no impact if you use GPU. 
os.environ["KMP_BLOCKTIME"] = "1"

os.environ["KMP_SETTINGS"] = "1"

os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

os.environ["OMP_NUM_THREADS"]= "36"

os.environ["TF_ENABLE_ONEDNN_OPTS"]="1"


# Loading the task data 
Here we use MRPC task as an example. 

In [16]:
glue, info = tfds.load('glue/mrpc', with_info=True,
                       # It's small, load the whole dataset
                       batch_size=-1)
print ("The splits are: \n\t", list(glue.keys()))
print ("Each sample is organized as: \n", info.features)
# print ("Each sample has two labels in MRPC: \n", info.features['label'].names)

The splits are: 
	 ['train', 'validation', 'test']
Each sample is organized as: 
 FeaturesDict({
    'idx': tf.int32,
    'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
    'sentence1': Text(shape=(), dtype=tf.string),
    'sentence2': Text(shape=(), dtype=tf.string),
})


# Preparing the data

## Specifying the location of BERT model files
from a Google Cloud location

Here we use the base BERT model. 

In [17]:
gs_folder_bert = "gs://cloud-tpu-checkpoints/bert/v3/uncased_L-12_H-768_A-12"
tf.io.gfile.listdir(gs_folder_bert)

['bert_config.json',
 'bert_model.ckpt.data-00000-of-00001',
 'bert_model.ckpt.index',
 'vocab.txt']

## Setting up and testing Tokenizer for BERT 

In [18]:
# Set up tokenizer to generate Tensorflow dataset
tokenizer = bert.tokenization.FullTokenizer(
    vocab_file=os.path.join(gs_folder_bert, "vocab.txt"),
     do_lower_case=True)

# Test 
print("Vocab size:", len(tokenizer.vocab))
tokens = tokenizer.tokenize("The multilaterial relationship is important between geopolitical organizations and multinational conglomerates")
print(tokens)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

Vocab size: 30522
['the', 'multi', '##late', '##rial', 'relationship', 'is', 'important', 'between', 'geo', '##pol', '##itical', 'organizations', 'and', 'multinational', 'conglomerate', '##s']
[1996, 4800, 13806, 14482, 3276, 2003, 2590, 2090, 20248, 18155, 26116, 4411, 1998, 20584, 22453, 2015]


## Generating token IDs, input masks, and input types

In [19]:
def encode_sentence(s):
   tokens = list(tokenizer.tokenize(s))
#    tokens.append('[SEP]')
   return tokenizer.convert_tokens_to_ids(tokens)

# TODO: how does this compare with TF1.x version and how we can parallelize to
#       multiple CPU cores 
def bert_encode(glue, split):
#   num_examples = len(glue[split]["sentence1"])

  sentence1 = tf.ragged.constant([
      encode_sentence(s)
      for s in np.array(glue[split]["sentence1"])])
  sentence2 = tf.ragged.constant([
      encode_sentence(s)
       for s in np.array(glue[split]["sentence2"])])

  cls_column = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence1.shape[0]
  sep_column = [tokenizer.convert_tokens_to_ids(['[SEP]'])]*sentence1.shape[0]
  input_word_ids = tf.concat([cls_column, sentence1, sep_column, sentence2, sep_column], axis=-1)

  input_mask = tf.ones_like(input_word_ids).to_tensor()

  type_cls = tf.zeros_like(cls_column)
  type_sep = tf.zeros_like(sep_column)
  type_s1 = tf.zeros_like(sentence1)
  type_s2 = tf.ones_like(sentence2)
  input_type_ids = tf.concat(
      [type_cls, type_s1, type_sep, type_s2, type_sep], axis=-1).to_tensor()

  inputs = {
      'input_word_ids': input_word_ids.to_tensor(),
      'input_mask': input_mask,
      'input_type_ids': input_type_ids}

  return inputs

task_data = {
    split: {'inputs':bert_encode(glue, split), 'labels':glue[split]['label']} 
            for split in glue.keys()
}


# Loading the BERT model
`gs_folder_bert` was defined earlier when loading tokenzier

In [20]:
bert_config_file = os.path.join(gs_folder_bert, "bert_config.json")
config_dict = json.loads(tf.io.gfile.GFile(bert_config_file).read())
bert_config = bert.configs.BertConfig.from_dict(config_dict)

bert_classifier, bert_encoder = bert.bert_models.classifier_model(
    bert_config, num_labels=2)


checkpoint = tf.train.Checkpoint(encoder=bert_encoder)
checkpoint.read(
    os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed()

2021-10-02 18:33:57.583654: E tensorflow/core/platform/cloud/curl_http_request.cc:614] The transmission  of request 0x1415b1a0 (URI: https://storage.googleapis.com/cloud-tpu-checkpoints/bert%2Fv3%2Funcased_L-12_H-768_A-12%2Fbert_config.json) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.000424 (No error), connect time: 0.014014 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)
2021-10-02 18:36:06.138441: E tensorflow/core/platform/cloud/curl_http_request.cc:614] The transmission  of request 0x14ec1150 (URI: https://storage.googleapis.com/cloud-tpu-checkpoints/bert%2Fv3%2Funcased_L-12_H-768_A-12%2Fbert_model.ckpt.data-00000-of-00001) has been stuck at 0 of 0 bytes for 61 seconds and will be aborted. CURL timing information: lookup time: 0.000378 (No error), connect time: 0.015309 (No error), pre-transfer time: 0 (No error), start-transfer time: 0 (No error)


<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f0cc9b3c8e0>

# Fine-tuning of the model 

In [21]:
# Set up epochs and steps
epochs = 3
batch_size = 32
eval_batch_size = 32

train_data_size = len(task_data['train']['labels'])
steps_per_epoch = int(train_data_size / batch_size)
num_train_steps = steps_per_epoch * epochs
warmup_steps = int(epochs * train_data_size * 0.1 / batch_size)

# creates an optimizer with learning rate schedule
optimizer = nlp.optimization.create_optimizer(
    2e-5, num_train_steps=num_train_steps, num_warmup_steps=warmup_steps)


metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)]
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

bert_classifier.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=metrics)

bert_classifier.fit(
      task_data['train']['inputs'], task_data['train']['labels'], 
      validation_data=(task_data['validation']['inputs'], task_data['validation']['labels']),
      batch_size=batch_size,
      epochs=epochs)


Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f0cc9b2c610>