# Fine-Tuning a BERT Model and Create a Text Classifier

We have already performed the Feature Engineering to create BERT embeddings from the `reviews_body` text using the pre-trained BERT model, and split the dataset into train, validation and test files. To optimize for Tensorflow training, we saved the files in TFRecord format. 

Now, let’s fine-tune the BERT model to our Customer Reviews Dataset and add a new classification layer to predict the `star_rating` for a given `review_body`.

![BERT Training](img/bert_training.png)

As mentioned earlier, BERT’s attention mechanism is called a Transformer. This is, not coincidentally, the name of the popular BERT Python library, “Transformers,” maintained by a company called [HuggingFace](https://github.com/huggingface/transformers). We will use a variant of BERT called [DistilBert](https://arxiv.org/pdf/1910.01108.pdf) which requires less memory and compute, but maintains very good accuracy on our dataset.

# DEMO 2: 

# Run Model Training on Amazon Elastic Kubernetes Service (Amazon EKS)

Amazon EKS is a managed service that makes it easy for you to run Kubernetes on AWS without needing to install and operate your own Kubernetes control plane or worker nodes.

## Amazon FSx For Lustre

Amazon FSx for Lustre is a fully managed service that provides cost-effective, high-performance storage for compute workloads. Many workloads such as machine learning, high performance computing (HPC), video rendering, and financial simulations depend on compute instances accessing the same set of data through high-performance shared storage.

Powered by Lustre, the world's most popular high-performance file system, FSx for Lustre offers sub-millisecond latencies, up to hundreds of gigabytes per second of throughput, and millions of IOPS. It provides multiple deployment options and storage types to optimize cost and performance for your workload requirements.

FSx for Lustre file systems can also be linked to Amazon S3 buckets, allowing you to access and process data concurrently from both a high-performance file system and from the S3 API.

## Using Amazon FSx for Lustre Container Storage Interface (CSI) 

The Amazon FSx for Lustre Container Storage Interface (CSI)  driver provides a CSI interface that allows Amazon EKS clusters to manage the lifecycle of Amazon FSx for Lustre file systems. 

* https://docs.aws.amazon.com/eks/latest/userguide/fsx-csi.html
* https://github.com/kubernetes-sigs/aws-fsx-csi-driver


```
code/
	train.py

input/
	data/
		test/
			*.tfrecord
		train/
			*.tfrecord
		validation/
			*.tfrecord

```

## List FSx Files

In [1]:
!aws s3 ls --recursive s3://fsx-container-demo/

2020-11-22 17:20:35          0 code/
2020-11-22 17:20:54      17519 code/train.py
2020-11-22 17:21:24        615 input/data/test/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-11-22 17:21:24        632 input/data/test/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-11-22 17:21:24      10728 input/data/train/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-11-22 17:21:24      11812 input/data/train/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-11-22 17:21:24        679 input/data/validation/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-11-22 17:21:24        642 input/data/validation/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord


## Model Training Code `train.py`

In [2]:
!pygmentize code/train.py

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mglob[39;49;00m [34mimport[39;49;00m glob
[34mimport[39;49;00m [04m[36mpprint[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m

subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33minstall[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mtransformers==2.8.0[39;49;00m[33m'[39;49;00m])
subprocess.check_call([sys.executable, [33m'[39;

## Write `train.yaml`

In [3]:
!pygmentize ./train.yaml

[04m[36m---[39;49;00m 
[94mapiVersion[39;49;00m: v1
[94mkind[39;49;00m: Pod
[94mmetadata[39;49;00m:
  [94mname[39;49;00m: bert-model-training
[94mspec[39;49;00m:
  [94mvolumes[39;49;00m:
  - [94mname[39;49;00m: fsx-opt-ml
    [94mpersistentVolumeClaim[39;49;00m:
      [94mclaimName[39;49;00m: fsx-claim
  [94mcontainers[39;49;00m: 
    - [94mname[39;49;00m: bert
      [94mcommand[39;49;00m: 
        - python
        - /opt/ml/code/train.py
        - --epochs=3
        - --learning_rate=0.00001
        - --epsilon=0.00000001
        - --train_batch_size=128
        - --validation_batch_size=64
        - --test_batch_size=64
        - --train_steps_per_epoch=100
        - --validation_steps=10
        - --test_steps=10
        - --use_xla=True
        - --use_amp=False
        - --max_seq_length=64
        - --freeze_bert_layer=True
        - --run_validation=True
        - --run_test=True
        - --run_sample_predictions=True
      [94mimage[39;49;00m: 7631

## Create Kubernetes Training Job

In [5]:
!kubectl get nodes

NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-12-220.us-west-2.compute.internal   Ready    <none>   58m   v1.18.9-eks-d1db3c
ip-192-168-27-75.us-west-2.compute.internal    Ready    <none>   58m   v1.18.9-eks-d1db3c
ip-192-168-32-76.us-west-2.compute.internal    Ready    <none>   58m   v1.18.9-eks-d1db3c
ip-192-168-54-75.us-west-2.compute.internal    Ready    <none>   58m   v1.18.9-eks-d1db3c


In [13]:
#!kubectl delete -f train.yaml

In [7]:
!kubectl create -f train.yaml

pod/bert-model-training created


## Describe Training Job

In [14]:
!kubectl get pods

NAME                  READY   STATUS    RESTARTS   AGE
bert-model-training   1/1     Running   0          72s


In [15]:
!kubectl get pod bert-model-training

NAME                  READY   STATUS    RESTARTS   AGE
bert-model-training   1/1     Running   0          75s


In [16]:
!kubectl describe pod bert-model-training

Name:         bert-model-training
Namespace:    default
Priority:     0
Node:         ip-192-168-54-75.us-west-2.compute.internal/192.168.54.75
Start Time:   Sun, 22 Nov 2020 17:48:20 +0000
Labels:       <none>
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           192.168.58.13
IPs:
  IP:  192.168.58.13
Containers:
  bert:
    Container ID:  docker://a3f1f9c6f14b84f7023ea07d2f1ff167f1eff8cbcead11b3a10254ab98cdba3a
    Image:         763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.1.0-cpu-py36-ubuntu18.04
    Image ID:      docker-pullable://763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training@sha256:4911ac31a130c68a2f92b72dd81d22bd02b542cc549c5652f22c1f24e702eaf5
    Port:          <none>
    Host Port:     <none>
    Command:
      python
      /opt/ml/code/train.py
      --epochs=3
      --learning_rate=0.00001
      --epsilon=0.00000001
      --train_batch_size=128
      --validation_batch_size=64
      --test_batch_size=6

## Review Training Job Logs

In [17]:
%%time

!kubectl logs -f bert-model-training

Collecting transformers==2.8.0
  Downloading transformers-2.8.0-py3-none-any.whl (563 kB)
Collecting regex!=2019.12.17
  Downloading regex-2020.11.13-cp36-cp36m-manylinux2014_x86_64.whl (723 kB)
Collecting filelock
  Downloading filelock-3.0.12-py3-none-any.whl (7.6 kB)
Collecting tqdm>=4.27
  Downloading tqdm-4.53.0-py2.py3-none-any.whl (70 kB)
Collecting dataclasses; python_version < "3.7"
  Downloading dataclasses-0.8-py3-none-any.whl (19 kB)
Collecting sentencepiece
  Downloading sentencepiece-0.1.94-cp36-cp36m-manylinux2014_x86_64.whl (1.1 MB)
Collecting sacremoses
  Downloading sacremoses-0.0.43.tar.gz (883 kB)
Collecting tokenizers==0.5.2
  Downloading tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7 MB)
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py): started
  Building wheel for sacremoses (setup.py): finished with status 'done'
  Created wheel for sacremoses: filename=sacremoses-0.0.43-py3-none-any.whl size=893259 sha256=cb7