# Run Model Training on Amazon Elastic Kubernetes Service (Amazon EKS)

Amazon EKS is a managed service that makes it easy for you to run Kubernetes on AWS without needing to install and operate your own Kubernetes control plane or worker nodes.

## Amazon FSx For Lustre

Amazon FSx for Lustre is a fully managed service that provides cost-effective, high-performance storage for compute workloads. Many workloads such as machine learning, high performance computing (HPC), video rendering, and financial simulations depend on compute instances accessing the same set of data through high-performance shared storage.

Powered by Lustre, the world's most popular high-performance file system, FSx for Lustre offers sub-millisecond latencies, up to hundreds of gigabytes per second of throughput, and millions of IOPS. It provides multiple deployment options and storage types to optimize cost and performance for your workload requirements.

FSx for Lustre file systems can also be linked to Amazon S3 buckets, allowing you to access and process data concurrently from both a high-performance file system and from the S3 API.

## Using Amazon FSx for Lustre Container Storage Interface (CSI) 

The Amazon FSx for Lustre Container Storage Interface (CSI)  driver provides a CSI interface that allows Amazon EKS clusters to manage the lifecycle of Amazon FSx for Lustre file systems. 

* https://docs.aws.amazon.com/eks/latest/userguide/fsx-csi.html
* https://github.com/kubernetes-sigs/aws-fsx-csi-driver


```
code/
	train.py
    test_data/
        *.tsv.gz

input/
	data/
		test/
			*.tfrecord
		train/
			*.tfrecord
		validation/
			*.tfrecord

model/
```

## List FSx Files

In [3]:
!aws s3 ls --recursive s3://fsx-antje/

2020-11-21 16:02:48          0 code/test_data/
2020-11-21 16:02:48   18997559 code/test_data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-10-30 18:14:13      24767 code/train.py
2020-10-30 18:14:13        615 input/data/test/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-10-30 18:14:13        632 input/data/test/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-10-30 18:14:13      10728 input/data/train/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-10-30 18:14:13      11812 input/data/train/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-10-30 18:14:13        679 input/data/validation/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-10-30 18:14:13        642 input/data/validation/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-10-30 18:14:43          0 model/


## Model Training Code `train.py`

In [4]:
!pygmentize code/train.py

[34mimport[39;49;00m [04m[36mtime[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mglob[39;49;00m [34mimport[39;49;00m glob
[34mimport[39;49;00m [04m[36mpprint[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mtensorflow[39;49;00m [34mas[39;49;00m [04m[36mtf[39;49;00m

subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33minstall[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mtransformers==2.8.0[39;49;00m[33m'[39;49;00m])
subprocess.check_call([sys.executable, [33m'[39;

## Write `train.yaml`

In [5]:
!pygmentize ./train.yaml

[04m[36m---[39;49;00m 
[94mapiVersion[39;49;00m: v1
[94mkind[39;49;00m: Pod
[94mmetadata[39;49;00m:
  [94mname[39;49;00m: bert-model-training
[94mspec[39;49;00m:
  [94mvolumes[39;49;00m:
  - [94mname[39;49;00m: fsx-opt-ml
    [94mpersistentVolumeClaim[39;49;00m:
      [94mclaimName[39;49;00m: fsx-claim
  [94mcontainers[39;49;00m: 
    - [94mname[39;49;00m: bert
      [94mcommand[39;49;00m: 
        - python
        - /opt/ml/code/train.py
        - --train_steps_per_epoch=100
        - --epochs=10
        - --learning_rate=0.00001
        - --epsilon=0.00000001
        - --train_batch_size=128
        - --validation_batch_size=64
        - --test_batch_size=64
        - --validation_steps=10
        - --test_steps=10
        - --use_xla=True
        - --use_amp=False
        - --max_seq_length=64
        - --freeze_bert_layer=True
        - --run_validation=True
        - --run_test=True
        - --run_sample_predictions=True
      [94mimage[39;49;00m: 763

## Create Kubernetes Training Job

In [19]:
!kubectl delete -f train.yaml

Error from server (NotFound): error when deleting "train.yaml": pods "bert-model-training" not found


In [20]:
!kubectl create -f train.yaml

pod/bert-model-training created


## Describe Training Job

In [37]:
!kubectl get pods

NAME                  READY   STATUS      RESTARTS   AGE
bert-csi-fsx          0/1     Completed   0          21d
bert-model-training   0/1     Pending     0          9m40s


In [38]:
!kubectl get pod bert-model-training

NAME                  READY   STATUS    RESTARTS   AGE
bert-model-training   0/1     Pending   0          9m44s


In [39]:
!kubectl describe pod bert-model-training

Name:         bert-model-training
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           
IPs:          <none>
Containers:
  bert:
    Image:      763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.1.0-cpu-py36-ubuntu18.04
    Port:       <none>
    Host Port:  <none>
    Command:
      python
      /opt/ml/code/train.py
      --train_steps_per_epoch=100
      --epochs=10
      --learning_rate=0.00001
      --epsilon=0.00000001
      --train_batch_size=128
      --validation_batch_size=64
      --test_batch_size=64
      --validation_steps=10
      --test_steps=10
      --use_xla=True
      --use_amp=False
      --max_seq_length=64
      --freeze_bert_layer=True
      --run_validation=True
      --run_test=True
      --run_sample_predictions=True
    Environment:  <none>
    Mounts:
      /opt/ml/ from fsx-opt-ml (rw)
      /var/run/secrets/kubernetes.io/servicea

## Review Training Job Logs

In [40]:
!kubectl logs -f bert-model-training