# TensorFlow CPU training

Create a pod file for your cluster. A pod file will provide the instructions about what the cluster should run. This pod file will download Keras and run a Keras example. This example uses the TensorFlow framework. Open vi or vim and copy and paste the following content. Save this file as tf.yaml. You can use this with either TensorFlow or TensorFlow 2. To use it with TensorFlow 2, change the Docker image to a TensorFlow 2 image.



## Clone Deep Learning Containers Repo

In [None]:
#!git clone https://github.com/aws/deep-learning-containers.git

# Setup Environment Variables

In [1]:
import boto3

aws_region_as_slist=!curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/'
region = aws_region_as_slist.s
print('Region: {}'.format(region))

account_id=boto3.client('sts').get_caller_identity().get('Account')
print('Account ID: {}'.format(account_id))

bucket='sagemaker-{}-{}'.format(region, account_id)
print('S3 Bucket: {}'.format(bucket))

role='arn:aws:iam::{}:role/TeamRole'.format(account_id)
print('SageMaker Role ARN: {}'.format(role))

docker_repo='dlc-demo'
print('Docker Repo Name: {}'.format(docker_repo))

Region: us-west-2
Account ID: 231218423789
S3 Bucket: sagemaker-us-west-2-231218423789
SageMaker Role ARN: arn:aws:iam::231218423789:role/TeamRole
Docker Repo Name: dlc-demo


# Login to ECR

In [2]:
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


## Create Custom ECR Repo `dlc-demo`

In [18]:
!aws ecr describe-repositories --repository-names $docker_repo || aws ecr create-repository --repository-name $docker_repo


An error occurred (RepositoryNotFoundException) when calling the DescribeRepositories operation: The repository with name 'dlc-demo' does not exist in the registry with id '231218423789'
{
    "repository": {
        "repositoryArn": "arn:aws:ecr:us-west-2:231218423789:repository/dlc-demo",
        "registryId": "231218423789",
        "repositoryName": "dlc-demo",
        "repositoryUri": "231218423789.dkr.ecr.us-west-2.amazonaws.com/dlc-demo",
        "createdAt": 1603915788.0,
        "imageTagMutability": "MUTABLE",
        "imageScanningConfiguration": {
            "scanOnPush": false
        },
        "encryptionConfiguration": {
            "encryptionType": "AES256"
        }
    }
}


# Pull the Deep Learning Container for Tensorflow 2.1 Training

Available Deep Learning Container Images:  
https://github.com/aws/deep-learning-containers/blob/master/available_images.md


In [None]:
dlc_repo_account_id='763104351884'
print(dlc_repo_account_id)

In [None]:
train_image='763104351884.dkr.ecr.{}.amazonaws.com/tensorflow-training:2.1.0-cpu-py36-ubuntu18.04'.format(region)
print(train_image)

In [None]:
dlc_repo='763104351884.dkr.ecr.{}.amazonaws.com'.format(region)
print(dlc_repo)

## Login to official DLC Repo

In [None]:
!$(aws ecr get-login --region $region --registry-ids $dlc_repo_account_id --no-include-email)

## Pull DLC 

In [15]:
!docker system prune -a -f

Deleted Images:
untagged: 231218423789.dkr.ecr.us-west-2.amazonaws.com/dlc-demo@sha256:5ceb1ee9893b8d7d637b50187edc6ee47a35a165fea38908bf214177fdfe6722
deleted: sha256:b5e47ed796a4ee080ba916dd2ab286d0e505762bc3b5124af4ad2c3843c57dc3
deleted: sha256:571c790b73cca3511760c8f87a69f17cd90c00d3e9a54276803e950db0e077ee
deleted: sha256:2a307fb1280445026e389df311f34a1cf0fc8c84f33a9a1da66e12923f0c375c
deleted: sha256:005a1bc90034be3c8b824675571c04e8a5fa0829621cdcccc6abafdf30662c5c
deleted: sha256:0a0071ed0e01c014ef0f19b720b04f09bc7a6a4735bb784d7a4aa5d2d52dc9ca
deleted: sha256:8557c29cf27c79e5c6a08a88b8d45e1a0093472a7c0070711169798ff8bde6c2
untagged: 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.1.0-cpu-py36-ubuntu18.04
untagged: 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training@sha256:4911ac31a130c68a2f92b72dd81d22bd02b542cc549c5652f22c1f24e702eaf5
untagged: 231218423789.dkr.ecr.us-west-2.amazonaws.com/dlc-demo:bert
untagged: 231218423789.dkr.ecr.us-west-2.ama

In [19]:
!docker images

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE


In [None]:
!docker pull $train_image

In [None]:
!docker images

# Extend DLC to your needs

## Create Dockerfile

In [None]:
# %%bash

# cat <<EoF > ./docker/Dockerfile
# # Use the Deep Learning Container as a base Image
# FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.1.0-cpu-py36-ubuntu18.04

# # Add any script or repo as required
# #ADD https://github.com/data-science-on-aws/workshop/blob/07ef8914a159012058bf4ad08493dc8da808b57f/12_kubeflow/wip/dlc/code/train_orig.py /opt/ml/code/train_orig.py
# ADD ../code/train_orig.py /opt/ml/code/train_orig.py

# RUN chmod +x /opt/ml/code/train_orig.py

# #WORKDIR "/usr/local/bin"
# #ENTRYPOINT ["python3", "train_eks.py"]
# WORKDIR "/"
# CMD ["bin/bash"]

# EoF

In [17]:
!docker images

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE


## Build Container

In [5]:
!pygmentize ./Dockerfile

[37m# Use the Deep Learning Container as a base Image[39;49;00m
[34mFROM[39;49;00m [33m763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.1.0-cpu-py36-ubuntu18.04[39;49;00m

[37m# Add any script or repo as required[39;49;00m
[37m#ADD https://github.com/data-science-on-aws/workshop/blob/07ef8914a159012058bf4ad08493dc8da808b57f/12_kubeflow/wip/dlc/code/train_orig.py /opt/ml/code/train.py[39;49;00m
[34mADD[39;49;00m ./code/train.py /opt/ml/code/train.py
[34mADD[39;49;00m ./data-tfrecord/ /opt/ml/input/data/

[34mRUN[39;49;00m chmod +x /opt/ml/code/train.py

[37m#WORKDIR "/usr/local/bin"[39;49;00m
[37m#ENTRYPOINT ["python3", "train_eks.py"][39;49;00m
[34mWORKDIR[39;49;00m[33m "/"[39;49;00m
[34mCMD[39;49;00m [[33m"bin/bash"[39;49;00m]


In [20]:
docker_repo = 'dlc-demo'
docker_tag = 'bert'

bert_image_uri = f'{account_id}.dkr.ecr.{region}.amazonaws.com/{docker_repo}:{docker_tag}'
print(bert_image_uri)

231218423789.dkr.ecr.us-west-2.amazonaws.com/dlc-demo:bert


In [21]:
!docker images

REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE


In [22]:
!docker build --pull --no-cache -t $docker_repo:$docker_tag -f ./Dockerfile .

Sending build context to Docker daemon  286.5MB
Step 1/6 : FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.1.0-cpu-py36-ubuntu18.04
2.1.0-cpu-py36-ubuntu18.04: Pulling from tensorflow-training

[1B26d33875: Pulling fs layer 
[1B29a9c730: Pulling fs layer 
[1Bda195c84: Pulling fs layer 
[1B9a5ad49e: Pulling fs layer 
[1Bd17a5040: Pulling fs layer 
[1Bd18b5ac8: Pulling fs layer 
[1Bc888bdfc: Pulling fs layer 
[1B2069664f: Pulling fs layer 
[1Bcaa11e65: Pulling fs layer 
[1B1ddf75e4: Pulling fs layer 
[1Bff9b9705: Pulling fs layer 
[1Bad67f82b: Pulling fs layer 
[1B4e9f6b89: Pulling fs layer 
[11Ba5ad49e: Waiting fs layer 
[11B17a5040: Waiting fs layer 
[1B0bdfac1c: Pull complete 265kB/4.265kBB[14A[2K[13A[2K[16A[2K[11A[2K[16A[2K[11A[2K[16A[2K[10A[2K[16A[2K[12A[2K[8A[2K[7A[2K[5A[2K[16A[2K[6A[2K[4A[2K[3A[2K[2A[2K[16A[2K[16A[2K[4A[2K[12A[2K[15A[2K[4A[2K[12A[2K[14A[2K[13A[2K[12A[2K[4A[2K[12A[2K

In [23]:
!docker inspect $docker_repo:$docker_tag

[
    {
        "Id": "sha256:701a64aed18a46ef729899c11cb9f3630b0d6a8c29e31966c2672b262369aaca",
        "RepoTags": [
            "dlc-demo:bert"
        ],
        "RepoDigests": [],
        "Parent": "sha256:710689675e01b8ee476960a43e076d3960c4c05c2a1d4ba239b7f9ce5792d420",
        "Comment": "",
        "Created": "2020-10-28T20:10:48.597746483Z",
        "Container": "5a50919692ea5e4228f683205ba0016c759fdda67999c3caba4a1b1130f1d742",
        "ContainerConfig": {
            "Hostname": "5a50919692ea",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/openmpi/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "DEBIAN_FRONTEND=noninteractive",
                "DEBCONF_NONINTERACTIVE_SEEN=true",
        

In [24]:
!docker images

REPOSITORY                                                         TAG                          IMAGE ID            CREATED             SIZE
dlc-demo                                                           bert                         701a64aed18a        2 seconds ago       2.04GB
763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training   2.1.0-cpu-py36-ubuntu18.04   88ccf0c00d27        6 months ago        2.04GB


# Push Container To ECR

In [25]:
!docker tag $docker_repo:$docker_tag $bert_image_uri

In [26]:
!docker images

REPOSITORY                                                         TAG                          IMAGE ID            CREATED             SIZE
231218423789.dkr.ecr.us-west-2.amazonaws.com/dlc-demo              bert                         701a64aed18a        7 seconds ago       2.04GB
dlc-demo                                                           bert                         701a64aed18a        7 seconds ago       2.04GB
763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training   2.1.0-cpu-py36-ubuntu18.04   88ccf0c00d27        6 months ago        2.04GB


In [27]:
!docker push $bert_image_uri

The push refers to repository [231218423789.dkr.ecr.us-west-2.amazonaws.com/dlc-demo]

[1Bbfb5eb2b: Preparing 
[1B13565381: Preparing 
[1B70df5756: Preparing 
[1B8c803ce7: Preparing 
[1B0b197b08: Preparing 
[1Bbcd9e4f8: Preparing 
[1Bd27bdf23: Preparing 
[1B47a735ee: Preparing 
[1B265b4a07: Preparing 
[1Baf664231: Preparing 
[1B9474ac8d: Preparing 
[1B6c789bd1: Preparing 
[1B65d5ffd1: Preparing 
[1Be37f97cc: Preparing 
[1B027e9b6c: Preparing 
[1B2a8fc3be: Preparing 
[9B265b4a07: Waiting g 
[1B83d4e999: Waiting g 
[13B27bdf23: Pushed   1.391GB/1.372GB[16A[2K[18A[2K[17A[2K[14A[2K[12A[2K[11A[2K[13A[2K[11A[2K[13A[2K[11A[2K[15A[2K[11A[2K[10A[2K[10A[2K[11A[2K[9A[2K[11A[2K[13A[2K[11A[2K[10A[2K[8A[2K[7A[2K[7A[2K[6A[2K[6A[2K[13A[2K[6A[2K[13A[2K[6A[2K[4A[2K[5A[2K[13A[2K[5A[2K[13A[2K[6A[2K[13A[2K[4A[2K[5A[2K[2A[2K[1A[2K[1A[2K[13A[2K[1A[2K[13A[2K[1A[2K[1A[2K[6A[2K[1A[2K[5A[2K[1A[2

In [28]:
!aws ecr list-images --repository-name $docker_repo

{
    "imageIds": [
        {
            "imageDigest": "sha256:4bf49dbd7bdb5ab8a4c76086622ccfaa893690047c744d68a93d9ee577f1720f",
            "imageTag": "bert"
        }
    ]
}


# Define Training Job

## Create YAML File (Training Pod)

In [None]:
# %%bash
# cat <<EoF > test8.yaml
# --- 
# apiVersion: v1
# kind: Pod
# metadata: 
#   name: bert
# spec: 
#   containers: 
#     - 
#       command: 
#         - python
#         - /opt/ml/code/train_orig.py
#         - --train_data=s3://sagemaker-us-west-2-231218423789/training-pipeline-2020-09-05-16-19-31/processing/output/bert-train
#         - --validation_data=s3://sagemaker-us-west-2-231218423789/training-pipeline-2020-09-05-16-19-31/processing/output/bert-validation
#         - --test_data=s3://sagemaker-us-west-2-231218423789/training-pipeline-2020-09-05-16-19-31/processing/output/bert-test
#       image: "231218423789.dkr.ecr.us-west-2.amazonaws.com/dlc-demo:bert"
#       imagePullPolicy: Always
#       name: bert
#   restartPolicy: Never
# EoF

# Create Training Job

In [29]:
!pygmentize test-final.yaml

[04m[36m---[39;49;00m 
[94mapiVersion[39;49;00m: v1
[94mkind[39;49;00m: Pod
[94mmetadata[39;49;00m: 
  [94mname[39;49;00m: bert
[94mspec[39;49;00m: 
  [94mcontainers[39;49;00m: 
    - [94mname[39;49;00m: bert 
      [94mcommand[39;49;00m: 
        - python
        - /opt/ml/code/train.py
[37m#        - --train_data=s3://sagemaker-us-west-2-231218423789/training-pipeline-2020-09-05-16-19-31/processing/output/bert-train[39;49;00m
[37m#        - --validation_data=s3://sagemaker-us-west-2-231218423789/training-pipeline-2020-09-05-16-19-31/processing/output/bert-validation[39;49;00m
[37m#        - --test_data=s3://sagemaker-us-west-2-231218423789/training-pipeline-2020-09-05-16-19-31/processing/output/bert-test[39;49;00m
      [94mimage[39;49;00m: [33m"[39;49;00m[33m231218423789.dkr.ecr.us-west-2.amazonaws.com/dlc-demo:bert[39;49;00m[33m"[39;49;00m
      [94mimagePullPolicy[39;49;00m: Always
      [94menv[39;49;00m: 
        - [94mname[39;49;00m: SM_TR

In [35]:
!kubectl delete -f test-final.yaml

pod "bert" deleted


In [36]:
!kubectl get pods bert

Error from server (NotFound): pods "bert" not found


In [37]:
!kubectl create -f test-final.yaml

pod/bert created


In [38]:
!kubectl get pods bert

NAME   READY   STATUS    RESTARTS   AGE
bert   1/1     Running   0          4s


In [42]:
!kubectl describe pods bert

Name:         bert
Namespace:    kubeflow
Priority:     0
Node:         ip-192-168-67-206.us-west-2.compute.internal/
Start Time:   Wed, 28 Oct 2020 20:24:42 +0000
Labels:       <none>
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Failed
Reason:       Evicted
Message:      The node was low on resource: memory. Container bert was using 2587156Ki, which exceeds its request of 0. 
IP:           
Containers:
  bert:
    Image:      231218423789.dkr.ecr.us-west-2.amazonaws.com/dlc-demo:bert
    Port:       <none>
    Host Port:  <none>
    Command:
      python
      /opt/ml/code/train.py
      --train_steps_per_epoch=1
    Environment:
      SM_TRAINING_ENV:        {"is_master":true}
      SAGEMAKER_JOB_NAME:     tf-bert-training-eks
      SM_CURRENT_HOST:        localhost
      SM_NUM_GPUS:            0
      SM_HOSTS:               {"hosts":"localhost"}
      SM_MODEL_DIR:           /opt/ml/model/
      SM_OUTPUT_DIR:          /opt/ml/output/
      SM_OUTPUT_DATA_DIR:    

In [41]:
!kubectl logs -f bert

Error from server (BadRequest): container "bert" in pod "bert" is not available
