# EKS CSI FSX Lustre Setup

Amazon FSx for Lustre is a high-performance file system optimized for deep learning workloads. FSx provides POSIX-compliant file system access to S3 for multiple readers and writers simultaneously.
  
The Amazon FSx for Lustre Container Storage Interface (CSI) driver provides a CSI interface that allows Amazon EKS clusters to manage the lifecycle of Amazon FSx for Lustre file systems.  

https://docs.aws.amazon.com/eks/latest/userguide/fsx-csi.html

In [97]:
import boto3
import json
from botocore.exceptions import ClientError

iam = boto3.client('iam')
sts = boto3.client('sts')
cfn = boto3.client('cloudformation')
eks = boto3.client('eks')

region = boto3.Session().region_name
cluster_name = 'demo'

# 1. Install the FSx CSI Driver for Kubernetes

## Create IAM Policy

Create an IAM policy and service account that allows the driver to make calls to AWS APIs on your behalf.

In [47]:
!pygmentize fsx/fsx-csi-driver.json

{
    [94m"Version"[39;49;00m:[33m"2012-10-17"[39;49;00m,
    [94m"Statement"[39;49;00m:[
        {
            [94m"Effect"[39;49;00m:[33m"Allow"[39;49;00m,
            [94m"Action"[39;49;00m:[
                [33m"iam:CreateServiceLinkedRole"[39;49;00m,
                [33m"iam:AttachRolePolicy"[39;49;00m,
                [33m"iam:PutRolePolicy"[39;49;00m
            ],
            [94m"Resource"[39;49;00m:[33m"arn:aws:iam::*:role/aws-service-role/s3.data-source.lustre.fsx.amazonaws.com/*"[39;49;00m
        },
        {
            [94m"Action"[39;49;00m:[33m"iam:CreateServiceLinkedRole"[39;49;00m,
            [94m"Effect"[39;49;00m:[33m"Allow"[39;49;00m,
            [94m"Resource"[39;49;00m:[33m"*"[39;49;00m,
            [94m"Condition"[39;49;00m:{
                [94m"StringLike"[39;49;00m:{
                    [94m"iam:AWSServiceName"[39;49;00m:[
                        [33m"fsx.amazonaws.com"[39;49;00m
                    ]
              

In [48]:
# !aws iam create-policy \
#     --policy-name Amazon_FSx_Lustre_CSI_Driver \
#     --policy-document file://fsx/fsx-csi-driver.json

In [None]:
with open('fsx/fsx-csi-driver.json') as json_file:
    data = json.load(json_file)
    policy = json.dumps(data)

try:
    response = iam.create_policy(
        PolicyName='Amazon_FSx_Lustre_CSI_Driver',
        PolicyDocument=policy
    )
    print("[OK] Policy created.")

except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        print("[OK] Policy already exists.")
    else:
        print("Error: %s" % e)

In [70]:
account_id = sts.get_caller_identity()['Account']
csi_policy_arn = 'arn:aws:iam::{}:policy/Amazon_FSx_Lustre_CSI_Driver'.format(account_id)
print(csi_policy_arn)

arn:aws:iam::231218423789:policy/Amazon_FSx_Lustre_CSI_Driver


## Create Kubernetes IAM Service Account

Create a Kubernetes service account for the driver and attach the policy to the service account. Replacing the ARN of the policy with the ARN returned in the previous step.

In [61]:
!eksctl create iamserviceaccount \
     --region $region \
     --name fsx-csi-controller-sa \
     --namespace kube-system \
     --cluster $cluster_name \
     --attach-policy-arn $policy_arn \
     --approve

[36m[ℹ]  eksctl version 0.30.0
[0m[36m[ℹ]  using region us-west-2
[0m[32m[!]  retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) from ec2metadata/GetToken - will retry after delay of 36.025941ms
[0m[32m[!]  retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) from ec2metadata/GetToken - will retry after delay of 117.25337ms
[0m[32m[!]  retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) from ec2metadata/GetToken - will retry after delay of 205.097108ms
[0m[32m[!]  retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context dead

In [88]:
cf_stack_name = 'eksctl-{}-addon-iamserviceaccount-kube-system-fsx-csi-controller-sa'.format(cluster_name)
print(cf_stack_name)

eksctl-demo-addon-iamserviceaccount-kube-system-fsx-csi-controller-sa


In [89]:
response = cfn.list_stack_resources(
    StackName=cf_stack_name
)
print(response)

In [95]:
iam_role_name = response['StackResourceSummaries'][0]['PhysicalResourceId']
print(iam_role_name)

eksctl-demo-addon-iamserviceaccount-kube-sys-Role1-F6F0V336BM0B


In [96]:
iam_role_arn = iam.get_role(RoleName=iam_role_name)['Role']['Arn']
print(iam_role_arn)

arn:aws:iam::231218423789:role/eksctl-demo-addon-iamserviceaccount-kube-sys-Role1-F6F0V336BM0B


# Deploy CSI Driver

In [99]:
!kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"


serviceaccount/fsx-csi-controller-sa configured
clusterrole.rbac.authorization.k8s.io/fsx-csi-external-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/fsx-csi-external-provisioner-binding created
deployment.apps/fsx-csi-controller created
daemonset.apps/fsx-csi-node created
csidriver.storage.k8s.io/fsx.csi.aws.com created


Patch the driver deployment to add the service account that you just created, replacing the ARN with the correct role ARN.

In [100]:
!kubectl annotate serviceaccount -n kube-system fsx-csi-controller-sa \
 eks.amazonaws.com/role-arn=$iam_role_arn --overwrite=true

serviceaccount/fsx-csi-controller-sa annotated


# Check S3 Bucket For FSX

In [106]:
bucket = 's3://fsx-antje'

In [None]:
#!aws s3 mb $bucket

In [118]:
!aws s3 ls $bucket

                           PRE code/
                           PRE input/
                           PRE model/


In [119]:
!aws s3 ls $bucket --recursive

2020-10-30 18:14:13      24767 code/train.py
2020-10-30 18:14:13        615 input/data/test/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-10-30 18:14:13        632 input/data/test/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-10-30 18:14:13      10728 input/data/train/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-10-30 18:14:13      11812 input/data/train/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-10-30 18:14:13        679 input/data/validation/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-10-30 18:14:13        642 input/data/validation/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-10-30 18:14:43          0 model/


# Download Storage Class Manifest

In [126]:
!curl -o storageclass.yaml https://raw.githubusercontent.com/kubernetes-sigs/aws-fsx-csi-driver/master/examples/kubernetes/dynamic_provisioning_s3/specs/storageclass.yaml
    

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   336  100   336    0     0   8400      0 --:--:-- --:--:-- --:--:--  8400


## Get VPC ID and Subnet ID

In [146]:
%%bash

source ~/.bash_profile

#### Get VPC ID
export VPC_ID=$(aws ec2 describe-vpcs --filters "Name=tag:Name,Values=eksctl-${AWS_CLUSTER_NAME}-cluster/VPC" --query "Vpcs[0].VpcId" --output text)
echo "export VPC_ID=${VPC_ID}" | tee -a ~/.bash_profile

#### Get Subnet ID
export SUBNET_ID=$(aws ec2 describe-subnets --filters "Name=vpc-id,Values=${VPC_ID}" --query "Subnets[0].SubnetId" --output text)
echo "export SUBNET_ID=${SUBNET_ID}" | tee -a ~/.bash_profile

export VPC_ID=vpc-00befc57aa9827938
export SUBNET_ID=subnet-0b7feca8e1a89866e


## Create Security Group

In [153]:
%%bash

source ~/.bash_profile

export SEC_GROUP_ID=$(aws ec2 create-security-group --group-name eks-fsx-security-group --vpc-id ${VPC_ID} --description "FSx for Lustre Security Group" --query "GroupId" --output text)
echo "export SEC_GROUP_ID=${SEC_GROUP_ID}" | tee -a ~/.bash_profile

export SEC_GROUP_ID=sg-0068354d692c6b51d


## Add an ingress rule that opens up port 988 from the 192.168.0.0/16 CIDR range

In [154]:
%%bash

source ~/.bash_profile

aws ec2 authorize-security-group-ingress --group-id ${SEC_GROUP_ID} --protocol tcp --port 988 --cidr 192.168.0.0/16

## Attach Security Group to Nodes

## Update the environment variables in the `storageclass.yaml` file

In [155]:
!pygmentize fsx/storageclass.yaml

[94mkind[39;49;00m: StorageClass
[94mapiVersion[39;49;00m: storage.k8s.io/v1
[94mmetadata[39;49;00m:
  [94mname[39;49;00m: fsx-sc
[94mprovisioner[39;49;00m: fsx.csi.aws.com
[94mparameters[39;49;00m:
  [94msubnetId[39;49;00m: subnet-0b7feca8e1a89866e
  [94msecurityGroupIds[39;49;00m: sg-0068354d692c6b51d
  [94ms3ImportPath[39;49;00m: s3://fsx-antje
  [94ms3ExportPath[39;49;00m: s3://fsx-antje
  [94mautoImportPolicy[39;49;00m: NEW_CHANGED
  [94mdeploymentType[39;49;00m: SCRATCH_2
[94mmountOptions[39;49;00m:
  - flock


In [None]:
# %%bash

# source ~/.bash_profile

# # Populate SUBNET_ID, SECURITY_GROUP_ID, S3_BUCKET

# cd

# sed "s@SUBNET_ID@$SUBNET_ID@" fsx/fsx-s3-sc.yaml.template > fsx/fsx-s3-sc.yaml

# sed -i.bak -e "s@SECURITY_GROUP_ID@$SECURITY_GROUP_ID@" fsx/fsx-s3-sc.yaml 

# sed -i.bak -e "s@S3_BUCKET@$S3_BUCKET@" fsx/fsx-s3-sc.yaml

# Create FSX Storage Class

In [156]:
!kubectl delete -f fsx/storageclass.yaml

Error from server (NotFound): error when deleting "code/storageclass.yaml": storageclasses.storage.k8s.io "fsx-sc" not found


In [157]:
!kubectl create -f fsx/storageclass.yaml

storageclass.storage.k8s.io/fsx-sc created


In [158]:
!kubectl get sc

NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
fsx-sc          fsx.csi.aws.com         Delete          Immediate              false                  4s
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  7h48m


# Create Claim

In [159]:
!curl -o claim.yaml https://raw.githubusercontent.com/kubernetes-sigs/aws-fsx-csi-driver/master/examples/kubernetes/dynamic_provisioning_s3/specs/claim.yaml

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   188  100   188    0     0    761      0 --:--:-- --:--:-- --:--:--   758


In [160]:
!pygmentize fsx/claim.yaml

[94mapiVersion[39;49;00m: v1
[94mkind[39;49;00m: PersistentVolumeClaim
[94mmetadata[39;49;00m:
  [94mname[39;49;00m: fsx-claim
[94mspec[39;49;00m:
  [94maccessModes[39;49;00m:
    - ReadWriteMany
  [94mstorageClassName[39;49;00m: fsx-sc
  [94mresources[39;49;00m:
    [94mrequests[39;49;00m:
      [94mstorage[39;49;00m: 1200Gi


In [161]:
!kubectl delete -f fsx/claim.yaml

Error from server (NotFound): error when deleting "code/claim.yaml": persistentvolumeclaims "fsx-claim" not found


In [162]:
!kubectl apply -f fsx/claim.yaml

persistentvolumeclaim/fsx-claim created


In [177]:
!kubectl get pvc fsx-claim

NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
fsx-claim   Bound    pvc-15d2fbd7-a0ee-44a6-8ccb-c570d15f4690   1200Gi     RWX            fsx-sc         11m


In [178]:
!kubectl describe pvc fsx-claim

Name:          fsx-claim
Namespace:     default
StorageClass:  fsx-sc
Status:        Bound
Volume:        pvc-15d2fbd7-a0ee-44a6-8ccb-c570d15f4690
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: fsx.csi.aws.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1200Gi
Access Modes:  RWX
VolumeMode:    Filesystem
Mounted By:    <none>
Events:
  Type     Reason                Age                   From                                                                                      Message
  ----     ------                ----                  ----                                                                                      -------
  Normal   Provisioning          6m39s (x2 over 11m)   fsx.csi.aws.com_fsx-csi-controller-56b56d8d7c-pzvll_265ac4c3-1adb-11eb-bd19-52b9ba508e4f  External provisioner is provisioning volume for cl

## _Wait for status == Bound_

## Update FSX to `autoImportPolicy: NEW_CHANGED`

In [164]:
fsx = boto3.client('fsx')

In [175]:
response = fsx.describe_file_systems()
fsx_id = response['FileSystems'][0]['FileSystemId']
print(fsx_id)

fs-074bb60625bdd4a9a


In [176]:
response = fsx.update_file_system(
    FileSystemId=fsx_id,
    LustreConfiguration={
        'AutoImportPolicy': 'NEW_CHANGED'
    }
)
print(response)

{'FileSystem': {'OwnerId': '231218423789', 'CreationTime': datetime.datetime(2020, 10, 30, 19, 12, 15, 581000, tzinfo=tzlocal()), 'FileSystemId': 'fs-074bb60625bdd4a9a', 'FileSystemType': 'LUSTRE', 'Lifecycle': 'AVAILABLE', 'StorageCapacity': 1200, 'StorageType': 'SSD', 'VpcId': 'vpc-00befc57aa9827938', 'SubnetIds': ['subnet-0b7feca8e1a89866e'], 'NetworkInterfaceIds': ['eni-0eebeb5c214c2ac23', 'eni-0fba861f463339909'], 'DNSName': 'fs-074bb60625bdd4a9a.fsx.us-west-2.amazonaws.com', 'ResourceARN': 'arn:aws:fsx:us-west-2:231218423789:file-system/fs-074bb60625bdd4a9a', 'Tags': [{'Key': 'CSIVolumeName', 'Value': 'pvc-15d2fbd7-a0ee-44a6-8ccb-c570d15f4690'}], 'LustreConfiguration': {'WeeklyMaintenanceStartTime': '3:13:00', 'DataRepositoryConfiguration': {'Lifecycle': 'UPDATING', 'ImportPath': 's3://fsx-antje', 'ExportPath': 's3://fsx-antje', 'ImportedFileChunkSize': 1024, 'AutoImportPolicy': 'NEW_CHANGED'}, 'DeploymentType': 'SCRATCH_2', 'MountName': 'zvexvbmv', 'CopyTagsToBackups': False}}, 

## kubectl version

In [214]:
!kubectl version

Client Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.10-eks-bac369", GitCommit:"bac3690554985327ae4d13e42169e8b1c2f37226", GitTreeState:"clean", BuildDate:"2020-02-21T23:37:18Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.9-eks-4c6976", GitCommit:"4c6976793196d70bc5cd29d56ce5440c9473648e", GitTreeState:"clean", BuildDate:"2020-07-17T18:46:04Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
