# EKS CSI FSX Lustre Setup

Amazon FSx for Lustre is a high-performance file system optimized for deep learning workloads. FSx provides POSIX-compliant file system access to S3 for multiple readers and writers simultaneously.
  
The Amazon FSx for Lustre Container Storage Interface (CSI) driver provides a CSI interface that allows Amazon EKS clusters to manage the lifecycle of Amazon FSx for Lustre file systems.  

* https://docs.aws.amazon.com/eks/latest/userguide/fsx-csi.html
* https://github.com/kubernetes-sigs/aws-fsx-csi-driver

In [1]:
import boto3
import json
from botocore.exceptions import ClientError

iam = boto3.client('iam')
sts = boto3.client('sts')
cfn = boto3.client('cloudformation')
eks = boto3.client('eks')

region = boto3.Session().region_name
cluster_name = 'workshop'

# 1. Install the FSx CSI Driver for Kubernetes

## Create IAM Policy

Create an IAM policy and service account that allows the driver to make calls to AWS APIs on your behalf.

In [2]:
!pygmentize fsx/fsx-csi-driver.json

{
    [94m"Version"[39;49;00m:[33m"2012-10-17"[39;49;00m,
    [94m"Statement"[39;49;00m:[
        {
            [94m"Effect"[39;49;00m:[33m"Allow"[39;49;00m,
            [94m"Action"[39;49;00m:[
                [33m"iam:CreateServiceLinkedRole"[39;49;00m,
                [33m"iam:AttachRolePolicy"[39;49;00m,
                [33m"iam:PutRolePolicy"[39;49;00m
            ],
            [94m"Resource"[39;49;00m:[33m"arn:aws:iam::*:role/aws-service-role/s3.data-source.lustre.fsx.amazonaws.com/*"[39;49;00m
        },
        {
            [94m"Action"[39;49;00m:[33m"iam:CreateServiceLinkedRole"[39;49;00m,
            [94m"Effect"[39;49;00m:[33m"Allow"[39;49;00m,
            [94m"Resource"[39;49;00m:[33m"*"[39;49;00m,
            [94m"Condition"[39;49;00m:{
                [94m"StringLike"[39;49;00m:{
                    [94m"iam:AWSServiceName"[39;49;00m:[
                        [33m"fsx.amazonaws.com"[39;49;00m
                    ]
              

In [None]:
# !aws iam create-policy \
#     --policy-name Amazon_FSx_Lustre_CSI_Driver \
#     --policy-document file://fsx/fsx-csi-driver.json

In [3]:
with open('fsx/fsx-csi-driver.json') as json_file:
    data = json.load(json_file)
    policy = json.dumps(data)

try:
    response = iam.create_policy(
        PolicyName='Amazon_FSx_Lustre_CSI_Driver',
        PolicyDocument=policy
    )
    print("[OK] Policy created.")

except ClientError as e:
    if e.response['Error']['Code'] == 'EntityAlreadyExists':
        print("[OK] Policy already exists.")
    else:
        print("Error: %s" % e)

[OK] Policy already exists.


In [4]:
account_id = sts.get_caller_identity()['Account']
csi_policy_arn = 'arn:aws:iam::{}:policy/Amazon_FSx_Lustre_CSI_Driver'.format(account_id)
print(csi_policy_arn)

arn:aws:iam::231218423789:policy/Amazon_FSx_Lustre_CSI_Driver


## Create Kubernetes IAM Service Account

Create a Kubernetes service account for the driver and attach the policy to the service account. Replacing the ARN of the policy with the ARN returned in the previous step.

## _The next cell runs for about 10min. Please be patient._

In [6]:
!eksctl create iamserviceaccount \
     --region $region \
     --name fsx-csi-controller-sa \
     --namespace kube-system \
     --cluster $cluster_name \
     --attach-policy-arn $csi_policy_arn \
     --approve

[36m[ℹ]  eksctl version 0.32.0
[0m[36m[ℹ]  using region us-west-2
[0m[32m[!]  retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) from ec2metadata/GetToken - will retry after delay of 37.102353ms
[0m[32m[!]  retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) from ec2metadata/GetToken - will retry after delay of 82.788892ms
[0m[32m[!]  retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)) from ec2metadata/GetToken - will retry after delay of 222.797624ms
[0m[32m[!]  retryable error (RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context dead

In [8]:
cf_stack_name = 'eksctl-{}-addon-iamserviceaccount-kube-system-fsx-csi-controller-sa'.format(cluster_name)
print(cf_stack_name)

eksctl-workshop-addon-iamserviceaccount-kube-system-fsx-csi-controller-sa


In [9]:
response = cfn.list_stack_resources(
    StackName=cf_stack_name
)
print(response)

{'StackResourceSummaries': [{'LogicalResourceId': 'Role1', 'PhysicalResourceId': 'eksctl-workshop-addon-iamserviceaccount-kube-Role1-YQ8E1QVLFGWV', 'ResourceType': 'AWS::IAM::Role', 'LastUpdatedTimestamp': datetime.datetime(2020, 11, 21, 18, 10, 47, 955000, tzinfo=tzlocal()), 'ResourceStatus': 'CREATE_COMPLETE', 'DriftInformation': {'StackResourceDriftStatus': 'NOT_CHECKED'}}], 'ResponseMetadata': {'RequestId': '77ea8ffb-86c5-4354-92e7-b6ca0d0b8db7', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '77ea8ffb-86c5-4354-92e7-b6ca0d0b8db7', 'content-type': 'text/xml', 'content-length': '858', 'date': 'Sat, 21 Nov 2020 18:12:39 GMT'}, 'RetryAttempts': 0}}


In [10]:
iam_role_name = response['StackResourceSummaries'][0]['PhysicalResourceId']
print(iam_role_name)

eksctl-workshop-addon-iamserviceaccount-kube-Role1-YQ8E1QVLFGWV


In [11]:
iam_role_arn = iam.get_role(RoleName=iam_role_name)['Role']['Arn']
print(iam_role_arn)

arn:aws:iam::231218423789:role/eksctl-workshop-addon-iamserviceaccount-kube-Role1-YQ8E1QVLFGWV


# Deploy CSI Driver

In [12]:
!kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"


serviceaccount/fsx-csi-controller-sa configured
clusterrole.rbac.authorization.k8s.io/fsx-csi-external-provisioner-role created
clusterrolebinding.rbac.authorization.k8s.io/fsx-csi-external-provisioner-binding created
deployment.apps/fsx-csi-controller created
daemonset.apps/fsx-csi-node created
csidriver.storage.k8s.io/fsx.csi.aws.com created


Patch the driver deployment to add the service account that you just created, replacing the ARN with the correct role ARN.

In [13]:
!kubectl annotate serviceaccount -n kube-system fsx-csi-controller-sa \
 eks.amazonaws.com/role-arn=$iam_role_arn --overwrite=true

serviceaccount/fsx-csi-controller-sa annotated


# Check S3 Bucket For FSX

In [14]:
bucket = 's3://fsx-antje'

In [15]:
#!aws s3 mb $bucket

In [16]:
!aws s3 ls $bucket

                           PRE code/
                           PRE input/
                           PRE model/


In [17]:
!aws s3 ls $bucket --recursive

2020-11-21 16:02:48          0 code/test_data/
2020-11-21 16:02:48   18997559 code/test_data/amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-11-21 16:21:10      21708 code/train.py
2020-10-30 18:14:13        615 input/data/test/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-10-30 18:14:13        632 input/data/test/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-10-30 18:14:13      10728 input/data/train/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-10-30 18:14:13      11812 input/data/train/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-10-30 18:14:13        679 input/data/validation/part-algo-1-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-10-30 18:14:13        642 input/data/validation/part-algo-2-amazon_reviews_us_Digital_Video_Games_v1_00.tfrecord
2020-10-30 18:14:43          0 model/


# Download Storage Class Manifest

In [None]:
!curl -o storageclass.yaml https://raw.githubusercontent.com/kubernetes-sigs/aws-fsx-csi-driver/master/examples/kubernetes/dynamic_provisioning_s3/specs/storageclass.yaml
    

## Get VPC ID and Subnet ID

In [18]:
%%bash

source ~/.bash_profile

#### Get VPC ID
export VPC_ID=$(aws ec2 describe-vpcs --filters "Name=tag:Name,Values=eksctl-${AWS_CLUSTER_NAME}-cluster/VPC" --query "Vpcs[0].VpcId" --output text)
echo "export VPC_ID=${VPC_ID}" | tee -a ~/.bash_profile

#### Get Subnet ID
export SUBNET_ID=$(aws ec2 describe-subnets --filters "Name=vpc-id,Values=${VPC_ID}" --query "Subnets[0].SubnetId" --output text)
echo "export SUBNET_ID=${SUBNET_ID}" | tee -a ~/.bash_profile

export VPC_ID=vpc-02e8cba0a081bf4ad
export SUBNET_ID=subnet-0a25103d821733ac0


## Create Security Group

In [19]:
%%bash

source ~/.bash_profile

export SEC_GROUP_ID=$(aws ec2 create-security-group --group-name eks-fsx-security-group --vpc-id ${VPC_ID} --description "FSx for Lustre Security Group" --query "GroupId" --output text)
echo "export SEC_GROUP_ID=${SEC_GROUP_ID}" | tee -a ~/.bash_profile

export SEC_GROUP_ID=sg-0ee7f41b8573a5d2b


## Add an ingress rule that opens up port 988 from the 192.168.0.0/16 CIDR range

In [20]:
%%bash

source ~/.bash_profile

aws ec2 authorize-security-group-ingress --group-id ${SEC_GROUP_ID} --protocol tcp --port 988 --cidr 192.168.0.0/16

## Update the environment variables in the `storageclass.yaml` file

In [21]:
!pygmentize fsx/storageclass.yaml

[94mkind[39;49;00m: StorageClass
[94mapiVersion[39;49;00m: storage.k8s.io/v1
[94mmetadata[39;49;00m:
  [94mname[39;49;00m: fsx-sc
[94mprovisioner[39;49;00m: fsx.csi.aws.com
[94mparameters[39;49;00m:
  [94msubnetId[39;49;00m: subnet-0a25103d821733ac0
  [94msecurityGroupIds[39;49;00m: sg-0ee7f41b8573a5d2b
  [94ms3ImportPath[39;49;00m: s3://fsx-antje
  [94ms3ExportPath[39;49;00m: s3://fsx-antje
  [94mautoImportPolicy[39;49;00m: NEW_CHANGED
  [94mdeploymentType[39;49;00m: SCRATCH_2
[94mmountOptions[39;49;00m:
  - flock


# Create FSX Storage Class

In [None]:
#!kubectl delete -f fsx/storageclass.yaml

In [22]:
!kubectl create -f fsx/storageclass.yaml

storageclass.storage.k8s.io/fsx-sc created


In [23]:
!kubectl get sc

NAME            PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
fsx-sc          fsx.csi.aws.com         Delete          Immediate              false                  3s
gp2 (default)   kubernetes.io/aws-ebs   Delete          WaitForFirstConsumer   false                  36m


# Create Claim

In [None]:
#!curl -o claim.yaml https://raw.githubusercontent.com/kubernetes-sigs/aws-fsx-csi-driver/master/examples/kubernetes/dynamic_provisioning_s3/specs/claim.yaml

In [24]:
!pygmentize fsx/claim.yaml

[94mapiVersion[39;49;00m: v1
[94mkind[39;49;00m: PersistentVolumeClaim
[94mmetadata[39;49;00m:
  [94mname[39;49;00m: fsx-claim
[94mspec[39;49;00m:
  [94maccessModes[39;49;00m:
    - ReadWriteMany
  [94mstorageClassName[39;49;00m: fsx-sc
  [94mresources[39;49;00m:
    [94mrequests[39;49;00m:
      [94mstorage[39;49;00m: 1200Gi


In [None]:
#!kubectl delete -f fsx/claim.yaml

In [25]:
!kubectl apply -f fsx/claim.yaml

persistentvolumeclaim/fsx-claim created


In [30]:
!kubectl get pvc fsx-claim

NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
fsx-claim   Bound    pvc-a7b6809e-e622-45d6-a754-d8b9d0ee6e25   1200Gi     RWX            fsx-sc         21m


In [31]:
!kubectl describe pvc fsx-claim

Name:          fsx-claim
Namespace:     default
StorageClass:  fsx-sc
Status:        Bound
Volume:        pvc-a7b6809e-e622-45d6-a754-d8b9d0ee6e25
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: fsx.csi.aws.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1200Gi
Access Modes:  RWX
VolumeMode:    Filesystem
Mounted By:    <none>
Events:
  Type     Reason                Age                 From                                                                                      Message
  ----     ------                ----                ----                                                                                      -------
  Normal   Provisioning          11m (x3 over 21m)   fsx.csi.aws.com_fsx-csi-controller-55bcb55d5d-stdbn_34284e3c-2c25-11eb-8987-228ffe6b94e2  External provisioner is provisioning volume for claim "d

## _Wait for status == Bound_

## Update FSX to `autoImportPolicy: NEW_CHANGED`

In [None]:
fsx = boto3.client('fsx')

In [None]:
response = fsx.describe_file_systems()
fsx_id = response['FileSystems'][0]['FileSystemId']
print(fsx_id)

In [None]:
response = fsx.update_file_system(
    FileSystemId=fsx_id,
    LustreConfiguration={
        'AutoImportPolicy': 'NEW_CHANGED'
    }
)
print(response)