title | description | published |
---|---|---|
AWS Cloud Deployment IAM Requirements |
Deploying Chalk to your AWS account. |
true |
import { ServiceDiagramSwitcher } from '@/components/ArchitectureDiagram'
Chalk's feature platform with best-in-class developer experience enables machine learning teams to focus on building the unique products and models that make their business stand out. Chalk provides a feature store so that you can deploy production machine learning pipelines for real time data in minutes.
Chalk is both a framework and a platform — developers can write code using familiar Python packages, and deploy their feature and data pipeline definitions to Chalk’s platform. In the Customer Cloud deployment, Chalk runs & administers its platform on the customer’s cloud account. Chalk's managed infrastructure then executes the customer defined pipelines to compute feature data for machine learning applications. Chalk then serves this data back to customer applications for online inference and to customer data teams for training set generation.
<ServiceDiagramSwitcher default_option={"AWS"} hideSwitch />
In order to manage infrastructure in your cloud account, Chalk requires certain IAM permissions. At a high level, Chalk needs the ability to provision the key components of your infrastructure:
- Storage resources (buckets for dataset storage and bulk insertion into offline storage)
- Networking resources (LBs, VPCs, etc.)
- IAM resources (e.g. creating service accounts for workload identity, etc.)
- Kubernetes resources (EKS)
- Online storage (RDS, or Elasticache, or Dynamo)
- Offline storage (typically Snowflake or Redshift)
We typically recommend the following steps for enabling Chalk to manage AWS infrastructure in your cloud account:
- Create a new account in your AWS organization.
- Create a new role in your root AWS organization that enables Chalk's server to perform actions in this account. Ensure that AssumeRole is granted.
- Create an IAM policy using the JSON document below and attach this IAM policy to the created role.
{
"Statement": [
{
"Resource": "*",
"Action": [
"sqs:*",
"sns:*",
"acm:*",
"secretsmanager:*",
"s3:*",
"redshift:*",
"redshift-data:*",
"redshift-serverless:*",
"rds:*",
"logs:*",
"kms:*",
"kafkaconnect:*",
"kafka:*",
"kafka-cluster:*",
"iam:UploadServerCertificate",
"iam:UploadSSHPublicKey",
"iam:UpdateServerCertificate",
"iam:UpdateRoleDescription",
"iam:UpdateRole",
"iam:UpdateOpenIDConnectProviderThumbprint",
"iam:UpdateAssumeRolePolicy",
"iam:RemoveRoleFromInstanceProfile",
"iam:RemoveClientIDFromOpenIDConnectProvider",
"iam:PutRolePolicy",
"iam:PassRole",
"iam:ListSSHPublicKeys",
"iam:ListRoles",
"iam:ListRolePolicies",
"iam:ListPolicyVersions",
"iam:ListPolicyTags",
"iam:ListPolicies",
"iam:ListOpenIDConnectProviders",
"iam:ListOpenIDConnectProviderTags",
"iam:ListAttachedRolePolicies",
"iam:GetServerCertificate",
"iam:GetSSHPublicKey",
"iam:GetRolePolicy",
"iam:GetRole",
"iam:GetPolicyVersion",
"iam:GetPolicy",
"iam:GetOpenIDConnectProvider",
"iam:GetInstanceProfile",
"iam:DetachRolePolicy",
"iam:DeleteServiceLinkedRole",
"iam:DeleteServerCertificate",
"iam:DeleteRolePolicy",
"iam:DeleteRole",
"iam:DeleteOpenIDConnectProvider",
"iam:DeleteInstanceProfile",
"iam:CreateServiceLinkedRole",
"iam:CreateRole",
"iam:CreatePolicyVersion",
"iam:CreatePolicy",
"iam:CreateOpenIDConnectProvider",
"iam:CreateInstanceProfile",
"iam:AttachRolePolicy",
"iam:AddRoleToInstanceProfile",
"iam:AddClientIDToOpenIDConnectProvider",
"iam:ListInstanceProfilesForRole",
"iam:DeletePolicy",
"elasticloadbalancing:*",
"eks:*",
"ecr:*",
"ec2:*",
"cloudwatch:*",
"autoscaling:*",
"application-autoscaling:*",
"dynamodb:*",
"dax:*"
],
"Effect": "Allow"
}
],
"Version": "2012-10-17"
}
cloudwatch
: The metrics viewer in the web UIlogs
: Logs for the web UIacm
: Provisioning SSL certs for client<>server encryptionrds
: RDS Online storeredshift
: Redshift offline storeredshift-data
: Redshift offline storeredshift-serverless
: Redshift offline storekafka
: Asynchronous persistence queues for metrics & feature storagekafka-cluster
: Asynchronous persistence queues for metrics & feature storagedynamodb
: DynamoDB online storedax
: DynamoDB online storeapplication-autoscaling
: DynamoDB online store, if auto-scaling is requiredec2
: ALB and EKS node pool managementecr
: ECR image management for deploymentseks
: EKS cluster management for running feature engineering workloadskms
: Encryption keys for secretssecretsmanager
: Encrypting datasource secrets from the web UI
Chalk's support team will work with you to scope permissions down to what are needed for ongoing maintenance. The precise details depend on the level of ongoing support that your team needs and your compliance requirements. In principle, Chalk does not require ongoing access to data, or to the ability to edit IAM permissions, but Chalk requires ongoing access to update the software deployed in your environment.
Chalk requires the following Kubernetes roles to manage the resources in your cluster:
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: chalk-cluster-management-role
rules:
- apiGroups:
- ""
resources:
- nodes # required to track usage
verbs:
- get
- list
- watch
# For allowing the web UI to manage cluster scaling
- apiGroups:
- "karpenter.sh"
resources:
- nodepools
verbs:
- get
- list
- create
- update
- patch
- delete
- apiGroups:
- "karpenter.k8s.aws"
resources:
- ec2nodeclasses
verbs:
- get
- list
- create
- update
- patch
- delete
# For read/list access to Karpenter NodeClaims
- apiGroups:
- "karpenter.sh"
resources:
- nodeclaims
verbs:
- get
- list
- apiGroups:
- ""
resources:
- persistentvolumes
verbs:
- get
- list
- create
- update
- patch
- watch
- delete
- apiGroups:
- "storage.k8s.io"
resources:
- storageclasses
verbs:
- get
- list
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: chalk-management-role
namespace: <namespace> # replace <namespace> with your actual namespace
rules:
- apiGroups:
- ""
resources:
- configmaps
- pods
- services
verbs:
- get
- list
- create
- update
- patch
- watch
# For support w/ debugging & rendering logs in the dashboard.
- apiGroups:
- ""
resources:
- pods/log
verbs:
- get
- list
- watch
- apiGroups:
- "extensions"
- "networking.k8s.io"
resources:
- ingresses
- ingresses/status
- ingressclasses
verbs:
- get
- list
- watch
- delete
- update
- create
- apiGroups:
- "extensions"
- "networking.k8s.io"
resources:
- ingresses # can be self-managed if necessary
verbs:
- get
- create
- delete
- update
- list
- patch
- apiGroups:
- ""
resources:
- secrets
verbs:
- get
- list
- watch
- create
- patch
- update
- delete
- apiGroups:
- ""
resources:
- events
verbs:
- get
- list
- watch
- apiGroups:
- "apps"
resources:
- replicasets
- statefulsets
- deployments
- daemonsets
verbs:
- get
- list
- update
- patch
- create
- delete
- apiGroups:
- "batch"
resources:
- cronjobs
verbs:
- get
- list
- create
- update
- patch
- watch
- delete
# For managing Jobs
- apiGroups:
- "batch"
resources:
- jobs
verbs:
- get
- list
- create
- update
- patch
- watch
- delete
# For durable storage management.
- apiGroups:
- ""
resources:
- persistentvolumeclaims
verbs:
- get
- list
- create
- update
- patch
- watch
- delete
# For managing KEDA objects for autoscaling.
- apiGroups:
- keda.sh
resources:
- scaledobjects
- scaledjobs
verbs:
- get
- list
- create
- update
- patch
- watch
- delete
# OPTIONAL: For showing thread dumps & profiling for batch backfills + some support use-cases.
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- create
- get
# OPTIONAL: For support with debugging k8s rbac, grant read-only access to this namespaces' rbac configuration.
- apiGroups: ["", "rbac.authorization.k8s.io"]
resources: ["roles", "serviceaccounts", "rolebindings"]
verbs: ["get", "list"]
Additionally, Chalk workload service accounts require the following role to be able to manage batch workloads:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: chalk-job-reader
namespace: <namespace> # replace <namespace> with your actual namespace
rules:
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "watch"]
If you use in-cluster docker image building powered by koniko and Argo, Chalk requires this role on the namespace where the docker image building is happening:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: argo-workflows-role
namespace: <namespace> # replace <namespace> with your actual namespace
rules:
- apiGroups: ["argoproj.io"]
resources: ["workflows"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
This role must be granted to the management service account that Chalk uses to manage the cluster, and to the workload service account that runs the docker image building.