Skip to content
This repository has been archived by the owner on Jul 21, 2023. It is now read-only.

Exafunction/terraform-aws-exafunction-kube

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exafunction AWS Kubernetes Module

Exafunction x AWS x K8s

This repository provides a Terraform module to set up the necessary Kubernetes resources for an ExaDeploy system in an AWS EKS cluster. This includes deploying:

  • Kubernetes secrets for the ExaDeploy system. This includes the Exafunction API key and optionally S3 access credentials and RDS credentials needed for the persistent module repository backend.
  • ExaDeploy Helm chart responsible for bringing up ExaDeploy Kubernetes resources including the scheduler and module repository. See more details about the Helm chart here.
  • Prometheus Helm chart to create a Prometheus instance for ExaDeploy monitoring and a Grafana instance for visualizing the metrics. See more details about the Helm chart here.
  • NVIDIA Device Plugin Helm chart to enable GPU scheduling on the cluster. See more details about the Helm chart here.
  • EKS Cluster Autoscaler used to autoscale nodes in the cluster.

Because this module uses the Kubernetes Terraform provider to create Kubernetes resources, it should not be used in the same Terraform root module that creates the EKS cluster these resources are deployed in. In simpler terms, EKS cluster creation and deployment of Kubernetes resources in that cluster should be managed with separate terraform apply operations. See this section of the official Kubernetes provider docs or this article describing the issue for more information.

System Diagram

System Diagram

This diagram shows a typical setup of ExaDeploy in EKS (and adjacent AWS resources). It consists of:

  • An RDS database and S3 bucket used as a persistent backend for the ExaDeploy module repository. These are not strictly required as the module repository also supports a local backend (which is not persistent between module repository restarts) but is recommended in production. This Terraform module is not responsible for creating these resources but must be configured to use them by providing the appropriate access credentials and addressing information (see Configuration - Module Repository Backend). To create these resources, see Exafunction/terraform-aws-exafunction-cloud/modules/module_repo_backend.
  • An EKS cluster used to run the ExaDeploy system. This Terraform module is not responsible for creating the cluster. To create it, see Exafunction/terraform-aws-exafunction-cloud/modules/cluster.
  • The ExaDeploy Kubernetes resources, specifically the module repository and scheduler (which is responsible for managing runners). This Terraform module is responsible for creating these resources.
  • A Prometheus instance for monitoring the ExaDeploy system (along with a Grafana server used to visualize these metrics, not pictured). This Terraform module is responsible for creating these resource.

Usage

module "exafunction_kube" {
  # Set the module source and version to use this module.
  source = "Exafunction/exafunction-kube/aws"
  version = "x.y.z"

  # Set the cluster variables.
  cluster_name              = "cluster-abcd1234"
  cluster_oidc_issuer_url   = "https://oidc.eks.us-west-1.amazonaws.com/id/ABCD1234"
  cluster_oidc_provider_arn = "arn:aws:iam::123456781234:oidc-provider/oidc.eks.us-west-1.amazonaws.com/id/ABCD1234"

  # Set the Exafunction API Key.
  exafunction_api_key = "12345678-1234-5678-1234-567812345678"

  # Set the ExaDeploy component images.
  scheduler_image         = "123456781234.dkr.ecr.us-west-1.amazonaws.com/exafunction/scheduler:prod_abcd1234_1234567812"
  module_repository_image = "123456781234.dkr.ecr.us-west-1.amazonaws.com/exafunction/module_repository:prod_abcd1234_1234567812"
  runner_image            = "123456781234.dkr.ecr.us-west-1.amazonaws.com/exafunction/runner@sha256:abcdef0123456789abcdef0123456789abcdef0123456789abcdef0123456789"

  # Set the module repository backend.
  module_repository_backend = "local"

  # ...
}

See the configuration sections below as well as the Inputs section and variables.tf file for a full list of configuration options.

See examples/setup_exadeploy for a working example of how to use this module.

Cluster Configuration

While this module is not responsible for creating an EKS cluster, it does require information about an existing cluster to deploy resources to, including the name of the cluster and OpenID Connect (OIDC) information associated with the cluster. For clusters created using Exafunction/terraform-aws-exafunction-cloud/modules/cluster, this information can be fetched using the cluster_name, cluster_oidc_issuer_url, and oidc_provider_arn module outputs.

ExaDeploy Helm Chart Configuration

This Terraform module installs the ExaDeploy Helm chart in the EKS cluster (version set by exadeploy_helm_chart_version). Helm is a package manager for Kubernetes that allows for easy installation of applications in a Kubernetes cluster. Helm charts can be configured by passing in values when installing through a values yaml file or command line arguments.

The configuration of the ExaDeploy Helm chart in this Terraform module is split between required values which are managed through dedicated Terraform variables and optional values which can be supplied using the exadeploy_helm_values_path variable (see Optional Configuration). These are all detailed in the sections below.

As a note, "required" variables in this sense means they must be set in order to install the Helm chart at all. Other "optional" values may be necessary to support specific infratructure configurations, toggle features, or maximize performance.

Exafunction API Key

The Exafunction API key is a unique key used to identify the ExaDeploy system to Exafunction. The API key itself should be provided by Exafunction.

To set, see the exafunction_api_key and exafunction_api_key_secret_name variables.

ExaDeploy Component Images

The Exafunction component images are the Docker images used to run the ExaDeploy system. These should be provided by Exafunction.

To set, see the scheduler_image, module_repository_image, and runner_image variables.

Module Repository Backend

The module repository can be configured to use either a local backend on disk or a remote backend backed by an RDS database and S3 bucket. The remote backend allows for persistence between module repository restarts and is recommended in production. For remote module backends created using Exafunction/terraform-aws-exafunction-cloud/modules/module_repo_backend, this information can be fetched from the module outputs.

To set, see the module_repository_backend, rds_*, and s3_* variables. If module_repository_backend is local then the other variables do not need to be specified. If module_repository_backend is remote then the other variables must all be non-null.

Optional Configuration

While the above sections cover all the required values for the ExaDeploy Helm chart configuration, there are many additional values that can be set to customize the deployment. These values are specified in a yaml file (in Helm values file format) and passed to the Helm chart installation through the optional exadeploy_helm_values_path variable.

To see all the available values to be set, see the ExaDeploy Helm chart. Note that Helm chart values corresponding to the required values above should not be set through this method as they will overriden (multiple Helm values specifications are automatically merged).

Prometheus Helm Chart Configuration

This Terraform module also installs the Prometheus Helm chart in the EKS cluster. This chart provides a Prometheus server and Grafana dashboard for monitoring the ExaDeploy system.

By default, the Grafana dashboard is made available on a private address within the VPC. The enable_grafana_public_address variable can be used to instead expose the Grafana dashboard on a public address for ease of access.

Prometheus Remote Write

This module also by default enables Prometheus remote write functionality in order to send ExaDeploy system metrics to a remote Prometheus server owned by Exafunction. The remote receiving endpoint is secured by TLS and clients are authenticated by using basic auth with a username and password that Exafunction should provide. These metrics will help Exafunction better understand your usage of ExaDeploy, and more rapidly troubleshoot any issues which may occur.

To enable / disable this feature, see enable_prom_remote_write. When enabled (default), the prom_remote_write_* variables must be specified. In particular, prom_remote_write_username and prom_remote_write_password should be the username and password provided by Exafunction.

Additional Resources

To learn more about Exafunction, visit the Exafunction website.

For technical support or questions, check out our community Slack.

For additional documentation about Exafunction including system concepts, setup guides, tutorials, API reference, and more, check out the Exafunction documentation.

For an equivalent repository used to set up ExaDeploy in a Kubernetes cluster on Google Cloud Platform (GCP) instead of AWS, visit Exafunction/terraform-gcp-exafunction-kube.

Resources

Name Type
helm_release.exadeploy resource
helm_release.kube_prometheus_stack resource
helm_release.nvidia_device_plugin resource
kubernetes_cluster_role_binding.cluster_admin resource
kubernetes_namespace.prometheus resource
kubernetes_secret.exafunction_api_key resource
kubernetes_secret.prom_remote_write_basic_auth resource
kubernetes_secret.rds_password resource
kubernetes_secret.s3_access resource
random_string.irsa_role_name_prefix resource
aws_region.current data source

Inputs

Name Description Type Default Required
cluster_name Name of the Exafunction EKS cluster. string n/a yes
cluster_oidc_issuer_url The URL for the OpenID Connect identity provider associated with the EKS cluster. string n/a yes
cluster_oidc_provider_arn The ARN of the OpenID Connect provider for the EKS cluster. string n/a yes
enable_grafana_public_address Whether the Grafana service will be accessible via a public address. If false, the Grafana service will only be accessible via a private address within the VPC. bool false no
enable_prom_remote_write Whether to enable remote writing Prometheus metrics to the Exafunction receiving endpoint. If true, all prom_remote_* variables must be set. bool true no
exadeploy_helm_chart_version The version of the ExaDeploy Helm chart to use. string "1.0.0" no
exadeploy_helm_values_path ExaDeploy Helm chart values yaml file path. string null no
exafunction_api_key Exafunction API key used to identify the ExaDeploy system to Exafunction. string n/a yes
exafunction_api_key_secret_name Exafunction API key Kubernetes secret name. This secret will be created by this module. string "exafunction-api-key" no
module_repository_backend The backend to use for the ExaDeploy module repository. One of [local, remote]. If remote, s3_* and rds_* variables must be set. string "local" no
module_repository_image Path to ExaDeploy module repository image. string n/a yes
prom_remote_write_basic_auth_secret_name Prometheus remote write basic auth Kubernetes secret name. This secret will be created by this module. string "prom-remote-write-basic-auth-secret" no
prom_remote_write_password Prometheus remote write basic auth password. This will be stored in the Prometheus remote write basic auth Kubernetes secret. string null no
prom_remote_write_target_url Prometheus remote write target url. string "https://prometheus.exafunction.com/api/v1/write" no
prom_remote_write_username Username (e.g. company name) for Prometheus remote write which will be used as a Prometheus label and basic auth username. This will be stored in the Prometheus remote write basic auth Kubernetes secret. string null no
rds_address Address of the RDS database. string null no
rds_password Password for RDS instance. This will be stored in the RDS password Kubernetes secret. string null no
rds_password_secret_name RDS password Kubernetes secret name. This secret will be created by this module. string "rds-password" no
rds_port Port of the RDS database. string null no
rds_username Username for the RDS database. string null no
runner_image Path to ExaDeploy runner image. string n/a yes
s3_access_key_secret_name S3 access key Kubernetes secret name. This secret will be created by this module. string "s3-access-key" no
s3_bucket_id ID for S3 bucket. string null no
s3_iam_user_access_key Access key ID for the S3 IAM user. This will be stored in the S3 access key Kubernetes secret. string null no
s3_iam_user_secret_key Secret access key for the S3 IAM user. This will be stored in the S3 access key Kubernetes secret. string null no
scheduler_image Path to ExaDeploy scheduler image. string n/a yes

Outputs

Name Description
cluster_autoscaler_helm_release_metadata Cluster autoscaler Helm release attributes.
exadeploy_helm_release_metadata ExaDeploy Helm release attributes.
nvidia_device_plugin_helm_release_metadata NVIDIA device plugin Helm release attributes.
prometheus_helm_release_metadata Prometheus Helm release attributes.