This Git repository contains several Terraform configurations.
shared_state
creates Terraform state storage in either Azure or AWS, which is a prerequisite for the Terraform configurations inaws
orazure
.shared_state/aws
creates an AWS S3 Bucket and DynamoDB table that are a prerequisite for the Terraform configuration inaws
.shared_state/azure
creates an Azure resource group and storage account that are a prerequisite for the Terraform configuration inazure
.
aws
creates the following AWS resources:- Creates one or more EC2 nodes for running the different components. Currently, the configuration uses the m5.2xlarge instance type which provides 8 vCPUs, 32GB RAM, and an EBS backed root volume.
- Runs commands on the EC2 nodes after they are started (5 minutes according to the docs) to install software and configure them.
- Creates DNS A records for the EC2 nodes.
azure
creates the following Azure resources:- Creates a resource group to hold all of the created resources.
- Creates networking resources (vnet, subnet, network security group).
- Creates two or more Azure VMs (along with associated NICs and public IP addresses) for running the different components. The default configuration creates D8s v4 VMs, providing 8 vCPUs and 32GiB RAM with an Azure storage backed OS drive.
- Runs commands on the VMs after cloud-init provisioning is complete in order to install and configure Hadoop, Zookeeper, Accumulo, and the Accumulo Testing repository.
You will need to download and install the correct Terraform CLI
for your platform. Put the terraform
binary on your PATH. You can optionally install
Terraform Docs if you want to be able
to generate documentation or an example variables file for either the shared state or
aws
or azure
configurations.
The shared_state
directory contains Terraform configurations for creating either an AWS S3 Bucket
or DynamoDB table, or an Azure resource group, storage account, and container. These objects only
need to be created once and are used for sharing the Terraform state with a team. To read more
about this see remote state. The AWS
shared state instructions are based on
this article.
To generate the storage, run terraform init
followed by terraform apply
.
The default AWS configuration generates the S3 bucket name when terraform apply
is run. This
ensures that a globally unique S3 bucket name is used. It is not required to set any variables for
the shared state. However, if you wish to override any variable values, this can be done by
creating an aws.auto.tfvars
file in the shared_state/aws
directory. For example:
cd shared_state/aws
cat > aws.auto.tfvars << EOF
bucket_force_destroy = true
EOF
Assuming the bucket variable is not overridden, the generated S3 bucket name will appear in the
terraform
apply output, like the following example:
Outputs:
bucket_name = "terraform-20220209131315353700000001"
This value should be supplied to terraform init
in the aws directory as described below.
Using the example above, the init command for the aws directory would be:
terraform init -backend-config=bucket=terraform-20220209131315353700000001
If you change any of the backend storage configuration parameters over their defaults, you will
need to override them when you initialize terraform for the aws
or azure
configuration
below. For example, if you change the region where the S3 bucket is deployed from us-east-1
to
us-west-2
, then you would need to run terraform init
in the aws
directory (not the
shared_state initialization, but the main aws
directory initialization) with:
terraform init -backend-config=region=us-west-2
The following backend configuration can be overridden from with -backend-config=<name>=<value>
options to terraform init
. This prevents the need to modify the backend
sections in
aws/main.tf or azure/main.tf.
For AWS:
-backend-config=bucket=<bucket_name>
: Override the S3 bucket name-backend-config=key=<key_name>
: Override the key in the S3 bucket-backend-config=region=<region>
: Override AWS region-backend-config=dynamodb_table=<dynamodb_table_name>
: Override the DynamoDB table name
For Azure:
-backend-config=resource_group_name=<resource_group_name>
: Override the resource group where the storage account is located-backend-config=storage_account_name=<storage_account_name>
: Override the name of the Azure storage account holding Terraform state-backend-config=container_name=<container_name>
: Override the name of the container within the storage account that is holding Terraform state-backend-config=key=<blob_name>
: Override the name of the blob within the container that will be used to hold Terraform state
The aws
and azure
directories contain Terraform configurations for creating an Accumulo cluster
on AWS or Azure respectively. The aws
and azure
directories contain the following Terraform
configuration items:
- main.tf - The Terraform configuration file
- variables.tf - The declaration and default values for Terraform variables These configurations both use shared Terraform module and configuration files that can be found in the following directories/files:
- modules/ - This contains several shared Terraform modules that are used by the
aws
andazure
Terraform configurationscloud-init-config
- contains templates to generate a Cloud Init configuration to configure AWS instances or Azure VMs with necessary Linux packages, user accounts, etc.config-files
- contains template configuration files for various components of the cluster (e.g., HDFS, Accumulo, Grafana, etc.) as well as helper scripts to install the software components that cannot be installed via cloud-init.upload-software
- if pre-built binaries for downloaded software components (Hadoop, Accumulo, Zookeeper, Maven) are included, this module uploads them to the clusterconfigure-nodes
- this module is responsible for executing scripts on the cluster to install and configure software, initialize the HDFS filesystem and Accumulo cluster, and start them.
- conf/ - a non-git tracked directory that contains rendered template files with variables replaced by selected runtime configuration. These files are uploaded to the cluster.
The table below lists the variables and their default values that are used in the aws
configuration.
Name | Description | Type | Default | Required |
---|---|---|---|---|
accumulo_branch_name | The name of the branch to build and install | string |
"main" |
no |
accumulo_dir | The Accumulo directory on each EC2 node | string |
"/data/accumulo" |
no |
accumulo_instance_name | The accumulo instance name. | string |
"accumulo-testing" |
no |
accumulo_repo | URL of the Accumulo git repo | string |
"https://github.com/apache/accumulo.git" |
no |
accumulo_root_password | The password for the accumulo root user. A randomly generated password will be used if none is specified here. | string |
null |
no |
accumulo_testing_branch_name | The name of the branch to build and install | string |
"main" |
no |
accumulo_testing_repo | URL of the Accumulo Testing git repo | string |
"https://github.com/apache/accumulo-testing.git" |
no |
accumulo_version | The branch of Accumulo to download and install | string |
"2.1.0-SNAPSHOT" |
no |
ami_name_pattern | The pattern of the name of the AMI to use | any |
n/a | yes |
ami_owner | The id of the AMI owner | any |
n/a | yes |
authorized_ssh_key_files | List of SSH public key files for the developers that will log into the cluster | list(string) |
[] |
no |
authorized_ssh_keys | List of SSH keys for the developers that will log into the cluster | list(string) |
n/a | yes |
cloudinit_merge_type | Describes the merge behavior for overlapping config blocks in cloud-init. | string |
null |
no |
create_route53_records | Indicates whether or not route53 records will be created | bool |
false |
no |
hadoop_dir | The Hadoop directory on each EC2 node | string |
"/data/hadoop" |
no |
hadoop_version | The version of Hadoop to download and install | string |
"3.3.1" |
no |
instance_count | The number of EC2 instances to create | string |
"2" |
no |
instance_type | The type of EC2 instances to create | string |
"m5.2xlarge" |
no |
local_sources_dir | Directory on local machine that contains Maven, ZooKeeper or Hadoop binary distributions or Accumulo source tarball | string |
"" |
no |
maven_version | The version of Maven to download and install | string |
"3.8.4" |
no |
optional_cloudinit_config | An optional config block for the cloud-init script. If you set this, you should consider setting cloudinit_merge_type to handle merging with the default script as you need. | string |
null |
no |
private_network | Indicates whether or not the user is on a private network and access to hosts should be through the private IP addresses rather than public ones. | bool |
false |
no |
root_volume_gb | The size, in GB, of the EC2 instance root volume | string |
"300" |
no |
route53_zone | The name of the Route53 zone in which to create DNS addresses | any |
n/a | yes |
security_group | The Security Group to use when creating AWS objects | any |
n/a | yes |
software_root | The full directory root where software will be installed | string |
"/opt/accumulo-testing" |
no |
us_east_1b_subnet | The AWS subnet id for the us-east-1b subnet | any |
n/a | yes |
us_east_1e_subnet | The AWS subnet id for the us-east-1e subnet | any |
n/a | yes |
zookeeper_dir | The ZooKeeper directory on each EC2 node | string |
"/data/zookeeper" |
no |
zookeeper_version | The version of ZooKeeper to download and install | string |
"3.5.9" |
no |
The following outputs are returned by the aws
Terraform configuration.
Name | Description |
---|---|
accumulo_root_password | The supplied, or automatically generated Accumulo root user password. |
manager_ip | The IP address of the manager instance. |
worker_ips | The IP addresses of the worker instances. |
The table below lists the variables and their default values that are used in the azure
configuration.
Name | Description | Type | Default | Required |
---|---|---|---|---|
accumulo_branch_name | The name of the branch to build and install | string |
"main" |
no |
accumulo_dir | The Accumulo directory on each node | string |
"/data/accumulo" |
no |
accumulo_instance_name | The accumulo instance name. | string |
"accumulo-testing" |
no |
accumulo_repo | URL of the Accumulo git repo | string |
"https://github.com/apache/accumulo.git" |
no |
accumulo_root_password | The password for the accumulo root user. A randomly generated password will be used if none is specified here. | string |
null |
no |
accumulo_testing_branch_name | The name of the branch to build and install | string |
"main" |
no |
accumulo_testing_repo | URL of the Accumulo Testing git repo | string |
"https://github.com/apache/accumulo-testing.git" |
no |
accumulo_version | The branch of Accumulo to download and install | string |
"2.1.0-SNAPSHOT" |
no |
admin_username | The username of the admin user, that can be authenticated with the first public ssh key. | string |
"azureuser" |
no |
authorized_ssh_key_files | List of SSH public key files for the developers that will log into the cluster | list(string) |
[] |
no |
authorized_ssh_keys | List of SSH keys for the developers that will log into the cluster | list(string) |
n/a | yes |
cloudinit_merge_type | Describes the merge behavior for overlapping config blocks in cloud-init. | string |
null |
no |
create_resource_group | Indicates whether or not resource_group_name should be created or is an existing resource group. | bool |
true |
no |
hadoop_dir | The Hadoop directory on each node | string |
"/data/hadoop" |
no |
hadoop_version | The version of Hadoop to download and install | string |
"3.3.1" |
no |
local_sources_dir | Directory on local machine that contains Maven, ZooKeeper or Hadoop binary distributions or Accumulo source tarball | string |
"" |
no |
location | The Azure region where resources are to be created. If an existing resource group is specified, this value is ignored and the resource group's location is used. | string |
n/a | yes |
maven_version | The version of Maven to download and install | string |
"3.8.4" |
no |
network_address_space | The network address space to use for the virtual network. | list(string) |
[ |
no |
optional_cloudinit_config | An optional config block for the cloud-init script. If you set this, you should consider setting cloudinit_merge_type to handle merging with the default script as you need. | string |
null |
no |
os_disk_caching | The type of caching to use for the OS disk. Possible values are None, ReadOnly, and ReadWrite. | string |
"ReadOnly" |
no |
os_disk_size_gb | The size, in GB, of the OS disk | number |
300 |
no |
os_disk_type | The disk type to use for OS disks. Possible values are Standard_LRS, StandardSSD_LRS, and Premium_LRS. | string |
"Standard_LRS" |
no |
resource_group_name | The name of the resource group to create or reuse. If not specified, the name is generated based on resource_name_prefix. | string |
"" |
no |
resource_name_prefix | A prefix applied to all resource names created by this template. | string |
"accumulo-testing" |
no |
software_root | The full directory root where software will be installed | string |
"/opt/accumulo-testing" |
no |
subnet_address_prefixes | The subnet address prefixes to use for the accumulo testing subnet. | list(string) |
[ |
no |
vm_image | n/a | object({ |
{ |
no |
vm_sku | The SKU of Azure VMs to create | string |
"Standard_D8s_v4" |
no |
worker_count | The number of worker VMs to create | number |
1 |
no |
zookeeper_dir | The ZooKeeper directory on each node | string |
"/data/zookeeper" |
no |
zookeeper_version | The version of ZooKeeper to download and install | string |
"3.5.9" |
no |
The following outputs are returned by the azure
Terraform configuration.
Name | Description |
---|---|
accumulo_root_password | The user-supplied or automatically generated Accumulo root user password. |
manager_ip | The public IP address of the manager VM. |
worker_ips | The public IP addresses of the worker VMs. |
When using either the aws
or azure
configuration, you will need to supply values for required
variables that have no default value. There are several
ways
to do this. If you installed Terraform Docs, it can generate the file for you. You can then edit the
generated file to configure values as desired:
CLOUD=<enter either aws or azure>
cd $CLOUD
terraform-docs tfvars hcl . > ${CLOUD}.auto.tfvars
# If you prefer JSON over HCL, then the command would be
# terraform-docs tfvars json . > ${CLOUD}.auto.tfvars.json
Note that these generated variable files will include values for all variables, where those with
defaults will be set to their default value. You can also refer to the tables above and simply
add the values that are required (and have no default, or a default that you wish to change).
Below is an example JSON file containing configuration for aws
. This content can be customized
and placed in the aws
directory in a file whose name ends with .auto.tfvars.json
. Any variable
files whose name ends in .auto.tfvars
or .auto.tfvars.json
are automatically included when
terraform
commands are executed.
{
"security_group": "sg-ABCDEF001",
"route53_zone": "some.domain.com",
"us_east_1b_subnet": "subnet-ABCDEF123",
"us_east_1e_subnet": "subnet-ABCDEF124",
"ami_owner": "000000000001",
"ami_name_pattern": "MY_AMI_*",
"authorized_ssh_keys": [
"ssh-rsa dev_key_1",
"ssh-rsa dev_key_2"
]
}
The cloud-init template can be found in cloud-init.tftpl.
If you need to customize this configuration, one method is to use the Terraform variable
optional_cloudinit_config
to supply your own additional configuration. For example, some CentOS 7
images are out of date, and will need software packages to be updated before the rest of the
software download/install will work. This can be accomplished by adding the following to your
.auto.tfvars
file:
optional_cloudinit_config = <<-EOT
package_upgrade: true
EOT
You can add any other cloud-init configuration that you wish here. One factor to consider here is
the cloud-init merging behavior
with sections in the default template. The merging behavior can be controlled by setting the
cloudinit_merge_type
variable to your desired merge algorithm. The default is set to
dict(recurse_array,no_replace)+list(append)
which will attempt to keep all lists from the default
configuration, rather than new ones overwriting them.
Another factor to consider is the size of the generated cloud-init template. Cloud providers place a limit on the size of this file. AWS limits this content to 16KB, before Base64 encoding, and Azure limits it to 64KB after Base64 encoding.
This Terraform configuration creates:
${instance_count}
EC2 nodes of${instance_type}
with the latest AMI matching${ami_name_pattern}
from the${ami_owner}
. Each EC2 node will have a${root_volume_gb}
GB root volume. The EFS filesystem is NFS mounted to each node at${software_root}
.- DNS entries in Route53 for each EC2 node.
This Terraform configuration:
- Downloads, if necessary, the Apache Maven
${maven_version}
binary tarball to${software_root}/sources
, then untars it to${software_root}/apache-maven/apache-maven-${maven_version}
- Downloads, if necessary, the Apache Zookeeper
${zookeer_version}
binary tarball to${software_root}/sources
, then untars it to${software_root}/zookeeper/apache-zookeeper-${zookeeper_version}-bin
- Downloads, if necessary, the Apache Hadoop
${hadoop_version}
binary tarball to${software_root}/sources
, then untars it to${software_root}/hadoop/hadoop-${hadoop_version}
- Clones, if necessary, the Apache Accumulo Git repo from
${accumulo_repo}
into${software_root}/sources/accumulo-repo
. It switches to the${accumulo_branch_name}
branch and builds the software using Maven, then untars the binary tarball to${software_root}/accumulo/accumulo-${accumulo_version}
- Downloads the OpenTelemetry Java Agent jar file and copies it to
${software_root}/accumulo/accumulo-${accumulo_version}/lib/opentelemetry-javaagent-1.7.1.jar
- Copies the Accumulo
test
jar to${software_root}/accumulo/accumulo-${accumulo_version}/lib
so thatorg.apache.accumulo.test.metrics.TestStatsDRegistryFactory
is on the classpath - Downloads the Micrometer StatsD Registry jar file and copies it to
${software_root}/accumulo/accumulo-${accumulo_version}/lib/micrometer-registry-statsd-1.7.4.jar
- Clones, if necessary, the Apache Accumulo Testing Git repo from
${accumulo_testing_repo}
into${software_root}/sources/accumulo-testing-repo
. It switches to the${accumulo_testing_branch_name}
branch and builds the software using Maven.
If you want to supply your own Apache Maven, Apache ZooKeeper, Apache Hadoop, Apache Accumulo, or
Apache Accumulo Testing binary tar files, then you can put them into a directory on your local
machine and set the ${local_sources_dir}
variable to the full path to the directory. These files
will be uploaded to ${software_root}/sources
and the installation script will use them instead of
downloading them. If the version of the supplied binary tarball is different than the default
version, then you will also need to override that property. Supplying your own binary tarballs does
speed up the deployment. However, if you provide the Apache Accumulo binary tarball, then it will
be harder to update the software on the cluster.
NOTE: If you supply your own binary tarball of Accumulo, then you will need to copy the
accumulo-test-${accumulo_version}.jar
file to the lib
directory manually as it's not part of
the binary tarball.
If you did not provide a binary tarball, then you can update the software running on the cluster by doing the following and then restarting Accumulo:
cd ${software_root}/sources/accumulo-repo
git pull
mvn -s ${software_root}/apache-maven/settings.xml clean package -DskipTests -DskipITs
tar zxf assemble/target/accumulo-${accumulo_version}-bin.tar.gz -C ${software_root}/accumulo
# Sync the Accumulo changes with the worker nodes
pdsh -R exec -g worker rsync -az ${software_root}/accumulo/ %h:${software_root}/accumulo/
If you did not provide a binary tarball, then you can update the software running on the cluster by doing the following:
cd ${software_root}/sources/accumulo-testing-repo
git pull
mvn -s ${software_root}/apache-maven/settings.xml clean package -DskipTests -DskipITs
The first node that is created is called the manager
, the others are worker
nodes. The
following components will run on the manager
node:
- Apache ZooKeeper
- Apache Hadoop NameNode
- Apache Hadoop Yarn ResourceManager
- Apache Accumulo Manager
- Apache Accumulo Monitor
- Apache Accumulo GarbageCollector
- Apache Accumulo CompactionCoordinator
- Docker
- Jaeger Tracing Docker Container
- Telegraf/InfluxDB/Grafana Docker Container
The following components will run on the worker
nodes:
- Apache Hadoop DataNode
- Apache Hadoop Yarn NodeManager
- Apache Accumulo TabletServer
- Apache Accumulo Compactor(s)
The logs for each service (zookeeper, hadoop, accumulo) are located in their respective local
directory on each node (/data/${service}/logs
unless you changed the properties).
The aws
Terraform configuration creates DNS entries of the following form:
<node_name>-<branch_name>-<workspace_name>.${route53_zone}
For example:
- manager-main-default.${route53_zone}
- worker#-main-default.${route53_zone} (where # is 0, 1, 2, ...)
The azure
configuration does not current create public DNS entries for the nodes, and it is
recommended that the public IP addresses be used instead.
- Once you have created a
.auto.tfvars.json
file, or set the properties some other way, runterraform init
. If you have modified shared_state backend configuration over the default, you can override the values here. For example, the following configuration updates theresource_group_name
andstorage_account_name
for theazurerm
backend:Once values are supplied toterraform init -backend-config=resource_group_name=my-tfstate-resource-group -backend-config=storage_account_name=mystorageaccountname
terraform init
, they are stored in the local state and it is not necessary to supply these overrides to theterraform apply
orterraform destroy
commands. - Run
terraform apply
to create the AWS/Azure resources. - Run
terraform destroy
to tear down the AWS/Azure resources.
NOTE: If you are working with aws
and get an Access Denied error then try setting the AWS
Short Term access keys in your environment
For an aws
cluster, you can access the software configuration/management web pages here:
- Hadoop NameNode: http://manager-main-default.${route53_zone}:9870
- Yarn ResourceManager: http://manager-main-default.${route53_zone}:8088
- Hadoop DataNode: http://worker#-main-default.${route53_zone}:9864
- Yarn NodeManager: http://worker#-main-default.${route53_zone}:8042
- Accumulo Monitor: http://manager-main-default.${route53_zone}:9995
- Jaeger Tracing UI: http://manager-main-default.${route53_zone}:16686
- Grafana: http://manager-main-default.${route53_zone}:3003
The azure
cluster creates a network security group that limits public access to port 22 (SSH).
Therefore, to access configuration/management web pages, you should create a SOCKS proxy and use
a browser plugin such as
FoxyProxy Standard
to point the browser to the SOCKS proxy. Create the proxy with
ssh -C2qTnNf -D 9876 hadoop@<manager-public-ip-address>
Configure FoxyProxy (or your browser directly) to connect to the proxy on localhost port 9876
(change the port specified in the -D
option above to use a different proxy port). If you
configure FoxyProxy with a SOCKS 5 proxy to match the URL regex patterns https?://manager:.*
and
https?://worker[0-9]+:.*
, then you can leave FoxyProxy set to
"Use proxies based on their pre-defined patterns and priorities" and access the web pages through
the proxy while other web pages will not use the proxy.
- Hadoop NameNode: http://manager:9870
- Yarn ResourceManager: http://manager:8088
- Hadoop DataNode: http://worker#:9864
- Yarn NodeManager: http://worker#:8042
- Accumulo Monitor: http://manager:9995
- Jaeger Tracing UI: http://manager:16686
- Grafana: http://manager:3003
The cloud-init configuration applied to each
AWS instance or Azure VM creates a hadoop
user. Any public SSH keys specified in the Terraform
configuration variable authorized_ssh_keys
(or public key file named in
authorized_ssh_key_files
) will be included in the cloud-init template as an authorized key for
the hadoop
user.
If you wish to use your default ssh key, typically stored in ~/.ssh/id_rsa.pub
, you would add the
following to your HCL .auto.tfvars
file:
authorized_ssh_key_files = [ "~/.ssh/id_rsa.pub" ]
Then, when the cluster is created, you can log in to a node with
ssh hadoop@<node-public-ip-address>
.
The /etc/hosts
file on each node has been updated with the names (manager, worker0, worker1,
etc.) and IP addresses of the nodes. pdsh
has been installed and /etc/genders
has been
configured. You should be able to ssh
to any node as the hadoop
user without a password.
Likewise, you should be able to pdsh
commands to groups of nodes as the hadoop user. The pdsh
genders group manager
specifies the manager node, and the worker
group will specify all
worker nodes.
Once the cluster is created you can simply stop or start the nodes from the AWS console or Azure
portal. Terraform is just for creating, updating, or destroying the resources. ZooKeeper and Hadoop
are setup to use SystemD service files, but Accumulo is not. You could log into the manager node
and run accumulo-cluster stop
before stopping the nodes. Or, you could just shut them down and
force Accumulo to recover (which might be good for testing). When restarting the nodes from the AWS
Console/Azure Portal, ZooKeeper and Hadoop should start on their own. For Accumulo, you should only
need to run accumulo-cluster start
on the manager node.