Infrastucure for the AWS Analytical Environment
The long-running EMR cluster is currently deployed directly by terraform. The cluster is restarted every night by the taint-emr
Concourse job to apply any outstanding user changes (see Authorisation and RBAC
).
The user batch cluster is deployed by the emr-launcher
(GitHub) lambda with the configurations in the batch_cluster_config
directory. The cluster is launched on-demand by Azkaban using the custom DataWorks EMR Jobtype or DataWorks EMR Azkaban plugin. The clusters automatically shut down after a period of inactivity by scheduled Concourse jobs (<env>-stop-waiting
).
As part of the EMR Launcher Lambda, when a Batch EMR cluster is deployed, it has a new security configuration copied from the previous security configuration and associated with the new EMR cluster. As per (DW-6602) and (DW-6602), these security configurations are copied by the EMR Launcher Lambda for the Batch EMR clusters only. The reason for doing this is described in the tickets, but this can mean we have many security configurations. If the number of EMR security configurations reaches the maximum of 600, we will be unable to launch any more EMR clusters. This can lead to outages of the user facing aws-analytical-env EMR cluster and the batch clusters if these aren't periodically cleaned up.
The following (Concourse Job) is responsible for ensuring security configurations are periodically cleaned up.
Both clusters output their logs to the Cloudwatch log group
/app/analytical_batch/step_logs
The logs from user submitted steps via Azkaban output to the Cloudwatch log group
/aws/emr/azkaban
Authentication is mostly handled by Cognito. There are 2 different authentication mechanisms:
- Direct Cognito login with username and password - uses a custom auth flow in Cognito
- Federated login using DWP ADFS - bypasses the custom auth flow
The custom authentication flow(AWS docs) is used for implementing additional security checks on top of the default Cognito ones. This is not needed for federated users. It uses the dataworks-analytical-custom-auth-flow lambdas triggered by Cognito hooks (aws-analytical-env repo).
Authentication and authorisation checking happens at multiple points throughout the Analytical Environment:
- Dataworks Analytical Frontend Service - facilitates the log in flow and stores JWT tokens (valid for 12 hours) in the browser's storage. Without valid credentials, the user will not be able to access the application
- Orchestration Service - Any request (provision and deprovision environments) made to the orchestration-service needs to include a valid JWT token
- Guacamole - Once the environment is fully provisioned, the JWT token is verified before establishing the remote desktop from the user's browser to the analytical workspace
- EMR - The analytical tooling environments have 2 distinct ways of interacting with the EMR cluster: Apache Livy for Spark sessions, and ODBC for Hive sessions:
- Apache Livy - JWT verification is performed by an Nginx proxy, which sits in front of the Livy server that runs on EMR - livy-proxy GitHub Repo
- ODBC/Hive - JWT verification is performed directly by Hive using Pluggable Authentication - analytical-env-hive-custom-auth repo builds the custom authentication JAR
The RBAC system uses EMR security configurations to assign a unique IAM role for each user for S3 EMRFS requests. At the moment there is no RBAC at the Hive metastore level, so users can see all database and table metadata. RBAC is performed when users try to access data in S3 based on the corresponding IAM role specified in the security configuration.
Security configurations match a local Linux PAM user to an IAM role, therefore all users must exist as Linux users to be able to access data. All users are set up using a custom EMR step which only runs when the EMR cluster is started. The EMR cluster is restarted by the taint-emr
job every night to ensure all users exist on the cluster.
The users and permissions are stored in a MySQL database. The database assigns RBAC policies at the user group level, and each user can be assigned to a group to inherit the group's permissions. Currently permissions cannot be attached directly to a user.
The RBAC sync lambda (#TODO: add link) synchronises the users from the Cognito User Pool to the MySQL database. The lambda is invoked by Concourse (admin-sync-and-munge/sync-congito-users-<env>
) daily at 23:00 UTC.
The RBAC 'munge' lambda takes all the access policies for a given user and combines them to the least number of AWS IAM policies, taking into account the resource limits imposed by AWS. The lambda is invoked by Concourse (admin-sync-and-munge/create-roles-and-munged-policies-<env>
) after the sync job succeeds.
There is a requirement for our data products to start using Hive 3 instead of Hive 2. Hive 3 comes bundled with EMR 6.2.0 along with other upgrades including Spark. Below is a list of steps taken to upgrade Analytical-env and batch to EMR 6.2.0
-
Make sure you are using an AL2 ami
-
Point
analytical-env
clusters at the new metastore:hive_metastore_v2
ininternal-compute
instead of the old one in the configurations.ymlThe values below should resolve to the new metastore, the details of which are an output of
internal-compute
"javax.jdo.option.ConnectionURL": "jdbc:mysql://${hive_metastore_endpoint}:3306/${hive_metastore_database_name}?createDatabaseIfNotExist=true" "javax.jdo.option.ConnectionUserName": "${hive_metsatore_username}" "javax.jdo.option.ConnectionPassword": "${hive_metastore_pwd}"
-
Alter the security group deployment to the new security group for
hive-metastore-v2
hive_metastore_sg_id = data.terraform_remote_state.internal_compute.outputs.hive_metastore_v2.security_group.id
-
Rotate the
analytical-en
user from theinternal-compute
pipeline so that whenanalytical-env
orbatch
starts up it can login to the metastore. -
Make sure to fetch the new Secret as the secret name has changed
data "aws_secretsmanager_secret_version" "hive_metastore_password_secret" { provider = aws secret_id = "metadata-store-v2-analytical-env" }
-
Bump the version of sparklyR from 2.4 to 3.0-2.12
-
Bump the EMR version to 6.2.0 and launch the cluster.
Make sure that the first time anything uses the metastore it initialises with Hive 3, otherwise it will have to be rebuilt.