Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Update Spark operator example to use new bootstrap flag for instance store volume RAID0 configuration #237

Merged
merged 7 commits into from
Jul 14, 2023

Conversation

bryantbiggs
Copy link
Contributor

What does this PR do?

  • Update Spark operator example to use new bootstrap flag for instance store volume RAID0 configuration
  • Remove addons not used by pattern (nginx-ingress, aws load balancer controller)
  • Clean up VPC CNI configuration
  • Update EKS blueprints addons version; use default policy provided for AWS for FluentBit

Motivation

More

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

image
k get pods -A
NAMESPACE              NAME                                                              READY   STATUS    RESTARTS   AGE
amazon-cloudwatch      aws-cloudwatch-metrics-7bq9k                                      1/1     Running   0          47m
amazon-cloudwatch      aws-cloudwatch-metrics-9g7wb                                      1/1     Running   0          12m
amazon-cloudwatch      aws-cloudwatch-metrics-cg4r6                                      1/1     Running   0          12m
amazon-cloudwatch      aws-cloudwatch-metrics-rn9z2                                      1/1     Running   0          47m
amazon-cloudwatch      aws-cloudwatch-metrics-zflrs                                      1/1     Running   0          45m
grafana                grafana-8466845568-gl9td                                          1/1     Running   0          41m
karpenter              karpenter-c8d5488d-58qf9                                          1/1     Running   0          38m
karpenter              karpenter-c8d5488d-mwwvs                                          1/1     Running   0          44m
kube-system            aws-for-fluent-bit-9n5vv                                          1/1     Running   0          12m
kube-system            aws-for-fluent-bit-r42rr                                          1/1     Running   0          47m
kube-system            aws-for-fluent-bit-r544b                                          1/1     Running   0          47m
kube-system            aws-for-fluent-bit-rwgws                                          1/1     Running   0          12m
kube-system            aws-for-fluent-bit-tzml7                                          1/1     Running   0          45m
kube-system            aws-node-cjs9p                                                    1/1     Running   0          47m
kube-system            aws-node-cq6h4                                                    1/1     Running   0          45m
kube-system            aws-node-d6jfw                                                    1/1     Running   0          47m
kube-system            aws-node-k9t5q                                                    1/1     Running   0          12m
kube-system            aws-node-lnjcz                                                    1/1     Running   0          12m
kube-system            cluster-autoscaler-aws-cluster-autoscaler-84764c55b6-hldpc        1/1     Running   0          44m
kube-system            cluster-proportional-autoscaler-kube-dns-autoscaler-758f884tgx7   1/1     Running   0          44m
kube-system            coredns-f76998b44-277n8                                           1/1     Running   0          38m
kube-system            coredns-f76998b44-mh8hw                                           1/1     Running   0          41m
kube-system            ebs-csi-controller-6bd74c9df-cmqsr                                6/6     Running   0          38m
kube-system            ebs-csi-controller-6bd74c9df-pdk8n                                6/6     Running   0          41m
kube-system            ebs-csi-node-294gk                                                3/3     Running   0          45m
kube-system            ebs-csi-node-2lmh5                                                3/3     Running   0          12m
kube-system            ebs-csi-node-bxrzf                                                3/3     Running   0          12m
kube-system            ebs-csi-node-k6gs6                                                3/3     Running   0          47m
kube-system            ebs-csi-node-v5zr7                                                3/3     Running   0          47m
kube-system            kube-proxy-6bdsj                                                  1/1     Running   0          45m
kube-system            kube-proxy-bjzjp                                                  1/1     Running   0          47m
kube-system            kube-proxy-h9bd6                                                  1/1     Running   0          12m
kube-system            kube-proxy-jzn27                                                  1/1     Running   0          12m
kube-system            kube-proxy-s7bln                                                  1/1     Running   0          47m
kube-system            metrics-server-644f9cbbcf-m8bzf                                   1/1     Running   0          38m
kube-system            metrics-server-644f9cbbcf-zk7md                                   1/1     Running   0          41m
kubecost               kubecost-cost-analyzer-58c6674d57-zwkqn                           2/2     Running   0          44m
kubecost               kubecost-prometheus-server-7bc77dcc7f-cncwn                       2/2     Running   0          44m
prometheus             prometheus-kube-state-metrics-79775c6c-trkz9                      1/1     Running   0          44m
prometheus             prometheus-prometheus-node-exporter-5knpr                         1/1     Running   0          47m
prometheus             prometheus-prometheus-node-exporter-jmwqv                         1/1     Running   0          12m
prometheus             prometheus-prometheus-node-exporter-jsr85                         1/1     Running   0          12m
prometheus             prometheus-prometheus-node-exporter-q4qp2                         1/1     Running   0          45m
prometheus             prometheus-prometheus-node-exporter-sbzzr                         1/1     Running   0          47m
prometheus             prometheus-server-6c5b74d9d9-4x2h4                                2/2     Running   0          41m
spark-history-server   spark-history-server-7ddddbcd86-mqb44                             1/1     Running   0          44m
spark-operator         spark-operator-5fd887888d-bccx5                                   1/1     Running   0          41m
yunikorn               yunikorn-admission-controller-865dcdd5bb-xlbjg                    1/1     Running   0          44m
yunikorn               yunikorn-scheduler-7695f8b55d-9fxv5                               2/2     Running   0          38m

@bryantbiggs bryantbiggs temporarily deployed to DoEKS Test July 6, 2023 00:46 — with GitHub Actions Inactive
Copy link
Contributor

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the blueprint @bryantbiggs . I left few comments.

kube-proxy = {
coredns = {}
kube-proxy = {}
vpc-cni = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need the VPC CNI policies to be added to this add-on? Is it a default now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - ref #244

aws_for_fluentbit_cw_log_group = {
create = true
use_name_prefix = false
name = "/${local.name}/aws-fluentbit-logs" # Add-on creates this log group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is added without a prefix so that we can use the same name in cloudwatch_log_group variable below to write the logs. Any better idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-added

Comment on lines 166 to 148
enable_aws_load_balancer_controller = true
aws_load_balancer_controller = {
version = "1.4.7"
timeout = "300"
}

enable_ingress_nginx = true
ingress_nginx = {
version = "4.5.2"
timeout = "300"
values = [templatefile("${path.module}/helm-values/nginx-values.yaml", {})]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two are being used for setting up Spark Live UI using path based routing for each Spark Job. Keep these two but add some comments that it is added for building Spark Live UI with Spark Operator config.

Can we get the output of the NLB DNS Name from this add-on and add this here

#ingressUrlFormat: '<ENTER_NLB_DNS_NAME/CUSTOM_DOMAIN_NAME>/{{$appName}}'
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-added

Comment on lines -70 to -90
data "aws_iam_policy_document" "fluent_bit" {
statement {
sid = ""
effect = "Allow"
resources = ["arn:${data.aws_partition.current.partition}:s3:::${module.s3_bucket.s3_bucket_id}/*"]

actions = [
"s3:ListBucket",
"s3:PutObject",
"s3:PutObjectAcl",
"s3:GetObject",
"s3:GetObjectAcl",
"s3:DeleteObject",
"s3:DeleteObjectVersion"
]
}

statement {
sid = ""
effect = "Allow"
resources = ["arn:${data.aws_partition.current.partition}:logs:${data.aws_region.current.id}:${data.aws_caller_identity.current.account_id}:log-group:*"]

actions = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams",
"logs:PutLogEvents",
"logs:PutRetentionPolicy",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these policies defaults now in our FluentBit add-on? If yes, we can remove that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we had the cloudwatch permissions and I just added the S3 permissions in aws-ia/terraform-aws-eks-blueprints-addons#203

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation of logs writing to S3
image

@@ -81,7 +81,7 @@ spec:
executor:
volumeMounts:
- name: spark-local-dir-1
mountPath: /data1
mountPath: /mnt/k8s-disks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test any of these examples after the new RAID0 config? Pod mountPath (e.g., /data1) is new directory and its different form the mountPoint(/mnt/k8s-disks).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted the /data1 changes. I did test the pyspark-pi-job.yaml:

k get pods -n spark-team-a -w
NAME                          READY   STATUS    RESTARTS   AGE
pyspark-pi-karpenter-driver   0/1     Pending   0          0s
pyspark-pi-karpenter-driver   0/1     Pending   0          3s
pyspark-pi-karpenter-driver   0/1     Pending   0          47s
pyspark-pi-karpenter-driver   0/1     ContainerCreating   0          48s
pyspark-pi-karpenter-driver   1/1     Running             0          3m11s
pythonpi-8fbf89894a5269c5-exec-1   0/1     Pending             0          0s
pythonpi-8fbf89894a5269c5-exec-2   0/1     Pending             0          0s
pythonpi-8fbf89894a5269c5-exec-1   0/1     Pending             0          1s
pythonpi-8fbf89894a5269c5-exec-2   0/1     Pending             0          1s
pythonpi-8fbf89894a5269c5-exec-1   0/1     Pending             0          41s
pythonpi-8fbf89894a5269c5-exec-2   0/1     Pending             0          41s
pythonpi-8fbf89894a5269c5-exec-1   0/1     ContainerCreating   0          42s
pythonpi-8fbf89894a5269c5-exec-2   0/1     ContainerCreating   0          42s
pythonpi-8fbf89894a5269c5-exec-1   1/1     Running             0          112s
pythonpi-8fbf89894a5269c5-exec-2   1/1     Running             0          112s
pythonpi-8fbf89894a5269c5-exec-1   1/1     Terminating         0          118s
pythonpi-8fbf89894a5269c5-exec-2   1/1     Terminating         0          118s
pyspark-pi-karpenter-driver        0/1     Completed           0          5m20s
pyspark-pi-karpenter-driver        0/1     Completed           0          5m22s
pythonpi-8fbf89894a5269c5-exec-2   0/1     Terminating         0          2m5s
pythonpi-8fbf89894a5269c5-exec-2   0/1     Terminating         0          2m5s
pythonpi-8fbf89894a5269c5-exec-2   0/1     Terminating         0          2m5s
pythonpi-8fbf89894a5269c5-exec-1   0/1     Terminating         0          2m5s
pythonpi-8fbf89894a5269c5-exec-1   0/1     Terminating         0          2m5s
pythonpi-8fbf89894a5269c5-exec-1   0/1     Terminating         0          2m5s
pyspark-pi-karpenter-driver        0/1     Terminating         0          8m6s
pyspark-pi-karpenter-driver        0/1     Terminating         0          8m6s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the ./taxi-trip-execute.sh example:

k get pods -n spark-team-a -w
NAME        READY   STATUS    RESTARTS   AGE
taxi-trip   0/1     Pending   0          7s
taxi-trip   0/1     Pending   0          44s
taxi-trip   0/1     Init:0/1   0          45s
taxi-trip   0/1     Init:0/1   0          55s
taxi-trip   0/1     PodInitializing   0          57s
taxi-trip   1/1     Running           0          84s
taxi-trip-exec-1   0/1     Pending           0          0s
taxi-trip-exec-2   0/1     Pending           0          0s
taxi-trip-exec-3   0/1     Pending           0          0s
taxi-trip-exec-1   0/1     Pending           0          0s
taxi-trip-exec-4   0/1     Pending           0          0s
taxi-trip-exec-2   0/1     Pending           0          1s
taxi-trip-exec-3   0/1     Pending           0          1s
taxi-trip-exec-4   0/1     Pending           0          1s
taxi-trip-exec-3   0/1     Pending           0          46s
taxi-trip-exec-2   0/1     Pending           0          46s
taxi-trip-exec-1   0/1     Pending           0          46s
taxi-trip-exec-3   0/1     Init:0/1          0          47s
taxi-trip-exec-1   0/1     Init:0/1          0          47s
taxi-trip-exec-2   0/1     Init:0/1          0          47s
taxi-trip-exec-4   0/1     Pending           0          52s
taxi-trip-exec-4   0/1     Init:0/1          0          52s
taxi-trip-exec-4   0/1     Init:0/1          0          53s
taxi-trip-exec-4   0/1     PodInitializing   0          54s
taxi-trip-exec-1   0/1     Init:0/1          0          56s
taxi-trip-exec-2   0/1     Init:0/1          0          56s
taxi-trip-exec-3   0/1     Init:0/1          0          57s
taxi-trip-exec-1   0/1     PodInitializing   0          57s
taxi-trip-exec-2   0/1     PodInitializing   0          57s
taxi-trip-exec-3   0/1     PodInitializing   0          58s
taxi-trip-exec-1   1/1     Running           0          77s
taxi-trip-exec-2   1/1     Running           0          77s
taxi-trip-exec-3   1/1     Running           0          77s
taxi-trip-exec-4   1/1     Running           0          81s
taxi-trip-exec-1   1/1     Terminating       0          11m
taxi-trip-exec-2   1/1     Terminating       0          11m
taxi-trip-exec-3   1/1     Terminating       0          11m
taxi-trip-exec-4   1/1     Terminating       0          11m
taxi-trip-exec-4   0/1     Terminating       0          11m
taxi-trip-exec-4   0/1     Terminating       0          11m
taxi-trip-exec-4   0/1     Terminating       0          11m
taxi-trip-exec-2   0/1     Terminating       0          11m
taxi-trip-exec-2   0/1     Terminating       0          11m
taxi-trip-exec-2   0/1     Terminating       0          11m
taxi-trip-exec-3   0/1     Terminating       0          11m
taxi-trip-exec-3   0/1     Terminating       0          11m
taxi-trip-exec-3   0/1     Terminating       0          11m
taxi-trip-exec-1   0/1     Terminating       0          11m
taxi-trip-exec-1   0/1     Terminating       0          11m
taxi-trip-exec-1   0/1     Terminating       0          11m
taxi-trip          0/1     Completed         0          13m
taxi-trip          0/1     Completed         0          13m
taxi-trip          0/1     Terminating       0          16m
taxi-trip          0/1     Terminating       0          16m
image

@bryantbiggs bryantbiggs temporarily deployed to DoEKS Test July 12, 2023 14:40 — with GitHub Actions Inactive
Copy link
Contributor

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryantbiggs Thanks for the update 👍🏼
We may have to look at the Website doc for this blueprint and replace /local1 with /mnt/k8s-disks before merging this PR.

@bryantbiggs bryantbiggs temporarily deployed to DoEKS Test July 14, 2023 11:32 — with GitHub Actions Inactive
@vara-bonthu vara-bonthu merged commit 3d01253 into awslabs:main Jul 14, 2023
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants