Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE] databricks_mount fails to mount AWS S3 bucket with Unable to execute HTTP request: Remote host terminated the handshake #1500

Closed
mvrangeme opened this issue Jul 26, 2022 · 6 comments
Labels
invalid This issue is not relevant to this provider or works as designed

Comments

@mvrangeme
Copy link

Configuration

resource "aws_s3_bucket" "this" {
  bucket = "test_databricks_data"
}

resource "aws_s3_bucket_acl" "this" {
  bucket = aws_s3_bucket.this.id
  acl    = "private"
}

resource "aws_s3_bucket_server_side_encryption_configuration" "this" {
  bucket = aws_s3_bucket.this.bucket
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "this" {
  bucket                             = aws_s3_bucket.this.id
  block_public_acls          = true
  block_public_policy       = true
  ignore_public_acls         = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_versioning" "this" {
  bucket = aws_s3_bucket.this.id
  versioning_configuration {
    status = "Enabled"
  }
}

# This is the resource which is failing / timing out
resource "databricks_mount" "data" {
  name       = "data"
  cluster_id = var.general_purpose_cluster_id
  s3 {
    instance_profile = var.databricks_instance_profile_id
    bucket_name      = aws_s3_bucket.this.id
  }
}

Note that

  • I've tested this with and without a bucket policy. It doesn't seem to make a difference so I've not included the bucket policy in the configuration I'm sharing
  • The S3 bucket is in the same region as the general purpose cluster instances
  • The instance profile role attached to the general purpose cluster instances is as follows:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowList",
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": "arn:aws:s3:::test_databricks_data"
        },
        {
            "Sid": "AllowObjectActions",
            "Effect": "Allow",
            "Action": [
                "s3:PutObjectAcl",
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3::test_databricks_data/*"
        }
    ]
}

Expected Behavior

What should have happened?

  • The databricks_mount resource should successfully reach the "created" status
  • The /mnt/data mount should be accessible on Databricks

Actual Behavior

What actually happened?

  • The databricks_mount resource is in "creating" status for 10 minutes and eventually times out
  • The /mnt/data mount actually is created but the terraform apply command still fails due to the databricks_mount resource timing out in the "creating" stauts
  • The Log4j logs on the general purpose cluster shows quite a few instances of the following error:
22/07/19 04:11:06 ERROR DatabricksS3LoggingUtils$:V3: S3 request failed with com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake; Request ID: null, Extended Request ID: null, Cloud Provider: AWS, Instance ID: i-0c463b179dd21c571
com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1216)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1162)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5453)
	at com.amazonaws.services.s3.AmazonS3Client.getBucketRegionViaHeadRequest(AmazonS3Client.java:6428)
	at com.amazonaws.services.s3.AmazonS3Client.fetchRegionFromCache(AmazonS3Client.java:6401)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5438)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5400)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5394)
	at com.amazonaws.services.s3.AmazonS3Client.listObjectsV2(AmazonS3Client.java:971)
	at shaded.databricks.org.apache.hadoop.fs.s3a.EnforcingDatabricksS3Client.listObjectsV2(EnforcingDatabricksS3Client.scala:214)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listObjects$5(S3AFileSystem.java:1852)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:333)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:294)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.listObjects(S3AFileSystem.java:1843)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3388)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3348)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3287)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.innerListStatus(S3AFileSystem.java:2809)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listStatus$12(S3AFileSystem.java:2788)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:118)
	at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:112)
	at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:2788)
	at com.databricks.backend.daemon.data.client.DBFSV2.$anonfun$listStatus$2(DatabricksFileSystemV2.scala:97)
	at com.databricks.s3a.S3AExceptionUtils$.convertAWSExceptionToJavaIOException(DatabricksStreamUtils.scala:66)
	at com.databricks.backend.daemon.data.client.DBFSV2.$anonfun$listStatus$1(DatabricksFileSystemV2.scala:94)
	at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:413)
	at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:507)
	at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:528)
	at com.databricks.logging.Log4jUsageLoggingShim$.$anonfun$withAttributionContext$1(Log4jUsageLoggingShim.scala:29)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:94)
	at com.databricks.logging.Log4jUsageLoggingShim$.withAttributionContext(Log4jUsageLoggingShim.scala:27)
	at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:283)
	at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:282)
	at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionContext(DatabricksFileSystemV2.scala:512)
	at com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:318)
	at com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:303)
	at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.withAttributionTags(DatabricksFileSystemV2.scala:512)
	at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:502)
	at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:422)
	at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperationWithResultTags(DatabricksFileSystemV2.scala:512)
	at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:413)
	at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:385)
	at com.databricks.backend.daemon.data.client.DatabricksFileSystemV2.recordOperation(DatabricksFileSystemV2.scala:512)
	at com.databricks.backend.daemon.data.client.DBFSV2.listStatus(DatabricksFileSystemV2.scala:94)
	at com.databricks.backend.daemon.data.client.DatabricksFileSystem.listStatus(DatabricksFileSystem.scala:164)
	at com.databricks.backend.daemon.dbutils.FSUtils$.$anonfun$ls$1(DBUtilsCore.scala:157)
	at com.databricks.backend.daemon.dbutils.FSUtils$.withFsSafetyCheck(DBUtilsCore.scala:91)
	at com.databricks.backend.daemon.dbutils.FSUtils$.ls(DBUtilsCore.scala:155)
	at com.databricks.backend.daemon.dbutils.FSUtils.ls(DBUtilsCore.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:306)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
	at java.lang.Thread.run(Thread.java:748)
Caused by: javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake
	at sun.security.ssl.SSLSocketImpl.handleEOF(SSLSocketImpl.java:1596)
	at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1426)
	at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1324)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:439)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:410)
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
	at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
	at com.amazonaws.http.conn.$Proxy55.connect(Unknown Source)
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
	at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1343)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
	... 67 more
Caused by: java.io.EOFException: SSL peer shut down incorrectly
	at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:481)
	at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:470)
	at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:160)
	at sun.security.ssl.SSLTransport.decode(SSLTransport.java:110)
	at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1418)
	... 90 more

Steps to Reproduce

Please list the steps required to reproduce the issue, for example:

  1. terraform apply

Terraform and provider versions

Please paste the output of terraform version. If version of databricks provider is not the latest (https://github.com/databricks/terraform-provider-databricks/releases), please make sure to use the latest one.

Terraform v1.2.4
on darwin_amd64
+ provider registry.terraform.io/databricks/databricks v1.0.0
+ provider registry.terraform.io/hashicorp/aws v4.21.0

Debug Output

Please add turn on logging, e.g. TF_LOG=DEBUG terraform apply and run command again, paste it to gist & provide the link to gist. If you're still willing to paste in log output, make sure you provide only relevant log lines with requests.

It would make it more readable, if you pipe the log through | grep databricks | sed -E 's/^.* plugin[^:]+: (.*)$/\1/', e.g.:

TF_LOG=DEBUG terraform plan 2>&1 | grep databricks | sed -E 's/^.* plugin[^:]+: (.*)$/\1/'

Relevant output as follows:

2022-07-26T16:18:45.300+1000 [DEBUG] ProviderTransformer: "module.storage.databricks_mount.data" (*terraform.NodeValidatableResource) needs provider["registry.terraform.io/databricks/databricks"].mgmt

2022-07-26T16:18:45.305+1000 [DEBUG] ReferenceTransformer: "module.storage.databricks_mount.data" references: [module.storage.var.gp_cluster_id (expand) module.storage.var.databricks_instance_profile_id (expand) module.storage.local.data_science_pipeline_bucket_id (expand)]

If Terraform produced a panic, please provide a link to a GitHub Gist containing the output of the crash.log.

Important Factoids

Are there anything atypical about your accounts that we should know?

@nfx
Copy link
Contributor

nfx commented Jul 26, 2022

relevant note to DBFS team:

22/07/19 04:11:06 ERROR DatabricksS3LoggingUtils$:V3: S3 request failed with com.amazonaws.SdkClientException: Unable to execute HTTP request: Remote host terminated the handshake; Request ID: null, Extended Request ID: null, Cloud Provider: AWS, Instance ID: i-0c463b179dd21c571
...
	at com.databricks.backend.daemon.data.client.DBFSV2.$anonfun$listStatus$2(DatabricksFileSystemV2.scala:97)
...
Caused by: javax.net.ssl.SSLHandshakeException: Remote host terminated the handshake
Caused by: java.io.EOFException: SSL peer shut down incorrectly

@stormwindy
Copy link

Are there any firewall/VPC settings possibly not whitelisting s3 endpoints, including regional ones? Regional endpoint whitelisting should be done for the region here the bucket exists from the top of my mind.

@thaiphv
Copy link
Contributor

thaiphv commented Jul 27, 2022

This is the same issue that I ran into. In my case I was using S3 gateway endpoint. For some reason, newly created buckets can't be queried using the global endpoint s3.amazonaws.com. However, they can be queried from the regional endpoint where they are created, for example s3.ap-southeast-2.amazonaws.com in my case.

I lodged the issue #1492 with the hope that I didn't have to create a dedicated cluster just for mounting S3 buckets in my workspace and I could just let the Terraform provider spin up a new cluster with a given instance profile and I could pass the fs.s3a.endpoint setting to the dbutils.fs.mount function via the extra_configs attribute.

However since issue was rejected, I had to work around that by doing something like below:

resource "aws_s3_bucket" "workspace" {
  bucket = local.s3_bucket_name
}

resource "databricks_mount" "workspace" {
  name       = local.workspace_name
  cluster_id = var.generic_cluster_id
  uri        = "s3a://${aws_s3_bucket.workspace.id}"
  extra_configs = {
    "fs.s3a.endpoint" = "s3.${var.aws_region}.amazonaws.com"
  }
}

@mvrangeme
Copy link
Author

Thanks for the feedback so far. I'm attempting @thaiphv's workaround now.

There was some truth to @stormwindy's comment too. We had a firewall that was not whitelisting a regional endpoint.

I'll update this issue after testing today.

@mvrangeme
Copy link
Author

Ok so it's definitely related to our AWS Network Firewall. Here are some relevant packets that I captured:

image

So you can see that the Client Hello is not acknowledged.

I'm still working out exactly what's wrong with our Network Firewall rules. I'll update this issue when I have a resolution.

@nfx nfx changed the title [ISSUE] Provider issue [ISSUE] databricks_mount fails to mount AWS S3 bucket with Unable to execute HTTP request: Remote host terminated the handshake Jul 28, 2022
@nfx nfx added the invalid This issue is not relevant to this provider or works as designed label Jul 28, 2022
@mvrangeme
Copy link
Author

mvrangeme commented Jul 29, 2022

Ok fixed 😓 .

Issue was that our Databricks cluster is deployed to the us-west-2 region and we were trying to mount an S3 bucket in the us-west-1 region.

Traffic to us-west-2 S3 buckets goes via an S3 VPC endpoint and therefore bypasses our Network FIrewall.

Fix was to punch a hole through our Network Firewall allowing access to the s3.us-west-1.amazonaws.com endpoint.

Thanks to everyone for their comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This issue is not relevant to this provider or works as designed
Projects
None yet
Development

No branches or pull requests

4 participants