ECONNRESET exceptions when running in Lambda environment #1196

monken · 2020-05-20T20:54:47Z

Describe the bug

import { S3 } from '@aws-sdk/client-s3';
import { Handler, Context, S3Event } from 'aws-lambda';

const s3 = new S3({})

export const handler: Handler = async (event: S3Event, context: Context) => {
  await s3.getObject({
    Bucket: event.Records[0].s3.bucket.name,
    Key: event.Records[0].s3.object.key,
  });
}

We have this very basic lambda function that reads the file from S3 when a new file is uploaded (we actually consume the Body stream too, left that out for brevity). The function is called intermittently meaning that sometimes we get a new Lambda function (i.e. cold) sometimes the Lambda container is reused. When the container is reused, we sometimes see a ECONNRESET exception such as this one

2020-05-20T16:50:28.107Z	d7a43394-afad-4267-a4a4-5ad3633a1db8	ERROR	Error: socket hang up
    at connResetException (internal/errors.js:608:14)
    at TLSSocket.socketOnEnd (_http_client.js:460:23)
    at TLSSocket.emit (events.js:322:22)
    at endReadableNT (_stream_readable.js:1187:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  code: 'ECONNRESET',
  '$metadata': { retries: 0, totalRetryDelay: 0 }
}

I'm pretty confident that this is due to the keep-alive nature of the https connection. Lambda processes are frozen after they execute and their host seems to terminate open sockets after ~10 minutes. The next time the S3 client tries to reuse the socket, the exception is thrown.

We are running into similar issues with connections to our Aurora database which also terminates intermittently with the same error message (see brianc/node-postgres#2112). It's an error we can easily recover from if we try to reopen the socket but aws-sdk-v3 seems to prefer to throw an error message instead.

Is the issue in the browser/Node.js?
Node.js 12.x on AWS Lambda

The text was updated successfully, but these errors were encountered:

PoBuchi · 2020-05-26T13:52:38Z

Having similar issue using DynamoDB client.

Using:
Node.js 12.x on AWS Lambda
"@aws-sdk/client-dynamodb": "^1.0.0-beta.5"

Error: write ECONNRESET
    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:92:16)
    at writevGeneric (internal/stream_base_commons.js:132:26)
    at TLSSocket.Socket._writeGeneric (net.js:782:11)
    at TLSSocket.Socket._writev (net.js:791:8)
    at doWrite (_stream_writable.js:440:12)
    at clearBuffer (_stream_writable.js:558:5)
    at TLSSocket.Writable.uncork (_stream_writable.js:337:7)
    at ClientRequest._flushOutput (_http_outgoing.js:861:10)
    at ClientRequest._flush (_http_outgoing.js:830:22)
    at _http_client.js:315:47 {
  errno: 'ECONNRESET',
  code: 'ECONNRESET',
  syscall: 'write',
  '$metadata': { retries: 0, totalRetryDelay: 0 }
}

samirda · 2020-05-26T14:28:14Z

Info
AWS lambda Node.js 12.x
"@aws-sdk/client-dynamodb": "^1.0.0-gamma.1"

Lambda

import { DynamoDBClient, DescribeTableCommand } from "@aws-sdk/client-dynamodb"

const dynamo = new DynamoDBClient({})

export const tempDebug = async (): Promise<object> => {
  const res = await dynamo.send(new DescribeTableCommand({
    TableName: '<TableName>'
  }))

  return Promise.resolve(res.Table)
}

Local

import { LambdaClient, InvokeCommand } from "@aws-sdk/client-lambda"

declare const TextDecoder
const lambda = new LambdaClient({})

;(async () => {
  let counter = 0
  // eslint-disable-next-line no-constant-condition
  while (true) {
    console.log(counter)
    counter++
    const res = await lambda.send(new InvokeCommand({
      FunctionName: '<FunctionName>'
    }))
    
    const obj = JSON.parse(new TextDecoder("utf-8").decode(res.Payload))
    if (obj.errorType === 'Error') {
      console.log(obj)
      break
    }

    //await new Promise(resolve => setTimeout(resolve, 5 * 60 * 1000))
    await new Promise(resolve => setTimeout(resolve, 90 * 1000))
  }
})()

Produces the following errors consistently when run with ~~5 minute~~ 90 sec intervals. First call works, second call after ~~5 minutes~~ 90 seconds produce 1 of the following 2 errors.
Error logs are from CloudWatch.

{
    "errorType": "Error",
    "errorMessage": "write EPIPE",
    "code": "EPIPE",
    "errno": "EPIPE",
    "syscall": "write",
    "$metadata": {
        "retries": 0,
        "totalRetryDelay": 0
    },
    "stack": [
        "Error: write EPIPE",
        "    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:92:16)",
        "    at writevGeneric (internal/stream_base_commons.js:132:26)",
        "    at TLSSocket.Socket._writeGeneric (net.js:782:11)",
        "    at TLSSocket.Socket._writev (net.js:791:8)",
        "    at doWrite (_stream_writable.js:401:12)",
        "    at clearBuffer (_stream_writable.js:519:5)",
        "    at TLSSocket.Writable.uncork (_stream_writable.js:338:7)",
        "    at ClientRequest._flushOutput (_http_outgoing.js:862:10)",
        "    at ClientRequest._flush (_http_outgoing.js:831:22)",
        "    at _http_client.js:315:47"
    ]
}

{
    "errorType": "Error",
    "errorMessage": "socket hang up",
    "code": "ECONNRESET",
    "$metadata": {
        "retries": 0,
        "totalRetryDelay": 0
    },
    "stack": [
        "Error: socket hang up",
        "    at connResetException (internal/errors.js:608:14)",
        "    at TLSSocket.socketOnEnd (_http_client.js:453:23)",
        "    at TLSSocket.emit (events.js:322:22)",
        "    at endReadableNT (_stream_readable.js:1187:12)",
        "    at processTicksAndRejections (internal/process/task_queues.js:84:21)"
    ]
}

Works as expected when run with 1 minute intervals.

samirda · 2020-05-28T10:17:20Z

Edit: Better workaround #1196 (comment)

I also changed err.code === 'ECONNRESET' -> (err.code === 'ECONNRESET' || err.code === 'EPIPE')

I think I found a workaround. Client needs to call destroy method.

dynamo.destroy()

import { DynamoDBClient, DescribeTableCommand } from "@aws-sdk/client-dynamodb"

const dynamo = new DynamoDBClient({})

export const tempDebug = async (): Promise<object> => {
  const res = await dynamo.send(new DescribeTableCommand({
    TableName: '<TableName>'
  }))

  dynamo.destroy()

  return Promise.resolve(res.Table)
}

I have not been able to reproduce the issue when run with ts-node v.8.8.1 or with a docker image(node 12.17.0).
I have only had the issue in AWS lambda Node.js 12.x

monken · 2020-05-28T10:48:27Z

@samirda thanks for that. However, for highly trafficked lambdas this will eliminate the benefits of keep-alive.

Either way, ECONNRESET is a retry-able error and should be handled that way.

samirda · 2020-05-28T11:19:16Z

@monken I agree but the workaround should be more error safe until a proper fix is in place.

studds · 2020-06-01T03:11:09Z

I've managed to work around this using this configuration (updated for gamma):

import {
    StandardRetryStrategy,
    defaultRetryDecider,
} from '@aws-sdk/middleware-retry';
import { SdkError } from '@aws-sdk/smithy-client';

const retryDecider = (err: SdkError & { code?: string }) => {
    if (
        'code' in err &&
        (err.code === 'ECONNRESET' ||
            err.code === 'EPIPE' ||
            err.code === 'ETIMEDOUT')
    ) {
        return true;
    } else {
        return defaultRetryDecider(err);
    }
};
// eslint-disable-next-line @typescript-eslint/require-await
const retryStrategy = new StandardRetryStrategy(async () => '3', {
    retryDecider,
});
export const defaultClientConfig = {
    maxRetries: 3,
    retryStrategy,
};

It would be nice if this was built-in to defaultRetryDecider. Although, is there an argument for this being built-in to the NodeHttpHandler, as this is a node-specific error, and one where the handler should probably "just work"?

abierbaum · 2020-06-27T13:10:15Z

@studds Thanks for the elegant solution. This is working perfect for me now.

polovi · 2020-07-04T21:50:30Z

Is something new or plans to fix this? Every second/third call of lambda with dynamoDB client in peak end with one of these errors

Error: write ETIMEDOUT
    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:92:16)
    at writevGeneric (internal/stream_base_commons.js:132:26)
    at TLSSocket.Socket._writeGeneric (net.js:782:11)
    at TLSSocket.Socket._writev (net.js:791:8)
    at doWrite (_stream_writable.js:401:12)
    at clearBuffer (_stream_writable.js:519:5)
    at TLSSocket.Writable.uncork (_stream_writable.js:338:7)
    at ClientRequest._flushOutput (_http_outgoing.js:862:10)
    at ClientRequest._flush (_http_outgoing.js:831:22)
    at _http_client.js:315:47 {
  errno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'write',
  '$metadata': { retries: 0, totalRetryDelay: 0 }
}

Error: write EPIPE
    at WriteWrap.onWriteComplete [as oncomplete] (internal/stream_base_commons.js:92:16)
    at writevGeneric (internal/stream_base_commons.js:132:26)
    at TLSSocket.Socket._writeGeneric (net.js:782:11)
    at TLSSocket.Socket._writev (net.js:791:8)
    at doWrite (_stream_writable.js:401:12)
    at clearBuffer (_stream_writable.js:519:5)
    at TLSSocket.Writable.uncork (_stream_writable.js:338:7)
    at ClientRequest._flushOutput (_http_outgoing.js:862:10)
    at ClientRequest._flush (_http_outgoing.js:831:22)
    at _http_client.js:315:47 {
  errno: 'EPIPE',
  code: 'EPIPE',
  syscall: 'write',
  '$metadata': { retries: 0, totalRetryDelay: 0 }
}

Error: socket hang up
    at connResetException (internal/errors.js:608:14)
    at TLSSocket.socketOnEnd (_http_client.js:453:23)
    at TLSSocket.emit (events.js:322:22)
    at endReadableNT (_stream_readable.js:1187:12)
    at processTicksAndRejections (internal/process/task_queues.js:84:21) {
  code: 'ECONNRESET',
  '$metadata': { retries: 0, totalRetryDelay: 0 }
}

So I'm unable to use sdk v3 as replace for old sdk

trivikr · 2020-07-09T03:48:30Z

clients in 1.0.0-gamma.3 now retry in case of Transient Errors

PR feat: add check for isTransientError #1222

Code

aws-sdk-js-v3/packages/service-error-classification/src/index.ts

Lines 19 to 21 in 54d2429

    
           export const isTransientError = (error: SdkError) => 
        
             TRANSIENT_ERROR_CODES.includes(error.name) || 
        
             TRANSIENT_ERROR_STATUS_CODES.includes(error.$metadata?.httpStatusCode || 0);

It doesn't check for ECONNRESET, ETIMEDOUT or EPIPE though

maxgr0 · 2020-07-27T20:35:26Z

Got the same problem using 1.0.0-gamma.5 in @aws-sdk/client-dynamodb and @aws-sdk/client-s3.
Dont know where the error comes from as this is extremely random (around every 10th-15th request). With the "old" aws-sdk this is never happening.

polovi · 2020-07-30T18:01:36Z

I confirm, it still happens with 1.0.0-gamma.5

samirda · 2020-08-11T14:27:14Z

I think this has been resolved in 1.0.0-gamma.6

monken · 2020-08-11T15:18:04Z

There is no mention of a fix in the release notes: https://github.com/aws/aws-sdk-js-v3/releases/tag/v1.0.0-gamma.6

samirda · 2020-08-11T15:21:36Z

I know but I have not been able to reproduce the errors.

maxgr0 · 2020-08-12T13:39:30Z

Issues are still happening in 1.0.0-gamma.6 😕

polovi · 2020-08-12T22:32:58Z

me too, still errors with 1.0.0-gamma.6

samirda · 2020-08-17T13:00:51Z

I was to quick, errors still happening in 1.0.0-gamma.6.

maxgr0 · 2020-08-25T16:57:49Z

Anyone made some progress here? @samirda @polovi @monken

polovi · 2020-08-31T14:07:16Z

Thers 1.0.0-gamma.7 but I have no time to test yet

polovi · 2020-10-04T11:57:19Z

iam testing 1.0.0-gamma.10 in production with loging over custom retry strategy

polovi · 2020-10-09T15:36:13Z

1.0.0-gamma.10 still produce same errors

rraziel · 2020-10-21T11:11:50Z

Has anyone from AWS or a maintainer even commented on this issue?
This seems like this should be a priority given it happens in most use-cases unless you rarely call your lambdas.

monken · 2020-10-21T16:18:06Z

@AllanFly120 mind to chime in?

martinjuhasz · 2020-11-06T14:42:27Z

I'm also seeing this regulary. I increased maxAttemps without any significant improvement.

const dynamodbClient = new DynamoDB({ maxAttempts: 10 })

monken · 2020-11-06T16:14:23Z

So we are at release candidate 4 and this problem has not even been acknowledged 😞

martinjuhasz · 2020-11-10T13:48:02Z

Using the fix of serverless-nextjs fixed it for me. This is not at all a permanent solution as it will requery continuously when the matched status code will get returned.

TS implementation:

import type { SdkError } from '@aws-sdk/smithy-client'
import {
	defaultRetryDecider,
	StandardRetryStrategy,
} from '@aws-sdk/middleware-retry'

// fix error in SDK release candidate
// see: https://github.com/aws/aws-sdk-js-v3/issues/1196
// see: https://github.com/serverless-nextjs/serverless-next.js/pull/720/files
export const retryStrategy = new StandardRetryStrategy(async () => 3, {
	retryDecider: (err: SdkError & { code?: string }) => {
		if (
			'code' in err &&
			(err.code === 'ECONNRESET' ||
				err.code === 'EPIPE' ||
				err.code === 'ETIMEDOUT')
		) {
			return true
		} else {
			return defaultRetryDecider(err)
		}
	},
})

import { DynamoDB } from '@aws-sdk/client-dynamodb'
const dynamodbClient = new DynamoDB({ retryStrategy })

the retryStrategy prop is available in all clients AFAIK.

Hoping for an actual fix in the next RC

trivikr · 2020-11-12T23:28:39Z

Documentation of the errors discussed in this thread:

ECONNRESET (Connection reset by peer): A connection was forcibly closed by a peer. This normally results from a loss of the connection on the remote socket due to a timeout or reboot. Commonly encountered via the http and net modules.
EPIPE (Broken pipe): A write on a pipe, socket, or FIFO for which there is no process to read the data. Commonly encountered at the net and http layers, indicative that the remote side of the stream being written to has been closed.
ETIMEDOUT (Operation timed out): A connect or send request failed because the connected party did not properly respond after a period of time. Usually encountered by http or net. Often a sign that a socket.end() was not properly called.

Node.js docs: https://nodejs.org/api/errors.html#errors_common_system_errors

trivikr · 2020-11-16T16:00:57Z

Findings with the example code given in #1196 (comment):

The return value from describeTableCommand is deserialized in deserializeAws_json1_0DescribeTableCommand

aws-sdk-js-v3/clients/client-dynamodb/protocols/Aws_json1_0.ts

Lines 2137 to 2152 in 41e9359

    
           export const deserializeAws_json1_0DescribeTableCommand = async ( 
        
             output: __HttpResponse, 
        
             context: __SerdeContext 
        
           ): Promise<DescribeTableCommandOutput> => { 
        
             if (output.statusCode >= 300) { 
        
               return deserializeAws_json1_0DescribeTableCommandError(output, context); 
        
             } 
        
             const data: any = await parseBody(output.body, context); 
        
             let contents: any = {}; 
        
             contents = deserializeAws_json1_0DescribeTableOutput(data, context); 
        
             const response: DescribeTableCommandOutput = { 
        
               $metadata: deserializeMetadata(output), 
        
               ...contents, 
        
             }; 
        
             return Promise.resolve(response); 
        
           };

However, when error is re-thrown by handler the deserializer is not called.

aws-sdk-js-v3/packages/node-http-handler/src/node-http-handler.ts

Line 86 in 41e9359

req.on("error", reject);

The smithy-client doesn't re-throw error:

aws-sdk-js-v3/packages/smithy-client/src/client.ts

Lines 54 to 56 in 41e9359

    
           } else { 
        
             return handler(command).then((result) => result.output); 
        
           }

Question to answer: When requestHandler throws an error, why deserialize step is not called?

trivikr · 2020-11-16T16:38:46Z

Question to answer: When requestHandler throws an error, why deserialize step is not called?

The deserializer is not called as next function in deserializerMiddleware throws:

aws-sdk-js-v3/packages/middleware-serde/src/deserializerMiddleware.ts

Lines 19 to 23 in 41e9359

    
           const { logger, outputFilterSensitiveLog } = context; 
        
           const { response } = await next(args); 
        
           const parsed = await deserializer(response, options);

As explained in the hack in #1196 (comment), this can be caught and retried in retryMiddleware as it has try/catch

aws-sdk-js-v3/packages/middleware-retry/src/defaultStrategy.ts

Lines 125 to 147 in 41e9359

    
             } catch (err) { 
        
               attempts++; 
        
               if (this.shouldRetry(err as SdkError, attempts, maxAttempts)) { 
        
                 retryTokenAmount = this.retryQuota.retrieveRetryTokens(err); 
        
                 const delay = this.delayDecider( 
        
                   isThrottlingError(err) ? THROTTLING_RETRY_DELAY_BASE : DEFAULT_RETRY_DELAY_BASE, 
        
                   attempts 
        
                 ); 
        
                 totalDelay += delay; 
        
                 await new Promise((resolve) => setTimeout(resolve, delay)); 
        
                 continue; 
        
               } 
        
               if (!err.$metadata) { 
        
                 err.$metadata = {}; 
        
               } 
        
               err.$metadata.attempts = attempts; 
        
               err.$metadata.totalRetryDelay = totalDelay; 
        
               throw err; 
        
             } 
        
           }

Question: Should deserializer be called in case next function throws error?
Answer: No, as deserializer doesn't have valid response to be processed if next function throws error. This should be handled in retryMiddleware.

rraziel · 2020-11-16T17:54:46Z

Just wondering (but nice to see the issue is being looked at): are you saying this is "just" an error that's not properly handled?

Because it seems to me like this is hiding the issue under the rug, and retrying when there is a failure after a (long) timeout: there is no reason for the request to fail, as there is no throttling or anything happening.

I can easily reproduce the bug by calling the code once every 10-ish minutes: the lambda process will still be around, there is no other load or anything in parallel, but the previous request was not properly "cleaned up".

Again - this has been mentioned before - simply swapping the new v3 request with the old aws-sdk one works right away without any issue.

We migrated a ton of services to use the v3 SDK and can't merge that branch to production because of this.

trivikr · 2020-11-16T18:16:32Z

Hi @rraziel, I'm currently looking into how JS SDK v2 handles this and will provide a fix in v3 accordingly.

are you saying this is "just" an error that's not properly handled?

The current behavior in undesirable, and the SDK should retry the error instead of asking user to do it.

trivikr · 2020-11-16T19:57:12Z

The fix needs to be in node-http-handler, where it detects ECONNRESET/EPIPE/ETIMEDOUT errors and reject them with name "TimeoutError" so that they can be retried.

Example TimeoutError for socket timeout:

aws-sdk-js-v3/packages/node-http-handler/src/node-http-handler.ts

Lines 88 to 90 in 41e9359

    
           // wire-up any timeout logic 
        
           setConnectionTimeout(req, reject, this.connectionTimeout); 
        
           setSocketTimeout(req, reject, this.socketTimeout);

aws-sdk-js-v3/packages/node-http-handler/src/set-connection-timeout.ts

Lines 16 to 18 in 41e9359

    
           const timeoutError = new Error(`Socket timed out without establishing a connection within ${timeoutInMs} ms`); 
        
           timeoutError.name = "TimeoutError"; 
        
           reject(timeoutError);

aws-sdk-js-v3/packages/node-http-handler/src/set-socket-timeout.ts

Lines 7 to 9 in 41e9359

    
           const timeoutError = new Error(`Connection timed out after ${timeoutInMs} ms`); 
        
           timeoutError.name = "TimeoutError"; 
        
           reject(timeoutError);

trivikr · 2020-11-18T18:55:51Z

This issue is fixed in #1693, and will be published in rc.7 on Thursday 11/19

trivikr · 2020-11-20T17:21:24Z

Update: The https://github.com/aws/aws-sdk-js-v3/releases/tag/v1.0.0-rc.7 is complete.

Verified that this issue is fixed using example code given in #1196 (comment)

github-actions · 2021-01-16T00:39:28Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

hisham mentioned this issue Oct 14, 2020

Intermittent Error: write EPIPE when running stripe client in AWS Lambda stripe/stripe-node#1040

Closed

dphang mentioned this issue Oct 28, 2020

fix(lambda-at-edge): add retries for s3 calls serverless-nextjs/serverless-next.js#720

Merged

monken mentioned this issue Nov 6, 2020

[@aws-sdk/ssm-client] "Error: write EPIPE" during PutParameter operation in long-running NodeJS process #1589

Closed

trivikr self-assigned this Nov 11, 2020

trivikr added the High Priority label Nov 12, 2020

trivikr mentioned this issue Nov 16, 2020

chore(service-error-classification): alphabetically order THROTTLING_ERROR_CODES #1689

Merged

trivikr mentioned this issue Nov 16, 2020

feat(service-error-classification): add 429 response as Throttling #1690

Merged

This was referenced Nov 17, 2020

feat(node-http-handler): update timeout code and tests #1691

Merged

fix(node-http-handler): throw TimeoutError for Node.js timeouts #1693

Merged

trivikr closed this as completed in #1693 Nov 18, 2020

github-actions bot locked as resolved and limited conversation to collaborators Jan 16, 2021

ECONNRESET exceptions when running in Lambda environment #1196

ECONNRESET exceptions when running in Lambda environment #1196

Comments

monken commented May 20, 2020

PoBuchi commented May 26, 2020

samirda commented May 26, 2020 • edited Loading

samirda commented May 28, 2020 • edited Loading

monken commented May 28, 2020

samirda commented May 28, 2020 • edited Loading

studds commented Jun 1, 2020 • edited Loading

abierbaum commented Jun 27, 2020

polovi commented Jul 4, 2020

trivikr commented Jul 9, 2020

maxgr0 commented Jul 27, 2020

polovi commented Jul 30, 2020

samirda commented Aug 11, 2020

monken commented Aug 11, 2020

samirda commented Aug 11, 2020 • edited Loading

maxgr0 commented Aug 12, 2020

polovi commented Aug 12, 2020

samirda commented Aug 17, 2020

maxgr0 commented Aug 25, 2020

polovi commented Aug 31, 2020

polovi commented Oct 4, 2020

polovi commented Oct 9, 2020

rraziel commented Oct 21, 2020

monken commented Oct 21, 2020

martinjuhasz commented Nov 6, 2020

monken commented Nov 6, 2020

martinjuhasz commented Nov 10, 2020 • edited Loading

trivikr commented Nov 12, 2020

trivikr commented Nov 16, 2020

trivikr commented Nov 16, 2020 • edited Loading

rraziel commented Nov 16, 2020

trivikr commented Nov 16, 2020 • edited Loading

trivikr commented Nov 16, 2020

trivikr commented Nov 18, 2020

trivikr commented Nov 20, 2020

github-actions bot commented Jan 16, 2021

samirda commented May 26, 2020 •

edited

Loading

samirda commented May 28, 2020 •

edited

Loading

samirda commented May 28, 2020 •

edited

Loading

studds commented Jun 1, 2020 •

edited

Loading

samirda commented Aug 11, 2020 •

edited

Loading

martinjuhasz commented Nov 10, 2020 •

edited

Loading

trivikr commented Nov 16, 2020 •

edited

Loading

trivikr commented Nov 16, 2020 •

edited

Loading