Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomly getting AEROSPIKE_ERR_TIMEOUT #144

Closed
akyoto opened this issue Aug 14, 2016 · 22 comments
Closed

Randomly getting AEROSPIKE_ERR_TIMEOUT #144

akyoto opened this issue Aug 14, 2016 · 22 comments

Comments

@akyoto
Copy link

akyoto commented Aug 14, 2016

Unhandled rejection AerospikeError: AEROSPIKE_ERR_TIMEOUT
    at Function.AerospikeError.fromASError (/home/eduard/node_modules/aerospike/lib/aerospike_error.js:113:10)
    at Client.DefaultCallbackHandler [as callbackHandler] (/home/eduard/node_modules/aerospike/lib/client.js:71:72)
    at putCb (/home/eduard/node_modules/aerospike/lib/client.js:1247:10)
Unhandled rejection AerospikeError: AEROSPIKE_ERR_TIMEOUT
    at Function.AerospikeError.fromASError (/home/eduard/node_modules/aerospike/lib/aerospike_error.js:113:10)
    at Client.DefaultCallbackHandler [as callbackHandler] (/home/eduard/node_modules/aerospike/lib/client.js:71:72)
    at putCb (/home/eduard/node_modules/aerospike/lib/client.js:1247:10)
Unhandled rejection AerospikeError: AEROSPIKE_ERR_TIMEOUT
    at Function.AerospikeError.fromASError (/home/eduard/node_modules/aerospike/lib/aerospike_error.js:113:10)
    at Client.DefaultCallbackHandler [as callbackHandler] (/home/eduard/node_modules/aerospike/lib/client.js:71:72)
    at /home/eduard/node_modules/aerospike/lib/client.js:836:10

I get about 560 of the first error message at once, sometimes. Here's my config:

Aerospike.client({
    hosts: [{
        addr: '127.0.0.1',
        port: 3000
    }],
    log: {
        level: Aerospike.log.WARN
    },
    policies: {
        timeout: 5000
    },
    maxConnsPerNode: 1024
})
namespace live {
    replication-factor 2
    memory-size 500M
    storage-engine device {
        file /home/eduard/db/db.dat
        filesize 2G
        data-in-memory true
        post-write-queue 1
    }
}

DB and client are on the same machine (server 3.9.0.3, node client 2.3.1).
I can provide more information if needed.
Is there anything I can do to fix these timeouts?

@jhecking
Copy link
Contributor

What's the load level of the application, in terms of requests/second to the Aerospike server?
When you get 560 timeout errors at once, is that more or less all the requests that are currently in progress or just a (small?) percentage?

@akyoto
Copy link
Author

akyoto commented Aug 20, 2016

I don't know which part of my application causes it because the stack traces only show the Aerospike related stack. All I know is that it's a background process (not the web server itself) and it's doing lots of different DB updates. Enough to bring 1 of 2 CPU cores on DigitalOcean to >90% usage. It might be possible that I'm issuing 560 requests of one type at once and they all fail.

I apologize for being a bit in the dark here.
There are many components and I don't know which component causes it yet.

@jhecking
Copy link
Contributor

The stack traces for errors raised in async operation callbacks are unfortunately not very useful. I'd like to add support for better stack traces at least in debug/development mode similar to how e.g. the redis client does it. But you can't do that in production since it comes with a pretty significant performance penalty. What is clear from the stack traces is that the timeouts happen in both put and get operations.

@akyoto
Copy link
Author

akyoto commented Sep 5, 2016

I would love to have improved stack traces, even if they come with a heavy performance penalty.
I'd use it to debug my application on the live server for 1 day, collect all the error reports, then switch it off.
And it would improve bug reporting for the Aerospike team as well.

@GavinDmello
Copy link

Was this problem solved ? I'm getting some weird behaviour where my callback for put function is executed twice. Once with a AEROSPIKE_ERR_TIMEOUT error and once with a success. Not sure what to do here.

@rahulbhanushali
Copy link

Upgraded to 2.4.3 from 1.0.56 and started receiving AEROSPIKE_ERR_TIMEOUT when trying to use the put method. Not sure why would we receive the timeout error randomly.

@bobiblazeski
Copy link

@rahulbhanushali I'm having the same issue, recently upgraded from node 0.10 to node 6.9.1, and tried to upgrade aerospike client from version 1.0.58 to version 2.4.2, had to return back to 1.0.58 due to timeouts.

@jhecking
Copy link
Contributor

@rahulbhanushali, @bobiblazeski, @GavinDmello, are you just doing single key read/write operations when you get the timeouts or other operations like scan/query?

@GavinDmello
Copy link

@jhecking It's mostly during single key writes in my case.

@akyoto
Copy link
Author

akyoto commented Nov 23, 2016

@GavinDmello

Was this problem solved ?

I'm in the process of rewriting everything in Go so I haven't checked it lately, however I suspect it's still existent.

In my case it's maybe just high load that's causing it (multiple scans + gets + sets at the same time).

@GavinDmello
Copy link

@blitzprog I had many set operations happening one after the other, where I have no other operations happening. The double callback(success and response callbacks getting called for the same call) issue is where I have many operations(gets, sets and queries)

@bobiblazeski
Copy link

@jhecking Single key get & put only, nothing else.

@rahulbhanushali
Copy link

@jhecking, facing this issue on single key write operations.

@tejas3190
Copy link

Any update over this?

@jhecking
Copy link
Contributor

jhecking commented Apr 20, 2017

I would love to have improved stack traces, even if they come with a heavy performance penalty.
I'd use it to debug my application on the live server for 1 day, collect all the error reports, then switch it off.
And it would improve bug reporting for the Aerospike team as well.

I know this comes too late for you, @blitzprog, but I have implemented better stack traces for the just released v2.5.2. More details in PR #190.

@dohse
Copy link
Contributor

dohse commented Apr 24, 2017

Is there any update on this issue? I am also getting AEROSPIKE_ERR_TIMEOUT errors after upgrading from 1.0.58 to 2.5.2.

@GavinDmello
Copy link

@dohse Try increasing the operation timeout property which is part of the policy object and check if it helps

@dohse
Copy link
Contributor

dohse commented Apr 24, 2017

I activated long stack traces to find out where these error occurs, but it seems to occur for all aerospike calls in our application.

@GavinDmello I am not setting any timeouts. Did the default timeout change between 1.0.58 and 2.5.2?

@dohse
Copy link
Contributor

dohse commented Apr 24, 2017

I need to correct myself. I am setting a timeout of 1000 ms in the client policies. I increased this to 10000 ms now.

@jhecking
Copy link
Contributor

The default timeout has not changed between v1 and v2. It is 1,000 ms. What has changed is the I/O model and that might impact the frequency of timeouts.

In v1, Aerospike DB commands are executed using synchronous I/O. The commands are queued up and are getting executed on libuv's worker threads. By default, there are only 4 worker threads, but this can be configured through the UV_THREADPOOL_SIZE environment variable. (Please refer to the libuv docs for more info.) The timeout only starts once the DB command gets executed on the worker thread. And it's using synchronous I/O so at that point it is no longer competing with any other I/O tasks handled by libuv/Node.js.

In contrast, v2 uses libuv's async TCP API to execute DB commands. The timeout starts as soon as the DB command is issued. If the Node.js server is under heavy I/O workload (not just from Aerospike DB commands but other I/O as well), then it's possible that you are seeing timeouts even if the server responds within the timeout window.

I'm not saying that's necessarily what's happening, but it is a significant difference between v1 and v2 that might be at least part of the cause for increased number of timeouts some people are observing.

@dohse
Copy link
Contributor

dohse commented Apr 25, 2017

@jhecking Thank you for the detailed reasoning. It could be indeed be the case in our application that operations were waiting very long for a thread from the pool. While they did not timeout in that case they might now timeout when waiting to send or receive the data from aerospike.

I resolved the issue by retrying timed out operations up to five times. This should be safe as the operations are idempotent.

@jhecking
Copy link
Contributor

Closing this issue.

While not a direct fix for this issue, client release v3 has improved timeout & retry handling. The new BasePolicy class has two separate socketTimeout and totalTimeout policy values; while the former is the socket idle timeout value for the network connection to the server, the latter is the total timeout value for the entire database operation. If totalTimeout > socketTimeout the client will automatically retry the operation up to maxRetries times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants