Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aerospike scan returns error "command execution timed out on client: See Policy.Timeout" #396

Open
chengc-sa opened this issue Jan 25, 2023 · 4 comments

Comments

@chengc-sa
Copy link

chengc-sa commented Jan 25, 2023

I am wondering which parameter I should config to make it less likely to time out? I don't see a Timeout field in the ScanPolicy, I tried to increase ScanPolicy.SocketTimeout to 5 minutes, and this error still occurs.

PS: I set the MaxRetries to 10000

@chengc-sa chengc-sa changed the title Aerospike scan returns error "command execution timed out on client: See Policy.Timeout" Aerospike scan returns error "command execution timed out on client: See Policy.Timeout" Jan 25, 2023
@khaf
Copy link
Collaborator

khaf commented Jan 25, 2023

Can you elaborate a bit about your use case? How big is the namespace/set you are scanning? Do you have a particularly unstable network connection to the cluster? Can't you resume your scan by passing the same PartitionFilter? (PartitionFilters are basically a cursor. If you pass them again to the Scan command, they will resume if the scan was not completed)

@chengc-sa
Copy link
Author

chengc-sa commented Jan 26, 2023

@khaf The namespace is about 110 GB big, the sets that are often timed out are about 900 MB / 7 million records, 200 MB / 2 million records, 500 MB / 3.5 million records, and 500 MB / 2 million records respectively. The network connection should be pretty stable since both clients and servers are hosted on AWS and connected within the same VPC. The errors occurred in the middle of the scan from the <-chan *aerospike.Result inside the *aerospike.Result.Err. In the meantime, I am also seeing EOF errors occurring from the same channel, are they related?

@un000
Copy link
Contributor

un000 commented Feb 14, 2023

@khaf The similar thing after I've updated from 6.4 to 6.10

aerospike version: Aerospike Community Edition build 5.6.0.5

error running query: error iterating over records: ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB94D643559A1A8 10.10.2.231:3000: network error. Checked the wrapped error for detail
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9F1653559A1A8 10.10.2.232:3000: network error. Checked the wrapped error for detail
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB98E623559A1A8 10.10.2.233:3000: network error. Checked the wrapped error for detail
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB98E623559A1A8 10.10.2.233:3000: network error. Checked the wrapped error for detail
ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout

I do scan with a Query with a FilterExpression over millions of records. One-record processing takes 200ms-3000ms, but I do it with a multiple goroutines.

	cp := aerospike.NewClientPolicy()
	cp.Timeout = 5 * time.Second
	cp.IdleTimeout = 30 * time.Second
	cp.ConnectionQueueSize = 1024
	cp.MinConnectionsPerNode = 512


	qp := aerospike.NewQueryPolicy()
	qp.IncludeBinData = true
	qp.RecordQueueSize = 16 * 1024
	qp.FilterExpression = aero.ExpEq(aero.ExpDigestModulo(shardCount), aero.ExpIntVal(shardID))

	statement := aero.NewStatement(r.namespace, r.set)

	rs, err := c.client.Query(qp, statement)
	if err != nil {
		return fmt.Errorf("error executing Query: %w", err)
	}

	var closeErr error
	closeOnce := sync.Once{}
	errGr, ctx := errgroup.WithContext(ctx)
	for i := 0; i < 768; i++ {
		errGr.Go(func() error {
			defer closeOnce.Do(func() { closeErr = rs.Close() })
			for result := range rs.Results() {
				if result.Err != nil {
					return result.Err
				}

				if ctx.Err() != nil {
					return nil
				}

				if err := processFunc(result); err != nil {
					return fmt.Errorf("process func returned an error: %w", err)
				}
			}

			return nil
		})
	}

	if err := errGr.Wait(); err != nil {
		return fmt.Errorf("error iterating over records: %w", err)
	}

	if closeErr != nil {
		return fmt.Errorf("error closing records chan: %w", err)
	}

Looks this returns

	if result.Err != nil {
		return result.Err
	}

@un000
Copy link
Contributor

un000 commented Feb 14, 2023

Also I see the following change f0d2818

So what's a behaviour will be, when we get out of retries? How to check if the whole set will be read?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants