Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR in SendProcessor with limited number of CPUs #43

Closed
qjing666 opened this issue Mar 22, 2019 · 1 comment
Closed

ERROR in SendProcessor with limited number of CPUs #43

qjing666 opened this issue Mar 22, 2019 · 1 comment
Assignees
Labels
arch Architecture related bug Something isn't working

Comments

@qjing666
Copy link

We rent multiple machines on google cloud for running the FATE project. In the beginning, we used machines with 2 vCPUs. While running the cluster code for hetero_logistic_regression, we get an Error from arbiter when sending public-keys to guest and host(console.log in federation):

[ERROR] 2019-03-18T07:11:15,361 [transferJobSchedulerExecutor-2] [SendProcessor:94] - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at com.webank.ai.fate.driver.federation.transfer.service.impl.DefaultProxySelectionService.select(DefaultProxySelectionService.java:81)
at com.webank.ai.fate.driver.federation.transfer.communication.processor.SendProcessor.run(SendProcessor.java:70)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

[ERROR] 2019-03-18T07:11:24,396 [transferJobSchedulerExecutor-2] [GrpcChannelFactory:119] - [COMMON][CHANNEL][ERROR] Error getting ManagedChannel after retries
[ERROR] 2019-03-18T07:11:24,397 [transferJobSchedulerExecutor-2] [TransferJobScheduler:127] - [FEDERATION][SCHEDULER] processor failed: transferMetaId: cxz-HeteroLRTransferVariable.paillier_pubkey-HeteroLRTransferVariable.paillier_pubkey.0-2-arbiter-1-guest, exception: java.lang.RuntimeException: should never get here
at com.webank.ai.fate.core.factory.GrpcStubFactory.createGrpcStub(GrpcStubFactory.java:47)
at com.webank.ai.fate.core.factory.GrpcStubFactory.createGrpcStub(GrpcStubFactory.java:56)
at com.webank.ai.fate.core.api.grpc.client.GrpcAsyncClientContext.createStub(GrpcAsyncClientContext.java:207)
at com.webank.ai.fate.core.api.grpc.client.GrpcStreamingClientTemplate.calleeStreamingRpc(GrpcStreamingClientTemplate.java:106)
at com.webank.ai.fate.core.api.grpc.client.GrpcStreamingClientTemplate.calleeStreamingRpcWithImmediateDelayedResult(GrpcStreamingClientTemplate.java:149)
at com.webank.ai.fate.driver.federation.transfer.api.grpc.client.ProxyClient.unaryCall(ProxyClient.java:98)
at com.webank.ai.fate.driver.federation.transfer.api.grpc.client.ProxyClient.requestSendEnd(ProxyClient.java:121)
at com.webank.ai.fate.driver.federation.transfer.communication.processor.SendProcessor.run(SendProcessor.java:98)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

and we tried the same code using machines with 8 vCPUs, the ERROR cannot be reproduced.

Here are the configurations of the two groups of machines:
The former ones: Google Cloud n1-standard-2 machines with 2 vCPUs and 7.5GB RAM

The latter ones: Google Cloud n1-standard-8 machines with 8 vCPUs and 30GB RAM

@maxwong maxwong added bug Something isn't working arch Architecture related labels Mar 22, 2019
@maxwong maxwong self-assigned this Mar 22, 2019
@maxwong
Copy link

maxwong commented Mar 22, 2019

Thanks for your feedback. We are looking into it.

jat001 pushed a commit that referenced this issue Nov 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch Architecture related bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants