ERROR in SendProcessor with limited number of CPUs #43

qjing666 · 2019-03-22T03:54:10Z

We rent multiple machines on google cloud for running the FATE project. In the beginning, we used machines with 2 vCPUs. While running the cluster code for hetero_logistic_regression, we get an Error from arbiter when sending public-keys to guest and host(console.log in federation):

[ERROR] 2019-03-18T07:11:15,361 [transferJobSchedulerExecutor-2] [SendProcessor:94] - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at com.webank.ai.fate.driver.federation.transfer.service.impl.DefaultProxySelectionService.select(DefaultProxySelectionService.java:81)
at com.webank.ai.fate.driver.federation.transfer.communication.processor.SendProcessor.run(SendProcessor.java:70)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

[ERROR] 2019-03-18T07:11:24,396 [transferJobSchedulerExecutor-2] [GrpcChannelFactory:119] - [COMMON][CHANNEL][ERROR] Error getting ManagedChannel after retries
[ERROR] 2019-03-18T07:11:24,397 [transferJobSchedulerExecutor-2] [TransferJobScheduler:127] - [FEDERATION][SCHEDULER] processor failed: transferMetaId: cxz-HeteroLRTransferVariable.paillier_pubkey-HeteroLRTransferVariable.paillier_pubkey.0-2-arbiter-1-guest, exception: java.lang.RuntimeException: should never get here
at com.webank.ai.fate.core.factory.GrpcStubFactory.createGrpcStub(GrpcStubFactory.java:47)
at com.webank.ai.fate.core.factory.GrpcStubFactory.createGrpcStub(GrpcStubFactory.java:56)
at com.webank.ai.fate.core.api.grpc.client.GrpcAsyncClientContext.createStub(GrpcAsyncClientContext.java:207)
at com.webank.ai.fate.core.api.grpc.client.GrpcStreamingClientTemplate.calleeStreamingRpc(GrpcStreamingClientTemplate.java:106)
at com.webank.ai.fate.core.api.grpc.client.GrpcStreamingClientTemplate.calleeStreamingRpcWithImmediateDelayedResult(GrpcStreamingClientTemplate.java:149)
at com.webank.ai.fate.driver.federation.transfer.api.grpc.client.ProxyClient.unaryCall(ProxyClient.java:98)
at com.webank.ai.fate.driver.federation.transfer.api.grpc.client.ProxyClient.requestSendEnd(ProxyClient.java:121)
at com.webank.ai.fate.driver.federation.transfer.communication.processor.SendProcessor.run(SendProcessor.java:98)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

and we tried the same code using machines with 8 vCPUs, the ERROR cannot be reproduced.

Here are the configurations of the two groups of machines:
The former ones: Google Cloud n1-standard-2 machines with 2 vCPUs and 7.5GB RAM

The latter ones: Google Cloud n1-standard-8 machines with 8 vCPUs and 30GB RAM

maxwong · 2019-03-22T10:01:40Z

Thanks for your feedback. We are looking into it.

Feature 1.7.0 provider

maxwong added bug Something isn't working arch Architecture related labels Mar 22, 2019

maxwong self-assigned this Mar 22, 2019

dylan-fan closed this as completed Nov 5, 2019

jat001 pushed a commit that referenced this issue Nov 12, 2021

Merge pull request #43 from FederatedAI/feature-1.7.0-provider

fb2edb2

Feature 1.7.0 provider

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR in SendProcessor with limited number of CPUs #43

ERROR in SendProcessor with limited number of CPUs #43

qjing666 commented Mar 22, 2019

maxwong commented Mar 22, 2019 •

edited

ERROR in SendProcessor with limited number of CPUs #43

ERROR in SendProcessor with limited number of CPUs #43

Comments

qjing666 commented Mar 22, 2019

maxwong commented Mar 22, 2019 • edited

maxwong commented Mar 22, 2019 •

edited