Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processes running are suspend by some unknown reasom. #69

Closed
iuserea opened this issue Nov 25, 2020 · 4 comments
Closed

Processes running are suspend by some unknown reasom. #69

iuserea opened this issue Nov 25, 2020 · 4 comments

Comments

@iuserea
Copy link

iuserea commented Nov 25, 2020

When I run the fedgkt algorithm by the following cmd.
sh run_FedGKT.sh 8 cifar10 homo 10 20 1 Adam 0.001 1 0 resnet56 fedml_resnet56_homo_cifar10 "./../../../data/cifar10" 64

The processes are often suspend by some reason.I derived the result successfully for only one time.

image
The one I figure it our is that the connection error between the process and wandb.
After solving the connection problem,there are still other potential reasons.
How can I figure it out?

@chaoyanghe
Copy link
Member

you can press ctrl+c to see what's the error.

@iuserea iuserea closed this as completed Nov 26, 2020
@chaoyanghe
Copy link
Member

Have you figured out the problem?

@hosytuyen
Copy link

hosytuyen commented Dec 16, 2021

Hi, @chaoyanghe @iuserea
I also faced this problem after the first epochs. Did you solve this problem?

Thank you

image

@hosytuyen
Copy link

Thank you, I have fixed the problem.

fedml_api/distributed/fedgkt/GKTServerTrainer.py at line 117:

epochs_server = self.args.self.args.epochs_server --> epochs_server = self.args.epochs_server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants