Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc server gets stuck under stress test #112

Closed
zcc35357949 opened this issue Jan 17, 2022 · 29 comments
Closed

grpc server gets stuck under stress test #112

zcc35357949 opened this issue Jan 17, 2022 · 29 comments

Comments

@zcc35357949
Copy link

Issue description

CPU:Intel 5318Y
kernel: 5.12.19
os: debian 10
EGO:v0.4.0 (ecc1a70)

Grpc server which starts with ego run will become stuck and can not deal with any requests after low stress test.
strace result:
image
htop result:
image

I try to use numactl to bind server pid in serveral cpus, and the problem can not be reproduced any more.

To reproduce

sample code:

...

func PKCS7Padding(ciphertext []byte, blockSize int) []byte {
    padding := blockSize - len(ciphertext)%blockSize
    padtext := bytes.Repeat([]byte{byte(padding)}, padding)
    return append(ciphertext, padtext...)
}

func PKCS7UnPadding(origData []byte) []byte {
    length := len(origData)
    unpadding := int(origData[length-1])
    return origData[:(length - unpadding)]
}

func AesEncrypt(plaintext []byte, key, iv []byte) ([]byte, error) {
    block, err := aes.NewCipher(key)
    if err != nil {
        return nil, err
    }
    blockSize := block.BlockSize()
    plaintext = PKCS7Padding(plaintext, blockSize)
    blockMode := cipher.NewCBCEncrypter(block, iv)
    crypted := make([]byte, len(plaintext))
    blockMode.CryptBlocks(crypted, plaintext)
    return crypted, nil
}

func AesDecrypt(ciphertext []byte, key, iv []byte) ([]byte, error) {
    block, err := aes.NewCipher(key)
    if err != nil {
        return nil, err
    }
    blockSize := block.BlockSize()
    blockMode := cipher.NewCBCDecrypter(block, iv[:blockSize])
    origData := make([]byte, len(ciphertext))
    blockMode.CryptBlocks(origData, ciphertext)
    origData = PKCS7UnPadding(origData)
    return origData, nil
}

func testAes() {
    key, _ := hex.DecodeString("6368616e676520746869732070617373")
    plaintext := []byte("hello ming")

    c := make([]byte, aes.BlockSize+len(plaintext))
iv := c[:aes.BlockSize]

    ciphertext, err := AesEncrypt(plaintext, key, iv)
    if err != nil {
        panic(err)
    }
fmt.Println(base64.StdEncoding.EncodeToString(ciphertext))

    plaintext, err = AesDecrypt(ciphertext, key, iv)
    if err != nil {
        panic(err)
    }
    fmt.Println(string(plaintext))
}

const (
	port = ":50051"
)

// server is used to implement helloworld.GreeterServer.
type server struct {
	pb.UnimplementedGreeterServer
}

// SayHello implements helloworld.GreeterServer
func (s *server) SayHello(ctx context.Context, in *pb.HelloRequest) (*pb.HelloReply, error) {
	log.Printf("Received: %v", in.GetName())
        testAes()
	return &pb.HelloReply{Message: "Hello " + in.GetName()}, nil
}

func main() {
	lis, err := net.Listen("tcp", port)
	if err != nil {
		log.Fatalf("failed to listen: %v", err)
	}
	s := grpc.NewServer()
	pb.RegisterGreeterServer(s, &server{})
	log.Printf("server listening at %v", lis.Addr())
	if err := s.Serve(lis); err != nil {
		log.Fatalf("failed to serve: %v", err)
	}
}

Steps to reproduce the behavior:

  1. Run a simple grpc server through ego. And do aes encryption in it's handler.
  2. use ghz to test the server.
    ghz -c 5000 -n 50000 --insecure --proto helloworld.proto --import-paths=/mnt/storage09/ego/samples/test/platform/vendor/googleapis,/mnt/storage09/ego/samples/test/platform/vendor/ --call helloworld.Greeter/SayHello -d '{"name":"xxx"}' 127.0.0.1:50051
  3. repeat step 2 serveral times, the server will stuck and client becomes timeout.

Expected behavior

Additional info / screenshot

@thomasten
Copy link
Member

Thanks for reporting. I've not yet been able to reproduce this.

Just to make sure I got everything right:

  • Your system has two 5318Y CPUs?
  • If you bind the server pid to only one of the two CPUs, it works correctly?
  • So I probably need a system with two CPUs to reproduce this?
  • Or is it about the number of cores the pid is bound to? If so, how many cores do I need to reproduce this?

@zcc35357949
Copy link
Author

Thanks for reporting. I've not yet been able to reproduce this.

Just to make sure I got everything right:

  • Your system has two 5318Y CPUs?
  • If you bind the server pid to only one of the two CPUs, it works correctly?
  • So I probably need a system with two CPUs to reproduce this?
  • Or is it about the number of cores the pid is bound to? If so, how many cores do I need to reproduce this?

1、yes, 5318Y cpu model, 2 physical cpus, 48 cores, 96 processors.
2、It works correctly when I bind to any core.

Additional, I found the problem can not be reproduced after I comment out all stdout print code in this test example even without core binding. I'm confused about the phenomenon.

@thomasten
Copy link
Member

Can you try to replace "env": null, with "env": [{ "name": "GOMAXPROCS", "value": "26" }], in enclave.json?
If this runs stable, increase the number. If it gets stuck, decrease it.
After each change you must run ego sign again.
Please tell me the biggest number that will be stable. (I suspect it might be around 26, thus, I'd start with that.)

@zcc35357949
Copy link
Author

Can you try to replace "env": null, with "env": [{ "name": "GOMAXPROCS", "value": "26" }], in enclave.json? If this runs stable, increase the number. If it gets stuck, decrease it. After each change you must run ego sign again. Please tell me the biggest number that will be stable. (I suspect it might be around 26, thus, I'd start with that.)

Indeed it is, after GOMAXPROCS set 26 or 27, I can not reproduce.
However, it occurs when increasing to 28.

@thomasten
Copy link
Member

Thanks. Meanwhile I've been able to reproduce the bug. A first fix will be included in the next EGo release.

@Glenrun
Copy link

Glenrun commented Jan 20, 2022

@thomasten I also encountered the same issue using the same test case like @zcc35357949, also without numact do the binding, in case we run into the same case, I uploaded the stack trace files generated by pstack, hope they could help.

@Glenrun
Copy link

Glenrun commented Jan 20, 2022

@thomasten two stack trace files as their name show, one for egohost process, another for ego run process.
egohost_stack.txt
egorun_stack.txt

@Glenrun
Copy link

Glenrun commented Jan 20, 2022

@thomasten gdb stacks for ego run
image
gdb stacks for ego host
image

@thomasten
Copy link
Member

Thanks @Glenrun. How many cores does your machine have?
Does #112 (comment) mitigate the problem on your machine, too?

@Glenrun
Copy link

Glenrun commented Jan 20, 2022

@thomasten I have 56 physical cores, 112 logic cores:
lscpu

...
NUMA node0 CPU(s):               0-27,56-83
NUMA node1 CPU(s):               28-55,84-111
...

It is not easy to reproduce in my environment with "env", null, however if set GOMAXPROCS=30 or 28, it is easy to reproduce; After I set it to 26, it is not easy to hang, it is still working correctly after running 18 hours now, seems it is stable
BTW: I'm using EGo v0.4. I filed an issue two months ago, ego built program does not scale well on multi-cores server , does this hang have something to do with the fix?

@Glenrun
Copy link

Glenrun commented Jan 21, 2022

@thomasten what is the next release date?

@thomasten
Copy link
Member

@Glenrun

After I set it to 26, it is not easy to hang, it is still working correctly after running 18 hours now, seems it is stable

Thanks for doing this, it's helpful.

I filed an issue two months ago, ego built program does not scale well on multi-cores server , does this hang have something to do with the fix?

Kind of. Previously, GOMAXPROCS was limited to 2, so the problem couldn't occur.

what is the next release date?

Maybe next week

@Glenrun
Copy link

Glenrun commented Jan 21, 2022

@thomasten Nice, thanks for reply

@Glenrun
Copy link

Glenrun commented Feb 7, 2022

@thomasten we are waiting for the new release, is there any update?

@thomasten
Copy link
Member

EGo v0.4.1 is out including the fix. Sorry for the delay.

@zcc35357949
Copy link
Author

zcc35357949 commented Feb 11, 2022

EGo v0.4.1 is out including the fix. Sorry for the delay.

EGo v0.4.1 (3a78efd)
The problem still exists in v0.4.1.

ego run process:
image

ego-host process:
image

@thomasten
Copy link
Member

@zcc35357949 sorry to hear that. Can you archive the directory of your sample project and attach it to this issue?

@zcc35357949
Copy link
Author

helloworld.zip
OK, this is my sample code.

@zcc35357949
Copy link
Author

@zcc35357949 sorry to hear that. Can you archive the directory of your sample project and attach it to this issue?

ghz -c 4000 -n 500000 --insecure --proto helloworld.proto --import-paths=/ego/samples/test/platform/vendor/googleapis,/ego/samples/test/platform/vendor/ --call helloworld.Greeter/SayHello -d '{"name":"abc"}' 127.0.0.1:50051

I run the above benchmark command in the same test machine. The problem occurs.

@thomasten
Copy link
Member

Thanks. Does it work if you remove GOMAXPROCS from enclave.json?

@zcc35357949
Copy link
Author

Thanks. Does it work if you remove GOMAXPROCS from enclave.json?

After remove GOMAXPROCS, the problem occurs less often. But it still exists.

@thomasten
Copy link
Member

ok, we'll investigate

@Glenrun
Copy link

Glenrun commented Feb 17, 2022

@thomasten I also reproduced this issue on v0.4.1, attached pstack trace when it occurred.
egohost_v041_pstack.txt
egorun_v041_pstack.txt

@Glenrun
Copy link

Glenrun commented Feb 22, 2022

ok, we'll investigate

Hi, @thomasten , is there any progress?

@thomasten
Copy link
Member

Yes, I think I found the root cause and have a fix that seems to work regardless of the GOMAXPROCS value. But this needs some more testing.

@Glenrun
Copy link

Glenrun commented Feb 24, 2022

Yes, I think I found the root cause and have a fix that seems to work regardless of the GOMAXPROCS value. But this needs some more testing.

@thomasten Thanks for update, waiting for the new release.

@thomasten
Copy link
Member

v0.4.2 contains the fix. Please try it out.

@Glenrun
Copy link

Glenrun commented Mar 9, 2022

v0.4.2 contains the fix. Please try it out.

@thomasten currently it seems the issue was already fixed, thanks!

@thomasten
Copy link
Member

We consider this to be fixed. Feel free to reopen if there are still problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants