Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run v0.8.1 on both AWS and VMWare - results in RTE_HASH PANIC #649

Closed
guesslin opened this issue Sep 10, 2019 · 17 comments
Closed

Can't run v0.8.1 on both AWS and VMWare - results in RTE_HASH PANIC #649

guesslin opened this issue Sep 10, 2019 · 17 comments
Assignees
Labels

Comments

@guesslin
Copy link
Contributor

Hi, We just upgrade nff-go to 0.8.1, but it fails on EC2 and VMWare, we got this panic message.

EAL: RTE_HASH tailq is already registered
PANIC in tailqinitfn_rte_hash_tailq():
Cannot initialize tailq: RTE_HASH
6: [/opt/glasnostic/bin/router(_start+0x2a) [0x43ed5a]]
5: [/lib64/libc.so.6(__libc_start_main+0x85) [0x7f6995e42425]]
4: [/opt/glasnostic/bin/router(__libc_csu_init+0x4d) [0x183890d]]
3: [/opt/glasnostic/bin/router() [0x43e19c]]
2: [/opt/glasnostic/bin/router(__rte_panic+0xba) [0x43130e]]
1: [/opt/glasnostic/bin/router(rte_dump_stack+0x18) [0x148d2b8]]
Aborted

We also try to downgrade with v0.8.0 tag, everything is working fine with it.

@gshimansky
Copy link
Contributor

gshimansky commented Sep 10, 2019

Can you clarify which VM image you are using? I tried v0.8.1 on m5a.4xlarge AWS Ubuntu 18.04 with kernel 4.15.0-1032-aws and ENA NICs and everything works as expected.

@gshimansky
Copy link
Contributor

I also updated kernel and tested on 4.15.0-1048-aws.

Did you update DPDK when you switched between NFF-Go versions?

@guesslin
Copy link
Contributor Author

guesslin commented Sep 11, 2019

@gshimansky we try to run it on m5.xlarge with our customized AMI with kernel 4.4.0-142-generic with ENA NICs, maybe it's the kernel version problem?

@gshimansky
Copy link
Contributor

It could be kernel version although grepping DPDK sources for this error doesn't produce kernel module sources. I am quite sure that key difference between 0.8.0 and 0.8.1 is DPDK version and something inside DPDK 19.08 stopped working in your environment. NFF-Go 0.8.0 used DPDK 19.04.

@guesslin
Copy link
Contributor Author

@gshimansky I just update the kernel to 4.9.184-0409184-generic which is the same as we build the binary but still failed in the same panic error message

EAL: RTE_HASH tailq is already registered
PANIC in tailqinitfn_rte_hash_tailq():
Cannot initialize tailq: RTE_HASH
6: [/opt/glasnostic/bin/router(_start+0x2a) [0x43ed5a]]
5: [/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x7f) [0x7fd2cf7007bf]]
4: [/opt/glasnostic/bin/router(__libc_csu_init+0x4d) [0x18388ed]]
3: [/opt/glasnostic/bin/router() [0x43e19c]]
2: [/opt/glasnostic/bin/router(__rte_panic+0xba) [0x43130e]]
1: [/opt/glasnostic/bin/router(rte_dump_stack+0x18) [0x148d298]]

@gshimansky
Copy link
Contributor

Can you also try different gcc version? Looking at DPDK sources I suppose it could be a compiler bug.

@guesslin
Copy link
Contributor Author

guesslin commented Oct 4, 2019

@gshimansky hi, we tried to compile the binary with the following environment

compiler: gcc 5.4.0
nff-go: 0.9.1
DPDK: 19.08
go: go1.10.8 linux/amd64
kernel: 4.9

but our binary still failed with RTE_HASH problem

Oct 04 03:04:23 ip-10-1-218-18 router[5703]: EAL: RTE_HASH tailq is already registered
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: PANIC in tailqinitfn_rte_hash_tailq():
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: Cannot initialize tailq: RTE_HASH
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 6: [/opt/glasnostic/bin/router(_start+0x2a) [0x43dc9a]]
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 5: [/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x7a) [0x7f70d0171afa]]
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 4: [/opt/glasnostic/bin/router(__libc_csu_init+0x4d) [0x182005d]]
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 3: [/opt/glasnostic/bin/router() [0x43d002]]
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 2: [/opt/glasnostic/bin/router(__rte_panic+0xb8) [0x430b7f]]
Oct 04 03:04:23 ip-10-1-218-18 router[5703]: 1: [/opt/glasnostic/bin/router(rte_dump_stack+0x18) [0x1488948]]

@gshimansky
Copy link
Contributor

Is there a reason to use gcc of such old version? It was released on June 3, 2016 which is more than 3 years ago.
Other than that I still think that this is some problem with DPDK conflicting with your setup. I could however google just a few errors like this with no guide on how to fix them.

@marcusschiesser
Copy link
Contributor

we actually also tried GCC (Debian 8.3.0-6) 8.3.0 before with the same error.

Then we checked that DPDK 19.08 is using gcc 5.4.0 in there CI, see:
https://github.com/DPDK/dpdk/blob/v19.08/.travis.yml
https://docs.travis-ci.com/user/reference/xenial/#compilers-and-build-toolchain

So I guess it's not a GCC related problem.

@gshimansky
Copy link
Contributor

Do you experience this problem only on your customized AMI? I need to find some configuration where I could reproduce this problem.

@gshimansky
Copy link
Contributor

I tried Amazon Linux 2 AMI (HVM), SSD Volume Type - ami-00c03f7f7f2ec15c3 (64-bit x86) with kernel 4.14.146-119.123.amzn2.x86_64 and gcc Red Hat 7.3.1-6, but DPDK initializes correctly.

@guesslin
Copy link
Contributor Author

Yes, we have this problem on our customized AMI, the kernel is
Linux ip-10-1-218-18 4.4.0-142-generic #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux based on ami-1ee65166
config-4.4.0-142-generic

@aregm
Copy link
Owner

aregm commented Oct 14, 2019

If I am not mistaken, there is some functionality in principle missing from 4.4. Have you tried to customize on top of 4.14.146-119.123.amzn2.x86_64?

@gshimansky
Copy link
Contributor

I believe that the problem happens in userspace DPDK code. I cannot say whether it is related to kernel version, but in my understanding it may be related to kernel version only through kernel headers, not through some kernel code directly.
I still need some way to reproduce it to debug it.

@guesslin
Copy link
Contributor Author

@gshimansky While we updated our application to use nff-go v0.9.2, we found this problem is caused by the following patch in our code:

index f13d151b7..944269b1e 100644
--- a/gateway/router/driver/nff/runner_linux.go
+++ b/gateway/router/driver/nff/runner_linux.go
@@ -2,6 +2,10 @@

 package nff

+/*
+#include <rte_ethdev.h>
+*/
+import "C"
 import (
        "fmt"
        "net"
@@ -16,7 +20,6 @@ import (

        "github.com/intel-go/nff-go/devices"
        "github.com/intel-go/nff-go/flow"
-       "github.com/intel-go/nff-go/low"
        libpacket "github.com/intel-go/nff-go/packet"
 )

 func getEthPort(hwaddr net.HardwareAddr) portType {
-       for p := 0; p < low.GetPortsNumber(); p++ {
-               portMACAddress := low.GetPortMACAddress(portType(p))
+       for p := 0; p < int(C.rte_eth_dev_count()); p++ {
+               portMACAddress := flow.GetPortMACAddress(portType(p))

Without this patch it's working, so the problem is caused by including rte_ethdev.h.

As you can see, we want to get the number of device ports by calling rte_eth_dev_count(). In version 0.8.0 we didn't have to do this, because the value was exported in the low package which was moved to internal/low, so we can't access it anymore.

How about creating a flow.GetPortsNumber() function, so we can get the correct number of device ports again?

guesslin added a commit to glasnostic/nff-go that referenced this issue Dec 27, 2019
	For user who want to get the number of device ports on system,
	better to get it from rte_eth_dev_count function in rte_ethdev.h
	But while a Go application include this rte_ethdev.h header and
	nff-go library will cause a RTR_HASH error.
	See aregm#649 for detail
	information about the error message.
@guesslin
Copy link
Contributor Author

@gshimansky I create #680 for this, please have a look :)

guesslin added a commit to glasnostic/nff-go that referenced this issue Dec 27, 2019
	For user who want to get the number of device ports on system,
	better to get it from rte_eth_dev_count function in rte_ethdev.h
	But while a Go application include this rte_ethdev.h header and
	nff-go library will cause a RTR_HASH error.
	See aregm#649 for detail
	information about the error message.
@gshimansky
Copy link
Contributor

It is great that you found the cause of this bug! I merged your PR, hope it will allow you to use the latest version of the framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants