Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test failed in server_failure_test&multi_process_test #21

Closed
CCrainys opened this issue Feb 25, 2019 · 5 comments
Closed

test failed in server_failure_test&multi_process_test #21

CCrainys opened this issue Feb 25, 2019 · 5 comments

Comments

@CCrainys
Copy link

Hi, Anuj
My cluster has 4 mellanox connectx-4 nics: ib0 and ib1 are infiniband nics. p6p1 and p6p2 are ethernet nics.

  mlx5_0 port 1 ==> ib0 (Up)
  mlx5_1 port 1 ==> ib1 (Down)
  mlx5_2 port 1 ==> p6p1 (Up)
  mlx5_3 port 1 ==> p6p2 (Up)

Ofed version is :

  MLNX_OFED_LINUX-4.4-2.0.7.0

Operating system is:

 CentOS Linux release 7.5.1804

I have two questions:

  1. When I run ctest and hello-world, from process output, we known, erpc automatically choose Device mlx5_3/p6p2,
    What should I do to select another nic? (changing ip doesn't work.)

2.I compile with command "cmake . -DPERF=OFF -DTRANSPORT=raw", then run ctest.
However, server_failure_test and multi_process_test failed, the error info is:

Total Test time (real) =  56.93 sec

The following tests FAILED:
  8 - server_failure_test (OTHER_FAULT)
  9 - multi_process_test (OTHER_FAULT)
 Errors while running CTest

I run build/server_failure_test and build/multi_process_test, the error information is:

server_failure_test:

server_failure_test: /root/eRPC/tests/client_tests/server_failure_test.cc:93:
void generic_test_func(erpc::Nexus*, size_t): Assertion `c.num_rpc_resps == config_num_rpcs' failed.Aborted

multi_process_test:

6:070851 WARNG: Installing flow rule for Rpc 0. NUMA node = 0. Flow RX UDP port = 36454.
6:071238 WARNG: RawTransport created for Rpc ID 0. Device mlx5_3/p6p2, port 1. IPv4 
63.63.63.86, MAC ec:d:9a:c5:ba:bd. Datapath UDP port 36454.
......
multi_process_test: /root/eRPC/tests/client_tests/multi_process_test.cc:60: void 
process_proxy_thread_func(size_t, size_t): Assertion `c.num_rpc_resps == num_processes - 1' failed.
Aborted

Looking forward for your reply and thanks in advance.

Best regards
Thomas

@anujkaliaiitd
Copy link
Collaborator

The NIC port can be changed using the last argument to the Rpc constructor. See docs.

The server failure handing feature isn't ready yet. I've disabled its test for now.

Can you try multi_process_test with a clean initial state (i.e., without server_failure_test having first failed)?

@CCrainys
Copy link
Author

CCrainys commented Feb 25, 2019

Hi, Anuj
Thanks for your quick reply.
Firstly, I pull your last commit(disable server failure). Then I run ctest. I found that sometimes multi_process_test run successfully, sometimes it fails.

  100% tests passed, 0 tests failed out of 16
  Total Test time (real) =  44.86 sec

or

   Total Test time (real) =  66.34 sec
   The following tests FAILED:
  8 - multi_process_test (OTHER_FAULT)
    Errors while running CTest

when multi_process_test failed, I just run build/multi_process_test alone, the error info is

    Process 4: All sessions connected
    Process 13: All sessions connected
    Process 22: All sessions connected
    multi_process_test: /root/eRPC/tests/client_tests/multi_process_test.cc:60: void 
    process_proxy_thread_func(size_t, size_t): Assertion `c.num_rpc_resps == num_processes - 1' 
   failed.
   Aborted

@CCrainys
Copy link
Author

Hi, Anuj

I read code about multi_process_test. I think the problem was caused by variable
kMaxNumERpcProcesses and kTestMaxEventLoopMs. After increasing kTestMaxEventLoopMs or decreasing kMaxNumERpcProcesses, multi_process_test works successfully.

My cluster has 28cores, 16 cores were used by other task. I think it might be a multi-process scheduling problem when number of free CPU cores on the system is less than value of kMaxNumERpcProcesses.

The above is my guess. Do you agree with me? Looking forward for your reply.

Best regards
Thomas

@anujkaliaiitd
Copy link
Collaborator

Sounds right. The hardcoded value of allowed test time (kTestMaxEventLoopMs) isn't good, and I will move to more flexible timing in the future.

@CCrainys
Copy link
Author

CCrainys commented Feb 26, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants