Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: crash on startup #74

Open
bluecmd opened this issue Sep 27, 2018 · 13 comments
Open

Regression: crash on startup #74

bluecmd opened this issue Sep 27, 2018 · 13 comments

Comments

@bluecmd
Copy link

bluecmd commented Sep 27, 2018

Hi,

We just updated from e440fcd0e1abfbcea5a08b18494d54bb78c561d6 to current master. With the same configuration we now get:

$ sudo  /usr/local/bin/wedge_agent -mode=wedge -mgmt_if=ma1 -config=/etc/wedge.json -alsologtostderr -v=10000
[ ... ]
xLatency_: 0us notificationQueueSize: 0 nothingHandledYet(): 0
I0927 20:07:37.147341  1304 EventBase.cpp:396] EventBase 0x7ffcc88d4570 loop time: 1001
I0927 20:07:38.148428  1304 EventBase.cpp:481] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) latest 18446744073709551589 next 18446744073709551590
I0927 20:07:38.148479  1304 EventBase.cpp:690] latest 18446744073709551589 next 18446744073709551590
I0927 20:07:38.148499  1304 EventBase.cpp:488] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) startWork_ 37729045998319
I0927 20:07:38.148531  1304 EventBase.cpp:481] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) latest 18446744073709551590 next 18446744073709551590
I0927 20:07:38.148555  1304 EventBase.cpp:690] latest 18446744073709551590 next 18446744073709551590
I0927 20:07:38.148582  1304 EventBase.cpp:690] latest 18446744073709551590 next 18446744073709551590
I0927 20:07:38.148607  1304 EventBase.cpp:356] EventBase 0x7ffcc88d4570 did not timeout  loop time guess: 1001350 idle time: 1001272 busy time: 78 avgLoopTime: 80.0261 maxLatencyLoopTime: 80.0261 maxLatency_: 0us notificationQueueSize: 0 nothingHandledYet(): 0
I0927 20:07:38.148648  1304 EventBase.cpp:396] EventBase 0x7ffcc88d4570 loop time: 1001
Common SDK init completed
I0927 20:07:38.959753  1325 BcmSwitch.cpp:588]  Performing cold boot
I0927 20:07:38.959548  1325 BcmSwitch.cpp:559] Initializing BcmSwitch for unit 0
I0927 20:07:39.149735  1304 EventBase.cpp:481] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) latest 18446744073709551590 next 18446744073709551591
I0927 20:07:39.149780  1304 EventBase.cpp:690] latest 18446744073709551590 next 18446744073709551591
I0927 20:07:39.149821  1304 EventBase.cpp:488] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) startWork_ 37730047318753
I0927 20:07:39.149868  1304 EventBase.cpp:481] EventBase 0x7ffcc88d4570 virtual void folly::EventBase::bumpHandlingTime() (loop) latest 18446744073709551591 next 18446744073709551591
I0927 20:07:39.149914  1304 EventBase.cpp:690] latest 18446744073709551591 next 18446744073709551591
I0927 20:07:39.149950  1304 EventBase.cpp:690] latest 18446744073709551591 next 18446744073709551591
I0927 20:07:39.149984  1304 EventBase.cpp:356] EventBase 0x7ffcc88d4570 did not timeout  loop time guess: 1001366 idle time: 1001241 busy time: 125 avgLoopTime: 102.76 maxLatencyLoopTime: 102.76 maxLatency_: 0us notificationQueueSize: 0 nothingHandledYet(): 0
I0927 20:07:39.150044  1304 EventBase.cpp:396] EventBase 0x7ffcc88d4570 loop time: 1001
*** Aborted at 1538078859 (unix time) try "date -d @1538078859" if you are using GNU date ***
PC: @     0x563ea0887664 std::unique_ptr<>::get()
*** SIGSEGV (@0x10) received by PID 1304 (TID 0x7fe0448f3700) from PID 16; stack trace: ***
    @     0x7fe05c0ae0c0 (unknown)
    @     0x563ea0887664 std::unique_ptr<>::get()
    @     0x563ea0887682 std::unique_ptr<>::operator->()
    @     0x563ea0882a38 facebook::fboss::BcmControlPlane::getMulticastQueueSettings()
    @     0x563ea0870a9a facebook::fboss::BcmSwitch::getColdBootSwitchState()
    @     0x563ea087305c facebook::fboss::BcmSwitch::init()
    @     0x563ea09f9fde facebook::fboss::SwSwitch::init()
    @     0x563ea092e57d facebook::fboss::Initializer::initImpl()
    @     0x563ea092de9c facebook::fboss::Initializer::initThread()
    @     0x563ea09366ac _ZSt13__invoke_implIvRKMN8facebook5fboss11InitializerEFvvEPS2_JEET_St21__invoke_memfun_derefOT0_OT1_DpOT2_
    @     0x563ea0936639 _ZSt8__invokeIRKMN8facebook5fboss11InitializerEFvvEJPS2_EENSt9result_ofIFOT_DpOT0_EE4typeESA_SD_
    @     0x563ea09365ea _ZNKSt12_Mem_fn_baseIMN8facebook5fboss11InitializerEFvvELb1EEclIJPS2_EEEDTcl8__invokedtdefpT6_M_pmfspcl7forwardIT_Efp_EEEDpOS8_
    @     0x563ea0936591 _ZNSt12_Bind_simpleIFSt7_Mem_fnIMN8facebook5fboss11InitializerEFvvEEPS3_EE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE
    @     0x563ea09363cd std::_Bind_simple<>::operator()()
    @     0x563ea0936194 std::thread::_State_impl<>::_M_run()
    @     0x7fe055195e6f (unknown)
    @     0x7fe05c0a4494 start_thread
    @     0x7fe05490aacf clone
    @                0x0 (unknown)
Segmentation fault
dhtech@wedge1:~$ sudo  /usr/local

To me it looks like queueManager_ is not initialized by the time that de9ad1f introduced the call to getMulticastQueueSettings.

We will try to debug this, will post any findings.

@bluecmd
Copy link
Author

bluecmd commented Sep 27, 2018

@bluecmd
Copy link
Author

bluecmd commented Sep 27, 2018

My working thesis is that https://github.com/facebook/fboss/blob/master/fboss/agent/hw/bcm/oss/BcmControlPlane.cpp is missing an init of queueManager_

EDIT:
We've tried initing queueManager_ like this: dhtech@837bf7f without success. Current thesis is that controlPlane_ in BcmSwitch might not be correctly set. initTables for example seem to set it, but I cannot find any call for the function.

@capveg
Copy link

capveg commented Sep 27, 2018

Thanks for the report - can you confirm before the update everything was working correctly? (sorry if it's a dumb question - I'm trying not to sound too surprised :-)

@capveg
Copy link

capveg commented Sep 27, 2018

Also, unfortunately it's not trivial for me to test the OSS internally, so if you could compile with debugging options (similar to what's described here: https://stackoverflow.com/questions/7724569/debug-vs-release-in-cmake), we can get a stack trace with line numbers and can make a better guess as to what broke. Thanks for the interest!

@bluecmd
Copy link
Author

bluecmd commented Sep 28, 2018

We were running at the stated commit, or at least one around that commit, in June and it certainly worked. The same switch was unboxed from storage and the only thing we did was:

  1. Modify the config to set port speed
  2. Realize we have to patch fboss to support that transceiver we want to use
  3. Compile a new fboss from head
  4. Install libsodium23 from Debian stretch backports as that seems to be a new dependency
  5. Start fboss - boom

We can certainly compile a version at the previously stated commit and run with it to verify the bisection window, and we will change the config back to the one we ran successfully in June.

@bluecmd
Copy link
Author

bluecmd commented Sep 28, 2018

And we will get back to you with the result of the cmake -DCMAKE_BUILD_TYPE=Debug run

@bluecmd
Copy link
Author

bluecmd commented Sep 28, 2018

This is the log for the startup based on:

dhtech@r1a0:~/fboss$ git status
HEAD detached at e440fcd
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   fboss/agent/ApplyThriftConfig.cpp
	modified:   getdeps.sh

no changes added to commit (use "git add" and/or "git commit -a")
dhtech@r1a0:~/fboss$ git diff
diff --git a/fboss/agent/ApplyThriftConfig.cpp b/fboss/agent/ApplyThriftConfig.cpp
index cc087e7..79f2bd8 100644
--- a/fboss/agent/ApplyThriftConfig.cpp
+++ b/fboss/agent/ApplyThriftConfig.cpp
@@ -328,7 +328,7 @@ shared_ptr<SwitchState> ThriftConfigApplier::run() {
     // Make sure there is a one-to-one map between vlan and interface
     // Remove this sanity check if multiple interfaces are allowed per vlans
     auto& entry = vlanInterfaces_[vlanInfo.first];
-    if (entry.interfaces.size() != 1) {
+    if (entry.interfaces.size() > 1) {
       auto cpu_vlan = newState->getDefaultVlan();
       if (vlanInfo.first != cpu_vlan) {
         throw FbossError("Vlan ", vlanInfo.first, " refers to ",
diff --git a/getdeps.sh b/getdeps.sh
index ec60709..19f8fc2 100755
--- a/getdeps.sh
+++ b/getdeps.sh
@@ -117,9 +117,9 @@ NPROC=$(grep -c processor /proc/cpuinfo)
     fi
     # iproute2 v4.4.0
     update https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git 7ca63aef7d1b0c808da0040c6b366ef7a61f38c1
-    update https://github.com/facebook/folly.git
-    update https://github.com/facebook/wangle.git
-    update https://github.com/facebook/fbthrift.git
+    update https://github.com/facebook/folly.git v2018.06.04.00
+    update https://github.com/facebook/wangle.git v2018.06.04.00
+    update https://github.com/facebook/fbthrift.git v2018.06.04.00
     update https://github.com/no1msd/mstch.git
     update https://github.com/facebook/zstd.git
     update https://github.com/google/googletest.git release-1.8.0

That should be the state where we ran it in June.
This is the startup log:

dhtech@wedge1:~$ sudo /usr/local/bin/wedge_agent -mode=wedge -mgmt_if=ma1 -config=/etc/wedge.json
E0928 17:48:54.393641 23401 WedgeProductInfo.cpp:136] json parse error on line 0: expected json value
E0928 17:48:54.393896 23401 WedgeProductInfo.cpp:67] json parse error on line 0: expected json value
DMA pool size: 16777216
PCI unit 0: Dev 0xb850, Rev 0x03, Chip BCM56850_A2, Driver BCM56850_A0

Initializing platform
Device Configuration - SUCCESS!
SOC unit 0 attached to PCI device BCM56850_A2
Boot flags: Cold boot
rc: unit 0 device BCM56850_A2
open /dev/linux-bcm-knet: : No such file or directory
rc: MMU initialized
rc: L2 Table shadowing enabled
rc: Port modes initialized
Common SDK init completed
E0928 17:49:08.767557 23439 QsfpCache.cpp:166] Exception talking to qsfp_service: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0928 17:49:08.795991 23439 WedgePort.cpp:85] Error retrieving info for transceiver 0 Exception: St13runtime_error: Transceiver 0 not in cache
E0928 17:49:08.901234 23439 WedgePort.cpp:85] Error retrieving info for transceiver 0 Exception: St13runtime_error: Transceiver 0 not in cache

The transceiver error is due to I'm not having qsfp_service running, I currently have a weird QSFP in which makes qsfp_service crash, but I don't believe it's relevant here (and I'm remotely present so I cannot unplug it).

Building from master:

# docker run --name fboss_build_phase2 -v `pwd`:/tmp/code fboss/tmp /bin/sh -c "cd /tmp/code && mkdir -p build && cd build && cmake -DCMAKE_BUILD_TYPE=Debug .. && make -j`grep -c processor /proc/cpuinfo`"
dhtech@wedge1:~$ sudo /usr/local/bin/wedge_agent -mode=wedge -mgmt_if=ma1 -config=/etc/wedge.json
E0928 18:20:48.462908 23573 WedgeProductInfo.cpp:140] json parse error on line 0: expected json value
E0928 18:20:48.463168 23573 WedgeProductInfo.cpp:67] json parse error on line 0: expected json value
I0928 18:20:48.507074 23573 Main.cpp:338] serving on localhost on port 5909
DMA pool size: 16777216
PCI unit 0: Dev 0xb850, Rev 0x03, Chip BCM56850_A2, Driver BCM56850_A0

Initializing platform
Device Configuration - SUCCESS!
SOC unit 0 attached to PCI device BCM56850_A2
Boot flags: Cold boot
rc: unit 0 device BCM56850_A2
open /dev/linux-bcm-knet: : No such file or directory
rc: MMU initialized
rc: L2 Table shadowing enabled
rc: Port modes initialized
Common SDK init completed
I0928 18:21:02.347016 23593 BcmSwitch.cpp:559] Initializing BcmSwitch for unit 0
*** Aborted at 1538158862 (unix time) try "date -d @1538158862" if you are using GNU date ***
PC: @     0x564a3693b2d4 std::unique_ptr<>::get()
*** SIGSEGV (@0x10) received by PID 23573 (TID 0x7f70048f3700) from PID 16; stack trace: ***
    @     0x7f701be5e0c0 (unknown)
    @     0x564a3693b2d4 std::unique_ptr<>::get()
    @     0x564a3693b2f2 std::unique_ptr<>::operator->()
    @     0x564a369366a8 facebook::fboss::BcmControlPlane::getMulticastQueueSettings()
    @     0x564a3692470a facebook::fboss::BcmSwitch::getColdBootSwitchState()
    @     0x564a36926ccc facebook::fboss::BcmSwitch::init()
    @     0x564a36aae464 facebook::fboss::SwSwitch::init()
    @     0x564a369e2a03 facebook::fboss::Initializer::initImpl()
    @     0x564a369e2322 facebook::fboss::Initializer::initThread()
    @     0x564a369eab32 _ZSt13__invoke_implIvRKMN8facebook5fboss11InitializerEFvvEPS2_JEET_St21__invoke_memfun_derefOT0_OT1_DpOT2_
    @     0x564a369eaabf _ZSt8__invokeIRKMN8facebook5fboss11InitializerEFvvEJPS2_EENSt9result_ofIFOT_DpOT0_EE4typeESA_SD_
    @     0x564a369eaa70 _ZNKSt12_Mem_fn_baseIMN8facebook5fboss11InitializerEFvvELb1EEclIJPS2_EEEDTcl8__invokedtdefpT6_M_pmfspcl7forwardIT_Efp_EEEDpOS8_
    @     0x564a369eaa17 _ZNSt12_Bind_simpleIFSt7_Mem_fnIMN8facebook5fboss11InitializerEFvvEEPS3_EE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE
    @     0x564a369ea853 std::_Bind_simple<>::operator()()
    @     0x564a369ea61a std::thread::_State_impl<>::_M_run()
    @     0x7f7014f45e6f (unknown)
    @     0x7f701be54494 start_thread
    @     0x7f70146baacf clone
    @                0x0 (unknown)
Segmentation fault

Not sure why the debug didn't take, do I need to do anything more than passing that to cmake?

@capveg
Copy link

capveg commented Oct 9, 2018

Sorry for the slow reply to this. I'm also surprised that the debug symbols didn't pop up for this - let me figure that out in parallel. Looking at the code and the error, I think the patch below may help. I'm trying to test it locally, but our ability to emulate a non-facebook setup inside facebook is limited :-(

If you get a chance to test this before I do, please let me know if it fixes the problem.

diff --git a/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp b/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
--- a/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
+++ b/fbcode/fboss/agent/hw/bcm/oss/BcmControlPlaneQueueManager.cpp
@@ -20,8 +20,9 @@

 std::shared_ptr<PortQueue> BcmControlPlaneQueueManager::getCurrentQueueSettings(
     cfg::StreamType /*streamType*/,
-    opennsl_cos_queue_t /*cosQ*/) const {
-  return std::shared_ptr<PortQueue>{};
+    opennsl_cos_queue_t cosQ) const {
+  // stub implementation - depends on newer OpenNSL
+  return std::make_shared<PortQueue>(cosQ);
 }

@capveg
Copy link

capveg commented Oct 9, 2018

also, if you could please create a new issue for the qsfp_service segfault. Even if it is because you have a funky optic, it still shouldn't segfault. I won't promise I can fix it promptly, but still good to track.

Thank you again!

@bluecmd
Copy link
Author

bluecmd commented Oct 9, 2018

Done, #77 . I'm compiling with your patch right now - I had to pull my OpenNSL 3.5.0.1 changes as well, but I guess they should be orthogonal to this issue.

@bluecmd
Copy link
Author

bluecmd commented Oct 9, 2018

@capveg The patch seems to not work sadly, https://gist.github.com/bluecmd/bd16185170dff642de197e34349aa14c

I wish the stack trace could be more useful. Granted, that particular build I did not run with the cmake debug settings, I can try to re-do that and spend some time to see if I can get line numbers if you feel it would be the next logical step.

@cubic1271
Copy link

Quick note to say that I'll echo this behavior (and this trace) on a Wedge 100 and using the master current as of around a month ago. Stack trace is similarly not very useful. I did try to apply the patch described above with no useful results.

In our case, we also don't have a functional FBOSS version to revert to, 'cause the infrastructure for a Wedge 100 FBOSS on Open Network Linux seems to be pretty broken at the moment (for quite a few other reasons). I will try a build of e440fcd and seeing if that gets us going, though.

@arnauddorgans
Copy link

Still have crash on startup on EventBase::bumpHandlingTime in 2021 facebook/flipper#2577

arajeev-ARISTA pushed a commit to arajeev-ARISTA/fboss that referenced this issue Sep 26, 2023
Xcvr Asic trace info fixes and script fixes to generate the right lane and polarity swaps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants