Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force Watchdog Feeder #5276

Closed
6 tasks done
darkain opened this issue Oct 25, 2018 · 7 comments
Closed
6 tasks done

Force Watchdog Feeder #5276

darkain opened this issue Oct 25, 2018 · 7 comments
Labels
waiting for feedback Waiting on additional info. If it's not received, the issue may be closed.

Comments

@darkain
Copy link

darkain commented Oct 25, 2018

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: [ESP-12]
  • Core Version: [2.4.2] (really, any, I've tried HEAD, old versions, new version, etc)
  • Development Env: [Arduino IDE]
  • Operating System: [Windows]

Settings in IDE

  • Module: [Nodemcu]
  • Flash Mode: [?] (whatever the automatic option is for Nodemcu)
  • Flash Size: [4MB/3MB]
  • lwip Variant: [v2 Lower Memory] (tested on all variants)
  • Reset Method: [nodemcu]
  • Flash Frequency: [40Mhz]
  • CPU Frequency: [80Mhz] (tested at 160MHz too)
  • Upload Using: [SERIAL]
  • Upload Speed: [115200]

Problem Description

Is there an actual way to FORCE feed the hardware watchdog timer to be fed? I keep getting resets as listed below. Resets happen anywhere between ~1 minute and ~3 hours, no consistency whatsoever. The actual sketch code is too complex to show here. I can say with absolute certainty though through performance profiling that the main loop() function completes in under 20ms. Inside of this loop, in my main inner-loop, I've also tried calling ESP.wdtFeed(), which means this function is being called about every 0.1ms. Even with this, the WTD will still randomly reset like mentioned.

Before getting into the boards and power supplies debate:

  1. I have an entire array of boards. EVERY single one does this.

  2. I have several different power supplies. I've used a variety of USB battery banks and wall outlets (all supply a minimum of 5V 2A very stable power). I've also used a 300w ATX power supply with nothing else attached. I have 1000uf caps on both the 5v line and 3.3v line. I've used the controller's own 3.3v regulators. I've used my own 3.3v buck-boost converters.

  3. Some of these boards are in breadboards for testing. Some of them a soldered directly into custom motherboards with 5000uf on the 5v line for power stability and 60w max power.

The code in question is running just shy of 1MHz on multiple GPIO channels simultaneously, plus pushing nearly full bandwidth of 115200 serial, and wifi in station mode. During a small portion of the loop (~10ms) interrupts are also disabled. I'm wondering if maybe there is an actual design flaw inside these CPUs where one of these may be causing voltage leaks/sags into the hardware register for the WDT?

ets Jan  8 2013,rst cause:4, boot mode:(3,6)

wdt reset
load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
vbb28d4a3
~ld
@devyte
Copy link
Collaborator

devyte commented Oct 25, 2018

I could be wrong, but I understand that feeding the sw WD also feeds the hw WD.

Are you disabling the sw WD manually? Depending on your answer:
If your loop() runs are short as you say, your problem is likely not a lack of feeding or whatever, but rather something going wrong and execution going off the deep end (e.g.: ending up in an infinite loop).
Per your description, that would happen only sometimes. From that, and because the same thing happens regardless of the core version, my first suspicion would be some kind of mem corruption, either heap or stack (remember that the stack in the ESP is tiny).
My second suspicion would be some kind of starvation. The high rate of interrupts means a high rate of ISRs getting called, which I suppose could starve normal non-interrupt code that feeds the wdt.

Some debugging techniques as suggestions:

  • remove components from your code (if you have cross-dependencies, emulate), and figure out which is the problematic one. You can also print out tiny msgs to serial at key points to figure out where execution was before the wdt fired.
  • try slightly lower rates at the GPIO channels. If the wdt goes away, it hints at starvation.
  • investigate if it's possible to
    I don't see anything actionable here within the repo, so I'm closing for now. If you do track this to an problem in the core, feel free to open a new issue and add in the details.

@devyte devyte closed this as completed Oct 25, 2018
@darkain
Copy link
Author

darkain commented Oct 25, 2018

As a developer, I must admit its extremely discouraging to just see "oh, we're going to close this outright, put it out of sight, out of mine" without any actual investigation. Its far too common to just throw out the idea "its user error, never us" - as an FYI: this is code that is working perfectly on ARM and AVR without issue, but I wanted to switch to a more powerful platform (the ESP series of chips).

GPIO is write-only, so no interrupts. I never disable software WDT. No infinite loops are happening, the application enters a static execution state and just repeats this particular state indifferently every ~20ms (as mentioned in the initial post). It simply sends the exact same data through multiple GPIO channels over and over again. Serial is output only as well. The only input is wifi, which would just be random broadcast packets on the LAN, nothing is targeting the ESP directly itself.

The serial code mentioned in my initial post is exactly what you suggested, outputting a log of checkpoints of several points within the code. Yes, by default it consistently failed at the exact same point in code, always. Removing that piece of code just shifted where it would die at. If any sort of minor modification happened in the code, then it would change where the WDT set would occur.

If you want more notes: setting the device to Soft AP mode instead of Station, no other code change whatsoever, the WDT resets happen significantly faster. I've asked about this particular issue before in IRC and here on another thread dealing with the web server by itself, with no reply on that particular topic at all. Once I started seeing it happen in Station mode as well, I realized the issue is more global, not just in AP, so decided to create a dedicated issue.

If you want to see my notes on resource exhaustion, please see my other issue: #4823 (which you've personally commented on, so should already be familiar with!?)

Yes, I'm more than aware of the limitations of this particular board. I've already put in significant work in to this library itself to reduce its memory consumption. But a lot of the wifi code is hidden, so I cannot get in and debug/fix that myself. I've already had to do this with a BlueTooth stack on ARM with only 8KiB RAM.

@devyte
Copy link
Collaborator

devyte commented Oct 25, 2018

@darkain I'm sorry that you feel discouraged, but we currently have over 400 issues to deal with, so expecting an investigation from someone here for your particular use case is unrealistic.
About Closing this, it means that there is no action to be taken by current maintainers, especially given that there is no MCVE sketch provided. BTW, why didn't you provide one? Feel free to discuss further, and even post results of your own investigation, a peer might be willing to help you out. However, you should keep in mind that this is an issue tracker for the code hosted on this repo, and not a forum for discussion, and so far I just see an issue, but nothing points to our core. That means keep investigating on your end, and if you eventually prove that there is a core issue I can reopen. The problems with the hw Serial turned out like that: many vague reports but nothing actionable, until one provided example showed random crashes at higher speeds, and we were able to pursue that to figure out the problem.
About ARM and AVR, the ESP works in a VERY different way than other uCs, both due to the fact that here we have a wifi and TCP/IP stack to maintain with painful timing requirements, and because we have to deal with closed source binary blobs from the manufacturer. So a comparison of behavior is not apples to apples.
About starvation, I said cpu starvation, not mem exhaustion. There is another possibility, if you're using bitbanging, e.g. softwareserial: starvation due to shutting down of interrupts too often or for too long total time, due to writing out of the bitbang sequence. Anyways, hard to tell without a sketch. Even if that doesn't cause starvation, it could cause mem corruption due to lack of servicing of the wifi stack. WhatI happens if you shut down wifi entirely?
My suggestion at this point, and here I'm assuming that you've debugged your own code and any 3rd party libs you're using thoroughly: monitor stack watermark and heap usage, monitor time spent in your app code vs. time spent out of of your app code, monitor variable contents right before crashing, and in general keep debugging.

@d-a-v
Copy link
Collaborator

d-a-v commented Oct 25, 2018

monitor stack watermark and heap usage

Can you enable all debug options, instrument your code with Esp.getFreeHeap&bro, watch for :oom messages, report what's happening right before the crash ?

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 17, 2019

This thread is closed, but according to your answer to this request above, it can be reopened.
Your initial description cannot help us help you. We need more input. See this and this.

@d-a-v d-a-v added the waiting for feedback Waiting on additional info. If it's not received, the issue may be closed. label Jan 18, 2019
@everslick
Copy link
Contributor

I also like to suggest that latest master improved stability for my application without any code changes on my side, so, @darkain I would give it another shot. my firmware is also pretty complex (~30k loc) and I can understand why providing MCVE can be a big PITA. nevertheless, there is only so much the devs can do here without one.

@darkain
Copy link
Author

darkain commented Apr 5, 2019

I upgraded to release 2.5.0, and all seems to be stable now. I've ran a pair of tests so far that ran stable for over 12 hours each without a single error. It appears that which ever it happened to be within the core ESP8266 for Arduino library was causing this issue (possibly the race condition crash listed in the bug fix list) appears to be resolved at this point!

Also, the code in question I was initially having an issue with is a fully scriptable wifi enabled LED controller. It was recently approved to be released publicly: https://github.com/cosplaylighting/spider2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for feedback Waiting on additional info. If it's not received, the issue may be closed.
Projects
None yet
Development

No branches or pull requests

4 participants