Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MagTag Vaccination tracker crash in magtag.network.connect() between 7alpha3 and 7alpha4 #5021

Closed
dglaude opened this issue Jul 19, 2021 · 19 comments · Fixed by #5245
Closed
Assignees
Labels
bug espressif applies to multiple Espressif chips
Milestone

Comments

@dglaude
Copy link

dglaude commented Jul 19, 2021

CircuitPython version

Adafruit CircuitPython 7.0.0-alpha.4 on 2021-07-08; Adafruit MagTag with ESP32S2

Code/REPL

Demo code can be fetched from here: https://learn.adafruit.com/adafruit-magtag-covid-vaccination-percent-tracker/code-the-vaccination-tracker

Behavior

MagTag Vaccination tracker crash in magtag.network.connect() between 7alpha3 and 7alpha4

I tested MagTag Covid percentage tracker with 7.0.0-alpha.3 and 7.0.0-alpha.4.

Somewhere in between the two it crash in loop in magtag.network.connect()

With Adafruit CircuitPython 7.0.0-alpha.3 on 2021-06-03; Adafruit MagTag with ESP32S2

It kind of work, it does not display the properly, but the progress bar are there and the network fetch work:

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
Connecting to AP NEO
Retrieving data...Reply is OK!
Response is ['United States', '2021-07-18', '"Johnson&Johnson', ' Moderna', ' Pfizer/BioNTech"', 'https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data', '337740358', '186038501', '161232483']
Sleeping for 86400 seconds

Code done running.

Press any key to enter the REPL. Use CTRL-D to reload.
Pretending to deep sleep until alarm, CTRL-C or file write.

With Adafruit CircuitPython 7.0.0-alpha.4 on 2021-07-08; Adafruit MagTag with ESP32S2 it bootloop in

It reset/boot in loop somewhere inside magtag.network.connect():

soft reboot

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
Connecting to AP NEO

^^^ And here bootlook reseting the serial ^^^

So somewhere between 7.0.0-alpha.3 and 7.0.0-alpha.4 the network call of

I guess I need to bisec and find where it stated to behave like that.

Demo code can be fetched from here:
https://learn.adafruit.com/adafruit-magtag-covid-vaccination-percent-tracker/code-the-vaccination-tracker

Description

No response

Additional information

This test is done with the learn guide library... except for manual update to the latest version from GIT of:

  • adafruit_portalbase
  • adafruit_magtag

Using the py (not mpy) version of those library (in an attempt to debug that further).

This is also mean using adafruit/Adafruit_CircuitPython_MagTag#63 that I was trying to validate.

@dglaude dglaude added the bug label Jul 19, 2021
@dhalbert
Copy link
Collaborator

Could you test this with a build that includes #5017? That fixed a bug in OnDiskBitmap, which is used to provide the background image for this project. The bug was a memory smash which causes unpredictable symptoms.

@dhalbert dhalbert added this to the 7.0.0 milestone Jul 19, 2021
@dglaude
Copy link
Author

dglaude commented Jul 19, 2021

I tested with adafruit-circuitpython-adafruit_magtag_2.9_grayscale-en_US-20210719-1b17b08.uf2 an artefact of #5017 and it reset loop in the same call as alpha-4.

Please notice that it crash very early in the program... before doing the background part.
Actually there might be no need to reduce the code to a minimum since I know where it crash:

from adafruit_magtag.magtag import MagTag
from adafruit_progressbar.progressbar import ProgressBar

# Set up where we'll be fetching data from
DATA_SOURCE = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/country_data/United%20States.csv"
# Find data for other countries/states here:
# https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations

magtag = MagTag(url=DATA_SOURCE)
magtag.network.connect()
# ^^^ it never goes beyond this point in the code

(having something from the learn guide make it easy to reproduce, just so far no one confirmed I am not alone with this)

@dglaude
Copy link
Author

dglaude commented Jul 19, 2021

Maybe I shoud rename the title to:

"from adafruit_progressbar import ProgressBar" make "magtag.network.connect()" crash.

Now I have a very small piece of code and adding one like make it crash...

from adafruit_magtag.magtag import MagTag

# If I put the following line, it make "magtag.network.connect()" crash.
#from adafruit_progressbar import ProgressBar

DATA_SOURCE = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/country_data/United%20States.csv"

magtag = MagTag(url=DATA_SOURCE)

print("Before magtag.network.connect()")
magtag.network.connect()
print("After magtag.network.connect()")

@dhalbert
Copy link
Collaborator

If you have the time and the willingness, try diving down in to the library to narrow the problem further, emulating what the library is doing, and not using the library itself. A lot is going on in the constructor MagTag() and the connect().

@jepler
Copy link
Member

jepler commented Aug 13, 2021

I re-tested with beta.6 loaded via uf2 with

from adafruit_magtag.magtag import MagTag
from adafruit_progressbar import ProgressBar

DATA_SOURCE = "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/country_data/United%20States.csv"

magtag = MagTag(url=DATA_SOURCE)

print("Before magtag.network.connect()")
magtag.network.connect()
print("After magtag.network.connect()")

and the crash still occurs. Intially I thought it was fixed, because the code sample a few comments above commented out the problematic line by default.

@jepler jepler self-assigned this Aug 13, 2021
@tannewt tannewt added the espressif applies to multiple Espressif chips label Aug 13, 2021
@jepler
Copy link
Member

jepler commented Aug 14, 2021

debug uart messages seem to indicate the watchdog is biting...

I (48) boot: ESP-IDF v4.3-dev-1197-g8bc19ba89 2nd stage bootloader
I (48) boot: compile time 12:53:14
I (48) boot: chip revision: 0
I (52) qio_mode: Enabling default flash chip QIO
I (57) boot.esp32s2: SPI Speed      : 80MHz
I (62) boot.esp32s2: SPI Mode       : QIO
I (66) boot.esp32s2: SPI Flash Size : 4MB
W (71) boot.esp32s2: PRO CPU has been reset by WDT.
W (77) boot.esp32s2: WDT reset info: PRO CPU PC=0x4004c8b6

@jepler jepler removed their assignment Aug 18, 2021
@jepler
Copy link
Member

jepler commented Aug 18, 2021

I didn't get any traction on this bug, so giving it back for someone else to try.

@jfurcean
Copy link

jfurcean commented Aug 24, 2021

Commenting out the gc.collect() line in network.py of adafruit_circuitpython_portalbase allows the shortened example to work. The longer example still doesn't work. I think there is some interaction with gc.collect() and the network connection.

I tested this with the simple magtag example in adafruit_circuitpython_magtag and added magtag.network.connnect(). It would crash unless I also commented out the gc.collect() in add_text and set_text in __init__.py in adafruit_circuitpython_portalbase.

Adafruit CircuitPython 7.0.0-beta.0 on 2021-08-24; Adafruit MagTag with ESP32S2
# SPDX-FileCopyrightText: 2017 Scott Shawcroft, written for Adafruit Industries
#
# SPDX-License-Identifier: Unlicense
import time
import terminalio
from adafruit_magtag.magtag import MagTag

magtag = MagTag()

# added this line to the example
magtag.network.connect()

magtag.add_text(
    text_font=terminalio.FONT,
    text_position=(
        50,
        (magtag.graphics.display.height // 2) - 1,
    ),
    text_scale=3,
)

magtag.set_text("Hello World")

buttons = magtag.peripherals.buttons
button_colors = ((255, 0, 0), (255, 150, 0), (0, 255, 255), (180, 0, 255))
button_tones = (1047, 1318, 1568, 2093)
timestamp = time.monotonic()

while True:
    for i, b in enumerate(buttons):
        if not b.value:
            print("Button %c pressed" % chr((ord("A") + i)))
            magtag.peripherals.neopixel_disable = False
            magtag.peripherals.neopixels.fill(button_colors[i])
            magtag.peripherals.play_tone(button_tones[i], 0.25)
            break
    else:
        magtag.peripherals.neopixel_disable = True
    time.sleep(0.01)

@tannewt tannewt self-assigned this Aug 24, 2021
@KeithTheEE
Copy link

I think this bug extends to the Metro Express ESP32-S2 however I can no longer reproduce it.

Similarities:

  • Metro ESP32-S2 is was using network connections
  • As the gc.mem_free() ram ran out instead of freeing memory, the board would reset.

I was using the .bin nightly build from August 21st.

This morning I moved to the 7.0 beta .bin, and experienced the same issue. In an effort to find the minimum amount of code needed to reproduce the issue, I was using a large string, putting it into a variable and that variable inside of an empty list, and just multiplying it to take up space so in around 10 steps the board would need to free up its memory. After the variables were populated, I'd reset them to empty strings and lists with the exception of the base, memory_hog_str in an effort to make sure the previously occupied memory can be freed.

Part way through while pruning the program I accidentally caused a hard crash, and the board booted into safemode. I think this was because I added to the memory hog string a few unescaped %'s and it freaked out.

I deleted those, hit reset on the board to exit safemode.

After that point, I can no longer cause a reset to occur with it frees up the memory. It now free's the memory without issue and without resetting, performing as one would expect.

To try and get the bug to occur again, I rolled back to the nightly mentioned above, and could not get it to occur, even using the full original code.

@jepler
Copy link
Member

jepler commented Aug 25, 2021

Unfortunately, #5220 did NOT fix this

@jepler
Copy link
Member

jepler commented Aug 25, 2021

It must be some kind of object that is moved but can't be moved safely, making this change causes the crash to go away for me:

--- a/py/gc.c
+++ b/py/gc.c
@@ -780,6 +780,7 @@ bool gc_has_finaliser(const void *ptr) {
 }
 
 void *gc_make_long_lived(void *old_ptr) {
+    return old_ptr;
     // If its already in the long lived section then don't bother moving it.
     if (old_ptr >= MP_STATE_MEM(gc_lowest_long_lived_ptr)) {
         return old_ptr;

@jfurcean
Copy link

Here is the minimum code that still caused the crash

from adafruit_portalbase.wifi_esp32s2 import WiFi
from secrets import secrets
import gc

wifi = WiFi()
wifi.connect(secrets["ssid"], secrets["password"])

print("wifi connected")

# crashes here
gc.collect()

print("gc.collect() finished")

However, this code doesn't crash

from secrets import secrets

import gc
import wifi
import ssl
import socketpool
import adafruit_requests

wifi.radio.connect(secrets["ssid"], secrets["password"])
pool = socketpool.SocketPool(wifi.radio)
requests = adafruit_requests.Session(pool, ssl.create_default_context())

print("wifi connected")

gc.collect()

print("gc.collect() finished")

Maybe an issue with native wifi that is moved?

@jepler
Copy link
Member

jepler commented Aug 26, 2021

diff --git a/shared-bindings/socketpool/SocketPool.c b/shared-bindings/socketpool/SocketPool.c
index bd6a2b44f..8389ca24a 100644
--- a/shared-bindings/socketpool/SocketPool.c
+++ b/shared-bindings/socketpool/SocketPool.c
@@ -47,7 +47,7 @@
 STATIC mp_obj_t socketpool_socketpool_make_new(const mp_obj_type_t *type, size_t n_args, const mp_obj_t *args, mp_map_t *kw_args) {
     mp_arg_check_num(n_args, kw_args, 1, 1, false);
 
-    socketpool_socketpool_obj_t *s = m_new_obj_with_finaliser(socketpool_socketpool_obj_t);
+    socketpool_socketpool_obj_t *s = m_new_ll_obj_with_finaliser(socketpool_socketpool_obj_t);
     s->base.type = &socketpool_socketpool_type;
     mp_obj_t radio = args[0];
 

This change makes little sense, but it makes the crash go away with me. I tested with 7098d4c plus only this change, using the script from #5021 (comment) saved to crash.py and imported at the repl with import crash.

The reason I say it makes little sense is that the esp32-s2 implementation of socketpool has NOTHING in it. The use of with_finaliser seems a bit doubtful to me, as the object doesn't define a __del__ method.

jepler added a commit to jepler/circuitpython that referenced this issue Aug 26, 2021
I cannot give a satisfactory account of _why_, but the crash in adafruit#5021
goes away when this object is initially allocated in the long-lived
portion of the heap.

Closes: adafruit#5021
@dhalbert
Copy link
Collaborator

I can think of two reasons why allocating this as long-lived accidentally fixes a problem:

  1. The object is not being protected against gc.
  2. Something is trying to write past the end of the object and is smashing something just above it that then causes a crash.

@jepler
Copy link
Member

jepler commented Aug 26, 2021

Scott & I think we figured it out. It's a super interesting bug:

scott will put together a patch for us soon, but it was way past lunchtime for him when we finished chatting & diagnosing. 🌮

@ladyada
Copy link
Member

ladyada commented Aug 26, 2021

🌮🌮🌮

@tannewt
Copy link
Member

tannewt commented Aug 26, 2021

@jfurcean Thanks for the minimal example! We used it to debug.

@jepler
Copy link
Member

jepler commented Aug 26, 2021

def gc_layout(size_bytes, block_size=16, bits_per_byte=8):
    pool_blocks = size_bytes * bits_per_byte // (block_size * bits_per_byte + 3)
    print("max # blocks", pool_blocks)
    bits_per_block = block_size * bits_per_byte + 3
    atb_bytes = pool_blocks // 4 + 1
    ftb_bytes = (pool_blocks + 7) // 8
    allocation = atb_bytes + ftb_bytes + pool_blocks * block_size
    
    print("total allocation", allocation)
    if allocation > size_bytes: 
       pool_blocks -= 1 

    return (pool_blocks, atb_bytes, ftb_bytes)

def round_up(a, n):
    return (a + n - 1) // n * n

def describe_gc_layout(base_addr, pool_blocks, atb_bytes, ftb_bytes, block_size=16):
    atb_base = base_addr
    atb_end = ftb_base = base_addr + atb_bytes
    ftb_end = ftb_base + ftb_bytes
    pool_base = round_up(ftb_end, 16)
    pool_end = pool_base + block_size * pool_blocks
    
    assert (4 * atb_bytes) >= (pool_blocks + 1)
    assert (8 * ftb_bytes) >= pool_blocks
    
    print("GC layout:")
    print(f"    alloc table at     {hex(atb_base)}, length {atb_bytes:9} bytes, {atb_bytes*4:6} blocks")
    print(f"    finaliser table at {hex(ftb_base)}, length {ftb_bytes:9} bytes, {ftb_bytes*8:6} blocks")
    print(f"    pool at            {hex(pool_base)}, length {block_size * pool_blocks:9} bytes, {pool_blocks:6} blocks")
    
    print()
    
    print(f"alloc table     {hex(atb_base)} .. {hex(atb_end)}")
    print(f"finaliser table {hex(ftb_base)} .. {hex(ftb_end)}")
    print(f"pool table      {hex(pool_base)} .. {hex(pool_end)}")
layout = gc_layout(2096088)
describe_gc_layout(0x3fd80428, *layout)

I wrote a program in Python to try to lay out the gc memory. Weirdly, ... it gets one more block than micropython, even though I think it is correct and should fix the bug!

max # blocks 128005
total allocation 2096083
GC layout:
    alloc table at     0x3fd80428, length     32002 bytes, 128008 blocks
    finaliser table at 0x3fd8812a, length     16001 bytes, 128008 blocks
    pool at            0x3fd8bfb0, length   2048080 bytes, 128005 blocks

alloc table     0x3fd80428 .. 0x3fd8812a
finaliser table 0x3fd8812a .. 0x3fd8bfab
pool table      0x3fd8bfb0 .. 0x3ff80000

compared to the current outcome

E (3011) gc: GC layout:
E (3011) gc:   alloc table at 0x3fd80428, length 32001 bytes, 128004 blocks
E (3011) gc:   finaliser table at 0x3fd88129, length 16001 bytes, 128008 blocks
E (3021) gc:   pool at 0x3fd8bfc0, length 2048064 bytes, 128004 blocks

notice how allocating one more byte to "alloc_table" means that we can call ATB_GET_KIND(128005) safely (not running into the ftb), but this doesn't push the end of the ftb into a higher 16-byte-aligned block. For unknown reasons, the original code doesn't use the first 16 byte block after the ftb, so we end up with 128005 instead of 128004 blocks.

@jepler
Copy link
Member

jepler commented Aug 27, 2021

Oh it also turns out the number of blocks has to be a multiple of BLOCKS_PER_ATB.

jepler added a commit to jepler/circuitpython that referenced this issue Aug 27, 2021
.. or, for !MICROPY_ENABLE_FINALISER, before the first block of the pool.

Closes: adafruit#5021
Closes: micropython#7116
Signed-off-by: Jeff Epler <jepler@gmail.com>
dpgeorge pushed a commit to dpgeorge/micropython that referenced this issue Nov 25, 2022
Closes: adafruit#5021
Closes: micropython#7116

Signed-off-by: Jeff Epler <jepler@gmail.com>
dpgeorge pushed a commit to dpgeorge/micropython that referenced this issue Dec 8, 2022
Closes: adafruit#5021
Closes: micropython#7116

Signed-off-by: Jeff Epler <jepler@gmail.com>
dpgeorge pushed a commit to dpgeorge/micropython that referenced this issue Dec 8, 2022
Prior to this fix the follow crash occurred.  With a GC layout of:

    GC layout:
      alloc table at 0x3fd80428, length 32001 bytes, 128004 blocks
      finaliser table at 0x3fd88129, length 16001 bytes, 128008 blocks
      pool at 0x3fd8bfc0, length 2048064 bytes, 128004 blocks

Block 128003 is an AT_HEAD and eventually is passed to gc_mark_subtree.
This causes gc_mark_subtree to call ATB_GET_KIND(128004).  When block 1 is
created with a finaliser, the first byte of the finaliser table becomes
0x2, but ATB_GET_KIND(128004) reads these bits as AT_TAIL, and then
gc_mark_subtree references past the end of the heap, which happened to be
past the end of PSRAM on the esp32-s2.

The fix in this commit is to ensure there is a one-byte gap after the ATB
filled permanently with AT_FREE.

Fixes issue micropython#7116.

See also adafruit#5021

Signed-off-by: Jeff Epler <jepler@gmail.com>
Signed-off-by: Damien George <damien@micropython.org>
jcernato pushed a commit to jcernato/micropython that referenced this issue Feb 10, 2023
Prior to this fix the follow crash occurred.  With a GC layout of:

    GC layout:
      alloc table at 0x3fd80428, length 32001 bytes, 128004 blocks
      finaliser table at 0x3fd88129, length 16001 bytes, 128008 blocks
      pool at 0x3fd8bfc0, length 2048064 bytes, 128004 blocks

Block 128003 is an AT_HEAD and eventually is passed to gc_mark_subtree.
This causes gc_mark_subtree to call ATB_GET_KIND(128004).  When block 1 is
created with a finaliser, the first byte of the finaliser table becomes
0x2, but ATB_GET_KIND(128004) reads these bits as AT_TAIL, and then
gc_mark_subtree references past the end of the heap, which happened to be
past the end of PSRAM on the esp32-s2.

The fix in this commit is to ensure there is a one-byte gap after the ATB
filled permanently with AT_FREE.

Fixes issue micropython#7116.

See also adafruit#5021

Signed-off-by: Jeff Epler <jepler@gmail.com>
Signed-off-by: Damien George <damien@micropython.org>
karfas pushed a commit to karfas/micropython that referenced this issue Apr 23, 2023
Prior to this fix the follow crash occurred.  With a GC layout of:

    GC layout:
      alloc table at 0x3fd80428, length 32001 bytes, 128004 blocks
      finaliser table at 0x3fd88129, length 16001 bytes, 128008 blocks
      pool at 0x3fd8bfc0, length 2048064 bytes, 128004 blocks

Block 128003 is an AT_HEAD and eventually is passed to gc_mark_subtree.
This causes gc_mark_subtree to call ATB_GET_KIND(128004).  When block 1 is
created with a finaliser, the first byte of the finaliser table becomes
0x2, but ATB_GET_KIND(128004) reads these bits as AT_TAIL, and then
gc_mark_subtree references past the end of the heap, which happened to be
past the end of PSRAM on the esp32-s2.

The fix in this commit is to ensure there is a one-byte gap after the ATB
filled permanently with AT_FREE.

Fixes issue micropython#7116.

See also adafruit#5021

Signed-off-by: Jeff Epler <jepler@gmail.com>
Signed-off-by: Damien George <damien@micropython.org>
alphonse82 pushed a commit to alphonse82/micropython-wch-ch32v307 that referenced this issue May 8, 2023
Prior to this fix the follow crash occurred.  With a GC layout of:

    GC layout:
      alloc table at 0x3fd80428, length 32001 bytes, 128004 blocks
      finaliser table at 0x3fd88129, length 16001 bytes, 128008 blocks
      pool at 0x3fd8bfc0, length 2048064 bytes, 128004 blocks

Block 128003 is an AT_HEAD and eventually is passed to gc_mark_subtree.
This causes gc_mark_subtree to call ATB_GET_KIND(128004).  When block 1 is
created with a finaliser, the first byte of the finaliser table becomes
0x2, but ATB_GET_KIND(128004) reads these bits as AT_TAIL, and then
gc_mark_subtree references past the end of the heap, which happened to be
past the end of PSRAM on the esp32-s2.

The fix in this commit is to ensure there is a one-byte gap after the ATB
filled permanently with AT_FREE.

Fixes issue micropython#7116.

See also adafruit#5021

Signed-off-by: Jeff Epler <jepler@gmail.com>
Signed-off-by: Damien George <damien@micropython.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment