Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raspberrypi 8.0.0 socket exceptions when web workflow is enabled #7333

Closed
anecdata opened this issue Dec 12, 2022 · 23 comments · Fixed by #7589
Closed

raspberrypi 8.0.0 socket exceptions when web workflow is enabled #7333

anecdata opened this issue Dec 12, 2022 · 23 comments · Fixed by #7589
Assignees
Labels
bug rp2040 Raspberry Pi RP2040
Milestone

Comments

@anecdata
Copy link
Member

anecdata commented Dec 12, 2022

CircuitPython version

Adafruit CircuitPython 8.0.0-beta.5 on 2022-12-08; Raspberry Pi Pico W with rp2040

Code/REPL

import time
import traceback
import wifi
import socketpool
import ssl
import adafruit_requests
from secrets import secrets

TEXT_URL = "http://wifitest.adafruit.com/testwifi/index.html"

print("Connecting...")
wifi.radio.connect(secrets["ssid"], secrets["password"])
print("My IP address is", wifi.radio.ipv4_address)

pool = socketpool.SocketPool(wifi.radio)
requests = adafruit_requests.Session(pool, ssl.create_default_context())

while True:
    try:
        print("Fetching text from", TEXT_URL)
        response = requests.get(TEXT_URL)
        print("-" * 40)
        print(response.text)
        print("-" * 40)
    except Exception as e:
        traceback.print_exception(e, e, e.__traceback__)
    time.sleep(5)

Behavior

The above code, slightly modified boilerplate internet connect code, does not work on raspberrypi 8.0.0-beta.5 when web workflow is enabled (wifi credentials in .env file).

Consistently gets exception:

Traceback (most recent call last):
  File "code.py", line 139, in get_http
  File "adafruit_requests.py", line 718, in post
  File "adafruit_requests.py", line 679, in request
OutOfRetries: Repeated socket failures

It seems that the timeout in requests is kicking in within the library. No data is received. Occasionally ETIMEDOUT.

Underlying exceptions to the OutOfRetries occur (discovered through some print-debugging in adafruit_requests) if a manual timeout is supplied. Exceptions are initially OSError: [Errno 116] ETIMEDOUT in recv_into, waiting for the H of HTTP to kick off the receive. Then OSError: 32 in _send takes over on later requests. But again, sometimes there is no error, just no data received.

I suspect some low-level socket shenanigans. There should be 8 sockets available, web workflow uses at least one for the TCP listen, probably something for mDNS, and more with client(s) accessing web workflow features?

Interestingly If requests alternate between two URLs, after several failures for both, the second will succeed once. Rinse, repeat.

No issue with the above code on 8.0.0-beta.4, or on 8.0.0-beta.5 with web workflow disabled (no .env file).

Description

No response

Additional information

No response

@anecdata anecdata added bug rp2040 Raspberry Pi RP2040 labels Dec 12, 2022
@anecdata anecdata changed the title raspberrypi 8.0.0-beta.5 socket exceptions when workflow is enabled raspberrypi 8.0.0-beta.5 socket exceptions when web workflow is enabled Dec 12, 2022
@tannewt tannewt added this to the 8.0.0 milestone Dec 13, 2022
@tannewt
Copy link
Member

tannewt commented Dec 13, 2022

Web workflow uses 1 socket for mdns, 1 socket to accept connections with a buffer of 1, 1 "active" socket for web requests and 1 socket for the websocket connection. So, 4 or 5 depending on how the pending connection is handled.

@anecdata
Copy link
Member Author

anecdata commented Dec 13, 2022

On some URLs, with web workflow off, things appear to work. But with the added print-debugs to Requests to expose the pass'd and retried exceptions, it looks like on _send_request there is either OSError: [Errno 9] EBADF (in my case, for an mDNS address) or OSError: 32 [EPIPE?] (in my case, for a numeric local IPv4) on every other try, but the retry gets it, so nothing is reported with the standard library.

This happens in beta.4 and beta.5. So something has been lurking there that may be related to the web workflow beta.5 case.

@anecdata
Copy link
Member Author

Forgot to mention that with web workflow on raspberrypi beta.5, when using the above code, I've tried several URLs. With the Adafruit URL, I don't know what the server sees, but none are successful. When I hit a WAN Apache server I can access (using domain name), the server shows HTTP 200 status codes and occasionally some are successful, but usually error as described above. But a LAN Apache server (using IP address) shows HTTP 408 status codes (server timed out). So it looks like some level of connection and data transfer is made, but the reads tend to fail in most circumstances.

@anecdata
Copy link
Member Author

anecdata commented Dec 14, 2022

Did some testing on espressif for comparison: none of the exceptions occur with beta.4 or beta.5 on QT Py ESP32-S2 or QT Py ESP32-S3, including none of the "hidden" exceptions pass'd through Requests' retries.

However, with some URLs, the first try in the _send_request Requests loop gets an empty response (status code 200 on the servers in all observable cases) but the retry is successful.

Addendum: In another espressif code context (QT Py ESP32-S2 running beta.5, requests timeout set to 5 seconds), the first request in _send_request gets OSError: Unhandled ESP TLS error 78 0 8018 -78 every time, but the internal Requests retry is successful.

It does seem that there are several issues at play here, but the most immediate is raspberrypi not playing well with beta.5 and web workflow.

@jepler
Copy link
Member

jepler commented Dec 14, 2022

I'm testing with

Adafruit CircuitPython 8.0.0-beta.5-11-gfc13fba6e5-dirty on 2022-12-14; Raspberry Pi Pico W with rp2040

this contains some fixes to make a wider range of socket name lookup failures throw the "gaierror" exception as intended, but should be functionally identical to main.

wifi workflow is enabled. However, I have not connected to the web workflow interface or messed with resolving the device's mdns name from other computers.

To exclude as many factors as possible, I did not use adafruit_requests, but just wrote low level tests of getaddrinfo and socket.connect. all code was run by import from code.py, not sure why it would make a difference to have the import, it's just the way I quickly switch between pieces of test code.

import socketpool
import wifi


if wifi.radio.ipv4_address is None:
    print("connecting to wifi")
    wifi.radio.connect(os.getenv("WIFI_SSID"), os.getenv('WIFI_PASSWORD'))
print(f"local address {wifi.radio.ipv4_address}")

socket = socketpool.SocketPool(wifi.radio)

host_names = ['eric.local', 'rat.local', 'jo.local', 'example.com', 'unpythonic.net', 'github.com']

def cycle(*args):
    while True:
        for a in args:
            yield from a

print("Name resolution test")
success = failure = 0
for i, h in zip(range(100), cycle(host_names)):
    try:
        info = socket.getaddrinfo(h, 80)
        print(end='.')
        success += 1
    except Exception as e:
        print(f"{type(e)}: {e} {getattr(e, 'errno')}")
        failure += 1

print(f"Over {success+failure} attempts, {success} success {failure} failure")


print()
print("Causing a failure, should give gaierror")
try:
    print(socket.getaddrinfo('this-should.fail', 80))
except Exception as e:
    print(f"{type(e)}: {e} {getattr(e, 'errno')}")

print("TCP connection test")
target = ('rat.u', 443) # a local computer running a web server

success = failure = 0
for i in range(100):
    try:
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            s.connect(target)
            print(end='.')
        success += 1
    except Exception as e:
        print(f"{type(e)}: {e} {getattr(e, 'errno', None)}")
        failure += 1

print(f"Over {success+failure} attempts, {success} success {failure} failure")

print()
print("SSL connection test")
time.sleep(.1)
ctx = ssl.create_default_context()
success = failure = 0
for i in range(10): 
    try:
        with ctx.wrap_socket(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
            print(f"{s=}")
            time.sleep(.1)
            s.connect(target)
            s.send('HEAD / HTTP/1.1\r\nHost: zaphod.unpythonic.net\r\n\r\n')
            read_response(s)
            print(end='.')
        success += 1
    except Exception as e:
        print(f"{type(e)}: {e} {getattr(e, 'errno', None)}")
        failure += 1
        
print()
print(f"Over {success+failure} attempts, {success} success {failure} failure")

Unfortunately, this doesn't reproduce any problems for me. Typical output:

Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.
code.py output:
local address 10.0.2.94
Name resolution test
....................................................................................................
Over 100 attempts, 100 success 0 failure

Causing a failure, should give gaierror
socket_resolve_host() returned -2
<class 'gaierror'>: (-2, 'Name or service not known') -2
TCP connection test
....................................................................................................
Over 100 attempts, 100 success 0 failure

SSL connection test
s=<SSLSocket>
b'HTTP/1.1 200 OK\r\nDate: Wed, 14 Dec 2022 21:24:14'
b' GMT\r\nServer: Apache/2.4.54 (Debian)\r\nLast-Modif'
b'ied: Sat, 08 Jul 2017 15:52:04 GMT\r\nETag: "29cd-'
b'553d054d154ae"\r\nAccept-Ranges: bytes\r\nContent-Le'
b'ngth: 10701\r\nVary: Accept-Encoding\r\nContent-Type'
b': text/html\r\n\r\n'
…repetitions snipped…
Over 10 attempts, 10 success 0 failure

Code done running.

@RetiredWizard
Copy link

RetiredWizard commented Dec 14, 2022

I tried running the original test script on the Pico W running the 12/8 beta.5 bits from s3 and after about 5 minutes I got the "OutOfRetries: Repeated socket failures" message.

I then flashed the latest bits from github and replaced my .env file with a settings.toml file and ran the test script which results in the success message being displayed repeatedly. I've run it now for several minutes without any failures.

@RetiredWizard
Copy link

RetiredWizard commented Dec 14, 2022

This is probably expected behavior but I hadn't noticed it before. Prior to running this script the Pico W shows up in the web workflow list of "CircuitPython devices on your network" from another idle device, but while the script is running the device doesn't show up in the list.....

@jepler
Copy link
Member

jepler commented Dec 16, 2022

This is probably expected behavior but I hadn't noticed it before. Prior to running this script the Pico W shows up in the web workflow list of "CircuitPython devices on your network" from another idle device, but while the script is running the device doesn't show up in the list.....

No, I think it would be expected to still show up. @tannewt can you say for sure?

@tannewt
Copy link
Member

tannewt commented Dec 16, 2022

I'd expect it to show up. However, mdns does use udp so it may be missed occasionally due to that.

@RetiredWizard
Copy link

I did check it out on a Feather ESP32-S3 and didn't see the same behavior so I opened an issue up on it #7346

@anecdata
Copy link
Member Author

anecdata commented Dec 21, 2022

I just tested this with latest S3
Adafruit CircuitPython 8.0.0-beta.5-22-ge9f032f46 on 2022-12-20; Raspberry Pi Pico W with rp2040
and .env converted to settings.toml.

This works now with web workflow enabled. (Requests is still often getting an OSError: 32 on the first, hidden, try, but the internal retry is successful.)

Does anyone know what was broken in beta.5, and what fixed it?

@anecdata
Copy link
Member Author

anecdata commented Feb 4, 2023

Adafruit CircuitPython 8.0.0-rc.2 on 2023-02-02; Raspberry Pi Pico W with rp2040

Re-opening based on:
https://discord.com/channels/327254708534116352/537365702651150357/1071240927936512092
adafruit/Adafruit_CircuitPython_Requests#126
https://discord.com/channels/327254708534116352/537365702651150357/1071516251693777009
It's very odd that when the issue occurs, it persists across microcontroller.reset() and power cycles. Either something is affecting flash(?), or perhaps some non-volatile state in the wifi module(??).

@anecdata anecdata reopened this Feb 4, 2023
@anecdata
Copy link
Member Author

anecdata commented Feb 4, 2023

I suspect there may be some marginal timing (perhaps variable) where (TLS?) sockets aren't ready at code start, and the perception of persistence across power / reset is just the code not working. But once it doesn't work, it will continue to not work until some delay is added and the device is reset.

This works after a reset and ongoing, but if STARTUP_WAIT = 0 it will fail with the OSError: (-29312, 'MBEDTLS_ERR_SSL_CONN_EOF').

import wifi
import time
import socketpool
import ssl

STARTUP_WAIT = 5
ITERATIONS = 10
HOST = "example.com"
PATH = "/"
PORT = 443
MAXBUF = 4096

time.sleep(STARTUP_WAIT)  # wait for serial (and maybe wait for sockets)

print(f"{'='*25}")
print("Web Workflow enabled - already connected to AP")

pool = socketpool.SocketPool(wifi.radio)

for _ in range(ITERATIONS):
    print(f"{'-'*25}\nCreate TCP Client Socket")
    with pool.socket(pool.AF_INET, pool.SOCK_STREAM) as sock:
        s = ssl.create_default_context().wrap_socket(sock, server_hostname=HOST)

        print("Connecting")
        s.connect((HOST, PORT))

        size = s.send(f"HEAD {PATH} HTTP/1.1\r\nHost: {HOST}:{PORT}\r\n\r\n".encode())
        print("Sent", size, "bytes")

        # just get the first hunk and call it a day
        buf = bytearray(MAXBUF)
        size = s.recv_into(buf)
        print('Received', size, "bytes", buf[:size])

Indeed, even @jepler's code behaves the same way if the TLS block is attempted first:

EXPAND...
import time
import socketpool
import wifi
import ssl


time.sleep(0)  # wait for serial (and maybe wait for sockets)
print(f"{'-'*25}")

if wifi.radio.ipv4_address is None:
    print("connecting to wifi")
    wifi.radio.connect(os.getenv("WIFI_SSID"), os.getenv('WIFI_PASSWORD'))
print(f"local address {wifi.radio.ipv4_address}")

socket = socketpool.SocketPool(wifi.radio)

print()
print("SSL connection test")
time.sleep(.1)
ctx = ssl.create_default_context()
success = failure = 0
for i in range(10): 
    try:
        with ctx.wrap_socket(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
            print(f"{s=}")
            time.sleep(.1)
            s.connect(("example.com", 443))
            s.send('HEAD / HTTP/1.1\r\nHost: example.com\r\n\r\n')
            buf = bytearray(4096)
            size = s.recv_into(buf)
            print('Received', size, "bytes", buf[:size])
        success += 1
    except Exception as e:
        print(f"{type(e)}: {e} {getattr(e, 'errno', None)}")
        failure += 1
        
print()
print(f"Over {success+failure} attempts, {success} success {failure} failure")

Addendum: The issue does not appear to be TLS-specific. The original example fails as well after a reset without a delay:

EXPAND...
import time
import traceback
import wifi
import socketpool
import ssl
import adafruit_requests


TEXT_URL = "http://wifitest.adafruit.com/testwifi/index.html"
STARTUP_WAIT = 0


time.sleep(STARTUP_WAIT)  # wait for serial (and maybe wait for sockets)

print(f"{'='*25}")
print("Web Workflow enabled - already connected to AP")

pool = socketpool.SocketPool(wifi.radio)
requests = adafruit_requests.Session(pool, ssl.create_default_context())

while True:
    try:
        print("Fetching text from", TEXT_URL)
        response = requests.get(TEXT_URL)
        print("-" * 40)
        print(response.text)
        print("-" * 40)
    except Exception as e:
        traceback.print_exception(e, e, e.__traceback__)
    time.sleep(5)

Perhaps related to #7313 (thanks, @Neradoc) - one second may not be enough.

@dhalbert
Copy link
Collaborator

dhalbert commented Feb 6, 2023

@jepler Is there something we need to wait for when starting up the co-processor?

Sounds like same problem reported here: https://forums.adafruit.com/viewtopic.php?t=198486 (noted by mgmt). I haven't caught up on discord this weekend, so maybe this is all already known and linked.

@anecdata
Copy link
Member Author

anecdata commented Feb 6, 2023

hyde00001 confirmed on Discord just now that a delay at the start of the code works as above, once the board is then reset.

@jepler
Copy link
Member

jepler commented Feb 6, 2023

I have not seen a documented need for such a delay.

I do not understand how to check the latest reproducer script to see the resulting exception, since without a delay I won't be able to connect and see the printed exception.

What exception does the non-tls version produce?

@anecdata
Copy link
Member Author

anecdata commented Feb 6, 2023

No exception, it (non-tls) just times out with zero bytes received. Try your TLS loop but with no TLS right at the top of code, no delay, right after a reset. You'll miss the first prints, but subsequent loops will print.

This works with 5 seconds delay after reset, but not with 0 seconds delay after reset:

EXPAND...
import time
import socketpool
import wifi
import ssl
import traceback


time.sleep(5)  # wait for serial (and maybe wait for sockets)
print(f"{'='*25}")
print("Web Workflow enabled - already connected to AP")

socket = socketpool.SocketPool(wifi.radio)

print()
print("non-SSL connection test")
time.sleep(.1)
# ctx = ssl.create_default_context()
success = failure = 0
for i in range(10): 
    try:
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            print(f"local address {wifi.radio.ipv4_address}")
            print(f"{s=}")
            time.sleep(.1)
            s.connect(("wifitest.adafruit.com", 80))
            # "http://wifitest.adafruit.com/testwifi/index.html"
            s.send('HEAD /testwifi/index.html HTTP/1.1\r\nHost: wifitest.adafruit.com\r\n\r\n')
            buf = bytearray(1024)
            size = s.recv_into(buf)
            print('Received', size, "bytes", buf[:size])
        success += 1
    except Exception as e:
        traceback.print_exception(e, e, e.__traceback__)
        print(f"{type(e)}: {e} {getattr(e, 'errno', None)}")
        failure += 1
        
print()
print(f"Over {success+failure} attempts, {success} success {failure} failure")

@anecdata
Copy link
Member Author

anecdata commented Feb 6, 2023

It may not be possible to reproduce it in MicroPython since it's really more like the non-web-workflow case in CircuitPython (where it's likely that the time to set up and connect to the AP is enough). This MicroPython code without a delay (rp2-pico-w-20230203-unstable-v1.19.1-852-g9ea64a36a.uf2) works across resets:

EXPAND...
import time

ITERATIONS = 10

# Connect to network
import network

wlan = network.WLAN(network.STA_IF)
wlan.active(True)
wlan.connect('ssid', 'password')

# Should be connected and have an IP address
wlan.status() # 3 == success
wlan.ifconfig()

# Get IP address for destination
import socket
ai = socket.getaddrinfo("wifitest.adafruit.com", 80)
addr = ai[0][-1]

for _ in range(ITERATIONS):
    # Create a socket and make a HTTP request
    s = socket.socket()
    s.connect(addr)
    # s.send(b"GET / HTTP/1.0\r\n\r\n")
    s.send('HEAD /testwifi/index.html HTTP/1.1\r\nHost: wifitest.adafruit.com\r\n\r\n')
    # Print the response
    print(s.recv(512))
    s.close()
    time.sleep(1)

time.sleep(10)
import machine
machine.reset()

@dhalbert
Copy link
Collaborator

dhalbert commented Feb 6, 2023

I have not seen a documented need for such a delay.

Sorry - I missed that is with the web workflow only.

@dhalbert dhalbert removed this from the 8.0.0 milestone Feb 6, 2023
@dhalbert dhalbert added this to the 8.1.0 milestone Feb 6, 2023
@dhalbert
Copy link
Collaborator

dhalbert commented Feb 7, 2023

import wifi has the side effect of calling common_hal_wifi_init(). If you move the delay to before the import wifi, does it affect what happens?

@anecdata
Copy link
Member Author

anecdata commented Feb 7, 2023

Interesting. I didn't expect that to behave the same, but it did. The issue appears more fundamental than wifi init.

This code works with a 5-second delay, but not with 0 seconds (again non-SSL version times out with 0 bytes received ; SSL version gets the (-29312, 'MBEDTLS_ERR_SSL_CONN_EOF') exception):

EXPAND...
import time

time.sleep(5)  # wait for serial and cyw43
print(f"{'='*25}")
print("Web Workflow enabled - already connected to AP")

import wifi
import socketpool
# import ssl
import traceback

socket = socketpool.SocketPool(wifi.radio)

print()
print("non-SSL connection test")
time.sleep(.1)
# ctx = ssl.create_default_context()
success = failure = 0
for i in range(10): 
    try:
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            print(f"local address {wifi.radio.ipv4_address}")
            print(f"{s=}")
            time.sleep(.1)
            s.connect(("wifitest.adafruit.com", 80))
            # "http://wifitest.adafruit.com/testwifi/index.html"
            s.send('HEAD /testwifi/index.html HTTP/1.1\r\nHost: wifitest.adafruit.com\r\n\r\n')
            buf = bytearray(1024)
            size = s.recv_into(buf)
            print('Received', size, "bytes", buf[:size])
        success += 1
    except Exception as e:
        traceback.print_exception(e, e, e.__traceback__)
        print(f"{type(e)}: {e} {getattr(e, 'errno', None)}")
        failure += 1
        
print()
print(f"Over {success+failure} attempts, {success} success {failure} failure")

I guess that makes sense though, even microcontroller.reset() (which is how I typically test this) presumably uses some common hal or driver function to reset the rp2040, which also presumably causes some reset of the cyw43 module.

@DavePutz
Copy link
Collaborator

Clearly some timing issue here, though I cannot see where. Putting a "mp_hal_delay_ms(1500);" at the top of common_hal_socketpool_socket() in ports/raspberrypi/common-hal/socketpool/Socket.c makes everything work, but that's quite a hack.

@anecdata anecdata changed the title raspberrypi 8.0.0-beta.5 socket exceptions when web workflow is enabled raspberrypi 8.0.0 socket exceptions when web workflow is enabled Feb 14, 2023
@dhalbert dhalbert linked a pull request Feb 14, 2023 that will close this issue
tannewt added a commit that referenced this issue Feb 16, 2023
Increase number of LWIP timers for MDNS (fixes #7333)
@anecdata
Copy link
Member Author

Tested. Fixed by #7589. Thanks, @gneverov!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug rp2040 Raspberry Pi RP2040
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants