Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVS/Preferences-lib getString() locks up intermittently #740

Closed
Curclamas opened this issue Oct 17, 2017 · 17 comments
Closed

NVS/Preferences-lib getString() locks up intermittently #740

Curclamas opened this issue Oct 17, 2017 · 17 comments

Comments

@Curclamas
Copy link
Contributor

Hardware:

Board: Custom Board
Core Installation/update date: 12.10.2017
IDE name: IDF component
Flash Frequency: 80Mhz
Upload Speed: 115200

Description:

I use Preferences lib to store some information once and read it periodically. Some reads occur every two minutes. After a few hours up to a few days the application will lock itself upon calling preferences.getString().

Sketch:

#include <Arduino.h>
#include <Preferences.h>

void someTask() { // 0 priority
        while(1){
           Preferences * preferences = new Preferences;
           preferences->begin("Logger", true);
           String facilityId = preferences->getString("facilityId", "unknown");
           preferences->end();
           delete preferences;

           // use facilityId...

           delay(1000 * 60 * 2);
        }
}

Debug Messages:

D (4673172) nvs: nvs_open_from_partition Logger 0
D (4673172) nvs: nvs_get_str_or_blob facilityId
D (4673178) nvs: nvs_get_str_or_blob facilityId
D (4673186) nvs: nvs_close 82

In case of the lockup, the last line is missing, implying that .getString() locks somehow internally. The task WDT does not always fire upon this lockup, causing the device to freeze until manual reset.

@me-no-dev
Copy link
Member

Can you try changing this line to:

char buf[len+1];
buf[len] = 0;

I doubt this is the issue, but still :)

@Curclamas
Copy link
Contributor Author

Hey @me-no-dev I've tried it, but sadly it did not work.
On the good side: I have a minimum code example now. It crashes after let's say a few secs to a minute and on my custom board as well as on a ESP32 DevBoard. It only happens, when you have more than one task, with heavy CPU usage on the other task:

#include <WiFi.h>
const char* ssid     = "***";
const char* password = "***";

#define USE_SERIAL Serial

#include <Preferences.h>
Preferences preferences;

#include <HTTPClient.h>
const char*  server = "www.howsmyssl.com";  // Server URL

// www.howsmyssl.com root certificate authority, to verify the server
// change it to your server root CA
// SHA1 fingerprint is broken now!

const char* test_root_ca= \
     "-----BEGIN CERTIFICATE-----\n" \
     "MIIEkjCCA3qgAwIBAgIQCgFBQgAAAVOFc2oLheynCDANBgkqhkiG9w0BAQsFADA/\n" \
     "MSQwIgYDVQQKExtEaWdpdGFsIFNpZ25hdHVyZSBUcnVzdCBDby4xFzAVBgNVBAMT\n" \
     "DkRTVCBSb290IENBIFgzMB4XDTE2MDMxNzE2NDA0NloXDTIxMDMxNzE2NDA0Nlow\n" \
     "SjELMAkGA1UEBhMCVVMxFjAUBgNVBAoTDUxldCdzIEVuY3J5cHQxIzAhBgNVBAMT\n" \
     "GkxldCdzIEVuY3J5cHQgQXV0aG9yaXR5IFgzMIIBIjANBgkqhkiG9w0BAQEFAAOC\n" \
     "AQ8AMIIBCgKCAQEAnNMM8FrlLke3cl03g7NoYzDq1zUmGSXhvb418XCSL7e4S0EF\n" \
     "q6meNQhY7LEqxGiHC6PjdeTm86dicbp5gWAf15Gan/PQeGdxyGkOlZHP/uaZ6WA8\n" \
     "SMx+yk13EiSdRxta67nsHjcAHJyse6cF6s5K671B5TaYucv9bTyWaN8jKkKQDIZ0\n" \
     "Z8h/pZq4UmEUEz9l6YKHy9v6Dlb2honzhT+Xhq+w3Brvaw2VFn3EK6BlspkENnWA\n" \
     "a6xK8xuQSXgvopZPKiAlKQTGdMDQMc2PMTiVFrqoM7hD8bEfwzB/onkxEz0tNvjj\n" \
     "/PIzark5McWvxI0NHWQWM6r6hCm21AvA2H3DkwIDAQABo4IBfTCCAXkwEgYDVR0T\n" \
     "AQH/BAgwBgEB/wIBADAOBgNVHQ8BAf8EBAMCAYYwfwYIKwYBBQUHAQEEczBxMDIG\n" \
     "CCsGAQUFBzABhiZodHRwOi8vaXNyZy50cnVzdGlkLm9jc3AuaWRlbnRydXN0LmNv\n" \
     "bTA7BggrBgEFBQcwAoYvaHR0cDovL2FwcHMuaWRlbnRydXN0LmNvbS9yb290cy9k\n" \
     "c3Ryb290Y2F4My5wN2MwHwYDVR0jBBgwFoAUxKexpHsscfrb4UuQdf/EFWCFiRAw\n" \
     "VAYDVR0gBE0wSzAIBgZngQwBAgEwPwYLKwYBBAGC3xMBAQEwMDAuBggrBgEFBQcC\n" \
     "ARYiaHR0cDovL2Nwcy5yb290LXgxLmxldHNlbmNyeXB0Lm9yZzA8BgNVHR8ENTAz\n" \
     "MDGgL6AthitodHRwOi8vY3JsLmlkZW50cnVzdC5jb20vRFNUUk9PVENBWDNDUkwu\n" \
     "Y3JsMB0GA1UdDgQWBBSoSmpjBH3duubRObemRWXv86jsoTANBgkqhkiG9w0BAQsF\n" \
     "AAOCAQEA3TPXEfNjWDjdGBX7CVW+dla5cEilaUcne8IkCJLxWh9KEik3JHRRHGJo\n" \
     "uM2VcGfl96S8TihRzZvoroed6ti6WqEBmtzw3Wodatg+VyOeph4EYpr/1wXKtx8/\n" \
     "wApIvJSwtmVi4MFU5aMqrSDE6ea73Mj2tcMyo5jMd6jmeWUHK8so/joWUoHOUgwu\n" \
     "X4Po1QYz+3dszkDqMp4fklxBwXRsW10KXzPMTZ+sOPAveyxindmjkW8lGy+QsRlG\n" \
     "PfZ+G6Z6h7mjem0Y+iWlkYcV4PIWL1iwBi8saCbGS5jN2p8M+X+Q7UNKEkROb3N6\n" \
     "KOqkqm57TH2H3eDJAkSnh6/DNFu0Qg==\n" \
     "-----END CERTIFICATE-----\n";

void coreTask(void * pvParameters) {
  while(1) {
    HTTPClient http;
  
    USE_SERIAL.print("[HTTP] begin...\n");
    http.begin("https://www.howsmyssl.com/a/check", test_root_ca); //HTTPS
  
    USE_SERIAL.print("[HTTP] GET...\n");
    // start connection and send HTTP header
    int httpCode = http.GET();
  
    // httpCode will be negative on error
    if(httpCode > 0) {
        // HTTP header has been send and Server response header has been handled
        USE_SERIAL.printf("[HTTP] GET... code: %d\n", httpCode);
  
        // file found at server
        if(httpCode == HTTP_CODE_OK) {
            String payload = http.getString();
            USE_SERIAL.println(payload);
        }
    } else {
        USE_SERIAL.printf("[HTTP] GET... failed, error: %s\n", http.errorToString(httpCode).c_str());
    }
  
    http.end();
    delay(1 * 1000);
  }
}

void setup() {
  preferences.begin("Logger", false);
  preferences.putString("facilityId", "qwertz123");
  preferences.end();
  
  Serial.begin(115200);  
  WiFi.begin(ssid, password);

  while (WiFi.status() != WL_CONNECTED) {
      delay(500);
      Serial.print(".");
  }

  xTaskCreatePinnedToCore(
                    coreTask,   /* Function to implement the task */
                    "coreTask", /* Name of the task */
                    10000,      /* Stack size in words */
                    NULL,       /* Task input parameter */
                    10,          /* Priority of the task */
                    NULL,       /* Task handle. */
                    1);  /* Core where the task should run */
}

void loop() {
  Serial.println("preferences.begin:");
  preferences.begin("Logger", true);
  Serial.println("preferences.getString:");
  String facilityId = preferences.getString("facilityId", "unknown");
  Serial.println("preferences.end:");
  preferences.end();
  Serial.print("facilityId: ");
  Serial.println(facilityId);
  Serial.print("Uptime: ");
  Serial.print(millis()/1000);
  Serial.println(" s");
  delay(100);
}

@me-no-dev
Copy link
Member

OK, I can confirm this. It seems to break on while mbedtls_ssl_handshake is running.
Please file in an Issue over at espressif/esp-idf and link to this issue :)

@copercini
Copy link
Contributor

Strange, using coreTask with priority 1 it works fine here

@me-no-dev
Copy link
Member

@copercini it should work regardless :)

@Curclamas
Copy link
Contributor Author

@me-no-dev I've opened a ticket in esp-idf as well: espressif/esp-idf#1157

@me-no-dev
Copy link
Member

@Curclamas can you please try the latest code :) just merged IDF last night

@Curclamas
Copy link
Contributor Author

@me-no-dev sadly, after the update my Arduino-IDE throws:
java.io.IOException: Cannot run program "{tools.esptool.path}.exe": CreateProcess error=2
"Cannot get file" or something like that.

@Curclamas
Copy link
Contributor Author

@me-no-dev after checking out 55289a4 (skipping 4f55293 and 84b8582, those seam to break my build process) getString still freezes.

@igrr
Copy link
Member

igrr commented Oct 30, 2017

I've looked briefly into this, and it seems that the following is happening:

  1. spi_flash operation is started from low-priority task on CPU1
  2. It uses IPC to wake up high-priority IPC0 task on CPU0, preventing all other tasks on CPU0 from running. This is needed to safely disable the cache.
  3. While the task which started spi_flash operation (loop_task) is waiting for IPC0 task to acknowledge that CPU0 is not using cache anymore, it is preempted by a higher priority task doing TLS connection (coreTask).
  4. TLS task starts reading from socket. It seems that it never blocks on the socket though, repeatedly calling net_would_block (and taking close to 100% CPU time while doing that).
  5. TCP/IP task is on CPU0, and it does not run because IPC0 task on CPU0 is running, waiting for the flash operation to start.

In this situation we have a deadlock, because TLS task never receives any more data, and it doesn't yield to lower-priority task which is doing flash operation. There are a few more possibly interesting things happening in this test, but that's the main part for now.

There are two issues to be addressed, one in IDF and another in Arduino.

  1. spi_flash operations should prevent task switching on the core where they are started, while waiting for the other core to disable the cache. This should prevent deadlocks similar to the one described above. I will provide a temporary patch for this for testing purposes in IDF issue linked above.

  2. WiFiClientSecure implementation should be changed to avoid busy waits on incoming network data. I hope @copercini can help with this part.

@copercini
Copy link
Contributor

copercini commented Oct 30, 2017

@igrr Thanks for your explanation and nice found!

As you said, there is no reason for checking if handshake was done too fast causing this high CPU utilization and power waste, so I PRed a very simple code to slow down the calling of net_would_block during TLS handshake =)

It seems there is the same problem in the mbedtls example of IDF here, but unfortunately I don't have a debugger to see deeply what is happening, how the tasks are scheduled and so on (waiting for 11/11 promotional prices of wrover kit hehe)

@Curclamas
Copy link
Contributor Author

@copercini I've applied the code from you PR and it made the lockup less often. Now it still locks up sometimes after [V][ssl_client.cpp:240] send_ssl_data(): Writing HTTP request... .
I looked up the code and see there is a similar while()-loop. However adding the vTaskDelay() here did not help.

@copercini
Copy link
Contributor

@Curclamas Try change this delay(0) by delay(10)

@Curclamas
Copy link
Contributor Author

@copercini This seams indeed to fix it (@iggr 's patch from espressif/esp-idf#1157 also does). Probably both of these precautions should be implemented. Does it make sense to also have the delay in send_ssl_data?

@Curclamas
Copy link
Contributor Author

@copercini I see 14dd44a is now in master. Maybe we should also add a fix for the HTTPclient (https://github.com/espressif/arduino-esp32/blob/master/libraries/HTTPClient/src/HTTPClient.cpp#L997)?

copercini added a commit that referenced this issue Nov 6, 2017
fix for #740
@copercini
Copy link
Contributor

copercini commented Nov 6, 2017

@Curclamas yes! PRed too, thanks for remember me!

Does it make sense to also have the delay in send_ssl_data?

It's a complicated thing, for low datarates it's ok, but for faster transfers maybe it's not adequate have a delay in a transmission function...

@Curclamas
Copy link
Contributor Author

Curclamas commented Nov 6, 2017 via email

me-no-dev pushed a commit that referenced this issue Nov 6, 2017
catalinio pushed a commit to catalinio/pycom-esp-idf that referenced this issue Jun 28, 2019
The fix is for the situation when cache disabling mechanism causes
a deadlock with user tasks. Situation is as follows:

1. spi_flash operation is started from low-priority task on CPU0
2. It uses IPC to wake up high-priority IPC1 task on CPU1, preventing
   all other tasks on CPU1 from running. This is needed to safely
   disable the cache.
3. While the task which started spi_flash operation is waiting for IPC1
   task to acknowledge that CPU1 is not using cache anymore, it is
   preempted by a higher priority application task ("app0").
4. Task app0 busy-waits for some operation on CPU1 to complete. But
   since application tasks are blocked out by IPC1 task, this never
   happens. Since app0 is busy-waiting, the task doing spi flash
   operation never runs.

The more or less logical soltion to the problem would be to also do
cache disabling on CPU0 and the SPI flash operation itself from IPC0
task. However IPC0 task stack would need to be increased to allow doing
SPI flash operation (and IPC1 stack as well). This would waste some
memory. An alternative approach adopted in this fix is to call FreeRTOS
functions to temporary increase the priority of SPI flash operation task
to the same level as the IPC task.

Fixes espressif/arduino-esp32#740
Fixes espressif/esp-idf#1157
turmary pushed a commit to Seeed-Studio/Seeed_Arduino_atHTTPClient that referenced this issue Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants