-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
esp_modem "NO CARRIER" message not correctly handled (IDFGH-10017) #287
Comments
But you receive the
If not, please make sure that you've registered on the specific event (or on all events from
|
No, I receive no event in this situation |
Thanks for sharing the log. It seems like the device is just sending I'm afraid that in this case, there's no way to tell that the connection failed and not able to post any event. The solution is to use the LCP sub-protocol as noted above and check for the
ESP_ERROR_CHECK(esp_event_handler_register(NETIF_PPP_STATUS, ESP_EVENT_ANY_ID, &on_ppp_changed, NULL));
static void on_ppp_changed(void *arg, esp_event_base_t base, int32_t event_id, void *data)
{
if (event_id == NETIF_PPP_ERRORPEERDEAD) {
ESP_LOGE(TAG, "Connection lost!");
}
} |
David, many thanks for your support. Your suggestions are very useful.
|
LCP is a link protocol within the PPP session, so shouldn't generate anything above IP = no data traffic from the modem to cellular provider.
I would expect entering a disconnection state of the PPP session (phases of "terminate" or "dead"). As you can see the
No, I can easily reproduce your issue just by disconnecting the UART wires to my device. Client's side has no knowledge on the server, cannot tell if it's been disconnecting without the LCP keepalive mechanism.
TBH, I'm not surprised. After providing support for this library I found a few deviations in some devices, usually related to switching modes and CMUX. This issue is related to both 😄 |
Thanks David. E (146511) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 54 The Connection lost event has triggered 12 secs after the first Transport Base fail. It is normal? |
Yes, 12 seconds sound about right (4 seconds to one 1 echo timeout, another 4 and a half to treat it as a failure). But, the weird part is the first error message you shared:
Are you getting this immediately after receiving the Could you please enable these options in lwip component config?
and share the log? Also, is it possible for you to easily switch to IDF |
The messages indicating trouble on connection appears few seconds after NO CARRIER (4-5 secs). pppos_write[1]: len=12 My application receive only the mqtt disconnect event. Because I don't know the cause, I retry to re-connect many times. Only after 12 seconds I realize that the PPP was lost. If it is normal, I try to workaround I have problem to switch now to IDF 5.0, sorry. I tried to build my application with 5.0 some months ago, but without success.... I should try it again, but I don't have much time right now... |
David, have you had the opportunity to look at the log? |
This looks weird, could you please try to modify your handler: static void on_ppp_changed(void *arg, esp_event_base_t base, int32_t event_id, void *data)
{
if (event_id == NETIF_PPP_ERRORPEERDEAD || event_id == NETIF_PPP_ERRORCONNECT) {
ESP_LOGE(TAG, "Connection lost!");
}
} to see if you get the connection-lost event (earlier)? I can see in the log that the phase got updated correctly (after those 4-5 seconds), but for some reason it's not propagated as the connection error event. |
I have added the new event on the handler as you suggested, but unfortunately the result don't change. |
I have also discover now that, after enabled the LCP echo, the OTA process does not work anymore ☹ I (71897) HTTP_CLIENT: Body received in fetch header state, 0x3fcc5cab, 205 |
I'm sorry. the phase to look for early disconnection is Terminate, so the event-id should have been: It should be also possible to enable netif_set_status_callback(lwip_netif, netif_status_callback); that should be called when the link gets down, somewhere around these lines in the log:
As for the OTA. This is a problem as the intensive flash operation might stall the CPU for longer than LCP timeout. Maybe its possible to adjust the timeouts to enable both OTA and timely responsiveness. Another option is to implement something like a two phase OTA, which would pause the network before flashing and resume it after flashing. |
Sorry for the delaied response. I tried both the terminate_event and the status_callback without significant improvements. Instead, what do you pratical mean with "two phase OTA, which would pause the network before flashing and resume it after flashing"? It is not possbile to run-time change the LCP timeout parameters? |
This is weird, it must work even on
I meant using native OTA interface directly (instead of https-ota component), similarly as you can see https://github.com/espressif/esp-idf/tree/master/examples/system/ota/native_ota_example
|
Thanks David! Ok, now i remember what the problem was... perhaps a small example would be help me |
I'm sorry, was suggesting direct interfacing lwIP API, which is not always very simple. Here's a quick demo of the two methods:
static void on_ppp_changed(void *arg, esp_event_base_t base, int32_t event_id, void *data)
{
ESP_LOGI(TAG, "on_ppp_changed: Event %x", event_id);
}
....
// in main (after creating `esp_netif`, before starting the DCE
esp_netif_ppp_config_t ppp_config = { true, true };
esp_netif_ppp_set_params(ppp_netif, &ppp_config);
ESP_ERROR_CHECK(esp_event_handler_register(NETIF_PPP_STATUS, ESP_EVENT_ANY_ID, &on_ppp_changed, NULL)); you should get notified of all the PPP events. When the interface goes down, you should get:
#include "lwip/netif.h"
...
void netif_status_callback(struct netif *netif)
{
ESP_LOGI(TAG, "netif(%p) is %d", netif, netif_is_up(netif));
}
...
static void on_ip_event(void *arg, esp_event_base_t event_base,
int32_t event_id, void *event_data)
{
ESP_LOGD(TAG, "IP event! %d", event_id);
if (event_id == IP_EVENT_PPP_GOT_IP) {
... here we're sure that the lwip netif has been created
struct netif *lwip_netif = NULL;
NETIF_FOREACH(lwip_netif) {
if (lwip_netif->name[0] == 'p' && lwip_netif->name[1] == 'p') {
break;
}
}
if (lwip_netif != NULL) {
netif_set_status_callback(lwip_netif, netif_status_callback);
}
}
} |
David, it works! I (90598) Gest4G: PPP state changed event 267 but if I disable the LCP Echo no events are triggered. |
Does it also fail for you if you have the UART ISR in
If so, we should probably add more functions to IRAM, the reception part up to (alternatively, you can implement this two phase OTA as mentioned previously using native OTA API -- I can help to put together a simple demo) |
David, I had already activated this option when I started working with the ota via modem, and this solve the problem. |
David, I update the situation. When I use the LAN interface (the other interface of my board) seems to works. However, I notice that it is slower than the previous implementation: 37 vs 23 seconds. But when I try the fw update over PPP the result is this:
Edit: I use the uart flow control mode +IFC=2,0 - I will check if it works effectively. But I don't know if this could be the problem. |
@FabITA1 The IDF example with native-ota example follows the same workflow as the default https-ota example.
here's a quick patch that modifies the IDF example accordingly: --- a/examples/system/ota/native_ota_example/main/native_ota_example.c
+++ b/examples/system/ota/native_ota_example/main/native_ota_example.c
@@ -139,6 +140,7 @@ static void ota_example_task(void *pvParameter)
int binary_file_length = 0;
/*deal with all receive packet*/
bool image_header_was_checked = false;
+ bool partition_erased = false;
while (1) {
int data_read = esp_http_client_read(client, ota_write_data, BUFFSIZE);
if (data_read < 0) {
@@ -182,16 +184,32 @@ static void ota_example_task(void *pvParameter)
}
#endif
- image_header_was_checked = true;
-
- err = esp_ota_begin(update_partition, OTA_WITH_SEQUENTIAL_WRITES, &update_handle);
- if (err != ESP_OK) {
- ESP_LOGE(TAG, "esp_ota_begin failed (%s)", esp_err_to_name(err));
- http_cleanup(client);
- esp_ota_abort(update_handle);
- task_fatal_error();
+ if (!partition_erased) {
+ partition_erased = true;
+ // now we disconnect
+ ESP_LOGW(TAG, "Disconnecting before bulk erase");
+ esp_http_client_close(client);
+ err = esp_ota_begin(update_partition, OTA_SIZE_UNKNOWN, &update_handle);
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "esp_ota_begin failed (%s)", esp_err_to_name(err));
+ http_cleanup(client);
+ esp_ota_abort(update_handle);
+ task_fatal_error();
+ }
+ ESP_LOGI(TAG, "esp_ota_begin succeeded..");
+ // and reconnect
+ ESP_LOGW(TAG, "Resuming network operations");
+ err = esp_http_client_open(client, 0);
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "Failed to open HTTP connection: %s", esp_err_to_name(err));
+ esp_http_client_cleanup(client);
+ task_fatal_error();
+ }
+ esp_http_client_fetch_headers(client);
+ continue;
+ } else {
+ image_header_was_checked = true;
}
- ESP_LOGI(TAG, "esp_ota_begin succeeded");
} else {
ESP_LOGE(TAG, "received package is not fit len");
http_cleanup(client); (note that you can place CMUX-exit and CMUX-enter at the placeholders -- warning logs about disconnection and reconnection , but it would probably also work without it, it's just expected to disconnect from the PPP server) |
@david-cermak thanks!! E (1004186) TRANSPORT_BASE: esp_tls_conn_read error, errno=No more processes In all other cases, the process stucks almost immediately. Here an example of what happens: I (66352) FwUpg: Running partition type 0 subtype 0 (offset 0x00010000) Maybe I didn't understand what to do? [Off topic: but is not possible to simply deactivate/activate the LCP Echo option while the application is running?] Thank you |
This is weird, I don't see any terminate event, or other ppp transitions caused by LCP timeouts.
These are compile time configuration, but I think we can provide an lwip hook to let users update it runtime. I'll make an example. |
Yes, I confirm David, the Uart ISR in IRAM flag is set. |
But it doesn't fail due to LCP timeout (at least cannot see the issue in your log). Does this native OTA work when you disable LCP keepalive?
This is not possible, unfortunately. We'd have to update IDF to support that (no need to update lwip tho). Could you please check if after applying this patch you can complete the OTA (after calling diff --git a/components/esp_netif/include/esp_netif_ppp.h b/components/esp_netif/include/esp_netif_ppp.h
index 159388e80e..a78934ac7f 100644
--- a/components/esp_netif/include/esp_netif_ppp.h
+++ b/components/esp_netif/include/esp_netif_ppp.h
@@ -120,6 +120,8 @@ esp_err_t esp_netif_ppp_set_params(esp_netif_t *netif, const esp_netif_ppp_confi
*/
esp_err_t esp_netif_ppp_get_params(esp_netif_t *netif, esp_netif_ppp_config_t *config);
+esp_err_t esp_netif_disable_lcp(esp_netif_t *esp_netif);
+
#ifdef __cplusplus
}
#endif
diff --git a/components/esp_netif/lwip/esp_netif_lwip_ppp.c b/components/esp_netif/lwip/esp_netif_lwip_ppp.c
index 4bf697bc27..46dd95a10f 100644
--- a/components/esp_netif/lwip/esp_netif_lwip_ppp.c
+++ b/components/esp_netif/lwip/esp_netif_lwip_ppp.c
@@ -269,6 +269,16 @@ esp_err_t esp_netif_start_ppp(esp_netif_t *esp_netif)
return ESP_OK;
}
+esp_err_t esp_netif_disable_lcp(esp_netif_t *esp_netif)
+{
+ netif_related_data_t *netif_related = esp_netif->related_data;
+ lwip_peer2peer_ctx_t *ppp_ctx = (lwip_peer2peer_ctx_t *)netif_related;
+ assert(ppp_ctx->base.netif_type == PPP_LWIP_NETIF);
+ ppp_ctx->ppp->settings.lcp_echo_interval = 0;
+ ppp_ctx->ppp->settings.lcp_echo_fails = 0;
+ return ESP_OK;
+}
+
esp_netif_recv_ret_t esp_netif_lwip_ppp_input(void *ppp_ctx, void *buffer, size_t len, void *eb)
{
struct lwip_peer2peer_ctx * obj = ppp_ctx; |
In the meantime, I'll answer the first question: And now the second question: |
If LCP keepalive mechanism is enabled in menuconfig, it's statically configured on creation of an interface and cannot be changed runtime. In some cases it's useful to relax LCP criteria during runtime operation, for example before initiating OTA. This config option allows for disabling already enabled LCP echo (this setting becomes effective after reconnecting, i.e. initializing a new session) Closes espressif/esp-protocols#287
Answers checklist.
General issue report
IDF version: v4.4.3
Esp_modem: v1.0.0
Esp module: ESP32-S3-WROOM-1
LTE module: Quectel EG915 (very similar to BG96)
Data mode: ESP_MODEM_MODE_CMUX
The communication between cpu and LTE module works fine.
But sometimes the data connection with the network drops and the modem generate a "NO CARRIER" message. Seems that this condition was not correctly handle (I make some mistake...) . I thought it generated a IP_EVENT_PPP_LOST_IP event that inform me about the new communication state. Instead, I see other errors caused by the drop of connection, like:
W (1773301) TRANSPORT_BASE: Poll timeout or error, errno=Connection already in progress, fd=57, timeout_ms=10000
E (1773301) MQTT_CLIENT: Writing didn't complete in specified timeout: errno=119
I (1773301) myMqttModule: MQTT_EVENT_ERROR
E (1773301) MQTT_CLIENT: Error to resend data
There is a way to correctly handle the "NO CARRIER" message?
Thanks
The text was updated successfully, but these errors were encountered: