Implement two-tier MQTT reconnection strategy with unlimited retries by eman · Pull Request #28 · eman/nwp500-python

eman · 2025-10-27T17:01:18Z

Problem

The MQTT client was experiencing permanent connection failures after network interruptions (AWS_ERROR_MQTT_UNEXPECTED_HANGUP). The client would:

Give up permanently after 10 failed reconnection attempts
Never recover, leaving users without data for extended periods
Log duplicate "Connection interrupted" warnings
Not handle AWS credential expiration during reconnection

Solution

Implemented a robust two-tier reconnection strategy that never gives up:

1. Quick Reconnection (attempts 1-9)

Fast reconnection using existing connection setup
Creates new connection manager with fresh credentials
Best for transient network issues
Exponential backoff: 1s, 2s, 4s, 8s, 16s, 32s, 64s, 120s, 120s

2. Deep Reconnection (every 10th attempt: 10, 20, 30...)

Full teardown and rebuild of entire connection
Explicitly refreshes AWS credentials from auth API
Falls back to full re-authentication if refresh fails
Recreates connection manager from scratch
Re-establishes all subscriptions automatically
Handles complex issues like expired credentials, corrupted state

Changes

Core Improvements

Unlimited retries: Changed max_reconnect_attempts default from 10 to -1
Two-tier strategy: Quick reconnects for speed, deep reconnects for resilience
Subscription recovery: New resubscribe_all() method restores all subscriptions after deep reconnection
Credential refresh: Deep reconnection explicitly refreshes AWS IoT credentials
Clean logging: Removed duplicate "Connection interrupted" warnings

Modified Files

src/nwp500/mqtt_client.py (+102 lines)
- Added _deep_reconnect() method for full rebuild
- Removed duplicate connection interrupted logging
- Pass deep_reconnect_func to reconnection handler
src/nwp500/mqtt_reconnection.py (+72 lines, -23 lines)
- Implement two-tier reconnection logic
- Support unlimited retries when max_reconnect_attempts = -1
- Switch between quick and deep reconnection based on threshold
src/nwp500/mqtt_subscriptions.py (+60 lines)
- New resubscribe_all() method to restore subscriptions
- Tracks subscription state for recovery
src/nwp500/mqtt_utils.py (+8 lines, -1 line)
- Add deep_reconnect_threshold config parameter (default: 10)
- Change max_reconnect_attempts default to -1 (unlimited)
- Update documentation

Testing

✅ All linting checks pass
✅ Type checking passes (mypy)
✅ Existing tests pass
✅ No breaking changes to public API

Behavior

Before:

Connection fails after 10 attempts (~7 minutes)
Client gives up permanently
Users lose data indefinitely
Must manually restart application

After:

Connection attempts continue indefinitely
Quick reconnections every 120 seconds
Deep reconnections every 10 attempts refresh credentials
Automatic recovery when network/AWS IoT becomes available
Handles credential expiration gracefully

Migration

No breaking changes. Existing code continues to work with improved resilience.

Users can customize behavior via MqttConnectionConfig:

config = MqttConnectionConfig(
    max_reconnect_attempts=-1,  # -1 = unlimited (default)
    deep_reconnect_threshold=10,  # Deep reconnect every 10 attempts (default)
    max_reconnect_delay=120.0,  # Max delay between attempts (default)
)

Closes: Connection recovery issue after AWS_ERROR_MQTT_UNEXPECTED_HANGUP

- Remove duplicate 'Connection interrupted' logging - Change max_reconnect_attempts default from 10 to -1 (unlimited) - Implement two-tier reconnection strategy: * Quick reconnection (attempts 1-9): Fast reconnection for transient issues * Deep reconnection (every 10th attempt): Full rebuild with token refresh - Add deep_reconnect_threshold config parameter (default: 10) - Add _deep_reconnect() method that: * Cleans up old connection * Refreshes AWS credentials via token refresh API * Falls back to full re-authentication if needed * Recreates connection manager from scratch * Re-establishes all subscriptions - Add resubscribe_all() to MqttSubscriptionManager for subscription recovery - Fix reconnection to never give up permanently - Improve resilience to extended network outages and credential expiration Fixes connection recovery after AWS_ERROR_MQTT_UNEXPECTED_HANGUP errors. Previously, the client would give up after 10 attempts and never reconnect. Now it continues indefinitely with intelligent recovery strategies.

Copilot

Pull Request Overview

This PR implements a robust two-tier MQTT reconnection strategy that prevents permanent connection failures after network interruptions. The key improvement is changing from a limited retry approach (10 attempts) to unlimited retries with intelligent escalation between quick and deep reconnection methods.

Key Changes:

Changed default max_reconnect_attempts from 10 to -1 (unlimited retries)
Implemented two-tier reconnection: quick reconnects (1-9) for transient issues, deep reconnects (every 10th attempt) for complex problems like credential expiration
Added subscription recovery mechanism via resubscribe_all() method to restore all subscriptions after deep reconnection

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
`src/nwp500/mqtt_utils.py`	Added `deep_reconnect_threshold` config parameter and changed `max_reconnect_attempts` default to -1
`src/nwp500/mqtt_reconnection.py`	Implemented two-tier reconnection logic with support for unlimited retries and deep reconnection triggers
`src/nwp500/mqtt_subscriptions.py`	Added `resubscribe_all()` method to restore all subscriptions after connection rebuild
`src/nwp500/mqtt_client.py`	Added `_deep_reconnect()` method for full connection rebuild and removed duplicate connection interrupted logging

src/nwp500/mqtt_subscriptions.py

src/nwp500/mqtt_client.py

Use a set instead of list to track failed subscriptions, preventing duplicate entries when a topic has multiple handlers. This ensures the failure count in warning messages is accurate. Co-authored-by: GitHub Copilot <noreply@github.com>

eman · 2025-10-27T17:15:18Z

Fixed Copilot AI Feedback

Good catch on the subscription failure tracking!

Issue: When a topic had multiple handlers, it would be added to failed_subscriptions list multiple times, causing incorrect failure counts.

Fix: Changed to use a set() instead of list and track per-topic failure status with a topic_failed flag. Now each unique failed topic is only counted once, regardless of how many handlers it has.

Example:

Topic "device/123/status" has 3 handlers
All 3 fail to resubscribe
Before: Would report "Failed to restore 3 subscription(s)" ❌
After: Correctly reports "Failed to restore 1 subscription(s)" ✅

Commit: 5ef8708

Instead of accessing private _user_id and _password attributes directly, add proper public API to NavienAuthClient: - Add has_stored_credentials property to check if credentials are available - Add re_authenticate() method to perform fresh sign-in with stored credentials - Update mqtt_client to use new public API methods This improves encapsulation and provides a cleaner interface for credential management and re-authentication scenarios. Co-authored-by: GitHub Copilot <noreply@github.com>

eman · 2025-10-27T17:23:05Z

Fixed Copilot AI Encapsulation Feedback

Good catch on accessing private attributes directly!

Issue: The deep reconnection logic was accessing _user_id and _password private attributes directly, violating encapsulation principles.

Solution: Added proper public API to NavienAuthClient:

New Public API

has_stored_credentials property
- Checks if credentials are available for re-authentication
- Returns True if both user_id and password are stored
- Clean way to check credential availability without accessing private attributes
re_authenticate() method
- Performs fresh sign-in using stored credentials
- Convenience method for reconnection scenarios
- Raises ValueError if credentials aren't available
- Returns AuthenticationResponse with fresh tokens

Usage in MQTT Client

Before (accessing private attributes):

if self._auth_client._user_id and self._auth_client._password:
    await self._auth_client.sign_in(
        self._auth_client._user_id,
        self._auth_client._password,
    )

After (using public API):

if self._auth_client.has_stored_credentials:
    await self._auth_client.re_authenticate()

This improves code maintainability and follows proper encapsulation practices while providing a cleaner, more intuitive API for credential management.

Commit: 59e9dec

Replace generic 'except Exception' handlers with specific exception types to improve error handling and debugging: - Import AuthenticationError and TokenRefreshError from auth module - Schedule coroutine: catch RuntimeError (event loop closed) - Connection callbacks: catch TypeError, AttributeError - Active reconnection: catch AwsCrtError, AuthenticationError, RuntimeError - Deep reconnection cleanup: catch AwsCrtError, RuntimeError - Token refresh: catch TokenRefreshError, ValueError, AuthenticationError - Deep reconnection: catch AwsCrtError, AuthenticationError, RuntimeError, ValueError - Connect: catch AwsCrtError, AuthenticationError, RuntimeError, ValueError - Disconnect: catch AwsCrtError, RuntimeError - Message processing: catch AttributeError, KeyError, TypeError - Publish: catch AwsCrtError specifically, add exception chaining This makes error handling more explicit and prevents masking unexpected exceptions that should propagate. Improves debuggability by catching only expected exception types.

eman · 2025-10-27T17:29:21Z

Improved Exception Handling

Replaced all catch-all except Exception handlers with specific exception types to improve error handling and debugging.

Changes

Before: 10 generic except Exception handlers that could mask unexpected bugs

After: Specific exception types for each context:

Context	Specific Exceptions
Schedule coroutine	`RuntimeError` (event loop closed)
Connection callbacks	`TypeError`, `AttributeError`
Active reconnection	`AwsCrtError`, `AuthenticationError`, `RuntimeError`
Deep reconnection cleanup	`AwsCrtError`, `RuntimeError`
Token refresh	`TokenRefreshError`, `ValueError`, `AuthenticationError`
Deep reconnection	`AwsCrtError`, `AuthenticationError`, `RuntimeError`, `ValueError`
Connect	`AwsCrtError`, `AuthenticationError`, `RuntimeError`, `ValueError`
Disconnect	`AwsCrtError`, `RuntimeError`
Message processing	`AttributeError`, `KeyError`, `TypeError`
Publish	`AwsCrtError` with proper exception chaining

Benefits

✅ Better debugging: Unexpected exceptions now propagate instead of being hidden
✅ Explicit error handling: Each handler only catches expected errors
✅ Exception chaining: Using raise ... from e preserves stack traces
✅ Type safety: Specific exception types improve IDE support and documentation

Commit: 2807e1a

Systematically replaced generic 'except Exception' handlers throughout the MQTT codebase with specific exception types: **mqtt_reconnection.py:** - Import AwsCrtError - Quick/deep reconnect: catch AwsCrtError, RuntimeError, ValueError - Reconnection loop: catch AwsCrtError, RuntimeError - Event emission: catch TypeError, RuntimeError **mqtt_connection.py:** - Import AwsCrtError - Connect: catch AwsCrtError, RuntimeError, ValueError - Disconnect: catch AwsCrtError, RuntimeError **mqtt_subscriptions.py:** - Import AwsCrtError - Message handlers: catch TypeError, AttributeError, KeyError - Message processing: catch AttributeError, KeyError, TypeError - Subscribe/unsubscribe: catch AwsCrtError, RuntimeError - Resubscribe: catch AwsCrtError, RuntimeError - Status parsing: catch TypeError, AttributeError (KeyError already caught) - State change detection: catch TypeError, AttributeError, RuntimeError - Feature parsing: catch TypeError, AttributeError (KeyError already caught) - Energy usage: catch TypeError, ValueError, AttributeError **mqtt_periodic.py:** - Periodic requests: catch AwsCrtError, RuntimeError with specific handling for clean session cancellation **events.py:** - Event handlers: catch TypeError, RuntimeError, AttributeError **Copilot instructions:** - Added exception handling guidelines with common exception types - Emphasize catching only expected exceptions This improves debuggability by allowing unexpected exceptions to propagate instead of being silently caught. Makes error handling explicit and type-safe.

eman · 2025-10-27T17:37:32Z

Complete Exception Handling Overhaul

Fixed all catch-all exception handlers throughout the MQTT codebase.

Files Fixed

✅ mqtt_client.py (10 handlers) - Previous commit
✅ mqtt_reconnection.py (4 handlers)
✅ mqtt_connection.py (2 handlers)
✅ mqtt_subscriptions.py (7 handlers)
✅ mqtt_periodic.py (1 handler)
✅ events.py (1 handler)

Exception Type Mapping

All except Exception handlers replaced with specific types appropriate for each context:

Module	Exception Types Used
mqtt_client.py	`AwsCrtError`, `AuthenticationError`, `TokenRefreshError`, `RuntimeError`, `ValueError`, `TypeError`, `AttributeError`
mqtt_reconnection.py	`AwsCrtError`, `RuntimeError`, `ValueError`, `TypeError`
mqtt_connection.py	`AwsCrtError`, `RuntimeError`, `ValueError`
mqtt_subscriptions.py	`AwsCrtError`, `RuntimeError`, `TypeError`, `AttributeError`, `KeyError`, `ValueError`
mqtt_periodic.py	`AwsCrtError`, `RuntimeError`
events.py	`TypeError`, `RuntimeError`, `AttributeError`

Updated Documentation

Added exception handling guidelines to .github/copilot-instructions.md:

List of common exception types
When to use each type
Emphasis on catching only expected exceptions

Benefits

✅ Debugging: Unexpected exceptions now propagate with full stack traces
✅ Explicit: Each handler documents what errors are expected
✅ Type-safe: IDEs can provide better error handling suggestions
✅ Maintainable: Future developers know exactly what can go wrong

Total handlers fixed: 25 catch-all exception handlers → specific exception types

Commit: 5f24c5b

Restore catch-all exception handler in EventEmitter.emit() because this is a legitimate use case where resilience is required: - User-provided callbacks can raise any exception type - One bad callback should not break other callbacks - Event emitter pattern requires fault isolation - Added detailed comment explaining why catch-all is appropriate here This is one of the rare cases where 'except Exception' is the correct approach, as we need to protect the event system from arbitrary user code. Fixes test_event_error_handling test that validates this behavior.

Copilot

Pull Request Overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Copilot · 2025-10-27T17:47:01Z

src/nwp500/mqtt_subscriptions.py

+                        f"Failed to re-subscribe to "
+                        f"'{redact_topic(topic)}': {e}"
+                    )
+                    topic_failed = True


The topic_failed flag doesn't prevent subsequent handlers from attempting to subscribe to the same failed topic. Consider breaking out of the loop after the first failure or logging once per topic rather than once per handler.

Suggested change

topic_failed = True

topic_failed = True

break

Copilot · 2025-10-27T17:47:01Z

src/nwp500/mqtt_client.py

-                isinstance(e, AwsCrtError)
-                and e.name == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION"
-            ):
+            if e.name == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":


The AwsCrtError.name attribute may not always exist or could be None. Add a check using hasattr(e, 'name') or handle AttributeError to prevent potential crashes when checking the error name.

Suggested change

if e.name == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":

if getattr(e, "name", None) == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":

Copilot · 2025-10-27T17:47:01Z

src/nwp500/mqtt_client.py

+                if (
+                    self._auth_client.current_tokens
+                    and self._auth_client.current_tokens.refresh_token
+                ):
+                    await self._auth_client.refresh_token(
+                        self._auth_client.current_tokens.refresh_token
+                    )


[nitpick] This nested condition could be simplified for better readability. Consider storing self._auth_client.current_tokens in a local variable or restructuring the check to make the logic clearer.

Suggested change

if (

self._auth_client.current_tokens

and self._auth_client.current_tokens.refresh_token

):

await self._auth_client.refresh_token(

self._auth_client.current_tokens.refresh_token

)

tokens = self._auth_client.current_tokens

if tokens and tokens.refresh_token:

await self._auth_client.refresh_token(tokens.refresh_token)

Copilot · 2025-10-27T17:47:01Z

src/nwp500/mqtt_reconnection.py

+            use_deep_reconnect = (
+                self._deep_reconnect_func is not None
+                and self._reconnect_attempts
+                >= self.config.deep_reconnect_threshold
+                and self._reconnect_attempts
+                % self.config.deep_reconnect_threshold
+                == 0


[nitpick] The multi-line boolean expression could be more readable with intermediate variables. Consider extracting is_at_threshold and is_threshold_multiple for clarity.

Suggested change

use_deep_reconnect = (

self._deep_reconnect_func is not None

and self._reconnect_attempts

>= self.config.deep_reconnect_threshold

and self._reconnect_attempts

% self.config.deep_reconnect_threshold

== 0

is_at_threshold = self._reconnect_attempts >= self.config.deep_reconnect_threshold

is_threshold_multiple = (

self._reconnect_attempts % self.config.deep_reconnect_threshold == 0

)

use_deep_reconnect = (

self._deep_reconnect_func is not None

and is_at_threshold

and is_threshold_multiple

When a topic subscription fails, subsequent handlers for that same topic will also fail for the same reason (network/connection issue). Changes: - Break out of handler loop immediately after first subscription failure - Avoids redundant error logs for the same topic - More efficient error handling during reconnection - Cleaner logs with one error per failed topic instead of one per handler Co-authored-by: GitHub Copilot <noreply@github.com>

eman · 2025-10-27T17:48:59Z

Optimized Subscription Recovery

Fixed inefficiency in resubscribe_all() method.

Issue

When a topic subscription failed, the code continued trying to subscribe remaining handlers for the same topic. Since they all fail for the same reason (connection/network issue), this created redundant error logs.

Fix

Break out of the handler loop after the first subscription failure for a topic:

Before:

for handler in handlers:
    try:
        await self.subscribe(topic, handler, qos)
    except (AwsCrtError, RuntimeError) as e:
        _logger.error(...)
        topic_failed = True
# Continues to next handler even after failure

After:

for handler in handlers:
    try:
        await self.subscribe(topic, handler, qos)
    except (AwsCrtError, RuntimeError) as e:
        _logger.error(...)
        failed_subscriptions.add(topic)
        break  # Skip remaining handlers for this topic

Benefits

✅ Cleaner logs: One error per failed topic instead of one per handler
✅ More efficient: Stops trying after first failure
✅ Same behavior: Correctly tracks and reports failed topics

Commit: 7 commits total now

The AwsCrtError.name attribute may not always exist or could be None. Added hasattr() checks before accessing e.name to prevent AttributeError. Changes: - mqtt_client.py: Added hasattr(e, 'name') check before comparing name - mqtt_periodic.py: Added hasattr(e, 'name') check in exception handling This prevents potential crashes when the name attribute is missing or None, making error handling more robust. Co-authored-by: GitHub Copilot <noreply@github.com>

eman · 2025-10-27T17:51:52Z

Added Safety Checks for Exception Attributes

Fixed potential AttributeError when accessing AwsCrtError.name.

Issue

The code assumed AwsCrtError.name attribute always exists, but it may be missing or None, which could cause crashes.

Fix

Added hasattr() checks before accessing the name attribute:

Before:

except AwsCrtError as e:
    if e.name == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":
        # Handle it

After:

except AwsCrtError as e:
    if hasattr(e, "name") and e.name == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":
        # Handle it

Files Updated

mqtt_client.py: Added safety check in publish error handling
mqtt_periodic.py: Added safety check in periodic request error handling

Benefits

✅ Crash prevention: No AttributeError if name attribute is missing
✅ Defensive coding: Handles edge cases gracefully
✅ Same functionality: Works correctly when name attribute exists

Commit: 4fb0bf0

Extracted self._auth_client.current_tokens to a local variable to avoid nested attribute access and improve readability. Before: if (self._auth_client.current_tokens and self._auth_client.current_tokens.refresh_token): After: current_tokens = self._auth_client.current_tokens if current_tokens and current_tokens.refresh_token: This makes the code clearer and easier to understand. Co-authored-by: GitHub Copilot <noreply@github.com>

…variables Extracted complex boolean expression into clear, self-documenting variables: - has_deep_reconnect: Check if deep reconnect function exists - is_at_threshold: Check if attempts reached threshold - is_threshold_multiple: Check if at threshold interval (10, 20, 30...) This makes the logic much clearer and easier to understand at a glance. Co-authored-by: GitHub Copilot <noreply@github.com>

eman requested a review from Copilot October 27, 2025 17:06

Copilot AI reviewed Oct 27, 2025

View reviewed changes

src/nwp500/mqtt_subscriptions.py Outdated Show resolved Hide resolved

src/nwp500/mqtt_client.py Outdated Show resolved Hide resolved

src/nwp500/mqtt_client.py Outdated Show resolved Hide resolved

eman requested a review from Copilot October 27, 2025 17:45

Copilot AI reviewed Oct 27, 2025

View reviewed changes

eman and others added 2 commits October 27, 2025 10:53

eman merged commit 9f158d1 into main Oct 27, 2025
10 checks passed

eman deleted the fix/mqtt-reconnection-improvements branch October 27, 2025 17:57

	if e.name == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":
	if getattr(e, "name", None) == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":

Conversation

eman commented Oct 27, 2025

Problem

Solution

1. Quick Reconnection (attempts 1-9)

2. Deep Reconnection (every 10th attempt: 10, 20, 30...)

Changes

Core Improvements

Modified Files

Testing

Behavior

Migration

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eman commented Oct 27, 2025

Fixed Copilot AI Feedback

Uh oh!

eman commented Oct 27, 2025

Fixed Copilot AI Encapsulation Feedback

New Public API

Usage in MQTT Client

Uh oh!

eman commented Oct 27, 2025

Improved Exception Handling

Changes

Benefits

Uh oh!

eman commented Oct 27, 2025

Complete Exception Handling Overhaul

Files Fixed

Exception Type Mapping

Updated Documentation

Benefits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

eman commented Oct 27, 2025

Optimized Subscription Recovery

Issue

Fix

Benefits

Uh oh!

eman commented Oct 27, 2025

Added Safety Checks for Exception Attributes

Issue

Fix

Files Updated

Benefits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants