Skip to content

Implement two-tier MQTT reconnection strategy with unlimited retries#28

Merged
eman merged 10 commits intomainfrom
fix/mqtt-reconnection-improvements
Oct 27, 2025
Merged

Implement two-tier MQTT reconnection strategy with unlimited retries#28
eman merged 10 commits intomainfrom
fix/mqtt-reconnection-improvements

Conversation

@eman
Copy link
Copy Markdown
Owner

@eman eman commented Oct 27, 2025

Problem

The MQTT client was experiencing permanent connection failures after network interruptions (AWS_ERROR_MQTT_UNEXPECTED_HANGUP). The client would:

  • Give up permanently after 10 failed reconnection attempts
  • Never recover, leaving users without data for extended periods
  • Log duplicate "Connection interrupted" warnings
  • Not handle AWS credential expiration during reconnection

Solution

Implemented a robust two-tier reconnection strategy that never gives up:

1. Quick Reconnection (attempts 1-9)

  • Fast reconnection using existing connection setup
  • Creates new connection manager with fresh credentials
  • Best for transient network issues
  • Exponential backoff: 1s, 2s, 4s, 8s, 16s, 32s, 64s, 120s, 120s

2. Deep Reconnection (every 10th attempt: 10, 20, 30...)

  • Full teardown and rebuild of entire connection
  • Explicitly refreshes AWS credentials from auth API
  • Falls back to full re-authentication if refresh fails
  • Recreates connection manager from scratch
  • Re-establishes all subscriptions automatically
  • Handles complex issues like expired credentials, corrupted state

Changes

Core Improvements

  • Unlimited retries: Changed max_reconnect_attempts default from 10 to -1
  • Two-tier strategy: Quick reconnects for speed, deep reconnects for resilience
  • Subscription recovery: New resubscribe_all() method restores all subscriptions after deep reconnection
  • Credential refresh: Deep reconnection explicitly refreshes AWS IoT credentials
  • Clean logging: Removed duplicate "Connection interrupted" warnings

Modified Files

  • src/nwp500/mqtt_client.py (+102 lines)

    • Added _deep_reconnect() method for full rebuild
    • Removed duplicate connection interrupted logging
    • Pass deep_reconnect_func to reconnection handler
  • src/nwp500/mqtt_reconnection.py (+72 lines, -23 lines)

    • Implement two-tier reconnection logic
    • Support unlimited retries when max_reconnect_attempts = -1
    • Switch between quick and deep reconnection based on threshold
  • src/nwp500/mqtt_subscriptions.py (+60 lines)

    • New resubscribe_all() method to restore subscriptions
    • Tracks subscription state for recovery
  • src/nwp500/mqtt_utils.py (+8 lines, -1 line)

    • Add deep_reconnect_threshold config parameter (default: 10)
    • Change max_reconnect_attempts default to -1 (unlimited)
    • Update documentation

Testing

✅ All linting checks pass
✅ Type checking passes (mypy)
✅ Existing tests pass
✅ No breaking changes to public API

Behavior

Before:

  • Connection fails after 10 attempts (~7 minutes)
  • Client gives up permanently
  • Users lose data indefinitely
  • Must manually restart application

After:

  • Connection attempts continue indefinitely
  • Quick reconnections every 120 seconds
  • Deep reconnections every 10 attempts refresh credentials
  • Automatic recovery when network/AWS IoT becomes available
  • Handles credential expiration gracefully

Migration

No breaking changes. Existing code continues to work with improved resilience.

Users can customize behavior via MqttConnectionConfig:

config = MqttConnectionConfig(
    max_reconnect_attempts=-1,  # -1 = unlimited (default)
    deep_reconnect_threshold=10,  # Deep reconnect every 10 attempts (default)
    max_reconnect_delay=120.0,  # Max delay between attempts (default)
)

Closes: Connection recovery issue after AWS_ERROR_MQTT_UNEXPECTED_HANGUP

- Remove duplicate 'Connection interrupted' logging
- Change max_reconnect_attempts default from 10 to -1 (unlimited)
- Implement two-tier reconnection strategy:
  * Quick reconnection (attempts 1-9): Fast reconnection for transient issues
  * Deep reconnection (every 10th attempt): Full rebuild with token refresh
- Add deep_reconnect_threshold config parameter (default: 10)
- Add _deep_reconnect() method that:
  * Cleans up old connection
  * Refreshes AWS credentials via token refresh API
  * Falls back to full re-authentication if needed
  * Recreates connection manager from scratch
  * Re-establishes all subscriptions
- Add resubscribe_all() to MqttSubscriptionManager for subscription recovery
- Fix reconnection to never give up permanently
- Improve resilience to extended network outages and credential expiration

Fixes connection recovery after AWS_ERROR_MQTT_UNEXPECTED_HANGUP errors.
Previously, the client would give up after 10 attempts and never reconnect.
Now it continues indefinitely with intelligent recovery strategies.
@eman eman requested a review from Copilot October 27, 2025 17:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a robust two-tier MQTT reconnection strategy that prevents permanent connection failures after network interruptions. The key improvement is changing from a limited retry approach (10 attempts) to unlimited retries with intelligent escalation between quick and deep reconnection methods.

Key Changes:

  • Changed default max_reconnect_attempts from 10 to -1 (unlimited retries)
  • Implemented two-tier reconnection: quick reconnects (1-9) for transient issues, deep reconnects (every 10th attempt) for complex problems like credential expiration
  • Added subscription recovery mechanism via resubscribe_all() method to restore all subscriptions after deep reconnection

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/nwp500/mqtt_utils.py Added deep_reconnect_threshold config parameter and changed max_reconnect_attempts default to -1
src/nwp500/mqtt_reconnection.py Implemented two-tier reconnection logic with support for unlimited retries and deep reconnection triggers
src/nwp500/mqtt_subscriptions.py Added resubscribe_all() method to restore all subscriptions after connection rebuild
src/nwp500/mqtt_client.py Added _deep_reconnect() method for full connection rebuild and removed duplicate connection interrupted logging

Use a set instead of list to track failed subscriptions, preventing
duplicate entries when a topic has multiple handlers. This ensures
the failure count in warning messages is accurate.

Co-authored-by: GitHub Copilot <noreply@github.com>
@eman
Copy link
Copy Markdown
Owner Author

eman commented Oct 27, 2025

Fixed Copilot AI Feedback

Good catch on the subscription failure tracking!

Issue: When a topic had multiple handlers, it would be added to failed_subscriptions list multiple times, causing incorrect failure counts.

Fix: Changed to use a set() instead of list and track per-topic failure status with a topic_failed flag. Now each unique failed topic is only counted once, regardless of how many handlers it has.

Example:

  • Topic "device/123/status" has 3 handlers
  • All 3 fail to resubscribe
  • Before: Would report "Failed to restore 3 subscription(s)" ❌
  • After: Correctly reports "Failed to restore 1 subscription(s)" ✅

Commit: 5ef8708

Instead of accessing private _user_id and _password attributes directly,
add proper public API to NavienAuthClient:

- Add has_stored_credentials property to check if credentials are available
- Add re_authenticate() method to perform fresh sign-in with stored credentials
- Update mqtt_client to use new public API methods

This improves encapsulation and provides a cleaner interface for
credential management and re-authentication scenarios.

Co-authored-by: GitHub Copilot <noreply@github.com>
@eman
Copy link
Copy Markdown
Owner Author

eman commented Oct 27, 2025

Fixed Copilot AI Encapsulation Feedback

Good catch on accessing private attributes directly!

Issue: The deep reconnection logic was accessing _user_id and _password private attributes directly, violating encapsulation principles.

Solution: Added proper public API to NavienAuthClient:

New Public API

  1. has_stored_credentials property

    • Checks if credentials are available for re-authentication
    • Returns True if both user_id and password are stored
    • Clean way to check credential availability without accessing private attributes
  2. re_authenticate() method

    • Performs fresh sign-in using stored credentials
    • Convenience method for reconnection scenarios
    • Raises ValueError if credentials aren't available
    • Returns AuthenticationResponse with fresh tokens

Usage in MQTT Client

Before (accessing private attributes):

if self._auth_client._user_id and self._auth_client._password:
    await self._auth_client.sign_in(
        self._auth_client._user_id,
        self._auth_client._password,
    )

After (using public API):

if self._auth_client.has_stored_credentials:
    await self._auth_client.re_authenticate()

This improves code maintainability and follows proper encapsulation practices while providing a cleaner, more intuitive API for credential management.

Commit: 59e9dec

Replace generic 'except Exception' handlers with specific exception types
to improve error handling and debugging:

- Import AuthenticationError and TokenRefreshError from auth module
- Schedule coroutine: catch RuntimeError (event loop closed)
- Connection callbacks: catch TypeError, AttributeError
- Active reconnection: catch AwsCrtError, AuthenticationError, RuntimeError
- Deep reconnection cleanup: catch AwsCrtError, RuntimeError
- Token refresh: catch TokenRefreshError, ValueError, AuthenticationError
- Deep reconnection: catch AwsCrtError, AuthenticationError, RuntimeError, ValueError
- Connect: catch AwsCrtError, AuthenticationError, RuntimeError, ValueError
- Disconnect: catch AwsCrtError, RuntimeError
- Message processing: catch AttributeError, KeyError, TypeError
- Publish: catch AwsCrtError specifically, add exception chaining

This makes error handling more explicit and prevents masking unexpected
exceptions that should propagate. Improves debuggability by catching only
expected exception types.
@eman
Copy link
Copy Markdown
Owner Author

eman commented Oct 27, 2025

Improved Exception Handling

Replaced all catch-all except Exception handlers with specific exception types to improve error handling and debugging.

Changes

Before: 10 generic except Exception handlers that could mask unexpected bugs

After: Specific exception types for each context:

Context Specific Exceptions
Schedule coroutine RuntimeError (event loop closed)
Connection callbacks TypeError, AttributeError
Active reconnection AwsCrtError, AuthenticationError, RuntimeError
Deep reconnection cleanup AwsCrtError, RuntimeError
Token refresh TokenRefreshError, ValueError, AuthenticationError
Deep reconnection AwsCrtError, AuthenticationError, RuntimeError, ValueError
Connect AwsCrtError, AuthenticationError, RuntimeError, ValueError
Disconnect AwsCrtError, RuntimeError
Message processing AttributeError, KeyError, TypeError
Publish AwsCrtError with proper exception chaining

Benefits

Better debugging: Unexpected exceptions now propagate instead of being hidden
Explicit error handling: Each handler only catches expected errors
Exception chaining: Using raise ... from e preserves stack traces
Type safety: Specific exception types improve IDE support and documentation

Commit: 2807e1a

Systematically replaced generic 'except Exception' handlers throughout
the MQTT codebase with specific exception types:

**mqtt_reconnection.py:**
- Import AwsCrtError
- Quick/deep reconnect: catch AwsCrtError, RuntimeError, ValueError
- Reconnection loop: catch AwsCrtError, RuntimeError
- Event emission: catch TypeError, RuntimeError

**mqtt_connection.py:**
- Import AwsCrtError
- Connect: catch AwsCrtError, RuntimeError, ValueError
- Disconnect: catch AwsCrtError, RuntimeError

**mqtt_subscriptions.py:**
- Import AwsCrtError
- Message handlers: catch TypeError, AttributeError, KeyError
- Message processing: catch AttributeError, KeyError, TypeError
- Subscribe/unsubscribe: catch AwsCrtError, RuntimeError
- Resubscribe: catch AwsCrtError, RuntimeError
- Status parsing: catch TypeError, AttributeError (KeyError already caught)
- State change detection: catch TypeError, AttributeError, RuntimeError
- Feature parsing: catch TypeError, AttributeError (KeyError already caught)
- Energy usage: catch TypeError, ValueError, AttributeError

**mqtt_periodic.py:**
- Periodic requests: catch AwsCrtError, RuntimeError with specific
  handling for clean session cancellation

**events.py:**
- Event handlers: catch TypeError, RuntimeError, AttributeError

**Copilot instructions:**
- Added exception handling guidelines with common exception types
- Emphasize catching only expected exceptions

This improves debuggability by allowing unexpected exceptions to propagate
instead of being silently caught. Makes error handling explicit and
type-safe.
@eman
Copy link
Copy Markdown
Owner Author

eman commented Oct 27, 2025

Complete Exception Handling Overhaul

Fixed all catch-all exception handlers throughout the MQTT codebase.

Files Fixed

  • mqtt_client.py (10 handlers) - Previous commit
  • mqtt_reconnection.py (4 handlers)
  • mqtt_connection.py (2 handlers)
  • mqtt_subscriptions.py (7 handlers)
  • mqtt_periodic.py (1 handler)
  • events.py (1 handler)

Exception Type Mapping

All except Exception handlers replaced with specific types appropriate for each context:

Module Exception Types Used
mqtt_client.py AwsCrtError, AuthenticationError, TokenRefreshError, RuntimeError, ValueError, TypeError, AttributeError
mqtt_reconnection.py AwsCrtError, RuntimeError, ValueError, TypeError
mqtt_connection.py AwsCrtError, RuntimeError, ValueError
mqtt_subscriptions.py AwsCrtError, RuntimeError, TypeError, AttributeError, KeyError, ValueError
mqtt_periodic.py AwsCrtError, RuntimeError
events.py TypeError, RuntimeError, AttributeError

Updated Documentation

Added exception handling guidelines to .github/copilot-instructions.md:

  • List of common exception types
  • When to use each type
  • Emphasis on catching only expected exceptions

Benefits

Debugging: Unexpected exceptions now propagate with full stack traces
Explicit: Each handler documents what errors are expected
Type-safe: IDEs can provide better error handling suggestions
Maintainable: Future developers know exactly what can go wrong

Total handlers fixed: 25 catch-all exception handlers → specific exception types

Commit: 5f24c5b

Restore catch-all exception handler in EventEmitter.emit() because this
is a legitimate use case where resilience is required:

- User-provided callbacks can raise any exception type
- One bad callback should not break other callbacks
- Event emitter pattern requires fault isolation
- Added detailed comment explaining why catch-all is appropriate here

This is one of the rare cases where 'except Exception' is the correct
approach, as we need to protect the event system from arbitrary
user code.

Fixes test_event_error_handling test that validates this behavior.
@eman eman requested a review from Copilot October 27, 2025 17:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

f"Failed to re-subscribe to "
f"'{redact_topic(topic)}': {e}"
)
topic_failed = True
Copy link

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topic_failed flag doesn't prevent subsequent handlers from attempting to subscribe to the same failed topic. Consider breaking out of the loop after the first failure or logging once per topic rather than once per handler.

Suggested change
topic_failed = True
topic_failed = True
break

Copilot uses AI. Check for mistakes.
isinstance(e, AwsCrtError)
and e.name == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION"
):
if e.name == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":
Copy link

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AwsCrtError.name attribute may not always exist or could be None. Add a check using hasattr(e, 'name') or handle AttributeError to prevent potential crashes when checking the error name.

Suggested change
if e.name == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":
if getattr(e, "name", None) == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":

Copilot uses AI. Check for mistakes.
Comment on lines +390 to +396
if (
self._auth_client.current_tokens
and self._auth_client.current_tokens.refresh_token
):
await self._auth_client.refresh_token(
self._auth_client.current_tokens.refresh_token
)
Copy link

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] This nested condition could be simplified for better readability. Consider storing self._auth_client.current_tokens in a local variable or restructuring the check to make the logic clearer.

Suggested change
if (
self._auth_client.current_tokens
and self._auth_client.current_tokens.refresh_token
):
await self._auth_client.refresh_token(
self._auth_client.current_tokens.refresh_token
)
tokens = self._auth_client.current_tokens
if tokens and tokens.refresh_token:
await self._auth_client.refresh_token(tokens.refresh_token)

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +170
use_deep_reconnect = (
self._deep_reconnect_func is not None
and self._reconnect_attempts
>= self.config.deep_reconnect_threshold
and self._reconnect_attempts
% self.config.deep_reconnect_threshold
== 0
Copy link

Copilot AI Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The multi-line boolean expression could be more readable with intermediate variables. Consider extracting is_at_threshold and is_threshold_multiple for clarity.

Suggested change
use_deep_reconnect = (
self._deep_reconnect_func is not None
and self._reconnect_attempts
>= self.config.deep_reconnect_threshold
and self._reconnect_attempts
% self.config.deep_reconnect_threshold
== 0
is_at_threshold = self._reconnect_attempts >= self.config.deep_reconnect_threshold
is_threshold_multiple = (
self._reconnect_attempts % self.config.deep_reconnect_threshold == 0
)
use_deep_reconnect = (
self._deep_reconnect_func is not None
and is_at_threshold
and is_threshold_multiple

Copilot uses AI. Check for mistakes.
When a topic subscription fails, subsequent handlers for that same topic
will also fail for the same reason (network/connection issue).

Changes:
- Break out of handler loop immediately after first subscription failure
- Avoids redundant error logs for the same topic
- More efficient error handling during reconnection
- Cleaner logs with one error per failed topic instead of one per handler

Co-authored-by: GitHub Copilot <noreply@github.com>
@eman
Copy link
Copy Markdown
Owner Author

eman commented Oct 27, 2025

Optimized Subscription Recovery

Fixed inefficiency in resubscribe_all() method.

Issue

When a topic subscription failed, the code continued trying to subscribe remaining handlers for the same topic. Since they all fail for the same reason (connection/network issue), this created redundant error logs.

Fix

Break out of the handler loop after the first subscription failure for a topic:

Before:

for handler in handlers:
    try:
        await self.subscribe(topic, handler, qos)
    except (AwsCrtError, RuntimeError) as e:
        _logger.error(...)
        topic_failed = True
# Continues to next handler even after failure

After:

for handler in handlers:
    try:
        await self.subscribe(topic, handler, qos)
    except (AwsCrtError, RuntimeError) as e:
        _logger.error(...)
        failed_subscriptions.add(topic)
        break  # Skip remaining handlers for this topic

Benefits

Cleaner logs: One error per failed topic instead of one per handler
More efficient: Stops trying after first failure
Same behavior: Correctly tracks and reports failed topics

Commit: 7 commits total now

The AwsCrtError.name attribute may not always exist or could be None.
Added hasattr() checks before accessing e.name to prevent AttributeError.

Changes:
- mqtt_client.py: Added hasattr(e, 'name') check before comparing name
- mqtt_periodic.py: Added hasattr(e, 'name') check in exception handling

This prevents potential crashes when the name attribute is missing or None,
making error handling more robust.

Co-authored-by: GitHub Copilot <noreply@github.com>
@eman
Copy link
Copy Markdown
Owner Author

eman commented Oct 27, 2025

Added Safety Checks for Exception Attributes

Fixed potential AttributeError when accessing AwsCrtError.name.

Issue

The code assumed AwsCrtError.name attribute always exists, but it may be missing or None, which could cause crashes.

Fix

Added hasattr() checks before accessing the name attribute:

Before:

except AwsCrtError as e:
    if e.name == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":
        # Handle it

After:

except AwsCrtError as e:
    if hasattr(e, "name") and e.name == "AWS_ERROR_MQTT_CANCELLED_FOR_CLEAN_SESSION":
        # Handle it

Files Updated

  • mqtt_client.py: Added safety check in publish error handling
  • mqtt_periodic.py: Added safety check in periodic request error handling

Benefits

Crash prevention: No AttributeError if name attribute is missing
Defensive coding: Handles edge cases gracefully
Same functionality: Works correctly when name attribute exists

Commit: 4fb0bf0

eman and others added 2 commits October 27, 2025 10:53
Extracted self._auth_client.current_tokens to a local variable to avoid
nested attribute access and improve readability.

Before:
  if (self._auth_client.current_tokens
      and self._auth_client.current_tokens.refresh_token):

After:
  current_tokens = self._auth_client.current_tokens
  if current_tokens and current_tokens.refresh_token:

This makes the code clearer and easier to understand.

Co-authored-by: GitHub Copilot <noreply@github.com>
…variables

Extracted complex boolean expression into clear, self-documenting variables:
- has_deep_reconnect: Check if deep reconnect function exists
- is_at_threshold: Check if attempts reached threshold
- is_threshold_multiple: Check if at threshold interval (10, 20, 30...)

This makes the logic much clearer and easier to understand at a glance.

Co-authored-by: GitHub Copilot <noreply@github.com>
@eman eman merged commit 9f158d1 into main Oct 27, 2025
10 checks passed
@eman eman deleted the fix/mqtt-reconnection-improvements branch October 27, 2025 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants