feat: (Airbyte Python CDK) - extend the `error_message` info for the `ResponseAction.FAIL` request scenario #87

bazarnov · 2024-11-25T16:05:57Z

What

Resolving:

https://github.com/airbytehq/airbyte-internal-issues/issues/10680

How

extended the HttpClient._handle_error_resolution > error_message with response info for debugging purposes
apply filter_secrets to omit the user-provided secret values once appear suddenly
covered the change with test_error_codes_http_stream_error_resolution_with_response_secrets_filtered unit test

User Impact

No impact is expected, this is not a breaking change.

Summary by CodeRabbit

New Features
- Added a section in the README.md for code formatting instructions using the ruff tool.
- Introduced a method to extract and sanitize HTTP response bodies in the HttpClient class.
Bug Fixes
- Enhanced error handling to filter sensitive information from error messages in the HttpClient class.
Tests
- Implemented a new test to validate error handling for HTTP responses containing sensitive information.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/sources/streams/http/http_client.py (2)
339-361: Great addition! A few suggestions to make it even better, wdyt?

The method is well-documented and handles different response formats gracefully. However, I have a few suggestions:

Consider returning the raw JSON object instead of converting it to string immediately:
-            return str(response.json())
+            return response.json()
Maybe we could be more specific about the exception type on line 359?

Would it be helpful to log when both parsing attempts fail?

390-393: Consider response size impact on performance?

The enhanced error message with filtered response details is a great addition! However, for very large response bodies, including the entire response might impact performance. Would it make sense to add a size limit or truncation for large responses, wdyt?

Example approach:
-                filtered_response_message = filter_secrets(
-                    f"Request (body): '{str(request.body)}'. Response (body): '{self._get_response_body(response)}'. Response (headers): '{response.headers}'."
-                )
+                response_body = self._get_response_body(response)
+                if response_body and len(response_body) > 1000:  # configurable limit
+                    response_body = f"{response_body[:1000]}... (truncated)"
+                filtered_response_message = filter_secrets(
+                    f"Request (body): '{str(request.body)}'. Response (body): '{response_body}'. Response (headers): '{response.headers}'."
+                )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ed5f654 and 310c5da.

📒 Files selected for processing (1)

airbyte_cdk/sources/streams/http/http_client.py (4 hunks)

🔇 Additional comments (1)

airbyte_cdk/sources/streams/http/http_client.py (1)

5-5: LGTM! Clean import additions.

The new imports for filter_secrets and JsonType are well-organized and align with the PR's objective of enhancing error messages while protecting sensitive data.

Also applies to: 54-54

maxi297

Can you should me the output of the console to confirm that we can see there error message? I fear that it would not work when running this in the platform

airbyte_cdk/sources/streams/http/http_client.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/sources/streams/http/http_client.py (2)

338-360: Consider enhancing the response body handling implementation.

The implementation is solid, but here are some suggestions for improvement:

The return type hint JsonType might not be the most accurate since we're returning string representations. What do you think about using str instead?
The response.json() already returns a parsed object, but we're converting it back to string. Maybe we could optimize this? Here's a suggestion, wdyt?

-    def _get_response_body(self, response: requests.Response) -> Optional[JsonType]:
+    def _get_response_body(self, response: requests.Response) -> str:
     """
     Extracts and returns the body of an HTTP response.
 
     This method attempts to parse the response body as JSON. If the response
     body is not valid JSON, it falls back to decoding the response content
     as a UTF-8 string. If both attempts fail, it returns None.
 
     Args:
         response (requests.Response): The HTTP response object.
 
     Returns:
-        Optional[JsonType]: The parsed JSON object as a string, the decoded
-        response content as a string, or None if both parsing attempts fail.
+        str: A string representation of the response body or an error message
+        if decoding fails.
     """
     try:
-        return str(response.json())
+        return orjson.dumps(response.json()).decode('utf-8')
     except requests.exceptions.JSONDecodeError:
         try:
             return response.content.decode("utf-8")
         except Exception:
             return "The Content of the Response couldn't be decoded."

389-407: LGTM! Good improvements to error handling.

The changes effectively enhance error messages while ensuring sensitive data is filtered. The addition of emit_message() before raising the exception is a nice touch for visibility.

One tiny suggestion: Would it make sense to extract the error message construction into a separate private method to improve readability? Something like _construct_error_message(request, response, filtered_response_message), wdyt?

+    def _construct_error_message(
+        self,
+        request: requests.PreparedRequest,
+        response: requests.Response,
+        filtered_response_message: str
+    ) -> str:
+        return (
+            f"'{request.method}' request to '{request.url}' failed with "
+            f"status code '{response.status_code}' and error message: "
+            f"'{self._error_message_parser.parse_response_error_message(response)}'. "
+            f"{filtered_response_message}"
+        )

     def _handle_error_resolution(self, ...):
         if error_resolution.response_action == ResponseAction.FAIL:
             if response is not None:
                 filtered_response_message = filter_secrets(
                     f"Request (body): '{str(request.body)}'. Response (body): '{self._get_response_body(response)}'. Response (headers): '{response.headers}'."
                 )
-                error_message = f"'{request.method}' request to '{request.url}' failed with status code '{response.status_code}' and error message: '{self._error_message_parser.parse_response_error_message(response)}'. {filtered_response_message}"
+                error_message = self._construct_error_message(
+                    request, response, filtered_response_message
+                )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 310c5da and 3199add.

📒 Files selected for processing (1)

airbyte_cdk/sources/streams/http/http_client.py (4 hunks)

🔇 Additional comments (1)

airbyte_cdk/sources/streams/http/http_client.py (1)

5-5: LGTM! Clean import additions.

The new imports are well-organized and necessary for the enhanced error handling functionality.

Also applies to: 54-54

airbyte_cdk/sources/streams/http/http_client.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

airbyte_cdk/sources/streams/http/http_client.py (1)
339-361: Consider improving JSON response handling, wdyt?

The method is well-documented and handles errors gracefully. However, str(response.json()) might not be the best way to handle JSON responses as it could:

Lose the JSON structure

Not handle complex objects consistently

What do you think about this alternative approach?
-            return str(response.json())
+            json_response = response.json()
+            return orjson.dumps(json_response).decode() if json_response else None
This would:

Preserve JSON structure

Handle None values gracefully

Use the same serialization method (orjson) that's already imported

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3199add and f64ad36.

📒 Files selected for processing (1)

airbyte_cdk/sources/streams/http/http_client.py (4 hunks)

🔇 Additional comments (3)

airbyte_cdk/sources/streams/http/http_client.py (3)

5-6: LGTM! Clean import additions.

The new imports are well-organized and necessary for the enhanced error handling functionality.

Also applies to: 54-54

390-393: Great enhancement to error handling! 🎉

The changes effectively:

Filter sensitive information from responses
Provide more detailed error messages for debugging
Ensure error messages are visible in logs through emit_message()

This addresses the visibility concerns raised in previous reviews while maintaining security through secret filtering.

Also applies to: 399-409

390-392: Verify comprehensive secret filtering

Let's ensure the secret filtering catches all potential sensitive data in responses.

✅ Verification successful

Secret filtering looks good, but should we add a note about maximal matching? 🤔

The secret filtering implementation properly handles sensitive data in headers and responses by replacing them with "****". The filter_secrets function is used consistently across the codebase, including for response headers. However, there's a TODO comment about maximal matching that might be worth addressing:

The current implementation might have edge cases with overlapping secrets (e.g., "x" and "xk")
This could potentially lead to partial secret exposure in some cases

What do you think about adding a note to track this as a future improvement? The current implementation is safe for the common case, but being explicit about the edge case might help future maintainers. wdyt?

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential sensitive data patterns in error handling
# Look for patterns that might indicate sensitive data in tests or examples

# Check for sensitive patterns in test files
rg -i '(password|secret|token|key|auth).*response' test_http.py -A 5

# Check for examples of response headers containing sensitive data
rg -i 'authorization.*header' -g '!*.pyc'

Length of output: 3912

Script:

#!/bin/bash
# Let's check the implementation of the filter_secrets function to ensure it handles headers properly
ast-grep --pattern 'def filter_secrets'

# Also check for any test cases related to secret filtering
rg -i 'filter.*secret' -g '*.py' -A 5

# Look for any other sensitive data patterns in headers
rg -i '(api[-_]?key|token|secret|password|auth).*header' -g '*.py' -A 2

Length of output: 51046

Script:

#!/bin/bash
# Let's check the implementation of filter_secrets to understand how it handles sensitive data
ast-grep --pattern 'def filter_secrets'

# Also check for any global secrets or configuration that might be used in filtering
rg -i '__SECRETS_FROM_CONFIG' -g '*.py' -A 5

# Look for any tests that verify secret filtering behavior
rg -i 'test.*filter.*secret' -g '*test*.py' -A 5

Length of output: 2454

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/sources/streams/http/http_client.py (2)
339-361: Consider optimizing the response parsing logic

The implementation is solid, but I have a couple of suggestions that might improve it further. What do you think about:

Returning the raw JSON instead of converting it to string? str(response.json()) might lose precision for numbers and booleans.

Adding debug logging when parsing fails to help with troubleshooting?

Here's a possible improvement:
 def _get_response_body(self, response: requests.Response) -> Optional[JsonType]:
     try:
-        return str(response.json())
+        return response.json()
     except requests.exceptions.JSONDecodeError:
+        self._logger.debug("Failed to parse response as JSON, attempting UTF-8 decode")
         try:
             return response.content.decode("utf-8")
         except Exception:
+            self._logger.debug("Failed to decode response content as UTF-8")
             return "The Content of the Response couldn't be decoded."
390-401: Great improvements to error handling! One tiny suggestion

The error handling looks solid! I particularly like how you've addressed the platform logging visibility issue and added comprehensive secret filtering.

Quick thought: Would it be clearer to structure the error message with newlines for better readability in logs? wdyt?

Here's an optional suggestion to improve log readability:
-                filtered_response_message = filter_secrets(f"Request (body): '{str(request.body)}'. Response (body): '{self._get_response_body(response)}'. Response (headers): '{response.headers}'.")
+                filtered_response_message = filter_secrets(
+                    "Details:\n"
+                    f"  Request body: '{str(request.body)}'\n"
+                    f"  Response body: '{self._get_response_body(response)}'\n"
+                    f"  Response headers: '{response.headers}'"
+                )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f64ad36 and af80a4a.

📒 Files selected for processing (1)

airbyte_cdk/sources/streams/http/http_client.py (4 hunks)

🔇 Additional comments (1)

airbyte_cdk/sources/streams/http/http_client.py (1)

5-5: LGTM! Clean import addition

The import of filter_secrets is well-placed and aligns with the PR objective of filtering sensitive information.

maxi297

LGTM! Thanks Baz

…ith-more-info

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/sources/streams/http/http_client.py (2)
55-55: Duplicate import detected

AuthBase is already imported earlier in the file. Would you mind removing this duplicate import? wdyt?

341-363: Consider enhancing error handling and type annotations

A few suggestions to make this more robust:

The return type annotation suggests JsonType but we're returning str. Should we update the type to str?

Would it be helpful to log the actual decoding error for debugging purposes?

The hardcoded message could be more descriptive, maybe include the encoding that failed?

Here's a potential improvement, wdyt?
-    def _get_response_body(self, response: requests.Response) -> Optional[JsonType]:
+    def _get_response_body(self, response: requests.Response) -> str:
         try:
             return str(response.json())
         except requests.exceptions.JSONDecodeError:
             try:
                 return response.content.decode("utf-8")
-            except Exception:
-                return "The Content of the Response couldn't be decoded."
+            except Exception as e:
+                self._logger.debug(f"Failed to decode response content: {str(e)}")
+                return "Failed to decode response content (both JSON and UTF-8 attempts failed)"

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between af80a4a and 32f3a02.

📒 Files selected for processing (2)

airbyte_cdk/sources/streams/http/http_client.py (4 hunks)
unit_tests/sources/streams/http/test_http.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

unit_tests/sources/streams/http/test_http.py

🔇 Additional comments (1)

airbyte_cdk/sources/streams/http/http_client.py (1)

392-403: LGTM! Just a quick verification needed

The implementation looks solid! The error message now includes comprehensive debug information while filtering secrets.

One small thing: shall we verify that all potential sensitive data is being filtered? For instance, would you like me to help check if there are any other fields in the request/response that might contain sensitive information?

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

airbyte_cdk/sources/streams/http/http_client.py (2)

339-361: Consider enhancing the response body handling, wdyt?

A few suggestions to make this more robust:

The JSON response could be returned as-is instead of converting to string, preserving the original structure.
We might want to catch specific exceptions for better error handling.
The docstring could be more precise about the actual return types.

Here's a potential improvement:

    def _get_response_body(self, response: requests.Response) -> Optional[JsonType]:
        """
        Extracts and returns the body of an HTTP response.

        Args:
            response (requests.Response): The HTTP response object.

        Returns:
            Optional[JsonType]: Either:
            - A dict/list for JSON responses
            - A string for plain text responses
            - A default message if decoding fails
        """
        try:
-           return str(response.json())
+           return response.json()
        except requests.exceptions.JSONDecodeError:
            try:
                return response.content.decode("utf-8")
-           except Exception:
+           except UnicodeDecodeError:
                return "The Content of the Response couldn't be decoded."

390-393: Consider optimizing the error message construction, what do you think?

The filtered message includes both request and response bodies, which could be quite large. Should we consider truncating these for readability? Also, the structure could be more readable with line breaks.

Here's a suggestion:

-                filtered_response_message = filter_secrets(
-                    f"Request (body): '{str(request.body)}'. Response (body): '{self._get_response_body(response)}'. Response (headers): '{response.headers}'."
-                )
+                filtered_response_message = filter_secrets("\n".join([
+                    "Request body: " + str(request.body)[:1000] + ("..." if len(str(request.body)) > 1000 else ""),
+                    "Response body: " + str(self._get_response_body(response))[:1000] + ("..." if len(str(self._get_response_body(response))) > 1000 else ""),
+                    "Response headers: " + str(response.headers)
+                ]))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 32f3a02 and 4a29cea.

📒 Files selected for processing (2)

airbyte_cdk/sources/streams/http/http_client.py (3 hunks)
unit_tests/sources/streams/http/test_http.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

unit_tests/sources/streams/http/test_http.py

🔇 Additional comments (2)

airbyte_cdk/sources/streams/http/http_client.py (2)

48-49: LGTM! Clean import addition.

The import of filter_secrets is well-placed and aligns with the security requirements.

399-401: Consider the duplicate logging behavior

The error message is logged here and will also be logged when the exception is printed. Should we consider removing this direct logging since the exception will handle it? Or is this intentional for immediate visibility?

Let's check how the exception logging is handled:

✅ Verification successful

Let me check how the exception handling and logging is implemented in this class.

The double logging is actually intentional here - what do you think about adding a quick comment to explain why? 🤔

Looking at the code, the explicit logging before raising the exception serves a specific purpose - it ensures the error is immediately visible in logs before any exception handling occurs. This is particularly useful in distributed systems where exception handling might happen at a different layer or service. The comment "ensure the exception message is emitted before raised" hints at this, but we could make it more explicit.

Maybe something like:

# Log error immediately for visibility in distributed systems, before exception handling occurs
self._logger.error(error_message)

wdyt about making this design choice more clear in the code? 🙂

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if MessageRepresentationAirbyteTracedErrors logging is handled elsewhere
rg -A 5 "MessageRepresentationAirbyteTracedErrors" --type py

Length of output: 4508

Script:

#!/bin/bash
# Check the implementation of the error handling in HttpClient
ast-grep --pattern 'class HttpClient {
  $$$
  def _handle_error_response($$$) {
    $$$
  }
  $$$
}'

Length of output: 103

Script:

#!/bin/bash
# Let's check the context around the error logging
rg -B 10 -A 5 "_logger.error\(error_message\)" --type py

Length of output: 1839

add more info to the error_message

5f1d8ce

bazarnov self-assigned this Nov 25, 2024

bazarnov requested a review from maxi297 November 25, 2024 16:06

github-actions bot added the enhancement New feature or request label Nov 25, 2024

This comment was marked as outdated.

Sign in to view

fix formatting

ed5f654

This comment was marked as outdated.

Sign in to view

fixed mypy bugs

310c5da

coderabbitai bot reviewed Nov 25, 2024

View reviewed changes

coderabbitai bot approved these changes Nov 25, 2024

View reviewed changes

maxi297 reviewed Nov 25, 2024

View reviewed changes

airbyte_cdk/sources/streams/http/http_client.py Outdated Show resolved Hide resolved

airbyte_cdk/sources/streams/http/http_client.py Show resolved Hide resolved

updated after the review

3199add

bazarnov requested a review from maxi297 November 26, 2024 13:55

coderabbitai bot reviewed Nov 26, 2024

View reviewed changes

maxi297 reviewed Nov 26, 2024

View reviewed changes

airbyte_cdk/sources/streams/http/http_client.py Outdated Show resolved Hide resolved

formatted

f64ad36

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

updated after the review

af80a4a

bazarnov requested a review from maxi297 November 27, 2024 23:18

coderabbitai bot reviewed Nov 27, 2024

View reviewed changes

maxi297 approved these changes Nov 28, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into baz/cdk/extend-logs-w…

32f3a02

…ith-more-info

coderabbitai bot reviewed Nov 29, 2024

View reviewed changes

formatted

4a29cea

coderabbitai bot reviewed Nov 29, 2024

View reviewed changes

bazarnov merged commit 65ed26f into main Nov 29, 2024
22 of 23 checks passed

bazarnov deleted the baz/cdk/extend-logs-with-more-info branch November 29, 2024 12:23

feat: (Airbyte Python CDK) - extend the error_message info for the ResponseAction.FAIL request scenario #87

feat: (Airbyte Python CDK) - extend the error_message info for the ResponseAction.FAIL request scenario #87

Uh oh!

Conversation

bazarnov commented Nov 25, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

User Impact

Summary by CodeRabbit

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

maxi297 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

maxi297 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: (Airbyte Python CDK) - extend the `error_message` info for the `ResponseAction.FAIL` request scenario #87

feat: (Airbyte Python CDK) - extend the `error_message` info for the `ResponseAction.FAIL` request scenario #87

bazarnov commented Nov 25, 2024 •

edited by coderabbitai bot

Loading