Skip to content

Conversation

bazarnov
Copy link
Contributor

@bazarnov bazarnov commented Nov 25, 2024

What

Resolving:

How

  • extended the HttpClient._handle_error_resolution > error_message with response info for debugging purposes
  • apply filter_secrets to omit the user-provided secret values once appear suddenly
  • covered the change with test_error_codes_http_stream_error_resolution_with_response_secrets_filtered unit test

User Impact

No impact is expected, this is not a breaking change.

Summary by CodeRabbit

  • New Features

    • Added a section in the README.md for code formatting instructions using the ruff tool.
    • Introduced a method to extract and sanitize HTTP response bodies in the HttpClient class.
  • Bug Fixes

    • Enhanced error handling to filter sensitive information from error messages in the HttpClient class.
  • Tests

    • Implemented a new test to validate error handling for HTTP responses containing sensitive information.

@bazarnov bazarnov self-assigned this Nov 25, 2024
@bazarnov bazarnov requested a review from maxi297 November 25, 2024 16:06
@github-actions github-actions bot added the enhancement New feature or request label Nov 25, 2024

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as outdated.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/streams/http/http_client.py (2)

339-361: Great addition! A few suggestions to make it even better, wdyt?

The method is well-documented and handles different response formats gracefully. However, I have a few suggestions:

  1. Consider returning the raw JSON object instead of converting it to string immediately:
-            return str(response.json())
+            return response.json()
  1. Maybe we could be more specific about the exception type on line 359?
  2. Would it be helpful to log when both parsing attempts fail?

390-393: Consider response size impact on performance?

The enhanced error message with filtered response details is a great addition! However, for very large response bodies, including the entire response might impact performance. Would it make sense to add a size limit or truncation for large responses, wdyt?

Example approach:

-                filtered_response_message = filter_secrets(
-                    f"Request (body): '{str(request.body)}'. Response (body): '{self._get_response_body(response)}'. Response (headers): '{response.headers}'."
-                )
+                response_body = self._get_response_body(response)
+                if response_body and len(response_body) > 1000:  # configurable limit
+                    response_body = f"{response_body[:1000]}... (truncated)"
+                filtered_response_message = filter_secrets(
+                    f"Request (body): '{str(request.body)}'. Response (body): '{response_body}'. Response (headers): '{response.headers}'."
+                )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ed5f654 and 310c5da.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/streams/http/http_client.py (4 hunks)
🔇 Additional comments (1)
airbyte_cdk/sources/streams/http/http_client.py (1)

5-5: LGTM! Clean import additions.

The new imports for filter_secrets and JsonType are well-organized and align with the PR's objective of enhancing error messages while protecting sensitive data.

Also applies to: 54-54

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you should me the output of the console to confirm that we can see there error message? I fear that it would not work when running this in the platform

@bazarnov bazarnov requested a review from maxi297 November 26, 2024 13:55
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/streams/http/http_client.py (2)

338-360: Consider enhancing the response body handling implementation.

The implementation is solid, but here are some suggestions for improvement:

  1. The return type hint JsonType might not be the most accurate since we're returning string representations. What do you think about using str instead?

  2. The response.json() already returns a parsed object, but we're converting it back to string. Maybe we could optimize this? Here's a suggestion, wdyt?

-    def _get_response_body(self, response: requests.Response) -> Optional[JsonType]:
+    def _get_response_body(self, response: requests.Response) -> str:
     """
     Extracts and returns the body of an HTTP response.
 
     This method attempts to parse the response body as JSON. If the response
     body is not valid JSON, it falls back to decoding the response content
     as a UTF-8 string. If both attempts fail, it returns None.
 
     Args:
         response (requests.Response): The HTTP response object.
 
     Returns:
-        Optional[JsonType]: The parsed JSON object as a string, the decoded
-        response content as a string, or None if both parsing attempts fail.
+        str: A string representation of the response body or an error message
+        if decoding fails.
     """
     try:
-        return str(response.json())
+        return orjson.dumps(response.json()).decode('utf-8')
     except requests.exceptions.JSONDecodeError:
         try:
             return response.content.decode("utf-8")
         except Exception:
             return "The Content of the Response couldn't be decoded."

389-407: LGTM! Good improvements to error handling.

The changes effectively enhance error messages while ensuring sensitive data is filtered. The addition of emit_message() before raising the exception is a nice touch for visibility.

One tiny suggestion: Would it make sense to extract the error message construction into a separate private method to improve readability? Something like _construct_error_message(request, response, filtered_response_message), wdyt?

+    def _construct_error_message(
+        self,
+        request: requests.PreparedRequest,
+        response: requests.Response,
+        filtered_response_message: str
+    ) -> str:
+        return (
+            f"'{request.method}' request to '{request.url}' failed with "
+            f"status code '{response.status_code}' and error message: "
+            f"'{self._error_message_parser.parse_response_error_message(response)}'. "
+            f"{filtered_response_message}"
+        )

     def _handle_error_resolution(self, ...):
         if error_resolution.response_action == ResponseAction.FAIL:
             if response is not None:
                 filtered_response_message = filter_secrets(
                     f"Request (body): '{str(request.body)}'. Response (body): '{self._get_response_body(response)}'. Response (headers): '{response.headers}'."
                 )
-                error_message = f"'{request.method}' request to '{request.url}' failed with status code '{response.status_code}' and error message: '{self._error_message_parser.parse_response_error_message(response)}'. {filtered_response_message}"
+                error_message = self._construct_error_message(
+                    request, response, filtered_response_message
+                )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 310c5da and 3199add.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/streams/http/http_client.py (4 hunks)
🔇 Additional comments (1)
airbyte_cdk/sources/streams/http/http_client.py (1)

5-5: LGTM! Clean import additions.

The new imports are well-organized and necessary for the enhanced error handling functionality.

Also applies to: 54-54

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
airbyte_cdk/sources/streams/http/http_client.py (1)

339-361: Consider improving JSON response handling, wdyt?

The method is well-documented and handles errors gracefully. However, str(response.json()) might not be the best way to handle JSON responses as it could:

  1. Lose the JSON structure
  2. Not handle complex objects consistently

What do you think about this alternative approach?

-            return str(response.json())
+            json_response = response.json()
+            return orjson.dumps(json_response).decode() if json_response else None

This would:

  • Preserve JSON structure
  • Handle None values gracefully
  • Use the same serialization method (orjson) that's already imported
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3199add and f64ad36.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/streams/http/http_client.py (4 hunks)
🔇 Additional comments (3)
airbyte_cdk/sources/streams/http/http_client.py (3)

5-6: LGTM! Clean import additions.

The new imports are well-organized and necessary for the enhanced error handling functionality.

Also applies to: 54-54


390-393: Great enhancement to error handling! 🎉

The changes effectively:

  • Filter sensitive information from responses
  • Provide more detailed error messages for debugging
  • Ensure error messages are visible in logs through emit_message()

This addresses the visibility concerns raised in previous reviews while maintaining security through secret filtering.

Also applies to: 399-409


390-392: Verify comprehensive secret filtering

Let's ensure the secret filtering catches all potential sensitive data in responses.

✅ Verification successful

Secret filtering looks good, but should we add a note about maximal matching? 🤔

The secret filtering implementation properly handles sensitive data in headers and responses by replacing them with "****". The filter_secrets function is used consistently across the codebase, including for response headers. However, there's a TODO comment about maximal matching that might be worth addressing:

  • The current implementation might have edge cases with overlapping secrets (e.g., "x" and "xk")
  • This could potentially lead to partial secret exposure in some cases

What do you think about adding a note to track this as a future improvement? The current implementation is safe for the common case, but being explicit about the edge case might help future maintainers. wdyt?

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential sensitive data patterns in error handling
# Look for patterns that might indicate sensitive data in tests or examples

# Check for sensitive patterns in test files
rg -i '(password|secret|token|key|auth).*response' test_http.py -A 5

# Check for examples of response headers containing sensitive data
rg -i 'authorization.*header' -g '!*.pyc'

Length of output: 3912


Script:

#!/bin/bash
# Let's check the implementation of the filter_secrets function to ensure it handles headers properly
ast-grep --pattern 'def filter_secrets'

# Also check for any test cases related to secret filtering
rg -i 'filter.*secret' -g '*.py' -A 5

# Look for any other sensitive data patterns in headers
rg -i '(api[-_]?key|token|secret|password|auth).*header' -g '*.py' -A 2

Length of output: 51046


Script:

#!/bin/bash
# Let's check the implementation of filter_secrets to understand how it handles sensitive data
ast-grep --pattern 'def filter_secrets'

# Also check for any global secrets or configuration that might be used in filtering
rg -i '__SECRETS_FROM_CONFIG' -g '*.py' -A 5

# Look for any tests that verify secret filtering behavior
rg -i 'test.*filter.*secret' -g '*test*.py' -A 5

Length of output: 2454

@bazarnov bazarnov requested a review from maxi297 November 27, 2024 23:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/streams/http/http_client.py (2)

339-361: Consider optimizing the response parsing logic

The implementation is solid, but I have a couple of suggestions that might improve it further. What do you think about:

  1. Returning the raw JSON instead of converting it to string? str(response.json()) might lose precision for numbers and booleans.
  2. Adding debug logging when parsing fails to help with troubleshooting?

Here's a possible improvement:

 def _get_response_body(self, response: requests.Response) -> Optional[JsonType]:
     try:
-        return str(response.json())
+        return response.json()
     except requests.exceptions.JSONDecodeError:
+        self._logger.debug("Failed to parse response as JSON, attempting UTF-8 decode")
         try:
             return response.content.decode("utf-8")
         except Exception:
+            self._logger.debug("Failed to decode response content as UTF-8")
             return "The Content of the Response couldn't be decoded."

390-401: Great improvements to error handling! One tiny suggestion

The error handling looks solid! I particularly like how you've addressed the platform logging visibility issue and added comprehensive secret filtering.

Quick thought: Would it be clearer to structure the error message with newlines for better readability in logs? wdyt?

Here's an optional suggestion to improve log readability:

-                filtered_response_message = filter_secrets(f"Request (body): '{str(request.body)}'. Response (body): '{self._get_response_body(response)}'. Response (headers): '{response.headers}'.")
+                filtered_response_message = filter_secrets(
+                    "Details:\n"
+                    f"  Request body: '{str(request.body)}'\n"
+                    f"  Response body: '{self._get_response_body(response)}'\n"
+                    f"  Response headers: '{response.headers}'"
+                )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f64ad36 and af80a4a.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/streams/http/http_client.py (4 hunks)
🔇 Additional comments (1)
airbyte_cdk/sources/streams/http/http_client.py (1)

5-5: LGTM! Clean import addition

The import of filter_secrets is well-placed and aligns with the PR objective of filtering sensitive information.

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks Baz

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/streams/http/http_client.py (2)

55-55: Duplicate import detected

AuthBase is already imported earlier in the file. Would you mind removing this duplicate import? wdyt?


341-363: Consider enhancing error handling and type annotations

A few suggestions to make this more robust:

  1. The return type annotation suggests JsonType but we're returning str. Should we update the type to str?
  2. Would it be helpful to log the actual decoding error for debugging purposes?
  3. The hardcoded message could be more descriptive, maybe include the encoding that failed?

Here's a potential improvement, wdyt?

-    def _get_response_body(self, response: requests.Response) -> Optional[JsonType]:
+    def _get_response_body(self, response: requests.Response) -> str:
         try:
             return str(response.json())
         except requests.exceptions.JSONDecodeError:
             try:
                 return response.content.decode("utf-8")
-            except Exception:
-                return "The Content of the Response couldn't be decoded."
+            except Exception as e:
+                self._logger.debug(f"Failed to decode response content: {str(e)}")
+                return "Failed to decode response content (both JSON and UTF-8 attempts failed)"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between af80a4a and 32f3a02.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/streams/http/http_client.py (4 hunks)
  • unit_tests/sources/streams/http/test_http.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • unit_tests/sources/streams/http/test_http.py
🔇 Additional comments (1)
airbyte_cdk/sources/streams/http/http_client.py (1)

392-403: LGTM! Just a quick verification needed

The implementation looks solid! The error message now includes comprehensive debug information while filtering secrets.

One small thing: shall we verify that all potential sensitive data is being filtered? For instance, would you like me to help check if there are any other fields in the request/response that might contain sensitive information?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
airbyte_cdk/sources/streams/http/http_client.py (2)

339-361: Consider enhancing the response body handling, wdyt?

A few suggestions to make this more robust:

  1. The JSON response could be returned as-is instead of converting to string, preserving the original structure.
  2. We might want to catch specific exceptions for better error handling.
  3. The docstring could be more precise about the actual return types.

Here's a potential improvement:

    def _get_response_body(self, response: requests.Response) -> Optional[JsonType]:
        """
        Extracts and returns the body of an HTTP response.

        Args:
            response (requests.Response): The HTTP response object.

        Returns:
            Optional[JsonType]: Either:
            - A dict/list for JSON responses
            - A string for plain text responses
            - A default message if decoding fails
        """
        try:
-           return str(response.json())
+           return response.json()
        except requests.exceptions.JSONDecodeError:
            try:
                return response.content.decode("utf-8")
-           except Exception:
+           except UnicodeDecodeError:
                return "The Content of the Response couldn't be decoded."

390-393: Consider optimizing the error message construction, what do you think?

The filtered message includes both request and response bodies, which could be quite large. Should we consider truncating these for readability? Also, the structure could be more readable with line breaks.

Here's a suggestion:

-                filtered_response_message = filter_secrets(
-                    f"Request (body): '{str(request.body)}'. Response (body): '{self._get_response_body(response)}'. Response (headers): '{response.headers}'."
-                )
+                filtered_response_message = filter_secrets("\n".join([
+                    "Request body: " + str(request.body)[:1000] + ("..." if len(str(request.body)) > 1000 else ""),
+                    "Response body: " + str(self._get_response_body(response))[:1000] + ("..." if len(str(self._get_response_body(response))) > 1000 else ""),
+                    "Response headers: " + str(response.headers)
+                ]))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 32f3a02 and 4a29cea.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/streams/http/http_client.py (3 hunks)
  • unit_tests/sources/streams/http/test_http.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • unit_tests/sources/streams/http/test_http.py
🔇 Additional comments (2)
airbyte_cdk/sources/streams/http/http_client.py (2)

48-49: LGTM! Clean import addition.

The import of filter_secrets is well-placed and aligns with the security requirements.


399-401: Consider the duplicate logging behavior

The error message is logged here and will also be logged when the exception is printed. Should we consider removing this direct logging since the exception will handle it? Or is this intentional for immediate visibility?

Let's check how the exception logging is handled:

✅ Verification successful

Let me check how the exception handling and logging is implemented in this class.


The double logging is actually intentional here - what do you think about adding a quick comment to explain why? 🤔

Looking at the code, the explicit logging before raising the exception serves a specific purpose - it ensures the error is immediately visible in logs before any exception handling occurs. This is particularly useful in distributed systems where exception handling might happen at a different layer or service. The comment "ensure the exception message is emitted before raised" hints at this, but we could make it more explicit.

Maybe something like:

# Log error immediately for visibility in distributed systems, before exception handling occurs
self._logger.error(error_message)

wdyt about making this design choice more clear in the code? 🙂

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if MessageRepresentationAirbyteTracedErrors logging is handled elsewhere
rg -A 5 "MessageRepresentationAirbyteTracedErrors" --type py

Length of output: 4508


Script:

#!/bin/bash
# Check the implementation of the error handling in HttpClient
ast-grep --pattern 'class HttpClient {
  $$$
  def _handle_error_response($$$) {
    $$$
  }
  $$$
}'

Length of output: 103


Script:

#!/bin/bash
# Let's check the context around the error logging
rg -B 10 -A 5 "_logger.error\(error_message\)" --type py

Length of output: 1839

@bazarnov bazarnov merged commit 65ed26f into main Nov 29, 2024
22 of 23 checks passed
@bazarnov bazarnov deleted the baz/cdk/extend-logs-with-more-info branch November 29, 2024 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants