Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is Error Handling example upto date? #170

Closed
ankurchavda opened this issue Oct 11, 2021 · 16 comments
Closed

Is Error Handling example upto date? #170

ankurchavda opened this issue Oct 11, 2021 · 16 comments
Assignees

Comments

@ankurchavda
Copy link

ankurchavda commented Oct 11, 2021

is_retryable does not seem to be available for usage. I can see is_transient. Should we use that or am I missing something?
Even the exception caught is coucbase.exceptions.CouchbaseError, hasn't this changed to coucbase.exceptions.CouchbaseException?

@osfameron osfameron self-assigned this Oct 11, 2021
@osfameron
Copy link
Collaborator

Thanks @ankurchavda, can you confirm which docs URL you're following, and which SDK/Server versions you're on, and we'll take a look!

@ankurchavda
Copy link
Author

ankurchavda commented Oct 11, 2021

@osfameron The one on the official website - https://docs.couchbase.com/python-sdk/current/howtos/error-handling.html.

And the github code for this is

# This an example of error handling for idempotent operations (such as the full-doc op seen here).
def change_email(collection, # type: Collection
maxRetries # type: int
):
try:
result = collection.get("doc_id") # type: GetResult
if not result:
raise couchbase.exceptions.KeyNotFoundException()
else:
content = result.content
content["email"]="john.smith@couchbase.com"
collection.replace("doc_id", content);
except couchbase.exceptions.CouchbaseError as err:
# isRetryable will be true for transient errors, such as a CAS mismatch (indicating
# another agent concurrently modified the document), or a temporary failure (indicating
# the server is temporarily unavailable or overloaded). The operation may or may not
# have been written, but since it is idempotent we can simply retry it.
if err.is_retryable:
if maxRetries > 0:
logging.info("Retrying operation on retryable err " + err)
change_email(collection, maxRetries - 1);
else:
# Errors can be transient but still exceed our SLA.
logging.error("Too many attempts, aborting on err " + err)
raise
# If the err is not isRetryable, there is perhaps a more permanent or serious error,
# such as a network failure.
else:
logging.error("Aborting operation on err " + err);
raise
MAX_RETRIES=5
try:
change_email(collection, MAX_RETRIES)
except RuntimeError as err:
# What to do here is highly application dependent. Options could include:
# - Returning a "please try again later" error back to the end-user (if any)
# - Logging it for manual human review, and possible follow-up with the end-user (if any)
logging.error("Failed to change email")

@osfameron
Copy link
Collaborator

OK, that page definitely needs some update, thanks for spotting. I've opened https://issues.couchbase.com/browse/DOC-9281 to track this internally, and have added to this sprint.

In the meantime:

  • CouchbaseException is correct
  • is_transient is the original method, is_retryable was indeed added as an alias of this, and later removed

So it looks like you're good to go! Let me know if you have any other questions.

@ankurchavda
Copy link
Author

Thanks for the quick action @osfameron. I am trying to add proper error handling to my code. Can you please check if the below code looks good?

def couchbase_connection(username, password, host, bucket='bucketname', max_retries=5, back_off=0.1):
    """
    Create couchbase connection

        Parameters:
            username: user id
            password: password
            host: connection string
            bucket (string): Name of the couchbase bucket
            max_retries (int): An integer
            back_off (int): An integer

        Returns:
            collection (couchbase.collections): A couchbase collection object

    """
    try:
        log.info("Creating couchbase connection.")
        # Setting timeout options for cluster object. kv_timeout is for key-value operations
        # like get() and other mutation commands. Simlarly query_timeout is for N1QL queries.
        options = ClusterTimeoutOptions(kv_timeout=timedelta(
            seconds=60), query_timeout=timedelta(seconds=120))

        cluster = Cluster('couchbase://' + host, ClusterOptions(
            PasswordAuthenticator(username, password), timeout_options=options))

        bucket = cluster.bucket('matrixDB')
        collection = bucket.default_collection()
        log.info("Connection created.")
        return collection

    # Transient errors are something that resolve automatically. is_retyable is provided
    # for such errors. So have added a retry mechanism if the error is retryable.
    except couchbase_exceptions.CouchbaseTransientException as err:
        log.error(f"Encountered a transient error: {err}")
        if max_retries > 0:
            log.info(f"Retries left: {max_retries}")
            log.info(f"Backing Off: {back_off} seconds")
            sleep(back_off)
            couchbase_connection(username, password, bucket,
                                 max_retries=max_retries-1, back_off=back_off*2)
        else:
            log.error(
                f"Too many attempts to create connection with couchbase: {err}")
            raise
    # catch generic couchbase exception
    except couchbase_exceptions.CouchbaseException as err:
        log.error(f"Unexpected error {err}")

So I have written the above function to create a couchbase connection. I catch transient errors and retry couchbase creation there. So should I also add is_transient check under CouchbaseException? Because sometimes connection creation fails with LCB_ERR_AUTHENTICATION_FAILURE (206) and that is not caught under CouchbaseTransientException.

@osfameron
Copy link
Collaborator

I'll speak to our Python SDK specialist soon hopefully and get them to look over it (which will also help me improve and update the documentation!)

@ankurchavda
Copy link
Author

Sure that would be great, thank you!

@thejcfactor
Copy link
Contributor

Hi @ankurchavda - The is_retryable and is_transient properties are not actually available within the 3.x SDK. The doc strings need to be updated in order to reflect what is actually available (I will look to create a ticket to track these changes accordingly). When it comes to retrying, definitely can retry on the CouchbaseTransientException. The authentication exception is tricky, since the connection process is relatively expensive, so that will depend how you want the application behave in terms of retrying or not (i.e. do you expect incorrect credentials fairly often, if so, probably not worth retrying and let the user know there was a problem connecting).

Something I would note with your current solution is that maybe you want to make a retry function that can behave as a decorator, that way you can reuse the retry mechanism for other operations. Also, the couchbase_connection() call in the exception path does not return anything. Having the connection wrapped via a decorator would also help to solve this problem. I will try to provide a sample of what I mean by tomorrow.

@ankurchavda
Copy link
Author

ankurchavda commented Oct 13, 2021

Hi @thejcfactor - The Authentication exception came in despite the credentials being correct. Also, it was resolved on a retry, but I have to specifically catch via AuthenticationException and retry there. Also on this retry, another exception was thrown LCB_ERROR_UNKNOWN_HOST (1049) (attached the image below). And the issue is that none of them are caught under CouchbaseTransientException which is causing the program to fail. Is there is a more robust way to handle these kind of errors that can be resolved via retries?

Also, it would great if you can share the sample of what you are trying to suggest.

image

@thejcfactor
Copy link
Contributor

thejcfactor commented Oct 14, 2021

Hmmm, there seems to be more going on in terms of the overall cluster setup if you are seeing intermittent connection issues. I would suggest trying out SDK doctor to help diagnose the connection.

Below is some sample code that demonstrates what I am talking about. You can apply the allow_retries decorator to any method that you would like to have retries. Also, you can also change the decorator signature to have more functionality (for instance pass in a dict that has exception types mapped to a function on how you want to handle the exception type...lots of possibilities :)).

import functools
import time
from datetime import timedelta
from typing import Optional, Tuple, Callable

from couchbase.cluster import Cluster, ClusterOptions, ClusterTimeoutOptions
from couchbase.auth import PasswordAuthenticator
from couchbase.exceptions import AuthenticationException, CouchbaseTransientException


def allow_retries(retry_limit=3,                # type: int
                  backoff=1.0,                  # type: float
                  exponential_backoff=False,    # type: bool
                  allowed_exceptions=None       # type: Optional[Tuple]
                  ) -> Callable:
    def handle_retries(func):
        @functools.wraps(func)
        def func_wrapper(*args, **kwargs):
            for i in range(retry_limit):
                try:
                    return func(*args, **kwargs)
                except Exception as ex:
                    if allowed_exceptions is None or not isinstance(ex, allowed_exceptions):
                        raise

                    if (retry_limit-1)-i == 0:
                        raise

                    delay = backoff
                    if exponential_backoff is True:
                        delay *= i
                    print(f"Retries left: {(retry_limit-1) - i}")
                    print(f"Backing Off: {delay} seconds")
                    time.sleep(delay)
        return func_wrapper
    return handle_retries


@allow_retries(retry_limit=5,
               backoff=0.5,
               allowed_exceptions=(CouchbaseTransientException, AuthenticationException))
def couchbase_connect(host,         # type: str
                      username,     # type: str
                      pw,           # type: str
                      bucket_name  # type: str
                      ) -> Tuple:
    options = ClusterTimeoutOptions(kv_timeout=timedelta(seconds=60),
                                    query_timeout=timedelta(seconds=120))

    cluster = Cluster(host, ClusterOptions(
        PasswordAuthenticator(username, pw), timeout_options=options))

    bucket = cluster.bucket(bucket_name)
    collection = bucket.default_collection()
    return cluster, bucket, collection


def run_sample_code():
    try:
        _, _, collection = couchbase_connect("couchbase://localhost",
                                             "Administrator", "password", "default")
        _ = collection.upsert("testKey", {"info": "is a test doc"})
        res = collection.get("testKey")
        print(f"Got: {res.content_as[dict]}")

    except Exception as ex:
        import traceback
        traceback.print_exc()


if __name__ == '__main__':
    run_sample_code()

@ankurchavda
Copy link
Author

Hi @thejcfactor, Sadly can't use SDK Doctor for the time being. But the code that you shared is very helpful, have added this decorator in my code for retrying after exceptions. Will let you know how the retry robustness changes after using this but the code definitely has become cleaner. Thanks.

@ankurchavda
Copy link
Author

Hi @thejcfactor - I used the base exception CouchbaseException for couchbase_connect function -

@allow_retries(backoff=0.5, exponential_backoff=True, allowed_exceptions=(CouchbaseException))

The allow_retries did not seem to catch AuthenticationException, TimeoutException.

I was expecting that since CouchbaseException is the base class, if allowed_exceptions is None or not isinstance(err, allowed_exceptions): would consider AuthenticationException an instance of CouchbaseException type. My understanding might be incorrect and sorry for digressing and asking a python related question. But will I have to catch and retry the exact exceptions using this decorator?

@thejcfactor
Copy link
Contributor

Hi @ankurchavda - I apologize for the delayed response (something is up w/ my notifications). I imagine you have figured out how to alter the helper method already. Just to close the loop, you would either need to pass in the specific exceptions if you are wanting to allow when using isinstance(), otherwise you could use issubclass(). I would lean on the side of being more explicit and passing in a tuple of all the errors you want to allow. IMHO, that would allow the code to be more clear.

@ankurchavda
Copy link
Author

Hi @thejcfactor, no problem at all. Yes, I did figure it out eventually. Just as you suggested, I passed the errors explicitly that I wanted to handle. The program is working fine now. Thanks for all the help :D.

Btw any updates on when the official documentation is going to be updated?

@thejcfactor
Copy link
Contributor

Hi @ankurchavda - Good to hear!

I am planning to make updates to the docs within the coming weeks (prior to the US holidays).

@ankurchavda
Copy link
Author

ankurchavda commented Dec 21, 2021

Hi @thejcfactor ,

Hope you are having a nice time :D

Apologies for asking a different question under this issue, I don't see an option of opening an issue in the Python SDK repo.

I wanted to understand a little more about the query_timeout option.

I set the option as 120 seconds in the ClusterTimeoutOptions(query_timeout=timedelta(seconds=120)) as a default option. However, I change it in the QueryOptions(timeout=timedelta(seconds=240)) if my query is going to take longer than 120. What I see happening is, it still takes 120 seconds and times out after it. I checked the API code and this is what is mentioned -

Timeouts may also be adjusted on a per-query basis by setting the
:attr:timeout property in the options to the n1ql_query method.
The effective timeout is either the per-query timeout or the global timeout,
whichever is lower.

Why the whichever is lower? Shouldn't the preference ideally be given to the lowest level of setting i.e. timeout settings in that query options? In this way, I might never be able to specify different timeouts for different queries if it exceeds the global timeout.

Also, I set QueryOptions(timeout=timedelta(seconds=120)) and ClusterTimeoutOptions(query_timeout=timedelta(seconds=240)). The query executed taking 240 as the timeout value. Meanwhile the API says it will take lower of the two. Then shouldn't it have taken 120 seconds as the timeout value and timed out? I see that the preference is just being given to the global settings in all scenarios.

I am using couchbase==3.1.2

Also, do let me know if there is a better place to raise this issue, will be happy to do it :D

@osfameron
Copy link
Collaborator

Initial error is closed and merged, should appear on our docs site at next build.

@ankurchavda sorry just noted your last question - is that still an open issue?
I'm going to close this ticket as the original problem is resolved.

If you still have an issue, you could open a new ticket, or as this looks like perhaps more of a discussion point, then https://forums.couchbase.com/ might be appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants