Catch UnicodeDecodeError when opening corrupt beat-schedule.db #8806

andyzickler · 2024-01-19T00:56:19Z

There is existing code to detect if celerybeat-schedule.db is corrupted and recreate it, however sometimes a UnicodeDecodeError is thrown in the process of throwing the KeyError. This catches that error and allows Beat to use the existing code to recreate the database.

(Fixes #2907) This issue was closed because the workaround posted helped the user, but the underlying issue was not resolved.

Example of a stack trace this fixes:

[2024-01-19 00:54:01,637: CRITICAL/MainProcess] beat raised exception <class 'UnicodeDecodeError'>: UnicodeDecodeError('ascii', b'\x07\xe4\x02\x0c\x13\x11\x00\t\x11)', 1, 2, 'ordinal not in range(128)')
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/shelve.py", line 111, in __getitem__
    value = self.cache[key]
            ~~~~~~~~~~^^^^^
KeyError: 'entries'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/andy/src/django/venv3.11/lib/python3.11/site-packages/celery/apps/beat.py", line 113, in start_scheduler
    service.start()
  File "/Users/andy/src/django/venv3.11/lib/python3.11/site-packages/celery/beat.py", line 634, in start
    humanize_seconds(self.scheduler.max_interval))
                     ^^^^^^^^^^^^^^
  File "/Users/andy/src/django/venv3.11/lib/python3.11/site-packages/kombu/utils/objects.py", line 40, in __get__
    return super().__get__(instance, owner)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/Users/andy/src/django/venv3.11/lib/python3.11/site-packages/celery/beat.py", line 677, in scheduler
    return self.get_scheduler()
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/andy/src/django/venv3.11/lib/python3.11/site-packages/celery/beat.py", line 668, in get_scheduler
    return symbol_by_name(self.scheduler_cls, aliases=aliases)(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/andy/src/django/venv3.11/lib/python3.11/site-packages/celery/beat.py", line 513, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/andy/src/django/venv3.11/lib/python3.11/site-packages/celery/beat.py", line 264, in __init__
    self.setup_schedule()
  File "/Users/andy/src/django/venv3.11/lib/python3.11/site-packages/celery/beat.py", line 541, in setup_schedule
    self._create_schedule()
  File "/Users/andy/src/django/venv3.11/lib/python3.11/site-packages/celery/beat.py", line 570, in _create_schedule
    self._store['entries']
    ~~~~~~~~~~~^^^^^^^^^^^
  File "/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/shelve.py", line 114, in __getitem__
    value = Unpickler(f).load()
            ^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)

auvipy

interesting catch! can we add some unit test for it? although we can go without it as well for this case

codecov · 2024-01-19T09:59:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d7700e2) 81.19% compared to head (2cebf34) 81.22%.
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8806      +/-   ##
==========================================
+ Coverage   81.19%   81.22%   +0.03%     
==========================================
  Files         148      148              
  Lines       18523    18523              
  Branches     3165     3165              
==========================================
+ Hits        15039    15045       +6     
+ Misses       3196     3191       -5     
+ Partials      288      287       -1

Flag	Coverage Δ
unittests	`81.20% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Nusnus

LGTM, On the same page as @auvipy

andyzickler · 2024-01-19T14:01:07Z

Thanks for checking. I'll see if i can make a test. It doesn't look like there is one for PersistentScheduler._create_schedule, but there is one for setup_schedule that I think I can use as a reference.

There is existing code to detect if celerybeat-schedule.db is corrupted and recreate it, however sometimes a UnicodeDecodeError or TypeError is thrown in the process of throwing the KeyError. This catches that error and allows Beat to use the existing code to recreate the database. (Fixes celery#2907)

Nusnus · 2024-01-19T23:44:55Z

CI approved to run - it will take time though.
Should start in about 1-2h.

andyzickler · 2024-01-19T23:46:42Z

@Nusnus @auvipy I've added two tests that exercise how _create_schedule handles errors. I also added a catch for TypeError as I found that exception would sometimes happen when opening a corrupt database. Here's the trace for that

KeyError: 'entries'
  File "shelve.py", line 111, in __getitem__
    value = self.cache[key]
TypeError: cannot unpack non-iterable ellipsis object
  File "celery/apps/beat.py", line 113, in start_scheduler
    service.start()
  File "celery/beat.py", line 634, in start
    humanize_seconds(self.scheduler.max_interval))
  File "kombu/utils/objects.py", line 31, in __get__
    return super().__get__(instance, owner)
  File "functools.py", line 1001, in __get__
    val = self.func(instance)
  File "celery/beat.py", line 677, in scheduler
    return self.get_scheduler()
  File "celery/beat.py", line 668, in get_scheduler
    return symbol_by_name(self.scheduler_cls, aliases=aliases)(
  File "celery/beat.py", line 513, in __init__
    super().__init__(*args, **kwargs)
  File "celery/beat.py", line 264, in __init__
    self.setup_schedule()
  File "celery/beat.py", line 541, in setup_schedule
    self._create_schedule()
  File "celery/beat.py", line 570, in _create_schedule
    self._store['entries']
  File "shelve.py", line 113, in __getitem__
    f = BytesIO(self.dict[key.encode(self.keyencoding)])
  File "dbm/dumb.py", line 148, in __getitem__
    pos, siz = self._index[key]     # may raise KeyError

It seems a little smelly that we're catching a few different exception types, but I'm not sure of a better method. It feels like shelve's parsing code could be a little more robust.

andyzickler · 2024-01-19T23:58:53Z

Hmm, i see that setup_schedule catches any exception and then destroys the database.

celery/celery/beat.py

Lines 529 to 539 in 8f38999

    
           def setup_schedule(self): 
        
               try: 
        
                   self._store = self._open_schedule() 
        
                   # In some cases there may be different errors from a storage 
        
                   # backend for corrupted files.  Example - DBPageNotFoundError 
        
                   # exception from bsddb.  In such case the file will be 
        
                   # successfully opened but the error will be raised on first key 
        
                   # retrieving. 
        
                   self._store.keys() 
        
               except Exception as exc:  # pylint: disable=broad-except 
        
                   self._store = self._destroy_open_corrupted_schedule(exc)

Should I change _create_schedule to work the same? What do you think? I don't really like blanket catching, but there might be some other parsing error that we don't know about yet.

auvipy · 2024-01-23T08:45:02Z

Should I change _create_schedule to work the same? What do you think? I don't really like blanket catching, but there might be some other parsing error that we don't know about yet.

would love to see how that works. but not fully sure now. may be in another PR? or you want that to try here?

andyzickler · 2024-01-23T15:58:20Z

Let's go with what I've got here for now. I think i have everything covered and we can add more catches later if they appear. I don't want to catch an exception that can't actually be resolved by recreating the database, like an IO error or something.

With that question resolved, I think this is good to merge. 🎉

Nusnus · 2024-01-23T20:30:16Z

Let's go with what I've got here for now. I think i have everything covered and we can add more catches later if they appear. I don't want to catch an exception that can't actually be resolved by recreating the database, like an IO error or something.

With that question resolved, I think this is good to merge. 🎉

LGTM, your call on the merge @auvipy

auvipy · 2024-01-23T21:08:48Z

thanks guys!

auvipy self-requested a review January 19, 2024 09:33

auvipy reviewed Jan 19, 2024

View reviewed changes

auvipy added this to the 5.4 milestone Jan 19, 2024

Nusnus approved these changes Jan 19, 2024

View reviewed changes

andyzickler force-pushed the fix/celerybeat-schedule-unicodedecodeerror branch from fb1951e to 2cebf34 Compare January 19, 2024 23:42

auvipy approved these changes Jan 23, 2024

View reviewed changes

auvipy merged commit 3e98049 into celery:main Jan 23, 2024
82 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch UnicodeDecodeError when opening corrupt beat-schedule.db #8806

Catch UnicodeDecodeError when opening corrupt beat-schedule.db #8806

andyzickler commented Jan 19, 2024

auvipy left a comment

codecov bot commented Jan 19, 2024 •

edited

Nusnus left a comment

andyzickler commented Jan 19, 2024

Nusnus commented Jan 19, 2024

andyzickler commented Jan 19, 2024

andyzickler commented Jan 19, 2024

auvipy commented Jan 23, 2024

andyzickler commented Jan 23, 2024

Nusnus commented Jan 23, 2024

auvipy commented Jan 23, 2024

Catch UnicodeDecodeError when opening corrupt beat-schedule.db #8806

Catch UnicodeDecodeError when opening corrupt beat-schedule.db #8806

Conversation

andyzickler commented Jan 19, 2024

auvipy left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 19, 2024 • edited

Codecov Report

Nusnus left a comment

Choose a reason for hiding this comment

andyzickler commented Jan 19, 2024

Nusnus commented Jan 19, 2024

andyzickler commented Jan 19, 2024

andyzickler commented Jan 19, 2024

auvipy commented Jan 23, 2024

andyzickler commented Jan 23, 2024

Nusnus commented Jan 23, 2024

auvipy commented Jan 23, 2024

codecov bot commented Jan 19, 2024 •

edited