Cache gets corrupted #12

CourchesneA · 2022-05-24T12:20:54Z

Hello,

I can't find exactly what is causing it or when it happens, but about once a week the cache file gets corrupted and I have to manually delete it. Any idea what could be happening ?

Traceback (most recent call last):
   File "telemetry_server.py", line 289, in upload_cached_records
     cache_object = self.cache_queue.get(block=False)
   File "/usr/lib/python3.8/queue.py", line 180, in get
     item = self._get()
   File "/usr/local/lib/python3.8/dist-packages/pqueue/pqueue.py", line 87, in _get
     data = pickle.load(self.tailf)
 _pickle.UnpicklingError: invalid load key, '\x00'.

The text was updated successfully, but these errors were encountered:

CourchesneA · 2022-07-25T12:15:21Z

Just a followup, I have been running into additional pickle errors related to cache corruption on pqueue.get / pickle.load:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 1: invalid start byte
EOFError: Ran out of input

zpqrtbnk · 2024-01-28T09:57:02Z

Confirmed / reproduced.
The queue, in a pretty standard usage scenario, does get corrupted at time.
Will try and investigate.

balena · 2024-01-28T12:41:03Z

Please provide more information to best diagnose the problem.

As described in the README, pqueue relies on atomic operations provided by the filesystem. That means the info does not get updated "in place"; instead it writes the new file in the tempdir (please check the constructor), and the moves it over. But this is atomic if and only if the origin and destinations are on the same filesystem.

So, say your /temp resides in memory, and your 'info' directory is mounted on some volume. In this case, file operations aren't atomic and you may experience corruptions when your application doesn't behave properly (crashes or gets abruptly interrupted in the middle of executions).

This would be the case if you use pqueue in Kubernetes and mount some persistent volume for the 'info' directory. If you don't shutdown the app gracefully, your Pod may be rescheduled interrupting the queue processing abruptly. If your Kubernetes cluster is scheduled for maintenance once a week, that may explain the problem.

zpqrtbnk · 2024-02-05T17:11:25Z

Interesting. This is on a Raspberry Pi zero. No tempdir passed in the constructor, so it's creating temp files in /tmp, which really looks like it's on the same filesystem as the queue files. Trying to gather more infos. Also I cannot 100% rule out a machine reboot (this is a small IOT appliance) and ... it looks like a process crash could corrupt the code (data written out but infos not updated, etc). I'd say that one has to nicely stop the process, otherwise things may get corrupted. And then, the problem is that due to the structure of the files, there's little we can do to skip the bad entry.

zpqrtbnk · 2024-02-05T17:34:01Z

Notes (mostly for me...) : what's weird is that when the queue is corrupted... the first chunk file starts in the middle of a record. I get it that non-atomic operations could lead to a corrupted info file, but a data file should always begin with a new record, as far as I can tell from the code. Data files are not truncated, they are just removed once entirely processed, right?

balena · 2024-02-05T19:00:20Z

So, when these cases happen, is it possible to collect and provide a sample (without infringing any copyright of your work)? The other question is: how do you know it has been corrupted? Do you get any error? If so, do you mind to provide a stack trace?

Data files are not truncated, they are just removed once entirely processed, right?

Yes that's right. Filesystem space can be claimed as we do ftruncate. Which filesystem do you use?

Another possibility is that the filesystem in use does not play nicely when there's no enough chunk space to use so it attempts to relocate the file (it isn't usually a problem as filesystems can deal with fragmented files properly).

zpqrtbnk · 2024-02-06T08:15:33Z

FS is /dev/mmcblk0p2 on / type ext4 (rw,noatime) ... on a Raspberry it lives on the SD card (could that mean something?). I would be happy to share the corrupted chunk, but silly me hasn't kept it around - will make sure to keep it if it happens again.

The error (stack trace) I also don't have it anymore, but it's basically data = pickle.load(self.tailf) failing to deserialize what comes at the beginning of the chunk. Say I serialize records of { name: "foo", value: 1234 } then by looking at a binary dump of the chunk file I can see that it starts somewhere in the middle of a record, then continues with normal records.

TBH, looking at the code, I fail to see how a chunk file could end being corrupted, as that would mean that pickle.dump(item, self.headf) not writing correctly... I don't see anything else messing with the file's content. I am pretty sure something can happen, I have seen it happen and apparently others have, but so far I cannot figure it out.

And of course, I cannot reproduce "on demand". Am going to try more things (such as shutting down power, etc).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache gets corrupted #12

Cache gets corrupted #12

CourchesneA commented May 24, 2022

CourchesneA commented Jul 25, 2022

zpqrtbnk commented Jan 28, 2024

balena commented Jan 28, 2024 •

edited

zpqrtbnk commented Feb 5, 2024

zpqrtbnk commented Feb 5, 2024

balena commented Feb 5, 2024 •

edited

zpqrtbnk commented Feb 6, 2024

Cache gets corrupted #12

Cache gets corrupted #12

Comments

CourchesneA commented May 24, 2022

CourchesneA commented Jul 25, 2022

zpqrtbnk commented Jan 28, 2024

balena commented Jan 28, 2024 • edited

zpqrtbnk commented Feb 5, 2024

zpqrtbnk commented Feb 5, 2024

balena commented Feb 5, 2024 • edited

zpqrtbnk commented Feb 6, 2024

balena commented Jan 28, 2024 •

edited

balena commented Feb 5, 2024 •

edited