Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache gets corrupted #12

Open
CourchesneA opened this issue May 24, 2022 · 7 comments
Open

Cache gets corrupted #12

CourchesneA opened this issue May 24, 2022 · 7 comments

Comments

@CourchesneA
Copy link

Hello,

I can't find exactly what is causing it or when it happens, but about once a week the cache file gets corrupted and I have to manually delete it. Any idea what could be happening ?

Traceback (most recent call last):
   File "telemetry_server.py", line 289, in upload_cached_records
     cache_object = self.cache_queue.get(block=False)
   File "/usr/lib/python3.8/queue.py", line 180, in get
     item = self._get()
   File "/usr/local/lib/python3.8/dist-packages/pqueue/pqueue.py", line 87, in _get
     data = pickle.load(self.tailf)
 _pickle.UnpicklingError: invalid load key, '\x00'.
@CourchesneA
Copy link
Author

Just a followup, I have been running into additional pickle errors related to cache corruption on pqueue.get / pickle.load:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x95 in position 1: invalid start byte
EOFError: Ran out of input

@zpqrtbnk
Copy link

Confirmed / reproduced.
The queue, in a pretty standard usage scenario, does get corrupted at time.
Will try and investigate.

@balena
Copy link
Owner

balena commented Jan 28, 2024

Please provide more information to best diagnose the problem.

As described in the README, pqueue relies on atomic operations provided by the filesystem. That means the info does not get updated "in place"; instead it writes the new file in the tempdir (please check the constructor), and the moves it over. But this is atomic if and only if the origin and destinations are on the same filesystem.

So, say your /temp resides in memory, and your 'info' directory is mounted on some volume. In this case, file operations aren't atomic and you may experience corruptions when your application doesn't behave properly (crashes or gets abruptly interrupted in the middle of executions).

This would be the case if you use pqueue in Kubernetes and mount some persistent volume for the 'info' directory. If you don't shutdown the app gracefully, your Pod may be rescheduled interrupting the queue processing abruptly. If your Kubernetes cluster is scheduled for maintenance once a week, that may explain the problem.

@zpqrtbnk
Copy link

zpqrtbnk commented Feb 5, 2024

Interesting. This is on a Raspberry Pi zero. No tempdir passed in the constructor, so it's creating temp files in /tmp, which really looks like it's on the same filesystem as the queue files. Trying to gather more infos. Also I cannot 100% rule out a machine reboot (this is a small IOT appliance) and ... it looks like a process crash could corrupt the code (data written out but infos not updated, etc). I'd say that one has to nicely stop the process, otherwise things may get corrupted. And then, the problem is that due to the structure of the files, there's little we can do to skip the bad entry.

@zpqrtbnk
Copy link

zpqrtbnk commented Feb 5, 2024

Notes (mostly for me...) : what's weird is that when the queue is corrupted... the first chunk file starts in the middle of a record. I get it that non-atomic operations could lead to a corrupted info file, but a data file should always begin with a new record, as far as I can tell from the code. Data files are not truncated, they are just removed once entirely processed, right?

@balena
Copy link
Owner

balena commented Feb 5, 2024

So, when these cases happen, is it possible to collect and provide a sample (without infringing any copyright of your work)? The other question is: how do you know it has been corrupted? Do you get any error? If so, do you mind to provide a stack trace?

Data files are not truncated, they are just removed once entirely processed, right?

Yes that's right. Filesystem space can be claimed as we do ftruncate. Which filesystem do you use?

Another possibility is that the filesystem in use does not play nicely when there's no enough chunk space to use so it attempts to relocate the file (it isn't usually a problem as filesystems can deal with fragmented files properly).

@zpqrtbnk
Copy link

zpqrtbnk commented Feb 6, 2024

FS is /dev/mmcblk0p2 on / type ext4 (rw,noatime) ... on a Raspberry it lives on the SD card (could that mean something?). I would be happy to share the corrupted chunk, but silly me hasn't kept it around - will make sure to keep it if it happens again.

The error (stack trace) I also don't have it anymore, but it's basically data = pickle.load(self.tailf) failing to deserialize what comes at the beginning of the chunk. Say I serialize records of { name: "foo", value: 1234 } then by looking at a binary dump of the chunk file I can see that it starts somewhere in the middle of a record, then continues with normal records.

TBH, looking at the code, I fail to see how a chunk file could end being corrupted, as that would mean that pickle.dump(item, self.headf) not writing correctly... I don't see anything else messing with the file's content. I am pretty sure something can happen, I have seen it happen and apparently others have, but so far I cannot figure it out.

And of course, I cannot reproduce "on demand". Am going to try more things (such as shutting down power, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants