-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when using with pd.read_parquet with threading on #213
Comments
Please post/link to pyarrow too. s3fs does not do anything particular with threads, the S3FileObjects are independent instances, but S3FileSystem is a singleton (for given parameters) with shared state. |
I don't think it's a problem with pyarrow per se, as it's reading regions that do not have overlap. And reading a local file always succeeds, threading turned on or not. Since I think there is only one instance of S3FileObjects, I think the issue is cache not being thread safe (and apologies if I misunderstood anything): If you read the lines starting with
However
which seems to suggest that |
I currently work around it by turning off threading in pyarrow, but I think the best way is to turn off caching. What is the best way to turn off fsspec’s caching? @martindurant |
Yeah, I didn't at first realise that the same file instance was being shared across threads - that's not how I normally think about things from a Dask perspective.
Yes, that's the conclusion I was coming to also. You can use |
Should we allow setting default cache_type in the |
Both of those sound reasonable. Note on a threadsafe cache: I would still want the file instance to be pickleable, so presumably the cache in instances reconstituted by pickle would always have empty caches (unless they happen to be in the same thread?). |
I decided to implement default cache_type in s3fs (PR #214 ). However when working on the cache I found that currently MMapCache is not pickleable. I guess it needs to be fixed as well? |
Good point. fsspec/filesystem_spec#99 does this, but cloudpickle only. Would be good not to have to use it, you might have a better idea. |
If you don’t mind losing the cache content, you can try implementing getstate and setstate |
I think I'll park it there and see what people think. If you have a nice implementation, would be happy to see it, and indeed one for a threadsafe bytes cache. I don't know when I would get to this. |
You know, I may even back away from saying that files should be pickleable, since we do have OpenFile after all, specifically for passing around things that are pickleable, but can be made into files with |
Hello, do you have any updates on this issue ? |
Apparently boto3 sessions are not thread-safe, but we have not come across this before, so maybe previously they accidentally were. Certainly, s3fs instances get shared, but this is not new behaviour. I do not have a good idea of how to produce thread-local boto sessions. |
Since updating to latest version I randomly get errors like this:
I'm pretty sure that the file is not corrupted because sometimes it can be read successfully, and it can also be read when the file is local.
It is also very hard to reproduce consistently but if I try to read the file in a loop, I realize a pattern:
This will succeed 30 times in a row without a problem
This will fail sometimes on the first iteration, sometimes on the second, and never got to iteration number 5:
I suspect that it has something to do with random access with multiple threads, and the cache is not handling it correctly.
Finally this is the debug log which I hope will help
The text was updated successfully, but these errors were encountered: