open(cloud_path) acts on cached local file instead of cloud file #128

jayqi · 2021-03-05T19:27:06Z

Unfortunately the following code does not behave as expected:

with open(cloud_path, "w") as fp:
    fp.write("foo")

(as opposed to using cloud_path.open("w") which does work as expected).

What will happen is that this will only modify the local cached copy of the file without uploading it to the cloud. Then further attempts to interact with the cloud path will correctly raise OverwriteNewerLocal exceptions. Because cloud paths have an __fspath__ method (#72) that returns the local file path, the open function will only interact with that local file path.

Users can still get the correct behavior by using the open method. However, I've seen a lot of people using regular pathlib, especially those who are still new to it, with open(path, "w") on their Path objects, so I expect this may be a common pattern for users of cloudpathlib as well.

The text was updated successfully, but these errors were encountered:

jayqi · 2021-03-07T03:12:45Z

The gist of the problem here is that our os.PathLike support fails in write situations.

Because CloudPath implements __fspath__ and therefore the os.PathLike protocol, it can be accepted by anything that supports an os.PathLike input, which will call __fspath__ and do whatever with the returned filepath. This fine and dandy for read operations, because the local cache file's path is returned and the caller will read that file, and we have the ability to refresh the cache before that happens.

However, the problem happens when the caller writes. We're able to refresh the cache within __fspath__ before returning the local cache's path, but after we return, we're out of that method's scope and can't do anything else. The caller will write to the local cache, but we won't be able to sync the cache back up to the cloud.

This may be impossible to fix. Once __fspath__ returns, we won't be able to do anything. We'd really like __fspath__ to behave like a context manager so we can run some code after, but unfortunately we can't actually do that because the os.PathLike protocol requires __fspath__ to return bytes or str. We'd need the Python standard library to implement a protocol that supports a context manager and then they have to convince all of the other libraries to support their new context manager interface. os.PathLike has been out since Python 3.6 and there are still lots of places that don't support it yet and still only support filepaths as strings.

If it's the case that this is impossible to fix, then we have two options:

1. Keep our `os.PathLike` support but try to detect when it's being used in a write case so we can raise an exception.

@pjbull suggested that we could maybe use the inspect module to trace the callstack. We could potentially try to detect if it's being used by the built-in open function and then check whether mode is set to one of the writable ones.

It's not clear to me we can come up with a solution that is guaranteed to catch all cases though. If some user calls __fspath__, stores the content as a string, and then writes to that file later, that would be harder to reliably detect. I'm also not sure if the builtins open is the only way you can write to files in Python or not.

2. Remove `fspath` and don't implement the `os.PathLike` interface.

That means users can't pass cloud paths into functions that expect os.PathLike inputs, even for reading.

pjbull · 2021-03-13T19:03:49Z

I think (1) is probably worth doing, and that we also may add a section to the docs on writing to the cloud so that we can be very explicit about what we recommend, when you might see errors, and how it could be fixed. Probably with having a library level setting for making this a warning/error.

Here are two other ideas while we're brainstorming:

3. Patch `open`

I would not do this by default, but I could see having a CLOUDPATHLIB_OVERRIDE_OPEN env var that is documented and could be set. If it is set at import time, we override the open builtin (and maybe also os.open?) With our own version that does the write-mode check (only adds the one check for read scenarios, so shouldn't be a huge perf hit). If we are in write mode, we supply our own object that will upload on __exit__/.close().

I guess the primary use case here is if you have large/complex code bases that you want to pass CloudPath objects into, but don't want to change all of the open calls in the existing code.

4. Warn/raise on del if cache file has changed.

We track modified-time of the cache file at the object level. In __del__, if the on-disk modified time is greater than the one on the object, we can warn or raise.

I could also see an env var here that would enable uploading to the cloud in __del__ if the local cache is newer, but that seems high risk for foot-shooting.

grisaitis · 2022-09-15T19:44:52Z

Hi! I'm encountering this issue in a context similar to #240, with pandas.

This issue causes a problem with pandas.DataFrame.to_*(path), where you can't write to the same file more than once :/

Example:

import pandas as pd
import cloudpathlib

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
cloud_path = cloudpathlib.GSPath('gs://bucket/df.parquet')
df.to_parquet(cloud_path)  # works fine
df.to_parquet(cloud_path)  # has OverwriteNewerLocalError exception

---------------------------------------------------------------------------
OverwriteNewerLocalError                  Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 df.to_parquet(cloud_path)

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/util/_decorators.py:207, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    205     else:
    206         kwargs[new_arg_name] = new_arg_value
--> 207 return func(*args, **kwargs)

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/core/frame.py:2835, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2749 """
   2750 Write a DataFrame to the binary parquet format.
   2751 
   (...)
   2831 >>> content = f.read()
   2832 """
   2833 from pandas.io.parquet import to_parquet
-> 2835 return to_parquet(
   2836     self,
   2837     path,
   2838     engine,
   2839     compression=compression,
   2840     index=index,
   2841     partition_cols=partition_cols,
   2842     storage_options=storage_options,
   2843     **kwargs,
   2844 )

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/parquet.py:420, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, **kwargs)
    416 impl = get_engine(engine)
    418 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
--> 420 impl.write(
    421     df,
    422     path_or_buf,
    423     compression=compression,
    424     index=index,
    425     partition_cols=partition_cols,
    426     storage_options=storage_options,
    427     **kwargs,
    428 )
    430 if path is None:
    431     assert isinstance(path_or_buf, io.BytesIO)

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/parquet.py:176, in PyArrowImpl.write(self, df, path, compression, index, storage_options, partition_cols, **kwargs)
    172     from_pandas_kwargs["preserve_index"] = index
    174 table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
--> 176 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
    177     path,
    178     kwargs.pop("filesystem", None),
    179     storage_options=storage_options,
    180     mode="wb",
    181     is_dir=partition_cols is not None,
    182 )
    183 try:
    184     if partition_cols is not None:
    185         # writes to multiple files under the given path

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/parquet.py:80, in _get_path_or_handle(path, fs, storage_options, mode, is_dir)
     70 def _get_path_or_handle(
     71     path: FilePath | ReadBuffer[bytes] | WriteBuffer[bytes],
     72     fs: Any,
   (...)
     77     FilePath | ReadBuffer[bytes] | WriteBuffer[bytes], IOHandles[bytes] | None, Any
     78 ]:
     79     """File handling for PyArrow."""
---> 80     path_or_handle = stringify_path(path)
     81     if is_fsspec_url(path_or_handle) and fs is None:
     82         fsspec = import_optional_dependency("fsspec")

File /opt/conda/envs/deconv/lib/python3.10/site-packages/pandas/io/common.py:225, in stringify_path(filepath_or_buffer, convert_file_like)
    222     return cast(BaseBufferT, filepath_or_buffer)
    224 if isinstance(filepath_or_buffer, os.PathLike):
--> 225     filepath_or_buffer = filepath_or_buffer.__fspath__()
    226 return _expand_user(filepath_or_buffer)

File /opt/conda/envs/deconv/lib/python3.10/site-packages/cloudpathlib/cloudpath.py:233, in CloudPath.__fspath__(self)
    231 def __fspath__(self):
    232     if self.is_file():
--> 233         self._refresh_cache(force_overwrite_from_cloud=False)
    234     return str(self._local)

File /opt/conda/envs/deconv/lib/python3.10/site-packages/cloudpathlib/cloudpath.py:844, in CloudPath._refresh_cache(self, force_overwrite_from_cloud)
    841 # if local newer but not dirty, it was updated
    842 # by a separate process; do not overwrite unless forced to
    843 if self._local.stat().st_mtime > stats.st_mtime:
--> 844     raise OverwriteNewerLocalError(
    845         f"Local file ({self._local}) for cloud path ({self}) is newer on disk, but "
    846         f"is being requested for download from cloud. Either (1) push your changes to the cloud, "
    847         f"(2) remove the local file, or (3) pass `force_overwrite_from_cloud=True` to "
    848         f"overwrite."
    849     )

OverwriteNewerLocalError: Local file (/tmp/tmp1bto4kma/<hidden>/df.parquet) for cloud path (gs://<hidden>/df.parquet) is newer on disk, but is being requested for download from cloud. Either (1) push your changes to the cloud, (2) remove the local file, or (3) pass `force_overwrite_from_cloud=True` to overwrite.

jayqi · 2022-09-15T20:05:01Z

Hi @grisaitis. This is indeed another case of the problem documented by this issue—cloud paths do not support being directly called with third-party functions (like pd.to_parquet(cloud_path)) that generically do writes. Unfortunately, we are not close to being able to support it. Most of the discussion related to this issue involves giving users a clearer error to indicate that this is not supported.

grisaitis · 2022-09-15T20:25:01Z

@jayqi thank you so much for the reply.

should i be calling str(cloud_path) or cloud_path.to_uri()?

jayqi · 2022-09-15T20:40:34Z

@grisaitis there are ~~two~~ three directions of workarounds you can take:

Update: See best option (0) in the following comment

(1): Introduce explicit local file paths and then use cloudpathlib to write/upload to the cloud. Some example pseudo-code:

cloud_path = CloudPath(...)
local_path = Path(...)
df.to_parquet(local_path)
cloud_path.upload_from(local_path)

This introduces local paths that you need to define and manage. It is also not the most efficient, because it introduces another copy of the data that uses up disk space, as well as associated extra reads/writes.

(2): Use as_uri() to leverage other cloud filesystem libraries, like fsspec. This only works if the function you're passing cloud_path.as_uri() into has some other third-party support and if you have those dependencies installed. It works for pandas because pandas supports fsspec. What happens in this case is that you're using cloudpathlib to potentially manipulate paths (using a pathlib-like syntax) but using an fsspec backend (s3fs, gcsfs, etc.), which is an entirely independent library, for the actual write to the cloud.

jayqi · 2022-09-15T20:57:50Z

Good news! @pjbull pointed out to me another option, which is probably the best for cases that support it.

(0): For write functions that support getting file objects/handles as inputs, you can using cloud_path.open("w"). pandas does indeed support this. The syntax here is a little clunky but it has the advantage over the other two options of not involving an extra copy of the file and leveraging purely cloudpathlib on the backend.

See example:

from cloudpathlib import CloudPath
import pandas as pd


cloud_path = CloudPath("s3://cloudpathlib-test-bucket/test.csv")
cloud_path.exists()
#> False

df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})

with cloud_path.open("w") as fp:
    df.to_csv(fp)
    
cloud_path.exists()
#> True
cloud_path.read_text()
#> ',x,y\n0,1,4\n1,2,5\n2,3,6\n'

^{Created at 2022-09-15 16:53:55 EDT by reprexlite v0.5.0}

@grisaitis

grisaitis · 2022-09-15T21:08:30Z

Thanks!

Since I want to avoid having any extra clutter around just calling df.to_*(path), I might try out https://github.com/fsspec/universal_pathlib as well

EpicWink · 2022-09-20T20:13:15Z

You could start an asynchronous file watcher (eg in a thread) that detects changes whenever __fspath__ is called. It would have to know when the changes come from cloupath vs externally

barneygale · 2022-11-27T19:01:06Z

Related issue in the standard library: python/cpython#99818

FullMetalMeowchemist · 2023-11-03T03:52:08Z

Another work around I've found is to use the context manage to open the CloudPath and do the write operation as normal.

I'm working with pyarrow.parquet, and the write_table method only accepts a str or pyarrow.NativeFile. This is the example I've ended up with

target_filepath = CloudPath("gs://some_bucket/target_file.parquet")
target_filepath.exists()  # False

with target_filepath.open("w"):
    parquet.write_table(table, target_filepath)

target_filepath.exists()  # True

jayqi · 2023-11-05T03:58:53Z

Whoa, I don't think I've considered that usage pattern. Thanks @FullMetalMeowchemist! It's really interesting that that works, given that we didn't (I don't think, anyways) do it intentionally. I think it makes sense, given that CloudPath.open patches the returned context manager to upload the local file to the cloud.

@pjbull I'm wondering if we do something like: on calls to fspath, check whether we have a buffer context manager active, and error if not? Then we could potentially enforce the pattern of

with cloud_path.open():
    whatever_write(cloud_path)

and error if fspath is called without being inside the context manager.

I guess it may still be undesirably cumbersome for read cases though to require a context manager, and this doesn't help with detecting the difference between read and write usage.

ollie-bell · 2024-07-01T11:18:57Z

Whoa, I don't think I've considered that usage pattern. Thanks @FullMetalMeowchemist! It's really interesting that that works, given that we didn't (I don't think, anyways) do it intentionally. I think it makes sense, given that CloudPath.open patches the returned context manager to upload the local file to the cloud.

@jayqi it sounds then like there isn't much difference between (continuing the previous example for objects with write methods that don't accept a file handle but only a file name):

with target_filepath.open("w"):
    parquet.write_table(table, target_filepath)

and something like

import tempfile

with tempfile.NamedTemporaryFile() as fp:
    parquet.write_table(table, fp.name)
    fp.flush()
    target_filepath.upload_from(fp.name)

... would there be any technical reasons to prefer one over the other? Thanks

jayqi · 2024-07-01T11:40:45Z

@jayqi it sounds then like there isn't much difference between (continuing the previous example for objects with write methods that don't accept a file handle but only a file name):
with target_filepath.open("w"):
    parquet.write_table(table, target_filepath)    
and something like
import tempfile

with tempfile.NamedTemporaryFile() as fp:
    parquet.write_table(table, fp.name)
    fp.flush()
    target_filepath.upload_from(fp.name)
... would there be any technical reasons to prefer one over the other? Thanks

There aren't any huge differences in this single operation.

However, if you later need to read the file you just uploaded, you may be able to save a download using the first approach. This is because first approach will create a cloudpathlib-managed local cache file, whereas your second approach explicitly manages its own temporary file. A later read operation would compare the cloudpathlib local cache file to the cloud file and would skip a download if it determines the cache is up-to-date. The second approach wouldn't have a local cache file and so it will always download to create one.

jayqi added the bug Something isn't working label Mar 5, 2021

pjbull mentioned this issue Apr 4, 2021

Raise exception when open with write mode in call stack #140

Open

jayqi mentioned this issue Jul 18, 2022

Write HDF and push to cloud? #240

Closed

jayqi mentioned this issue Jul 26, 2022

Error when writing to a CloudPath file using Pandas #242

Closed

grisaitis mentioned this issue Sep 15, 2022

Hello from fsspec! #96

Closed

Gilthans mentioned this issue Oct 30, 2022

OverwriteNewerLocalError when reading same resource in parallel #283

Open

pjbull mentioned this issue Jan 24, 2023

Some libraries generate confusing errors when opening CloudPath objects #315

Open

pjbull mentioned this issue Feb 16, 2023

WIP: Patch open, os and os.path builtins #322

Draft

8 tasks

This was referenced Nov 2, 2023

Compatibility with aiofiles #376

Open

File not found error surfaces local path instead of cloud path #118

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open(cloud_path) acts on cached local file instead of cloud file #128

open(cloud_path) acts on cached local file instead of cloud file #128

jayqi commented Mar 5, 2021 •

edited

Loading

jayqi commented Mar 7, 2021 •

edited

Loading

pjbull commented Mar 13, 2021

grisaitis commented Sep 15, 2022

jayqi commented Sep 15, 2022

grisaitis commented Sep 15, 2022

jayqi commented Sep 15, 2022 •

edited

Loading

jayqi commented Sep 15, 2022

grisaitis commented Sep 15, 2022

EpicWink commented Sep 20, 2022

barneygale commented Nov 27, 2022

FullMetalMeowchemist commented Nov 3, 2023 •

edited

Loading

jayqi commented Nov 5, 2023

ollie-bell commented Jul 1, 2024 •

edited

Loading

jayqi commented Jul 1, 2024

open(cloud_path) acts on cached local file instead of cloud file #128

open(cloud_path) acts on cached local file instead of cloud file #128

Comments

jayqi commented Mar 5, 2021 • edited Loading

jayqi commented Mar 7, 2021 • edited Loading

1. Keep our os.PathLike support but try to detect when it's being used in a write case so we can raise an exception.

2. Remove __fspath__ and don't implement the os.PathLike interface.

pjbull commented Mar 13, 2021

3. Patch open

4. Warn/raise on del if cache file has changed.

grisaitis commented Sep 15, 2022

jayqi commented Sep 15, 2022

grisaitis commented Sep 15, 2022

jayqi commented Sep 15, 2022 • edited Loading

jayqi commented Sep 15, 2022

grisaitis commented Sep 15, 2022

EpicWink commented Sep 20, 2022

barneygale commented Nov 27, 2022

FullMetalMeowchemist commented Nov 3, 2023 • edited Loading

jayqi commented Nov 5, 2023

ollie-bell commented Jul 1, 2024 • edited Loading

jayqi commented Jul 1, 2024

jayqi commented Mar 5, 2021 •

edited

Loading

jayqi commented Mar 7, 2021 •

edited

Loading

1. Keep our `os.PathLike` support but try to detect when it's being used in a write case so we can raise an exception.

2. Remove `fspath` and don't implement the `os.PathLike` interface.

3. Patch `open`

jayqi commented Sep 15, 2022 •

edited

Loading

FullMetalMeowchemist commented Nov 3, 2023 •

edited

Loading

ollie-bell commented Jul 1, 2024 •

edited

Loading