Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble saving CSR matrix to S3 using scipy.sparse.save_npz #190

Closed
datavisoryangzhou opened this issue Jul 1, 2019 · 11 comments
Closed

Comments

@datavisoryangzhou
Copy link

datavisoryangzhou commented Jul 1, 2019

I'm running into an issue where I tried to save a CSR matrix to aws s3 using the testing codes below:

import numpy as np
import pandas as pd
from scipy import sparse
import s3fs

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
csr = sparse.csr_matrix(df.values)

s3 = s3fs.S3FileSystem(anon=False)
s3_path = "<an_aws_s3_path>"
f = s3.open(s3_path, 'wb')
sparse.save_npz(f, csr)

If this is a not current support function, could you provide any leads on good ways to achieve my goal?

Thanks.

Error trace:

...
  File "/Users/yangzhou/code/sml/core/util/csr_matrix_wrapper.py", line 37, in save
    sparse.save_npz(f, self.csr)
  File "/usr/local/lib/python3.7/site-packages/scipy/sparse/_matrix_io.py", line 78, in save_npz
    np.savez_compressed(file, **arrays_dict)
  File "/usr/local/lib/python3.7/site-packages/numpy/lib/npyio.py", line 667, in savez_compressed
    _savez(file, args, kwds, True)
  File "/usr/local/lib/python3.7/site-packages/numpy/lib/npyio.py", line 695, in _savez
    zipf = zipfile_factory(file, mode="w", compression=compression)
  File "/usr/local/lib/python3.7/site-packages/numpy/lib/npyio.py", line 112, in zipfile_factory
    return zipfile.ZipFile(file, *args, **kwargs)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1214, in __init__
    self.fp.seek(self.start_dir)
  File "/usr/local/lib/python3.7/site-packages/s3fs/core.py", line 1278, in seek
    raise ValueError('Seek only available in read mode')
ValueError: Seek only available in read mode
@datavisoryangzhou
Copy link
Author

datavisoryangzhou commented Jul 1, 2019

I bypass this error by adding the "ValueError" to the catch clause in zipfile.py.
Some file-like objects can provide tell() but not seek()

try:
    self.fp.seek(self.start_dir)
except (AttributeError, OSError, **ValueError**):
    self._seekable = False

Can we modify the exception raiding logic in seek() in core.py?

def seek(self, loc, whence=0):
    """ Set current file location

    Parameters
    ----------
    loc : int
        byte location
    whence : {0, 1, 2}
        from start of file, current location or end of file, resp.
    """
    **if not self.readable():
        raise ValueError('Seek only available in read mode')**
    if whence == 0:
        nloc = loc
    elif whence == 1:
        nloc = self.loc + loc
    elif whence == 2:
        nloc = self.size + loc
    else:
        raise ValueError(
            "invalid whence (%s, should be 0, 1 or 2)" % whence)
    if nloc < 0:
        raise ValueError('Seek before start of file')
    self.loc = nloc
    return self.loc

@martindurant
Copy link
Member

That certainly sounds reasonable to me. Good detective work!

@datavisoryangzhou
Copy link
Author

@martindurant Thanks for your reply, I'm actually not quite familiar with the IO section above there, could you provide some help on fixing this issue?

@martindurant
Copy link
Member

I suppose you'd prefer

    if not self.readable():
        raise OSError('Seek only available in read mode')

and then you don't have to edit zipfile

@angelcervera
Copy link

Are you thinking about fixing it? I'm using s3fs to stream zip files into s3 and this is blocking. Any workaround in the meantime?

@martindurant
Copy link
Member

I would love for someone to submit a PR

@xuchen
Copy link

xuchen commented Feb 9, 2020

I had exactly the same error today using both scipy.sparse.save_npz and np.savez_compressed. However pickle.dump had no errors. I suppose this is on s3fs side?

@martindurant
Copy link
Member

I believe the fix for zipfile is above. You could try it, and if it works, please do submit the change as a PR. Different code calling file functions may make different assumptions about the capabilities of the files.

@xuchen
Copy link

xuchen commented Feb 10, 2020

Did you mean the following line? I couldn't figure out which file you should put in...

if not self.readable():
    raise OSError('Seek only available in read mode')

@martindurant
Copy link
Member

Yes, that line, which would be in fsspec.spec.AbstractBufferedFile.seek, change ValueError to OSError, since this is apparently what ZipFile expects. There may be some tests that rely on the right kind of error, that would need to be amended to reflect the change.

@xuchen
Copy link

xuchen commented Feb 10, 2020

there you go, PR incoming fsspec/filesystem_spec#238

martindurant added a commit to fsspec/filesystem_spec that referenced this issue Feb 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants