-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing s3 support #364
Fixing s3 support #364
Conversation
smart_open using (also) s3fs for CSVMatrixStore
is a limitation of pytables, not of s3fs: H5 doesn't support buffers, so it can't be used with context manager from s3fs, and if you try to force it, an unicode exception is throw)
if scheme in ('', 'file'): # Local file | ||
matrix = pd.read_hdf(self.matrix_path) | ||
else: | ||
raise ValueError(f""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the problem with s3 and HDF? I know Eddie ran into some problems but seemed to work past them. Although it definitely seems like you can't stream the data from S3 reading the buffer into memory and applying that workaround seemed to do the trick.
requirement/main.txt
Outdated
@@ -17,3 +17,4 @@ retrying | |||
Dickens==1.0.1 | |||
signalled-timeout==1.0.0 | |||
smart_open==1.5.3 | |||
s3fs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about this library, very cool! From the dask organization - I am wondering if it would be advantageous for us to support Dask dataframes for lower-memory environments, at least for non-training operations like matrix building and prediction. It's probably not a good fit but it's a possibility.
if not scheme or scheme == 'file': # Local file | ||
with open(os.path.join(project_path, name + ".yaml"), "wb") as f: | ||
yaml.dump(df, f, encoding='utf-8') | ||
elif scheme == 's3': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these if statements are definitely crying out for some refactoring, but we can do that in a future commit.
The direction I'm thinking is that S3Store, FSStore are too specific to model pickles and be repurposed into something generic enough to use here. Then MatrixStore
takes a storage class, and there we go.
Also, the list of supported schemes is probably in too many places. A class-level SUPPORTED_SCHEMES
should work as not too invasive of a change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, for the matrix store to simply take a storage backend.
Nothing blocking I see here, except you should fix flake8 for Travis. As we talked about I do have some related refactoring in another branch I was working on last week: https://github.com/dssg/triage/compare/s3_hdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
requirement/main.txt
Outdated
@@ -17,3 +17,4 @@ retrying | |||
Dickens==1.0.1 | |||
signalled-timeout==1.0.0 | |||
smart_open==1.5.3 | |||
s3fs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pin this requirement's version, as we did for smart_open
, to avoid future issues, (as we had before we pinned smart_open
, IIRC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
@@ -17,3 +17,4 @@ retrying | |||
Dickens==1.0.1 | |||
signalled-timeout==1.0.0 | |||
smart_open==1.5.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need smart_open
or can we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove it
s3_conn = boto3.resource('s3') | ||
s3_conn.create_bucket(Bucket='a-bucket') | ||
store = S3Store(s3_conn.Object('a-bucket', 'a-path')) | ||
import boto3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, just curious if boto3
really has to be imported dynamically in the test, rather than at the module-level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I believe I introduced this pattern a while ago to attempt to fix something with mocking but am unsure if it is needed.
assert hdf.metadata == matrix_store_list[0].metadata | ||
assert hdf.matrix.to_dict() == matrix_store_list[0].matrix.to_dict() | ||
assert csv.metadata == matrix_store_list[0].metadata | ||
assert csv.matrix.to_dict() == matrix_store_list[0].matrix.to_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So have we removed testing of non-csv stores? Or is it that you've added assertions which can only apply to csv? (And we can't apply these to, say, HDF?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not testing HDF
We should move to Arrow or Parquet or at least compressed CSV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have HDF testing. This is for S3 storage specifically. The code now disallows storage of HDF on S3 by throwing an error upfront. So I think this testing is fine for the current state of the code.
@@ -13,11 +13,21 @@ | |||
download_object, | |||
) | |||
|
|||
import s3fs | |||
|
|||
from urllib.parse import urlparse # Python3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments are 🆒 but I don't think we need that one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
raise ValueError(f""" | ||
URL scheme not supported: | ||
{scheme} (from {os.path.join(project_path, name + '.yaml')}) | ||
""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly not a big deal, but that's going to be quite the error message to read out of the logs. if it's important to organize it in the code this way, you might use textwrap.dedent
–
raise ValueError(textwrap.dedent(f"""\
…
…"""
)
But I think you can cleanly write –
raise ValueError("URL scheme not supported: "
f"{scheme} (from {path})")
(And, rather than repeat your path construction throughout the method, do it once at the top.)
self._metadata = None | ||
self._head_of_matrix = None | ||
self.metadata = None | ||
self.head_of_matrix = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we want to clear the object's cache, so much as omit its cached members from the pickle. It might not make a difference due to how it's used, of course. But it seems simple enough to say –
nopickle = frozenset(['metadata', 'matrix', 'head_of_matrix'])
return {key: value for (key, value) in self.__dict__.items() if key not in nopickle}
For that matter, you might be able to get really fancy (and DRY) –
nopickle = {name for (name, item) in self.__class__.__dict__.items()
if isinstance(item, cachedproperty)}
…
|
||
|
||
class CSVMatrixStore(MatrixStore): | ||
|
||
def __init__(self, matrix_path=None, metadata_path=None): | ||
super().__init__(matrix_path, metadata_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superfluous overriding
self._head_of_matrix = None | ||
|
||
except FileNotFoundError as fnfe: | ||
logging.error(f"Matrix isn't there: {fnfe}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless you really want to suppress the traceback, rather than capture the exception instance, you can simply use logging.exception
–
except FileNotFoundError:
logging.exception("Matrix isn't there")
…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't knew this
|
||
[testenv:flake8] | ||
deps = -r{toxinidir}/requirement/include/lint.txt | ||
commands = flake8 src/triage | ||
|
||
[testenv:py35] | ||
[testenv:py36] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but you'll have to update .travis.yml
to upgrade the Python version.
stops the creation of the matrices, but it doesn't stop the training so, it thinks that the matrices are empty...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing pressing, just comments on logs.
self.build_matrix(**task_arguments) | ||
logging.debug(f"Matrix {matrix_uuid} built") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't have a firm opinion on this, now that we have the wonderful f
string; however, logging
advises lazy %s
interpolation because – particularly in the case of debug
messages – these messages might simply be discarded, and there's no reason to waste cycles eagerly casting these objects to str
and interpolating them:
logging.debug("Matrix %s built", matrix_uuid)
or:
logging.debug("Matrix %(matrix_uuid)s built", {'matrix_uuid': matrix_uuid})
(For that matter, %s
interpolation simply isn't going away, no matter how much we might like it to; and, lazy %s
interpolation is alive and strong, thanks to logging
and database drivers.)
All that said, particularly for higher-level log messages, it's hard to argue with the elegance and simplicity of f
strings.
|
||
if scheme in ('', 'file'): | ||
if not self.replace and os.path.exists(matrix_filename): | ||
logging.info('Skipping %s because matrix already exists', matrix_filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the additional logging is great. I don't have the hands-on experience with debugging an experiment, and so I can't say how "spammy" these might seem outside of the DEBUG context; rather, just curious if they might be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did decide a while ago that we would err the side of being too spammy, at least for now.
maxtaskperchild (see https://docs.python.org/3.6/library/multiprocessing.html#module-multiprocessing.pool). More logging
It looks like there are still some problems with the build. Many of them look to be pandas related, and on tests/code that don't seem to be touched here so it is probably one of the more global changes here. I don't think it's the pandas version, as this is the same version as recent builds on other branches that did pass. So it might be python 3.6 related? |
I took another look at the failures. Many of them actually just seem to be a result of the InMemoryMatrixStore changing its behavior to set the indices upon instantiation. This could make sense as a change. It wasn't originally how I envisioned this subclass: it was more to just allow people using this class with its original companion, the ModelTrainer, to use their existing dataframes using these storage abstractions to allow swapping with disk-backed dataframes. The assumption is that if you were already using these dataframes, they probably already were indexed and so there was no need to do that here. Many tests are also written with this assumption, and those are now failing. But the above approach probably isn't totally right. Maybe the dataframe being passed in is indexed, maybe it's not. The storage container here should ensure that it is, but be fine either way. I'm a little bit surprised pandas doesn't handle this better, but we can enable this behavior with something like this: if self.matrix.index.names != self.metadata['indices']: Any of those tests that still breaks deserves to break and be fixed. I tried making this change. It fixed most of them, but a couple of the tests were actually built incorrectly, with the metadata and matrix having different indices! Those have been fixed now. Most of the other tests that broke were related to either the S3ModelStorage interface change (it doesn't take an s3_conn anymore), and one was the pickle->joblib change (the ModelTrainer test actually loads the pickle and makes predictions with it, so it needed to be changed if the pickle was in a different format). I fixed the tests and pushed the change. Feel free to revert this if you have been fixing things yourself. There is one more problem: With the S3Store now using joblib, it and FSStore actually save in different formats, and they should agree on that format. I'm happy to use whichever format is best, though I think we may have to be more deliberate about switching pickle formats for deployed projects if the new code will be unable to load the old pickles. |
Codecov Report
@@ Coverage Diff @@
## master #364 +/- ##
==========================================
- Coverage 91.2% 90.83% -0.38%
==========================================
Files 59 61 +2
Lines 3435 3795 +360
==========================================
+ Hits 3133 3447 +314
- Misses 302 348 +46
Continue to review full report at Codecov.
|
Closed, but waiting on @nanounanue to delete when he switches SFPD to master |
Catwalk
Removed
smart_open
, now usings3fs
Reason:
smart_open
usesboto
instead ofboto3
.boto
lacks of several features (like using awssts
)S3Store
is working, and alsoS3ModelStoreEngine
CSVMatrixStore
supports storing to s3 usings3fs
Metta
Added support to store matrices in
s3
(only forcsv
)NOTES:
python 3.6
sfpd_eis
)architect
,metta
andcatlwalk