Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing s3 support #364

Merged
merged 29 commits into from
Feb 1, 2018
Merged

Fixing s3 support #364

merged 29 commits into from
Feb 1, 2018

Conversation

nanounanue
Copy link
Contributor

Catwalk

Removed smart_open, now using s3fs
Reason: smart_open uses boto instead of boto3. boto lacks of several features (like using aws sts)

  • S3Store is working, and also S3ModelStoreEngine
  • CSVMatrixStore supports storing to s3 using s3fs
  • HDFMatrixStore now doesn' t support storing to s3 (not that impotant since triage only supports CSVMatrixStore)

Metta

Added support to store matrices in s3 (only for csv)

NOTES:

  • This code uses python 3.6
  • Tests passes (but a lot of testing was missing)
  • Tested in a readl project (sfpd_eis)
  • We need to unify how we handle the storage, now is distributed in architect, metta and catlwalk

if scheme in ('', 'file'): # Local file
matrix = pd.read_hdf(self.matrix_path)
else:
raise ValueError(f"""
Copy link
Contributor

@thcrock thcrock Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the problem with s3 and HDF? I know Eddie ran into some problems but seemed to work past them. Although it definitely seems like you can't stream the data from S3 reading the buffer into memory and applying that workaround seemed to do the trick.

@@ -17,3 +17,4 @@ retrying
Dickens==1.0.1
signalled-timeout==1.0.0
smart_open==1.5.3
s3fs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about this library, very cool! From the dask organization - I am wondering if it would be advantageous for us to support Dask dataframes for lower-memory environments, at least for non-training operations like matrix building and prediction. It's probably not a good fit but it's a possibility.

if not scheme or scheme == 'file': # Local file
with open(os.path.join(project_path, name + ".yaml"), "wb") as f:
yaml.dump(df, f, encoding='utf-8')
elif scheme == 's3':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these if statements are definitely crying out for some refactoring, but we can do that in a future commit.

The direction I'm thinking is that S3Store, FSStore are too specific to model pickles and be repurposed into something generic enough to use here. Then MatrixStore takes a storage class, and there we go.

Also, the list of supported schemes is probably in too many places. A class-level SUPPORTED_SCHEMES should work as not too invasive of a change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, for the matrix store to simply take a storage backend.

@thcrock
Copy link
Contributor

thcrock commented Jan 17, 2018

Nothing blocking I see here, except you should fix flake8 for Travis.

As we talked about I do have some related refactoring in another branch I was working on last week: https://github.com/dssg/triage/compare/s3_hdf
That covers some of the what's missing here, like enabling hdf support in Triage Experiments by unifying all matrix access under the MatrixStore umbrella and having the user pass in a class to the Experiment to pick how they want this stored. Once we merge your changes, I'll redo that branch to work with them.

Copy link
Member

@jesteria jesteria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@@ -17,3 +17,4 @@ retrying
Dickens==1.0.1
signalled-timeout==1.0.0
smart_open==1.5.3
s3fs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pin this requirement's version, as we did for smart_open, to avoid future issues, (as we had before we pinned smart_open, IIRC).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@@ -17,3 +17,4 @@ retrying
Dickens==1.0.1
signalled-timeout==1.0.0
smart_open==1.5.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need smart_open or can we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove it

s3_conn = boto3.resource('s3')
s3_conn.create_bucket(Bucket='a-bucket')
store = S3Store(s3_conn.Object('a-bucket', 'a-path'))
import boto3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, just curious if boto3 really has to be imported dynamically in the test, rather than at the module-level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I believe I introduced this pattern a while ago to attempt to fix something with mocking but am unsure if it is needed.

assert hdf.metadata == matrix_store_list[0].metadata
assert hdf.matrix.to_dict() == matrix_store_list[0].matrix.to_dict()
assert csv.metadata == matrix_store_list[0].metadata
assert csv.matrix.to_dict() == matrix_store_list[0].matrix.to_dict()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So have we removed testing of non-csv stores? Or is it that you've added assertions which can only apply to csv? (And we can't apply these to, say, HDF?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not testing HDF

We should move to Arrow or Parquet or at least compressed CSV

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have HDF testing. This is for S3 storage specifically. The code now disallows storage of HDF on S3 by throwing an error upfront. So I think this testing is fine for the current state of the code.

@@ -13,11 +13,21 @@
download_object,
)

import s3fs

from urllib.parse import urlparse # Python3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are 🆒 but I don't think we need that one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

raise ValueError(f"""
URL scheme not supported:
{scheme} (from {os.path.join(project_path, name + '.yaml')})
""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly not a big deal, but that's going to be quite the error message to read out of the logs. if it's important to organize it in the code this way, you might use textwrap.dedent

raise ValueError(textwrap.dedent(f"""\
    …
    …"""
)

But I think you can cleanly write –

raise ValueError("URL scheme not supported: "
                 f"{scheme} (from {path})")

(And, rather than repeat your path construction throughout the method, do it once at the top.)

self._metadata = None
self._head_of_matrix = None
self.metadata = None
self.head_of_matrix = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that we want to clear the object's cache, so much as omit its cached members from the pickle. It might not make a difference due to how it's used, of course. But it seems simple enough to say –

nopickle = frozenset(['metadata', 'matrix', 'head_of_matrix'])
return {key: value for (key, value) in self.__dict__.items() if key not in nopickle}

For that matter, you might be able to get really fancy (and DRY) –

nopickle = {name for (name, item) in self.__class__.__dict__.items()
            if isinstance(item, cachedproperty)}
…



class CSVMatrixStore(MatrixStore):

def __init__(self, matrix_path=None, metadata_path=None):
super().__init__(matrix_path, metadata_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superfluous overriding

self._head_of_matrix = None

except FileNotFoundError as fnfe:
logging.error(f"Matrix isn't there: {fnfe}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless you really want to suppress the traceback, rather than capture the exception instance, you can simply use logging.exception

except FileNotFoundError:
    logging.exception("Matrix isn't there")
    …

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't knew this


[testenv:flake8]
deps = -r{toxinidir}/requirement/include/lint.txt
commands = flake8 src/triage

[testenv:py35]
[testenv:py36]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but you'll have to update .travis.yml to upgrade the Python version.

Copy link
Member

@jesteria jesteria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing pressing, just comments on logs.

self.build_matrix(**task_arguments)
logging.debug(f"Matrix {matrix_uuid} built")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't have a firm opinion on this, now that we have the wonderful f string; however, logging advises lazy %s interpolation because – particularly in the case of debug messages – these messages might simply be discarded, and there's no reason to waste cycles eagerly casting these objects to str and interpolating them:

logging.debug("Matrix %s built", matrix_uuid)

or:

logging.debug("Matrix %(matrix_uuid)s built", {'matrix_uuid': matrix_uuid})

(For that matter, %s interpolation simply isn't going away, no matter how much we might like it to; and, lazy %s interpolation is alive and strong, thanks to logging and database drivers.)

All that said, particularly for higher-level log messages, it's hard to argue with the elegance and simplicity of f strings.


if scheme in ('', 'file'):
if not self.replace and os.path.exists(matrix_filename):
logging.info('Skipping %s because matrix already exists', matrix_filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the additional logging is great. I don't have the hands-on experience with debugging an experiment, and so I can't say how "spammy" these might seem outside of the DEBUG context; rather, just curious if they might be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did decide a while ago that we would err the side of being too spammy, at least for now.

@thcrock
Copy link
Contributor

thcrock commented Feb 1, 2018

It looks like there are still some problems with the build. Many of them look to be pandas related, and on tests/code that don't seem to be touched here so it is probably one of the more global changes here. I don't think it's the pandas version, as this is the same version as recent builds on other branches that did pass. So it might be python 3.6 related?

@thcrock
Copy link
Contributor

thcrock commented Feb 1, 2018

I took another look at the failures. Many of them actually just seem to be a result of the InMemoryMatrixStore changing its behavior to set the indices upon instantiation. This could make sense as a change. It wasn't originally how I envisioned this subclass: it was more to just allow people using this class with its original companion, the ModelTrainer, to use their existing dataframes using these storage abstractions to allow swapping with disk-backed dataframes. The assumption is that if you were already using these dataframes, they probably already were indexed and so there was no need to do that here. Many tests are also written with this assumption, and those are now failing.

But the above approach probably isn't totally right. Maybe the dataframe being passed in is indexed, maybe it's not. The storage container here should ensure that it is, but be fine either way. I'm a little bit surprised pandas doesn't handle this better, but we can enable this behavior with something like this:

if self.matrix.index.names != self.metadata['indices']:
self.matrix.set_index(...)

Any of those tests that still breaks deserves to break and be fixed.

I tried making this change. It fixed most of them, but a couple of the tests were actually built incorrectly, with the metadata and matrix having different indices! Those have been fixed now.

Most of the other tests that broke were related to either the S3ModelStorage interface change (it doesn't take an s3_conn anymore), and one was the pickle->joblib change (the ModelTrainer test actually loads the pickle and makes predictions with it, so it needed to be changed if the pickle was in a different format).

I fixed the tests and pushed the change. Feel free to revert this if you have been fixing things yourself.

There is one more problem: With the S3Store now using joblib, it and FSStore actually save in different formats, and they should agree on that format.

I'm happy to use whichever format is best, though I think we may have to be more deliberate about switching pickle formats for deployed projects if the new code will be unable to load the old pickles.

@codecov-io
Copy link

Codecov Report

Merging #364 into master will decrease coverage by 0.37%.
The diff coverage is 73.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #364      +/-   ##
==========================================
- Coverage    91.2%   90.83%   -0.38%     
==========================================
  Files          59       61       +2     
  Lines        3435     3795     +360     
==========================================
+ Hits         3133     3447     +314     
- Misses        302      348      +46
Impacted Files Coverage Δ
src/triage/component/catwalk/utils.py 98.46% <ø> (+4.58%) ⬆️
src/triage/experiments/multicore.py 83.73% <100%> (+18.69%) ⬆️
src/triage/component/architect/planner.py 96.82% <100%> (ø) ⬆️
src/triage/component/catwalk/model_trainers.py 93.61% <100%> (ø) ⬆️
src/triage/component/metta/metta_io.py 72.51% <57.89%> (-16.15%) ⬇️
src/triage/component/architect/builders.py 93.23% <76.19%> (-3.35%) ⬇️
src/triage/component/catwalk/storage.py 85.97% <78.47%> (-7.52%) ⬇️
src/triage/component/audition/__init__.py 87.87% <0%> (-2.9%) ⬇️
...e/component/audition/selection_rule_performance.py 95.23% <0%> (-1.06%) ⬇️
src/triage/component/audition/regrets.py 100% <0%> (ø) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c03db45...6b490a3. Read the comment docs.

@thcrock thcrock merged commit 7bfaf85 into master Feb 1, 2018
@thcrock
Copy link
Contributor

thcrock commented Feb 1, 2018

Closed, but waiting on @nanounanue to delete when he switches SFPD to master

@thcrock thcrock deleted the fixing_s3 branch June 26, 2018 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants