Fixing s3 support #364

nanounanue · 2018-01-15T19:25:30Z

Catwalk

Removed smart_open, now using s3fs
Reason: smart_open uses boto instead of boto3. boto lacks of several features (like using aws sts)

S3Store is working, and also S3ModelStoreEngine
CSVMatrixStore supports storing to s3 using s3fs
HDFMatrixStore now doesn' t support storing to s3 (not that impotant since triage only supports CSVMatrixStore)

Metta

Added support to store matrices in s3 (only for csv)

NOTES:

This code uses python 3.6
Tests passes (but a lot of testing was missing)
Tested in a readl project (sfpd_eis)
We need to unify how we handle the storage, now is distributed in architect, metta and catlwalk

smart_open using (also) s3fs for CSVMatrixStore

is a limitation of pytables, not of s3fs: H5 doesn't support buffers, so it can't be used with context manager from s3fs, and if you try to force it, an unicode exception is throw)

thcrock · 2018-01-16T22:01:08Z

src/triage/component/catwalk/storage.py

+        if scheme in ('', 'file'):  # Local file
+            matrix = pd.read_hdf(self.matrix_path)
+        else:
+            raise ValueError(f"""


What is the problem with s3 and HDF? I know Eddie ran into some problems but seemed to work past them. Although it definitely seems like you can't stream the data from S3 reading the buffer into memory and applying that workaround seemed to do the trick.

thcrock · 2018-01-17T04:11:12Z

requirement/main.txt

@@ -17,3 +17,4 @@ retrying
 Dickens==1.0.1
 signalled-timeout==1.0.0
 smart_open==1.5.3
+s3fs


I didn't know about this library, very cool! From the dask organization - I am wondering if it would be advantageous for us to support Dask dataframes for lower-memory environments, at least for non-training operations like matrix building and prediction. It's probably not a good fit but it's a possibility.

thcrock · 2018-01-17T04:29:21Z

src/triage/component/catwalk/storage.py

+        if not scheme or scheme == 'file':  # Local file
+            with open(os.path.join(project_path, name + ".yaml"), "wb") as f:
+                yaml.dump(df, f, encoding='utf-8')
+        elif scheme == 's3':


All of these if statements are definitely crying out for some refactoring, but we can do that in a future commit.

The direction I'm thinking is that S3Store, FSStore are too specific to model pickles and be repurposed into something generic enough to use here. Then MatrixStore takes a storage class, and there we go.

Also, the list of supported schemes is probably in too many places. A class-level SUPPORTED_SCHEMES should work as not too invasive of a change.

That's a good idea, for the matrix store to simply take a storage backend.

thcrock · 2018-01-17T04:45:17Z

Nothing blocking I see here, except you should fix flake8 for Travis.

As we talked about I do have some related refactoring in another branch I was working on last week: https://github.com/dssg/triage/compare/s3_hdf
That covers some of the what's missing here, like enabling hdf support in Triage Experiments by unifying all matrix access under the MatrixStore umbrella and having the user pass in a class to the Experiment to pick how they want this stored. Once we merge your changes, I'll redo that branch to work with them.

jesteria

Nice

jesteria · 2018-01-17T18:31:55Z

requirement/main.txt

@@ -17,3 +17,4 @@ retrying
 Dickens==1.0.1
 signalled-timeout==1.0.0
 smart_open==1.5.3
+s3fs


Let's pin this requirement's version, as we did for smart_open, to avoid future issues, (as we had before we pinned smart_open, IIRC).

jesteria · 2018-01-17T18:33:14Z

requirement/main.txt

@@ -17,3 +17,4 @@ retrying
 Dickens==1.0.1
 signalled-timeout==1.0.0
 smart_open==1.5.3


Do we still need smart_open or can we remove it?

We can remove it

jesteria · 2018-01-17T18:34:57Z

src/tests/catwalk_tests/test_storage.py

-        s3_conn = boto3.resource('s3')
-        s3_conn.create_bucket(Bucket='a-bucket')
-        store = S3Store(s3_conn.Object('a-bucket', 'a-path'))
+        import boto3


Not a big deal, just curious if boto3 really has to be imported dynamically in the test, rather than at the module-level.

I agree. I believe I introduced this pattern a while ago to attempt to fix something with mocking but am unsure if it is needed.

jesteria · 2018-01-17T18:38:17Z

src/tests/catwalk_tests/test_storage.py

-            assert hdf.metadata == matrix_store_list[0].metadata
-            assert hdf.matrix.to_dict() == matrix_store_list[0].matrix.to_dict()
+                    assert csv.metadata == matrix_store_list[0].metadata
+                    assert csv.matrix.to_dict() == matrix_store_list[0].matrix.to_dict()


So have we removed testing of non-csv stores? Or is it that you've added assertions which can only apply to csv? (And we can't apply these to, say, HDF?)

I am not testing HDF

We should move to Arrow or Parquet or at least compressed CSV

We still have HDF testing. This is for S3 storage specifically. The code now disallows storage of HDF on S3 by throwing an error upfront. So I think this testing is fine for the current state of the code.

jesteria · 2018-01-17T18:40:13Z

src/triage/component/catwalk/storage.py

@@ -13,11 +13,21 @@
    download_object,
 )

+import s3fs
+
+from urllib.parse import urlparse # Python3


Comments are 🆒 but I don't think we need that one

jesteria · 2018-01-17T19:20:56Z

src/triage/component/catwalk/storage.py

+            raise ValueError(f"""
+                  URL scheme not supported:
+                  {scheme} (from {os.path.join(project_path, name + '.yaml')})
+            """)


Certainly not a big deal, but that's going to be quite the error message to read out of the logs. if it's important to organize it in the code this way, you might use textwrap.dedent –

raise ValueError(textwrap.dedent(f"""\ … …""" )

But I think you can cleanly write –

raise ValueError("URL scheme not supported: " f"{scheme} (from {path})")

(And, rather than repeat your path construction throughout the method, do it once at the top.)

jesteria · 2018-01-17T19:28:50Z

src/triage/component/catwalk/storage.py

-        self._metadata = None
-        self._head_of_matrix = None
+        self.metadata = None
+        self.head_of_matrix = None


I'm not sure that we want to clear the object's cache, so much as omit its cached members from the pickle. It might not make a difference due to how it's used, of course. But it seems simple enough to say –

nopickle = frozenset(['metadata', 'matrix', 'head_of_matrix']) return {key: value for (key, value) in self.__dict__.items() if key not in nopickle}

For that matter, you might be able to get really fancy (and DRY) –

nopickle = {name for (name, item) in self.__class__.__dict__.items() if isinstance(item, cachedproperty)} …

jesteria · 2018-01-17T19:31:38Z

src/triage/component/catwalk/storage.py



 class CSVMatrixStore(MatrixStore):
+
+    def __init__(self, matrix_path=None, metadata_path=None):
+        super().__init__(matrix_path, metadata_path)


superfluous overriding

jesteria · 2018-01-17T19:33:00Z

src/triage/component/catwalk/storage.py

-            self._head_of_matrix = None
+
+        except FileNotFoundError as fnfe:
+            logging.error(f"Matrix isn't there: {fnfe}")


unless you really want to suppress the traceback, rather than capture the exception instance, you can simply use logging.exception –

except FileNotFoundError: logging.exception("Matrix isn't there") …

I didn't knew this

jesteria · 2018-01-17T19:35:24Z

tox.ini


 [testenv:flake8]
 deps = -r{toxinidir}/requirement/include/lint.txt
 commands = flake8 src/triage

-[testenv:py35]
+[testenv:py36]


Ah, but you'll have to update .travis.yml to upgrade the Python version.

stops the creation of the matrices, but it doesn't stop the training so, it thinks that the matrices are empty...

…ersion)

jesteria

Nothing pressing, just comments on logs.

jesteria · 2018-01-18T17:22:54Z

src/triage/component/architect/builders.py

            self.build_matrix(**task_arguments)
+            logging.debug(f"Matrix {matrix_uuid} built")


I really don't have a firm opinion on this, now that we have the wonderful f string; however, logging advises lazy %s interpolation because – particularly in the case of debug messages – these messages might simply be discarded, and there's no reason to waste cycles eagerly casting these objects to str and interpolating them:

logging.debug("Matrix %s built", matrix_uuid)

or:

logging.debug("Matrix %(matrix_uuid)s built", {'matrix_uuid': matrix_uuid})

(For that matter, %s interpolation simply isn't going away, no matter how much we might like it to; and, lazy %s interpolation is alive and strong, thanks to logging and database drivers.)

All that said, particularly for higher-level log messages, it's hard to argue with the elegance and simplicity of f strings.

jesteria · 2018-01-18T17:25:50Z

src/triage/component/architect/builders.py

+
+        if scheme in ('', 'file'):
+            if not self.replace and os.path.exists(matrix_filename):
+                logging.info('Skipping %s because matrix already exists', matrix_filename)


All the additional logging is great. I don't have the hands-on experience with debugging an experiment, and so I can't say how "spammy" these might seem outside of the DEBUG context; rather, just curious if they might be.

We did decide a while ago that we would err the side of being too spammy, at least for now.

maxtaskperchild (see https://docs.python.org/3.6/library/multiprocessing.html#module-multiprocessing.pool). More logging

…sion.

thcrock · 2018-02-01T02:43:33Z

It looks like there are still some problems with the build. Many of them look to be pandas related, and on tests/code that don't seem to be touched here so it is probably one of the more global changes here. I don't think it's the pandas version, as this is the same version as recent builds on other branches that did pass. So it might be python 3.6 related?

thcrock · 2018-02-01T05:16:31Z

I took another look at the failures. Many of them actually just seem to be a result of the InMemoryMatrixStore changing its behavior to set the indices upon instantiation. This could make sense as a change. It wasn't originally how I envisioned this subclass: it was more to just allow people using this class with its original companion, the ModelTrainer, to use their existing dataframes using these storage abstractions to allow swapping with disk-backed dataframes. The assumption is that if you were already using these dataframes, they probably already were indexed and so there was no need to do that here. Many tests are also written with this assumption, and those are now failing.

But the above approach probably isn't totally right. Maybe the dataframe being passed in is indexed, maybe it's not. The storage container here should ensure that it is, but be fine either way. I'm a little bit surprised pandas doesn't handle this better, but we can enable this behavior with something like this:

if self.matrix.index.names != self.metadata['indices']:
self.matrix.set_index(...)

Any of those tests that still breaks deserves to break and be fixed.

I tried making this change. It fixed most of them, but a couple of the tests were actually built incorrectly, with the metadata and matrix having different indices! Those have been fixed now.

Most of the other tests that broke were related to either the S3ModelStorage interface change (it doesn't take an s3_conn anymore), and one was the pickle->joblib change (the ModelTrainer test actually loads the pickle and makes predictions with it, so it needed to be changed if the pickle was in a different format).

I fixed the tests and pushed the change. Feel free to revert this if you have been fixing things yourself.

There is one more problem: With the S3Store now using joblib, it and FSStore actually save in different formats, and they should agree on that format.

I'm happy to use whichever format is best, though I think we may have to be more deliberate about switching pickle formats for deployed projects if the new code will be unable to load the old pickles.

codecov-io · 2018-02-01T17:27:11Z

Codecov Report

Merging #364 into master will decrease coverage by 0.37%.
The diff coverage is 73.56%.

@@            Coverage Diff             @@
##           master     #364      +/-   ##
==========================================
- Coverage    91.2%   90.83%   -0.38%     
==========================================
  Files          59       61       +2     
  Lines        3435     3795     +360     
==========================================
+ Hits         3133     3447     +314     
- Misses        302      348      +46

Impacted Files	Coverage Δ
src/triage/component/catwalk/utils.py	`98.46% <ø> (+4.58%)`	⬆️
src/triage/experiments/multicore.py	`83.73% <100%> (+18.69%)`	⬆️
src/triage/component/architect/planner.py	`96.82% <100%> (ø)`	⬆️
src/triage/component/catwalk/model_trainers.py	`93.61% <100%> (ø)`	⬆️
src/triage/component/metta/metta_io.py	`72.51% <57.89%> (-16.15%)`	⬇️
src/triage/component/architect/builders.py	`93.23% <76.19%> (-3.35%)`	⬇️
src/triage/component/catwalk/storage.py	`85.97% <78.47%> (-7.52%)`	⬇️
src/triage/component/audition/__init__.py	`87.87% <0%> (-2.9%)`	⬇️
...e/component/audition/selection_rule_performance.py	`95.23% <0%> (-1.06%)`	⬇️
src/triage/component/audition/regrets.py	`100% <0%> (ø)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c03db45...6b490a3. Read the comment docs.

thcrock · 2018-02-01T22:27:32Z

Closed, but waiting on @nanounanue to delete when he switches SFPD to master

nanounanue added 14 commits January 14, 2018 12:09

Added s3fs as requirement (used for S3Store and CSVMatrixStore)

6257873

boto -> boto3. New S3Store object for models (uses s3fs) and removed

9ce8d5f

smart_open using (also) s3fs for CSVMatrixStore

Fixed recursion error. H5 matrices can't be stored in S3 with s3fs (it

68e2794

is a limitation of pytables, not of s3fs: H5 doesn't support buffers, so it can't be used with context manager from s3fs, and if you try to force it, an unicode exception is throw)

Not needed

1098644

Testing with 3.6

bef3117

Removed line: Throws 404 (metadata doesn't exists yet)

bf24c36

Forgot this: head_of_matrix is read from s3 too, duh

f515ca9

Catching error

f6e2786

Cleaning empty. Returning an Empty dataframe

c1709ff

existx -> exists

cfb925b

Fixed typo

cfd6751

We are not supporting Python 2.7 so...

597defc

Added s3 support to metta (only for csv files)

a8a5b37

Corrected some minor errors in both READMEs

c64a788

nanounanue requested review from jesteria, thcrock and tweddielin January 15, 2018 19:25

thcrock reviewed Jan 16, 2018

View reviewed changes

thcrock reviewed Jan 17, 2018

View reviewed changes

jesteria requested changes Jan 17, 2018

View reviewed changes

nanounanue added 6 commits January 17, 2018 15:44

Added logging

1e11214

Fixed logging messages

5816be3

Another error in the messages :/. The fun thing is that this error

f7b2535

stops the creation of the matrices, but it doesn't stop the training so, it thinks that the matrices are empty...

Missing "f"

a8807fe

Getting the size of the data frame in bytes

b5e06b1

Skipping creation of matrices on disk if they are already there (s3 v…

73acf94

…ersion)

jesteria reviewed Jan 18, 2018

View reviewed changes

Better use of logging

cd8c53d

nanounanue added 2 commits January 18, 2018 11:58

Added

7aa94a7

maxtaskperchild (see https://docs.python.org/3.6/library/multiprocessing.html#module-multiprocessing.pool). More logging

unpinned sklearn

45f9f50

This was referenced Jan 19, 2018

Triage should support writing the matrices to S3 #200

Closed

Upload matrices to S3 as they are made #267

Closed

nanounanue added 3 commits January 30, 2018 11:42

Fixing some of the comments (except the refactoring ones)

e03cfe9

Using joblib instead of pickle. Also storing the models using compres…

f332833

…sion.

Removing unnecessary init

65d0045

Test fixes

e467cd1

thcrock added 2 commits February 1, 2018 10:55

Flake8 fixes

afd79f2

Standardize on joblib pickling, remove now unused catwalk util code

6b490a3

thcrock merged commit 7bfaf85 into master Feb 1, 2018

thcrock deleted the fixing_s3 branch June 26, 2018 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing s3 support #364

Fixing s3 support #364

nanounanue commented Jan 15, 2018

thcrock Jan 16, 2018 •

edited

Loading

thcrock Jan 17, 2018

thcrock Jan 17, 2018

jesteria Jan 17, 2018

thcrock commented Jan 17, 2018

jesteria left a comment

jesteria Jan 17, 2018

nanounanue Jan 30, 2018

jesteria Jan 17, 2018

nanounanue Jan 30, 2018

jesteria Jan 17, 2018

thcrock Jan 18, 2018

jesteria Jan 17, 2018

nanounanue Jan 30, 2018

thcrock Feb 1, 2018

jesteria Jan 17, 2018

nanounanue Jan 30, 2018

jesteria Jan 17, 2018

jesteria Jan 17, 2018

jesteria Jan 17, 2018

jesteria Jan 17, 2018

nanounanue Jan 30, 2018

jesteria Jan 17, 2018

jesteria Jan 17, 2018

jesteria left a comment

jesteria Jan 18, 2018

jesteria Jan 18, 2018

thcrock Jan 22, 2018

thcrock commented Feb 1, 2018

thcrock commented Feb 1, 2018

codecov-io commented Feb 1, 2018

thcrock commented Feb 1, 2018

		self.build_matrix(**task_arguments)
		logging.debug(f"Matrix {matrix_uuid} built")

Fixing s3 support #364

Fixing s3 support #364

Conversation

nanounanue commented Jan 15, 2018

Catwalk

Metta

thcrock Jan 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thcrock commented Jan 17, 2018

jesteria left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jesteria left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thcrock commented Feb 1, 2018

thcrock commented Feb 1, 2018

codecov-io commented Feb 1, 2018

Codecov Report

thcrock commented Feb 1, 2018

thcrock Jan 16, 2018 •

edited

Loading