Added support for label_size=1 as an optional parameter for csv. #34

aws-patlin · 2019-09-13T21:28:52Z

Issue #, if available:
https://issues.amazon.com/issues/AWSMLC-77

Description of changes:
Added additional handling in data_utils.py to check for 'text/csv' content type with the label_size optional parameter. If the label_size is present and is set to 1, the content type is accepted, otherwise a UserError is raised.

Testing:
Added two assert statements in the test_get_content_type unit test. Tests were run with tox and passed successfully.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

aws-patlin · 2019-09-17T21:54:31Z

Is it not a concern that users may include other optional parameters, such as 'text/csv; charset=utf-8'? If this happens, I think it makes sense for us to allow this, even if it is not officially supported.

ericangelokim · 2019-09-17T21:58:24Z

I'm not saying we shouldn't do it, but we should wait to have the larger discussion with support for this across all containers.

aws-patlin · 2019-09-17T22:15:36Z

Okay, in that case, is this bare-bones fix the right approach for now?

edwardjkim · 2019-09-19T04:38:07Z

Do we allow text/csv;label_size=1 (with no spaces after the semicolon)?. This RCF documentation has text/csv;label_size=1 and text/csv;label_size=0 as examples. What about two (or more) spaces?

aws-patlin · 2019-09-24T19:58:49Z

Yury notes that we should be as generally permissive of optional parameters and formatting. Reverted to a more general style of checking for optional parameters.

iyerr3 · 2019-09-24T20:47:53Z

src/sagemaker_xgboost_container/data_utils.py

@@ -60,6 +60,16 @@ def get_content_type(content_type_cfg_val):
        return LIBSVM
    elif content_type_cfg_val.lower() in [CSV, _content_types.CSV]:
        return CSV
+    elif _content_types.CSV in content_type_cfg_val.lower():
+        items = content_type_cfg_val.split(';')
+        label_size = [item for item in items if 'label_size' in item]


I would suggest renaming the variable to something similar to all_label_sizes to indicate that this is an iterable.

do we expect more than 1 label_size? is that a valid case?

I feel like that should raise an exception, if multiple label_size values are given.

iyerr3 · 2019-09-24T20:52:49Z

src/sagemaker_xgboost_container/data_utils.py

+        items = content_type_cfg_val.split(';')
+        label_size = [item for item in items if 'label_size' in item]
+        if label_size:
+            label_size = label_size[0].split('=')[1]


Why is this only for the first label size? If we're making a generic solution, then might as well make it any(label_size != 1 ... )

iyerr3 · 2019-09-24T20:55:00Z

test/unit/test_data_utils.py


        with self.assertRaises(exc.UserError):
            data_utils.get_content_type('incorrect_format')
+        with self.assertRaises(exc.UserError):
+            data_utils.get_content_type('text/csv; label_size=5')


how about text/csv; label_size, text/csv; label_size=1=1 and text/csv; label_size=1; label_size=2?

Good point, I wasn't thinking of these kinds of edge cases.

rizwangilani · 2019-09-24T21:08:01Z

src/sagemaker_xgboost_container/data_utils.py

@@ -60,6 +60,16 @@ def get_content_type(content_type_cfg_val):
        return LIBSVM
    elif content_type_cfg_val.lower() in [CSV, _content_types.CSV]:
        return CSV
+    elif _content_types.CSV in content_type_cfg_val.lower():


From a readability perspective, I find this a little confusing as there is an explicit csv condition right above this (on line 61).
Would it make more sense to handle this in a single if condition (i.e. if its a csv) and work on label_size from there?

Good point - I think the lines 61-62 need to be removed.

rizwangilani · 2019-09-24T21:22:34Z

src/sagemaker_xgboost_container/data_utils.py

+        label_size = [item for item in items if 'label_size' in item]
+        if label_size:
+            label_size = label_size[0].split('=')[1]
+            if label_size.strip() != '1':


How about making 1 as a parameter?

What would be the benefit in this case? By definition, label size for XGBoost must be 1, and this value is only referenced twice.

rizwangilani · 2019-09-24T21:23:40Z

src/sagemaker_xgboost_container/data_utils.py

+            label_size = label_size[0].split('=')[1]
+            if label_size.strip() != '1':
+                msg = "{} is not an accepted csv ContentType. "\
+                      "Optional parameter label_size must be equal to 1".format(content_type_cfg_val)


can reference the same parameter here

ericangelokim · 2019-10-01T15:36:17Z

src/sagemaker_xgboost_container/data_utils.py

@@ -43,13 +43,49 @@ def _get_invalid_csv_error_msg(line_snippet, file_name):
    return INVALID_CONTENT_FORMAT_ERROR.format(line_snippet=line_snippet, file_name=file_name, content_type='CSV')


+def _get_csv_content_type(content_type_cfg_val):
+    """
+    Returns CSV if content_type_cfg_val is


Minor: PEP8, make this a command rather than desc ie Return CSV if content_type ...

See examples here: https://www.python.org/dev/peps/pep-0257/

ericangelokim · 2019-10-01T15:40:17Z

src/sagemaker_xgboost_container/data_utils.py

@@ -25,8 +25,8 @@
 LIBSVM = 'libsvm'


-INVALID_CONTENT_TYPE_ERROR = "{invalid_content_type} is not an accepted ContentType:" \
-                             " 'csv', 'libsvm', 'text/csv', 'text/libsvm', 'text/x-libsvm'."
+INVALID_CONTENT_TYPE_ERROR = "{invalid_content_type} is not an accepted ContentType: " \


Minor. I would've expected text/csv;label_size=1 without space. Did you check that this is the conventional pattern?

The public sagemaker documentation shows a space: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html

ericangelokim · 2019-10-01T15:43:41Z

src/sagemaker_xgboost_container/data_utils.py

+    if content_type_cfg_val in [CSV, _content_types.CSV]:
+        # Allow 'csv' and 'text/csv'
+        return CSV
+    elif _content_types.CSV in content_type_cfg_val:


This seems like a really ugly way to deal with a general problem. Have you looked into existing tools that can parse request header content types? such as: https://pypi.org/project/python-mimeparse/

It just feels like we are trying to reinvent the wheel here.

Good point, I hadn't looked into existing tools.

Looks like python has a library called 'cgi' with a parse_header() function which we can leverage. My only concern is that it handles duplicate parameters by keeping the last one (i.e. 'text/csv; label_size=1; label_size=2' produces a parameter dictionary with {'label_size': '2'}). Is this acceptable behavior?

It also returns an empty dictionary if the parameter is only a name without a value (i.e. 'text/csv; label_size' -> ('text/csv', {})), which would pass instead of erroring.

Does the standard for MIME/HTTP headers mention rules for options in content-type?
If not then IMO it's better to be permissive than restrictive for optional parameters.

The protocol doesn't have any guidelines regarding these edge cases: https://www.w3.org/Protocols/rfc1341/4_Content-Type.html
I'd say it should be fine to just go with the default behavior provided by cgi.parse_header()

ericangelokim

Looks much better, thanks for looking into header parsing tools.

ericangelokim · 2019-10-02T15:41:00Z

src/sagemaker_xgboost_container/data_utils.py

@@ -25,8 +26,8 @@
 LIBSVM = 'libsvm'


-INVALID_CONTENT_TYPE_ERROR = "{invalid_content_type} is not an accepted ContentType:" \
-                             " 'csv', 'libsvm', 'text/csv', 'text/libsvm', 'text/x-libsvm'."


Minor: These accepted ContentTypes should be put in a list so that we aren't copy/pasting raw string values. No need to submit a review for this change.

Do you mean just for this error message, or for all references to these content types? I went ahead and created a list just for the error message, but it might be worthwhile to migrate everything into a dictionary (though this would be a separate issue to deal with).

There should be a list that can be referenced throughout the codebase so that we aren't checking against a copy pasted list repeatedly. This isn't just for the error messages but throughout the repo.

That would mean changing the way we're referencing these content types on a large scale. We should definitely do this, but I'm not sure this PR is the right place to make this change. What do you think?

That's fair. Can we make the list for the error messages for now, since they refer to the same exact list? No need to resubmit for approval.

…arameters. Moved logic into helper function, improved edge case handling, added more test cases. Using python cgi library to parse mime type. Added list of content types for error message.

aws-patlin requested review from iyerr3 and ericangelokim September 13, 2019 21:28

aws-patlin force-pushed the dev branch from b99803b to 917aee1 Compare September 13, 2019 21:30

aws-patlin force-pushed the dev branch from c28a7f8 to 573a570 Compare September 24, 2019 19:56

iyerr3 reviewed Sep 24, 2019

View reviewed changes

rizwangilani reviewed Sep 24, 2019

View reviewed changes

aws-patlin force-pushed the dev branch from 573a570 to 7ef8817 Compare September 25, 2019 00:04

ericangelokim reviewed Oct 1, 2019

View reviewed changes

aws-patlin force-pushed the dev branch 2 times, most recently from 851a2d7 to cad7a90 Compare October 1, 2019 21:09

aws-patlin requested review from ericangelokim and iyerr3 October 1, 2019 21:13

aws-patlin force-pushed the dev branch from cad7a90 to c303804 Compare October 1, 2019 21:15

ericangelokim approved these changes Oct 2, 2019

View reviewed changes

Added support for label_size=1 and general handling of CSV optional p…

f13061d

…arameters. Moved logic into helper function, improved edge case handling, added more test cases. Using python cgi library to parse mime type. Added list of content types for error message.

aws-patlin force-pushed the dev branch from c303804 to f13061d Compare October 2, 2019 18:43

aws-patlin merged commit f880272 into master Oct 2, 2019

aws-patlin deleted the dev branch October 25, 2019 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for label_size=1 as an optional parameter for csv. #34

Added support for label_size=1 as an optional parameter for csv. #34

aws-patlin commented Sep 13, 2019

aws-patlin commented Sep 17, 2019 •

edited

ericangelokim commented Sep 17, 2019

aws-patlin commented Sep 17, 2019

edwardjkim commented Sep 19, 2019 •

edited

aws-patlin commented Sep 24, 2019

iyerr3 Sep 24, 2019

rizwangilani Sep 24, 2019

aws-patlin Sep 24, 2019

iyerr3 Sep 24, 2019

iyerr3 Sep 24, 2019

aws-patlin Sep 24, 2019

rizwangilani Sep 24, 2019

iyerr3 Sep 24, 2019

rizwangilani Sep 24, 2019

aws-patlin Sep 24, 2019

rizwangilani Sep 24, 2019

ericangelokim Oct 1, 2019

ericangelokim Oct 1, 2019

aws-patlin Oct 1, 2019 •

edited

ericangelokim Oct 1, 2019

aws-patlin Oct 1, 2019

aws-patlin Oct 1, 2019

aws-patlin Oct 1, 2019

iyerr3 Oct 1, 2019

aws-patlin Oct 1, 2019

ericangelokim left a comment

ericangelokim Oct 2, 2019

aws-patlin Oct 2, 2019

ericangelokim Oct 2, 2019

aws-patlin Oct 2, 2019

ericangelokim Oct 2, 2019

Added support for label_size=1 as an optional parameter for csv. #34

Added support for label_size=1 as an optional parameter for csv. #34

Conversation

aws-patlin commented Sep 13, 2019

aws-patlin commented Sep 17, 2019 • edited

ericangelokim commented Sep 17, 2019

aws-patlin commented Sep 17, 2019

edwardjkim commented Sep 19, 2019 • edited

aws-patlin commented Sep 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aws-patlin Oct 1, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericangelokim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aws-patlin commented Sep 17, 2019 •

edited

edwardjkim commented Sep 19, 2019 •

edited

aws-patlin Oct 1, 2019 •

edited