Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for label_size=1 as an optional parameter for csv. #34

Merged
merged 1 commit into from
Oct 2, 2019

Conversation

aws-patlin
Copy link
Contributor

Issue #, if available:
https://issues.amazon.com/issues/AWSMLC-77

Description of changes:
Added additional handling in data_utils.py to check for 'text/csv' content type with the label_size optional parameter. If the label_size is present and is set to 1, the content type is accepted, otherwise a UserError is raised.

Testing:
Added two assert statements in the test_get_content_type unit test. Tests were run with tox and passed successfully.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@aws-patlin
Copy link
Contributor Author

aws-patlin commented Sep 17, 2019

Is it not a concern that users may include other optional parameters, such as 'text/csv; charset=utf-8'? If this happens, I think it makes sense for us to allow this, even if it is not officially supported.

@ericangelokim
Copy link
Contributor

I'm not saying we shouldn't do it, but we should wait to have the larger discussion with support for this across all containers.

@aws-patlin
Copy link
Contributor Author

Okay, in that case, is this bare-bones fix the right approach for now?

@edwardjkim
Copy link
Contributor

edwardjkim commented Sep 19, 2019

Do we allow text/csv;label_size=1 (with no spaces after the semicolon)?. This RCF documentation has text/csv;label_size=1 and text/csv;label_size=0 as examples. What about two (or more) spaces?

@aws-patlin
Copy link
Contributor Author

Yury notes that we should be as generally permissive of optional parameters and formatting. Reverted to a more general style of checking for optional parameters.

@@ -60,6 +60,16 @@ def get_content_type(content_type_cfg_val):
return LIBSVM
elif content_type_cfg_val.lower() in [CSV, _content_types.CSV]:
return CSV
elif _content_types.CSV in content_type_cfg_val.lower():
items = content_type_cfg_val.split(';')
label_size = [item for item in items if 'label_size' in item]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest renaming the variable to something similar to all_label_sizes to indicate that this is an iterable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect more than 1 label_size? is that a valid case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like that should raise an exception, if multiple label_size values are given.

items = content_type_cfg_val.split(';')
label_size = [item for item in items if 'label_size' in item]
if label_size:
label_size = label_size[0].split('=')[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this only for the first label size? If we're making a generic solution, then might as well make it any(label_size != 1 ... )


with self.assertRaises(exc.UserError):
data_utils.get_content_type('incorrect_format')
with self.assertRaises(exc.UserError):
data_utils.get_content_type('text/csv; label_size=5')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about text/csv; label_size, text/csv; label_size=1=1 and text/csv; label_size=1; label_size=2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I wasn't thinking of these kinds of edge cases.

@@ -60,6 +60,16 @@ def get_content_type(content_type_cfg_val):
return LIBSVM
elif content_type_cfg_val.lower() in [CSV, _content_types.CSV]:
return CSV
elif _content_types.CSV in content_type_cfg_val.lower():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a readability perspective, I find this a little confusing as there is an explicit csv condition right above this (on line 61).
Would it make more sense to handle this in a single if condition (i.e. if its a csv) and work on label_size from there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - I think the lines 61-62 need to be removed.

label_size = [item for item in items if 'label_size' in item]
if label_size:
label_size = label_size[0].split('=')[1]
if label_size.strip() != '1':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making 1 as a parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the benefit in this case? By definition, label size for XGBoost must be 1, and this value is only referenced twice.

label_size = label_size[0].split('=')[1]
if label_size.strip() != '1':
msg = "{} is not an accepted csv ContentType. "\
"Optional parameter label_size must be equal to 1".format(content_type_cfg_val)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can reference the same parameter here

@@ -43,13 +43,49 @@ def _get_invalid_csv_error_msg(line_snippet, file_name):
return INVALID_CONTENT_FORMAT_ERROR.format(line_snippet=line_snippet, file_name=file_name, content_type='CSV')


def _get_csv_content_type(content_type_cfg_val):
"""
Returns CSV if content_type_cfg_val is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: PEP8, make this a command rather than desc ie Return CSV if content_type ...

See examples here: https://www.python.org/dev/peps/pep-0257/

@@ -25,8 +25,8 @@
LIBSVM = 'libsvm'


INVALID_CONTENT_TYPE_ERROR = "{invalid_content_type} is not an accepted ContentType:" \
" 'csv', 'libsvm', 'text/csv', 'text/libsvm', 'text/x-libsvm'."
INVALID_CONTENT_TYPE_ERROR = "{invalid_content_type} is not an accepted ContentType: " \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor. I would've expected text/csv;label_size=1 without space. Did you check that this is the conventional pattern?

Copy link
Contributor Author

@aws-patlin aws-patlin Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public sagemaker documentation shows a space: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html

if content_type_cfg_val in [CSV, _content_types.CSV]:
# Allow 'csv' and 'text/csv'
return CSV
elif _content_types.CSV in content_type_cfg_val:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a really ugly way to deal with a general problem. Have you looked into existing tools that can parse request header content types? such as: https://pypi.org/project/python-mimeparse/

It just feels like we are trying to reinvent the wheel here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I hadn't looked into existing tools.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like python has a library called 'cgi' with a parse_header() function which we can leverage. My only concern is that it handles duplicate parameters by keeping the last one (i.e. 'text/csv; label_size=1; label_size=2' produces a parameter dictionary with {'label_size': '2'}). Is this acceptable behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also returns an empty dictionary if the parameter is only a name without a value (i.e. 'text/csv; label_size' -> ('text/csv', {})), which would pass instead of erroring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the standard for MIME/HTTP headers mention rules for options in content-type?
If not then IMO it's better to be permissive than restrictive for optional parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol doesn't have any guidelines regarding these edge cases: https://www.w3.org/Protocols/rfc1341/4_Content-Type.html
I'd say it should be fine to just go with the default behavior provided by cgi.parse_header()

Copy link
Contributor

@ericangelokim ericangelokim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better, thanks for looking into header parsing tools.

@@ -25,8 +26,8 @@
LIBSVM = 'libsvm'


INVALID_CONTENT_TYPE_ERROR = "{invalid_content_type} is not an accepted ContentType:" \
" 'csv', 'libsvm', 'text/csv', 'text/libsvm', 'text/x-libsvm'."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: These accepted ContentTypes should be put in a list so that we aren't copy/pasting raw string values. No need to submit a review for this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean just for this error message, or for all references to these content types? I went ahead and created a list just for the error message, but it might be worthwhile to migrate everything into a dictionary (though this would be a separate issue to deal with).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a list that can be referenced throughout the codebase so that we aren't checking against a copy pasted list repeatedly. This isn't just for the error messages but throughout the repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would mean changing the way we're referencing these content types on a large scale. We should definitely do this, but I'm not sure this PR is the right place to make this change. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. Can we make the list for the error messages for now, since they refer to the same exact list? No need to resubmit for approval.

…arameters.

Moved logic into helper function, improved edge case handling, added more test cases.

Using python cgi library to parse mime type.

Added list of content types for error message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants