-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for label_size=1 as an optional parameter for csv. #34
Conversation
Is it not a concern that users may include other optional parameters, such as 'text/csv; charset=utf-8'? If this happens, I think it makes sense for us to allow this, even if it is not officially supported. |
I'm not saying we shouldn't do it, but we should wait to have the larger discussion with support for this across all containers. |
Okay, in that case, is this bare-bones fix the right approach for now? |
Do we allow |
Yury notes that we should be as generally permissive of optional parameters and formatting. Reverted to a more general style of checking for optional parameters. |
@@ -60,6 +60,16 @@ def get_content_type(content_type_cfg_val): | |||
return LIBSVM | |||
elif content_type_cfg_val.lower() in [CSV, _content_types.CSV]: | |||
return CSV | |||
elif _content_types.CSV in content_type_cfg_val.lower(): | |||
items = content_type_cfg_val.split(';') | |||
label_size = [item for item in items if 'label_size' in item] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest renaming the variable to something similar to all_label_sizes
to indicate that this is an iterable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we expect more than 1 label_size? is that a valid case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like that should raise an exception, if multiple label_size values are given.
items = content_type_cfg_val.split(';') | ||
label_size = [item for item in items if 'label_size' in item] | ||
if label_size: | ||
label_size = label_size[0].split('=')[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this only for the first label size? If we're making a generic solution, then might as well make it any(label_size != 1 ... )
|
||
with self.assertRaises(exc.UserError): | ||
data_utils.get_content_type('incorrect_format') | ||
with self.assertRaises(exc.UserError): | ||
data_utils.get_content_type('text/csv; label_size=5') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about text/csv; label_size
, text/csv; label_size=1=1
and text/csv; label_size=1; label_size=2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I wasn't thinking of these kinds of edge cases.
@@ -60,6 +60,16 @@ def get_content_type(content_type_cfg_val): | |||
return LIBSVM | |||
elif content_type_cfg_val.lower() in [CSV, _content_types.CSV]: | |||
return CSV | |||
elif _content_types.CSV in content_type_cfg_val.lower(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a readability perspective, I find this a little confusing as there is an explicit csv condition right above this (on line 61).
Would it make more sense to handle this in a single if condition (i.e. if its a csv) and work on label_size from there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I think the lines 61-62 need to be removed.
label_size = [item for item in items if 'label_size' in item] | ||
if label_size: | ||
label_size = label_size[0].split('=')[1] | ||
if label_size.strip() != '1': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making 1 as a parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the benefit in this case? By definition, label size for XGBoost must be 1, and this value is only referenced twice.
label_size = label_size[0].split('=')[1] | ||
if label_size.strip() != '1': | ||
msg = "{} is not an accepted csv ContentType. "\ | ||
"Optional parameter label_size must be equal to 1".format(content_type_cfg_val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can reference the same parameter here
@@ -43,13 +43,49 @@ def _get_invalid_csv_error_msg(line_snippet, file_name): | |||
return INVALID_CONTENT_FORMAT_ERROR.format(line_snippet=line_snippet, file_name=file_name, content_type='CSV') | |||
|
|||
|
|||
def _get_csv_content_type(content_type_cfg_val): | |||
""" | |||
Returns CSV if content_type_cfg_val is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: PEP8, make this a command rather than desc ie Return CSV if content_type ...
See examples here: https://www.python.org/dev/peps/pep-0257/
@@ -25,8 +25,8 @@ | |||
LIBSVM = 'libsvm' | |||
|
|||
|
|||
INVALID_CONTENT_TYPE_ERROR = "{invalid_content_type} is not an accepted ContentType:" \ | |||
" 'csv', 'libsvm', 'text/csv', 'text/libsvm', 'text/x-libsvm'." | |||
INVALID_CONTENT_TYPE_ERROR = "{invalid_content_type} is not an accepted ContentType: " \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor. I would've expected text/csv;label_size=1
without space. Did you check that this is the conventional pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The public sagemaker documentation shows a space: https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html
if content_type_cfg_val in [CSV, _content_types.CSV]: | ||
# Allow 'csv' and 'text/csv' | ||
return CSV | ||
elif _content_types.CSV in content_type_cfg_val: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a really ugly way to deal with a general problem. Have you looked into existing tools that can parse request header content types? such as: https://pypi.org/project/python-mimeparse/
It just feels like we are trying to reinvent the wheel here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I hadn't looked into existing tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like python has a library called 'cgi' with a parse_header() function which we can leverage. My only concern is that it handles duplicate parameters by keeping the last one (i.e. 'text/csv; label_size=1; label_size=2' produces a parameter dictionary with {'label_size': '2'}). Is this acceptable behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also returns an empty dictionary if the parameter is only a name without a value (i.e. 'text/csv; label_size' -> ('text/csv', {})
), which would pass instead of erroring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the standard for MIME/HTTP headers mention rules for options in content-type?
If not then IMO it's better to be permissive than restrictive for optional parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protocol doesn't have any guidelines regarding these edge cases: https://www.w3.org/Protocols/rfc1341/4_Content-Type.html
I'd say it should be fine to just go with the default behavior provided by cgi.parse_header()
851a2d7
to
cad7a90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better, thanks for looking into header parsing tools.
@@ -25,8 +26,8 @@ | |||
LIBSVM = 'libsvm' | |||
|
|||
|
|||
INVALID_CONTENT_TYPE_ERROR = "{invalid_content_type} is not an accepted ContentType:" \ | |||
" 'csv', 'libsvm', 'text/csv', 'text/libsvm', 'text/x-libsvm'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: These accepted ContentTypes should be put in a list so that we aren't copy/pasting raw string values. No need to submit a review for this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean just for this error message, or for all references to these content types? I went ahead and created a list just for the error message, but it might be worthwhile to migrate everything into a dictionary (though this would be a separate issue to deal with).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a list that can be referenced throughout the codebase so that we aren't checking against a copy pasted list repeatedly. This isn't just for the error messages but throughout the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would mean changing the way we're referencing these content types on a large scale. We should definitely do this, but I'm not sure this PR is the right place to make this change. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. Can we make the list for the error messages for now, since they refer to the same exact list? No need to resubmit for approval.
…arameters. Moved logic into helper function, improved edge case handling, added more test cases. Using python cgi library to parse mime type. Added list of content types for error message.
Issue #, if available:
https://issues.amazon.com/issues/AWSMLC-77
Description of changes:
Added additional handling in data_utils.py to check for 'text/csv' content type with the
label_size
optional parameter. If thelabel_size
is present and is set to 1, the content type is accepted, otherwise aUserError
is raised.Testing:
Added two assert statements in the
test_get_content_type
unit test. Tests were run with tox and passed successfully.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.