Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix non-alg mode content-type handling #223

Merged
merged 2 commits into from
Oct 19, 2021
Merged

Conversation

awsbmillare
Copy link
Contributor

@awsbmillare awsbmillare commented Oct 19, 2021

Issue #, if available:

Internal high escalations

Description of changes:

Content-type handling of default_input_fn (for non-algorithm mode) currently has a brittle implementation that does not accept more sophisticated content-type formats. Content formats are well defined https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type and can have parameters after a semicolon. Currently, this container can support loading of csv files, where it will convert csv files into an xgboost dmatrix as long as the supplied content-type is 'text/csv'. However, it will throw an UnsupportedFormatError if passed 'text/csv; charset=UTF-8', which is incorrect as this should be supported. This change correctly accepts this case and opens the door to parsing/handling those additional parameters if needed in the future. Note that in the case of csv and UTF-8, python3 already handles the UTF-8 case with no other changes needed.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@mabunday
Copy link
Contributor

What's the motivation for this change? SageMaker already uses semicolons to allow specification of additional content type parameters (e.g. text/csv;type=single for single dense 2D tensor), but in general we don't subscribe to the common web specifications.

The risk of making this change is that by not being prescriptive about what parameters are and aren't allowed (and if they are allowed, whether they're actually used) it could actually cause more customer confusion.

In this particular case, if the change is motivated by customers commonly passing text/csv; charset=UTF-8 thinking that specifying the characterset is necessary, then perhaps the error message thrown by UnsupportedFormatError (and public documentation in general) can be updated or improved instead.

I think it's an interesting idea and I agree that the current implementation is brittle and poorly documented, but given that this is an API change I think we have to be careful about which doors we open and when.

@awsbmillare
Copy link
Contributor Author

awsbmillare commented Oct 19, 2021

I can't link internal tickets and integ tests here but this is likely causing breakage right now. Also for non-alg we are NOT handling semicolons https://github.com/aws/sagemaker-xgboost-container/blob/master/src/sagemaker_xgboost_container/handler_service.py#L61 . For alg-mode we are

dtest, content_type = serve_utils.parse_content_data(input_data, input_content_type)
, there are two different implementations here.

Currently my understanding is some platform/reverse-proxy is now calling the server with charset=utf-8 supplied, which was not happening in the past. This is currently the relief fix once this is confirmed to actually address the situation (but note I don't technically need the approval right away, I just need a codebuild to build a test image, so I'm happy (and already planning) to do further research).

Just to compare to non-alg mode, that accepts content-type with multiple variants and ignores irrelevant params (and also accepts specific non-media-type valid forms as well). In these proposed changes, we simply ignore additional params. This increases what we accept on and in the future, if we care about throwing errors for other invalid params combinations, this would be a "breaking" change, but I don't see this being much an issue when it comes to "breaking" changes for clearly unspecified/unused data.

@yusharon yusharon merged commit b03ac7b into master Oct 19, 2021
@mabunday mabunday deleted the inference_content_type_fix branch August 4, 2022 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants