-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utility function to check file encoding #87
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 questions about signature and automatic validation for the user
src/watchful/client.py
Outdated
confidence lower than the given threshold, otherwise 0 if the encoding is | ||
not utf-8 or 1 if it is. This function may need some tweaking for a very | ||
large dataset, but should work with the ``is_fast`` argument set to True by | ||
default. This function is suitable to be invoked prior to invoking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a call to this function as a mandatory step instead of leaving it to the user to know and implement the call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very valid question, I had thought about it but I did not go for it because:
- although it is improbable for a utf8 encoded file to be detected as non-utf8, it is possible
- although it is improbable for a non-utf8 encoded file to be detected as utf8, it is possible
If this call is made mandatory, and if case 1 happens (rarely), then users will not be able to upload their datasets. However, we can make it an intentional loose check (i.e. a warning
) that does not prevent uploading even if the checker detects non-utf8 when the user is super sure that the dataset file is utf8 encoded.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of emitting a warning
. In general, my worry is a silent failure that isn't obvious to the user. Whatever we can do to help them debug problems on their own would be my choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be a point to start thinking about logging in the SDK, which can help us and users themselves to supporting user issues. For now, I would cater to this with a "warning" message printed to screen; I'll also add a "force" argument so that a user can upload even if the checker detects non-utf8 when the user is super sure that the dataset file is utf8 encoded.
This pull request has been linked to Shortcut Story #9843: Utility to check file encoding. |
9546f87
to
31c15d8
Compare
31c15d8
to
40e9c53
Compare
…f8 before loading it to project
Users currently mistakenly import csv datasets with encodings other than utf-8, which would result in errors when other features like enrichment are used. Such a validation is currently not built into the backend. To minimize this source of error from SDK users, the utility function
is_utf8
is created to enable users to detect if the encoding of their csv datasets is utf-8 prior to importing into their projects via the SDK.