Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utility function to check file encoding #87

Merged
merged 9 commits into from
Jan 11, 2023

Conversation

ryttry
Copy link
Collaborator

@ryttry ryttry commented Jan 5, 2023

Users currently mistakenly import csv datasets with encodings other than utf-8, which would result in errors when other features like enrichment are used. Such a validation is currently not built into the backend. To minimize this source of error from SDK users, the utility function is_utf8 is created to enable users to detect if the encoding of their csv datasets is utf-8 prior to importing into their projects via the SDK.

Copy link
Contributor

@davidstanley01 davidstanley01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 questions about signature and automatic validation for the user

src/watchful/client.py Outdated Show resolved Hide resolved
confidence lower than the given threshold, otherwise 0 if the encoding is
not utf-8 or 1 if it is. This function may need some tweaking for a very
large dataset, but should work with the ``is_fast`` argument set to True by
default. This function is suitable to be invoked prior to invoking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a call to this function as a mandatory step instead of leaving it to the user to know and implement the call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very valid question, I had thought about it but I did not go for it because:

  1. although it is improbable for a utf8 encoded file to be detected as non-utf8, it is possible
  2. although it is improbable for a non-utf8 encoded file to be detected as utf8, it is possible

If this call is made mandatory, and if case 1 happens (rarely), then users will not be able to upload their datasets. However, we can make it an intentional loose check (i.e. a warning) that does not prevent uploading even if the checker detects non-utf8 when the user is super sure that the dataset file is utf8 encoded.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of emitting a warning. In general, my worry is a silent failure that isn't obvious to the user. Whatever we can do to help them debug problems on their own would be my choice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be a point to start thinking about logging in the SDK, which can help us and users themselves to supporting user issues. For now, I would cater to this with a "warning" message printed to screen; I'll also add a "force" argument so that a user can upload even if the checker detects non-utf8 when the user is super sure that the dataset file is utf8 encoded.

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #9843: Utility to check file encoding.

@ryttry ryttry force-pushed the runyan/sc-9843/utility-to-check-file-encoding branch 4 times, most recently from 9546f87 to 31c15d8 Compare January 9, 2023 10:25
@ryttry ryttry force-pushed the runyan/sc-9843/utility-to-check-file-encoding branch from 31c15d8 to 40e9c53 Compare January 11, 2023 12:43
@ryttry ryttry merged commit adc2eae into main Jan 11, 2023
@ryttry ryttry deleted the runyan/sc-9843/utility-to-check-file-encoding branch January 11, 2023 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants