Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csv header sniffer fails for utf-8 encoding #309

Closed
hdevalke opened this issue Sep 5, 2015 · 0 comments

Comments

@hdevalke
Copy link

commented Sep 5, 2015

Because the method to infer the header in the file takes exactly x bytes it can fail if the last byte is part of a multi-byte utf-8 character. The decoding of the first x bytes should ignore the errors see the codecs documentation.

A possible solution is to replace invalid bytes with the unicode replacement character:

--- a/odo/backends/csv.py
+++ b/odo/backends/csv.py
@@ -93,7 +93,7 @@ def infer_header(path, nbytes=10000, encoding='utf-8', **kwargs):
     if not raw:
         return True
     sniffer = PipeSniffer()
-    decoded = raw if PY2 else raw.decode(encoding)
+    decoded = raw if PY2 else raw.decode(encoding, 'replace')
     try:
         return sniffer.has_header(decoded)
     except csv.Error:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.