Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
csv header sniffer fails for utf-8 encoding #309
Because the method to infer the header in the file takes exactly x bytes it can fail if the last byte is part of a multi-byte utf-8 character. The decoding of the first x bytes should ignore the errors see the codecs documentation.
A possible solution is to replace invalid bytes with the unicode replacement character:
--- a/odo/backends/csv.py +++ b/odo/backends/csv.py @@ -93,7 +93,7 @@ def infer_header(path, nbytes=10000, encoding='utf-8', **kwargs): if not raw: return True sniffer = PipeSniffer() - decoded = raw if PY2 else raw.decode(encoding) + decoded = raw if PY2 else raw.decode(encoding, 'replace') try: return sniffer.has_header(decoded) except csv.Error: