-
Notifications
You must be signed in to change notification settings - Fork 42
helpers.detect_encoding detects 'iso-8859-1' instead of UTF-8 #265
Comments
I just tried character detection using the filemagic module, on this test case, and it detects |
I checked both I think it makes sense to create an issue there if it's still actual. Anyway, encoding detection can't be 100% precise. If it's not an encoding should be provided by a client. Please re-open if there are any ideas of how we can fix it on the |
I don't really understand why you closed this issue because it is the issue you propose to create :)
If I understand you correctly, the caller should do its own encoding detection then provide the detected encoding to the tabulator library.
I would say either use a better encoding detection library (but there always be invalid corner cases...), or keep So the second option calls for keeping this issue closed :) About frictionless-py, I'm interested to know more about the design choices about encoding detection. |
Hi @cbenz, Sorry if I closed the issue by mistake I had a massive review of the whole backlog (like 20 repos) and might have closed something too early. Frictionless solves it just providing a hook to override encoding detection. That how it looks with the default implementation: import chardet
from frictionless import Table, Control
# Getting a byte sample / return an encoding string
def detect_encoding(sample):
result = chardet.detect(sample)
confidence = result["confidence"] or 0
encoding = result["encoding"] or config.DEFAULT_ENCODING
if confidence < config.DEFAULT_INFER_ENCODING_CONFIDENCE:
encoding = config.DEFAULT_ENCODING
if encoding == "ascii":
encoding = config.DEFAULT_ENCODING
return encoding
control = Control(detect_encoding=detect_encoding)
with Table("capital-3.csv", control=control) as table:
print(table.source)
print(table.encoding) Also, we stick to |
Hi,
I encounter a problem validating an UTF-8 CSV file where one of the columns has a name containing an accent. But as the CSV file detected encoding is wrong, goodtables reports a
non-matching-header
.You can try it online.
It seems that the problem comes from
helpers.detect_encoding
function (based on python chardet) that detects 'iso-8859-1' instead of 'UTF-8'.schema_and_csv_samples.zip
The text was updated successfully, but these errors were encountered: