helpers.detect_encoding detects 'iso-8859-1' instead of UTF-8 #265

pierredittgen · 2019-07-02T10:21:17Z

Hi,

I encounter a problem validating an UTF-8 CSV file where one of the columns has a name containing an accent. But as the CSV file detected encoding is wrong, goodtables reports a non-matching-header.

You can try it online.

It seems that the problem comes from helpers.detect_encoding function (based on python chardet) that detects 'iso-8859-1' instead of 'UTF-8'.

schema_and_csv_samples.zip

The text was updated successfully, but these errors were encountered:

cbenz · 2019-07-02T10:23:46Z

I just tried character detection using the filemagic module, on this test case, and it detects utf-8 correctly.

roll · 2020-03-25T13:59:37Z

I checked both chardet and cchardet detects it's incorrectly.

I think it makes sense to create an issue there if it's still actual. Anyway, encoding detection can't be 100% precise. If it's not an encoding should be provided by a client.

Please re-open if there are any ideas of how we can fix it on the tabulator level.

cbenz · 2020-11-23T11:12:19Z

@roll

I think it makes sense to create an issue there if it's still actual. Anyway, encoding detection can't be 100% precise.

I don't really understand why you closed this issue because it is the issue you propose to create :)

If it's not an encoding should be provided by a client.

If I understand you correctly, the caller should do its own encoding detection then provide the detected encoding to the tabulator library.

Please re-open if there are any ideas of how we can fix it on the tabulator level.

I would say either use a better encoding detection library (but there always be invalid corner cases...), or keep chardet in tabulator as a good-enough encoding detection library, and let the caller do the job.

So the second option calls for keeping this issue closed :)

About frictionless-py, I'm interested to know more about the design choices about encoding detection.

roll · 2020-11-24T06:42:48Z

Hi @cbenz,

Sorry if I closed the issue by mistake I had a massive review of the whole backlog (like 20 repos) and might have closed something too early.

Frictionless solves it just providing a hook to override encoding detection. That how it looks with the default implementation:

import chardet
from frictionless import Table, Control

# Getting a byte sample / return an encoding string
def detect_encoding(sample):
    result = chardet.detect(sample)
    confidence = result["confidence"] or 0
    encoding = result["encoding"] or config.DEFAULT_ENCODING
    if confidence < config.DEFAULT_INFER_ENCODING_CONFIDENCE:
        encoding = config.DEFAULT_ENCODING
    if encoding == "ascii":
        encoding = config.DEFAULT_ENCODING
    return encoding

control = Control(detect_encoding=detect_encoding)
with Table("capital-3.csv", control=control) as table:
  print(table.source)
  print(table.encoding)

Also, we stick to chardet as non-pure Python detectors were consistently breaking builds for Anaconda Cloud due one or another reason

johanricher mentioned this issue Jul 2, 2019

Erreur schéma sur le fichier d'exemple valide etalab/schema-irve#2

Closed

roll added this to Software (core) in Frictionless General Jul 8, 2019

roll added the bug label Jul 8, 2019

roll closed this as completed Mar 25, 2020

mcarans mentioned this issue Apr 6, 2020

Issue with change to chardet #305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

helpers.detect_encoding detects 'iso-8859-1' instead of UTF-8 #265

helpers.detect_encoding detects 'iso-8859-1' instead of UTF-8 #265

pierredittgen commented Jul 2, 2019

cbenz commented Jul 2, 2019

roll commented Mar 25, 2020

cbenz commented Nov 23, 2020

roll commented Nov 24, 2020 •

edited

helpers.detect_encoding detects 'iso-8859-1' instead of UTF-8 #265

helpers.detect_encoding detects 'iso-8859-1' instead of UTF-8 #265

Comments

pierredittgen commented Jul 2, 2019

cbenz commented Jul 2, 2019

roll commented Mar 25, 2020

cbenz commented Nov 23, 2020

roll commented Nov 24, 2020 • edited

roll commented Nov 24, 2020 •

edited