Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

helpers.detect_encoding detects 'iso-8859-1' instead of UTF-8 #265

Closed
pierredittgen opened this issue Jul 2, 2019 · 4 comments
Closed

Comments

@pierredittgen
Copy link
Contributor

Hi,

I encounter a problem validating an UTF-8 CSV file where one of the columns has a name containing an accent. But as the CSV file detected encoding is wrong, goodtables reports a non-matching-header.

You can try it online.

It seems that the problem comes from helpers.detect_encoding function (based on python chardet) that detects 'iso-8859-1' instead of 'UTF-8'.

schema_and_csv_samples.zip

@cbenz
Copy link
Contributor

cbenz commented Jul 2, 2019

I just tried character detection using the filemagic module, on this test case, and it detects utf-8 correctly.

@roll
Copy link
Member

roll commented Mar 25, 2020

I checked both chardet and cchardet detects it's incorrectly.

I think it makes sense to create an issue there if it's still actual. Anyway, encoding detection can't be 100% precise. If it's not an encoding should be provided by a client.

Please re-open if there are any ideas of how we can fix it on the tabulator level.

@cbenz
Copy link
Contributor

cbenz commented Nov 23, 2020

@roll

I think it makes sense to create an issue there if it's still actual. Anyway, encoding detection can't be 100% precise.

I don't really understand why you closed this issue because it is the issue you propose to create :)

If it's not an encoding should be provided by a client.

If I understand you correctly, the caller should do its own encoding detection then provide the detected encoding to the tabulator library.

Please re-open if there are any ideas of how we can fix it on the tabulator level.

I would say either use a better encoding detection library (but there always be invalid corner cases...), or keep chardet in tabulator as a good-enough encoding detection library, and let the caller do the job.

So the second option calls for keeping this issue closed :)

About frictionless-py, I'm interested to know more about the design choices about encoding detection.

@roll
Copy link
Member

roll commented Nov 24, 2020

Hi @cbenz,

Sorry if I closed the issue by mistake I had a massive review of the whole backlog (like 20 repos) and might have closed something too early.

Frictionless solves it just providing a hook to override encoding detection. That how it looks with the default implementation:

import chardet
from frictionless import Table, Control

# Getting a byte sample / return an encoding string
def detect_encoding(sample):
    result = chardet.detect(sample)
    confidence = result["confidence"] or 0
    encoding = result["encoding"] or config.DEFAULT_ENCODING
    if confidence < config.DEFAULT_INFER_ENCODING_CONFIDENCE:
        encoding = config.DEFAULT_ENCODING
    if encoding == "ascii":
        encoding = config.DEFAULT_ENCODING
    return encoding

control = Control(detect_encoding=detect_encoding)
with Table("capital-3.csv", control=control) as table:
  print(table.source)
  print(table.encoding)

Also, we stick to chardet as non-pure Python detectors were consistently breaking builds for Anaconda Cloud due one or another reason

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
No open projects
Frictionless General
  
Software (core)
Development

No branches or pull requests

3 participants