Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chardet 3.0.0 breaks the behavior of 2.3.0 without a depreciation notice. #113

Closed
luzfcb opened this issue Apr 12, 2017 · 6 comments
Closed

Comments

@luzfcb
Copy link

luzfcb commented Apr 12, 2017

hello, I was trying to solve a issue in binaryornot cookiecutter/cookiecutter-django#1118 (comment)

Is expected that the chardet.detect() function return a dictionary with encoding and confidence keys. However, None is being returned for some cases.

The cases in question is if I try to detect the encoding of the first 1024 bytes of images files. (I do not know if there are other cases like that.)

The question is: Is this the expected result, is a issue or is a documentation issue?

From this snippet and this sample_images.zip:

import chardet

print('chardet version:', chardet.__version__)

def detect_from_file(filename, length=1024):
    first_chunk_of_file = None
    with open(filename, 'rb') as f:
        first_chunk_of_file = f.read(length)
        detected_encoding = chardet.detect(first_chunk_of_file)
        return detected_encoding


images = ['bmp_format.bmp', 'gif_format.gif', 'jpg_format.jpg', 'png_format.png', 'tga_format.tga']

for i in images:
    c = detect_from_file(i)
    print('____\nfile: {}'.format(i))
    print(c, '\n')
Chardet 3.0.1 return:
chardet version: 3.0.1
____
file: bmp_format.bmp
{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''} 

____
file: gif_format.gif
None 

____
file: jpg_format.jpg
None 

____
file: png_format.png
None 

____
file: tga_format.tga
{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
but the expected values to return is
chardet version: 2.3.0
____
file: bmp_format.bmp
{'encoding': 'windows-1252', 'confidence': 0.73} 

____
file: gif_format.gif
{'encoding': None, 'confidence': 0.0} 

____
file: jpg_format.jpg
{'encoding': 'ISO-8859-2', 'confidence': 0.2766991521036145} 

____
file: png_format.png
{'encoding': 'IBM866', 'confidence': 0.345640381912402} 

____
file: tga_format.tga
{'encoding': 'windows-1252', 'confidence': 0.73}
@dan-blanchard
Copy link
Member

This was a completely accidental change on my part. I will push out a bugfix release ASAP.

pydanny pushed a commit to binaryornot/binaryornot that referenced this issue Apr 12, 2017
pydanny pushed a commit to binaryornot/binaryornot that referenced this issue Apr 12, 2017
pydanny pushed a commit to binaryornot/binaryornot that referenced this issue Apr 12, 2017
@pydanny
Copy link

pydanny commented Apr 12, 2017

Thanks @dan-blanchard! 😄

In the meantime, we've pushed out a new version of binaryornot that forces the user to the older version of chardet. Not a perfect solution, but it will mitigate the problem temporarily.

@dan-blanchard
Copy link
Member

dan-blanchard commented Apr 12, 2017

@pydanny just released 3.0.2 with this fix.

@pydanny
Copy link

pydanny commented Apr 12, 2017

Hooray! Testing it now...

@pydanny
Copy link

pydanny commented Apr 12, 2017

Tests ran great. Just pushed up 0.4.3 of binaryornot.

Again, @dan-blanchard, thank you for the quick response. 😄 🍪

@luzfcb
Copy link
Author

luzfcb commented Apr 12, 2017

@dan-blanchard thank you for the hard work on this library 👍

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Oct 21, 2019
0.4.4:
* Notify users for file i/o issues.

0.4.3:
* Restricted chardet to anything 3.0.2 or higher due to chardet/chardet#113.

0.4.2:
* Restricted chardet to anything under 3.0 due to chardet/chardet#113
* Added pyup badge
* Added utilities for pushing new versions up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants