Release 4.0.0 #140

dan-blanchard · 2017-10-11T13:13:45Z

The optimizations in master really could use a release, even though my model retraining work has stalled out lately (because I've been too busy with work and life).

This will have to be a major version change since the model format has changed entirely (even though 99% of users never mess with that).

…models (#121) * Convert single byte charset modules to use dicts of dicts for language modules - Also provide conversion script * Fix debug logging check * Keep Hungarian commented out until we retrain

* Add API option to get all the encodings confidence #96 * make code more straightforward by treating the self.done = True as a real finish point of the analysis * use detect_all instead of detect(.., all=True) * fix corner case of when there is no good prober

sigmavirus24

I only had time to review this visually, but it all looks fine to me.

dan-blanchard · 2017-10-19T17:13:12Z

Ugh. I put together a little benchmark script to show the performance improvements that switching to dicts would have made. Unfortunately, it showed exactly the opposite. Turns out the microbenchmarks I was running to justify the change didn't hold up when everything was fully integrated.

Before:

Benchmarking chardet 3.0.4 on CPython 3.6.3 | packaged by conda-forge | (default, Oct  5 2017, 19:18:17)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 31184.416356877322
big5: 5.070345385292817
cp932: 3.2069605829343444
cp949: 2.083250883504551
euc-jp: 3.320824171408068
euc-kr: 4.498561131545299
euc-tw: 48.99756198139536
gb2312: 4.55051526320596
ibm855: 18.0173338760511
ibm866: 27.157370866963888
iso-2022-jp: 2384.34654084475
iso-2022-kr: 10246.253817026995
iso-8859-1: 89.04007574438816
iso-8859-5: 23.843149290107355
iso-8859-7: 45.85586100711808
koi8-r: 19.71956986432471
maccyrillic: 22.258476294246357
shift_jis: 3.498450722935723
tis-620: 10.728576291002154
utf-16: 64527.75384615385
utf-32: 118316.05077574048
utf-8: 16.722890545863986
utf-8-sig: 97997.75700934579
windows-1251: 23.389860495067502
windows-1252: 147.1726925668089
windows-1255: 9.893936751289512

Total time: 459.3649871349335s (7.815136330678761 calls per second)

After:

Benchmarking chardet 4.0.0 on CPython 3.6.3 | packaged by conda-forge | (default, Oct  5 2017, 19:18:17)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 2271.8771820750844
big5: -0.18189224524472447
cp932: 0.046362371132514735
cp949: -0.19279474147731213
euc-jp: -0.17010130600427997
euc-kr: 0.0074659949420379235
euc-tw: 12.422974186586565
gb2312: 0.009971666683429525
ibm855: 0.8959192996468062
ibm866: -2.029760099831517
iso-2022-jp: -5.679269251714231
iso-2022-kr: 8226.787363329757
iso-8859-1: -0.5427543139328606
iso-8859-5: -0.7698640100054561
iso-8859-7: -1.893714346652203
koi8-r: -0.26980212358143163
maccyrillic: 0.5434692105018506
shift_jis: -0.06122787025867593
tis-620: -0.46658667115513275
utf-16: 56870.335878882324
utf-32: -44211.73982167688
utf-8: -0.5287044522123061
utf-8-sig: -6518.934762889956
windows-1251: -0.16414366325980012
windows-1252: -23.830037157210995
windows-1255: -0.2958933200025804

Total time: 469.64176321029663s (7.644124269230431 calls per second)

Significant Differences (ignoring everything where difference is less than 1 call per second):

Calls per second deltas (positive = good, negative = bad):

ascii: 2271.8771820750844
euc-tw: 12.422974186586565
ibm866: -2.029760099831517
iso-2022-jp: -5.679269251714231
iso-2022-kr: 8226.787363329757
iso-8859-7: -1.893714346652203
utf-16: 56870.335878882324
utf-32: -44211.73982167688
utf-8-sig: -6518.934762889956
windows-1252: -23.830037157210995

I don't quite know what to do with these results. It looks like we're much faster at detecting ASCII, ISO-2022-KR, and UTF-16 now, but much worse at detecting UTF-32 and UTF-8-SIG. The most confusing part of that is that I didn't actually change of the ASCII detection code, and that happens before we even use the SBCS probers I actually modified.the ones I actually modified appear to have been mostly a wash.

Also, if you want to see what a true speed up actually is, try running it with cChardet:

Benchmarking cchardet 2.1.1 on CPython 3.6.3 | packaged by conda-forge | (default, Oct  5 2017, 19:18:17)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 265462.2784810127
big5: 2479.9403278330633
cp932: 2300.0972471026944
cp949: 3912.597014925373
euc-jp: 7759.846378605286
euc-kr: 2617.9834437344443
euc-tw: 2398.5269057013784
gb2312: 2580.697242287388
ibm855: 321.94328917658646
ibm866: 442.73316839260195
iso-2022-jp: 158875.15151515152
iso-2022-kr: 268865.641025641
iso-8859-1: 7852.786220239024
iso-8859-5: 520.6036793643053
iso-8859-7: 3327.2546981968762
koi8-r: 220.32986674266112
maccyrillic: 442.66703419435385
shift_jis: 5229.5663955513255
tis-620: 320.0622679735819
utf-16: 603496.9784172662
utf-32: 612307.1532846715
utf-8: 27442.071625344353
utf-8-sig: 453438.2702702703
windows-1251: 468.51856225429214
windows-1252: 28493.91304347826
windows-1255: 377.21009418033856

Total time: 4.861895561218262s (738.3951289773163 calls per second)

The wheel package format supports including the license file. This is done using the [metadata] section in the setup.cfg file. For additional information on this feature, see: https://wheel.readthedocs.io/en/stable/index.html#including-the-license-in-the-generated-wheel-file

Include license file in the generated wheel package

Python 2.6 and 3.3 is EOL and is no longer receiving bug fixes, including security fixes. https://snarky.ca/stop-using-python-2-6/ http://www.curiousefficiency.org/posts/2015/04/stop-supporting-python26.html Fixes #133

Last use removed in 1f1cf1c.

No longer considered "beta".

Helps pip decide what version of the library to install. https://packaging.python.org/tutorials/distributing-packages/#python-requires > If your project only runs on certain Python versions, setting the > python_requires argument to the appropriate PEP 440 version specifier > string will prevent pip from installing the project on other Python > versions. https://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords > python_requires > > A string corresponding to a version specifier (as defined in PEP 440) > for the Python version, used to specify the Requires-Python defined in > PEP 345.

elcolumbio · 2018-06-02T17:39:49Z

Edited
Ok now i understand. If you have only very few none ASCII character rows, you can do something like the last thing.
Which gives you than a massive speedup.

On the first parsing remove all ASCII character?

def guess_encoding():

u = fp_r.original.content

detector = UniversalDetector()

detector.feed(u)

detector.done

detector.close()

return detector.result

%t guess_encoding()

4.71 s ± 63.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def guess_encoding2():

bytelist = fp_r.original.content.splitlines()

guess = []

detector = UniversalDetector()

for line in bytelist:

    detector.reset()

    detector.feed(line)

    detector.close()

    guess.append(detector.result)

%t guess_encoding2()

39 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

For details on the new PyPI, see the blog post: https://pythoninsider.blogspot.ca/2018/04/new-pypi-launched-legacy-pypi-shutting.html

Bulgairan -> Bulgarian

Fixes #173

Document that PyPy is also supported.

Python 3.4 went EOL on 2019-03-18. For additional details, see: https://devguide.python.org/devcycle/#end-of-life-branches

Starting with wheel 0.32.0 (2018-09-29), the "license_file" option is deprecated. https://wheel.readthedocs.io/en/stable/news.html The wheel will continue to include LICENSE, it is now included automatically: https://wheel.readthedocs.io/en/stable/user_guide.html#including-license-files-in-the-generated-wheel-file

https://blog.travis-ci.com/2018-11-19-required-linux-infrastructure-migration

When packaging chardet and pip (which bundles it) in Fedora, we have realized that there is a nonexecuatble file with a shebang line. It seems that the primary purpose of this file is to be imported from Python code or to be executed via python chardet/cli/chardetect.py or python -m chardet.cli.chardetect and hence the shebang appears to be unnecessary. Shebangs are hard to handle when doing downstream packaging, because it makes sense for upstream to use #!/usr/bin/env python while in the RPM package, we need to avoid that and use a more specific interpreter. Since the shebang was unused, I propose to remove it to avoid the problems.

Since setuptools v41.5.0 (27 Oct 2019), the 'test' command is formally deprecated and should not be used. The pytest-runner package also lists itself as deprecated: https://github.com/pytest-dev/pytest-runner > Deprecation Notice > > pytest-runner depends on deprecated features of setuptools and relies > on features that break security mechanisms in pip. For example > 'setup_requires' and 'tests_require' bypass pip --require-hashes. See > also pypa/setuptools#1684.

Has been unnecessary since Python 2.7.

The CLI entry point is installed by setuptools through the console_scripts option. This setuptools feature automatically constructs a file with a shebang and sets the executable bit. The imported file chardet.cli.chardetect doesn't also require this bit.

Throughout the rest of the chardet code we assume that FOUND_IT means we can stop looking. Previously the CharsetGroupProber did not set its state appropriately when a child prober returned FOUND_IT. This substantially speeds up the chardet for most encodings. Fixes #202

dan-blanchard · 2020-12-10T16:36:33Z

Now that we've got an actual performance improvement in #203, I'm going to push this absurdly long-lived release out. Retrained models will come in next release.

dan-blanchard and others added 3 commits June 8, 2017 10:32

Convert single-byte charset probers to use nested dicts for language …

ec3bce7

…models (#121) * Convert single byte charset modules to use dicts of dicts for language modules - Also provide conversion script * Fix debug logging check * Keep Hungarian commented out until we retrain

Make sure pyc files are not in tarballs

d7c7343

dan-blanchard changed the title ~~Release 3.1.0~~ Release 4.0.0 Oct 11, 2017

dan-blanchard requested a review from sigmavirus24 October 11, 2017 13:14

sigmavirus24 approved these changes Oct 19, 2017

View reviewed changes

dan-blanchard added 5 commits October 19, 2017 10:21

Bump version to 4.0.0

b3d867a

add benchmark script

d702545

Add more info and options to bench.py

8dccd00

Fix miscalculation in bench.py

726973e

Simplify timing steps in bench.py

71a0fad

jdufresne and others added 6 commits October 21, 2017 11:23

Merge pull request #141 from jdufresne/wheel-license

c4c1ba0

Include license file in the generated wheel package

Drop support for Python 2.6 (#143)

d94c13b

Python 2.6 and 3.3 is EOL and is no longer receiving bug fixes, including security fixes. https://snarky.ca/stop-using-python-2-6/ http://www.curiousefficiency.org/posts/2015/04/stop-supporting-python26.html Fixes #133

Remove unused coverage configuration (#142)

38b43cd

Last use removed in 1f1cf1c.

Doc the chardet package suitable for production (#144)

53914f3

No longer considered "beta".

jdufresne and others added 11 commits June 25, 2018 21:39

Update pypi.python.org URL to pypi.org (#155)

1721846

For details on the new PyPI, see the blog post: https://pythoninsider.blogspot.ca/2018/04/new-pypi-launched-legacy-pypi-shutting.html

Typo fix (#159)

b5194bf

Bulgairan -> Bulgarian

Support pytest 4, don't apply marks directly to parameters (#174)

440828f

Fixes #173

Test Python 3.7 and 3.8 and document support (#175)

388501a

Document that PyPy is also supported.

Drop support for end-of-life Python 3.4 (#181)

a4605d5

Python 3.4 went EOL on 2019-03-18. For additional details, see: https://devguide.python.org/devcycle/#end-of-life-branches

Workaround for distutils bug in python 2.7 (#165)

b411a97

Remove deprecated 'sudo: false' from Travis configuraiton (#200)

96f8cff

https://blog.travis-ci.com/2018-11-19-required-linux-infrastructure-migration

Add testing for Python 3.9 (#201)

1be32c9

Adds explicit os and distro definitions

0608f05

Merge branch 'edumco-upgrade-travis-syntax'

5b1d7d5

hrnciar and others added 8 commits December 8, 2020 15:17

Remove unnecessary numeric placeholders from format strings (#176)

e4290b6

Has been unnecessary since Python 2.7.

Update links (#152)

55ef330

Handle weird logging edge case in universaldetector.py

1db0347

Try to switch from Travis to GitHub Actions (#204)

a9286f7

Add language to detect_all output

53854fb

dan-blanchard force-pushed the master branch from 9a754c9 to 53854fb Compare December 10, 2020 17:58

dan-blanchard merged commit a808ed1 into stable Dec 10, 2020

This was referenced Mar 8, 2021

Bump chardet from 3.0.4 to 4.0.0 thermondo/stanley#816

Closed

build(deps): bump chardet from 3.0.4 to 4.0.0 negillett/exodus-gw#67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 4.0.0 #140

Release 4.0.0 #140

dan-blanchard commented Oct 11, 2017 •

edited

Loading

sigmavirus24 left a comment

dan-blanchard commented Oct 19, 2017

elcolumbio commented Jun 2, 2018 •

edited

Loading

dan-blanchard commented Dec 10, 2020 •

edited

Loading

Release 4.0.0 #140

Release 4.0.0 #140

Conversation

dan-blanchard commented Oct 11, 2017 • edited Loading

sigmavirus24 left a comment

Choose a reason for hiding this comment

dan-blanchard commented Oct 19, 2017

elcolumbio commented Jun 2, 2018 • edited Loading

dan-blanchard commented Dec 10, 2020 • edited Loading

dan-blanchard commented Oct 11, 2017 •

edited

Loading

elcolumbio commented Jun 2, 2018 •

edited

Loading

dan-blanchard commented Dec 10, 2020 •

edited

Loading