Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly set CharsetGroupProber.state to FOUND_IT #203

Merged
merged 1 commit into from Dec 10, 2020

Conversation

@dan-blanchard
Copy link
Member

@dan-blanchard dan-blanchard commented Dec 10, 2020

Throughout the rest of the chardet code we assume that FOUND_IT means we can stop looking. Previously the CharsetGroupProber did not set its state appropriately when a child prober returned FOUND_IT. This substantially speeds up the chardet for most encodings.

Fixes #202

Before change:

Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
.......................................................................................................................................................................................................................................................................................................................................................................
Calls per second for each encoding:
ascii: 32887.903815995815
big5: 3.906468756080019
cp932: 2.980933994770643
cp949: 1.7553914946606737
euc-jp: 2.7375734716114426
euc-kr: 2.948884082793895
euc-tw: 67.4964556469068
gb2312: 4.909248428000002
ibm855: 15.664702233398641
ibm866: 35.52791448770514
iso-2022-jp: 1916.0822293284605
iso-2022-kr: 16663.901470003973
iso-8859-1: 65.22453585573267
iso-8859-5: 24.604965867422464
iso-8859-7: 39.346294718222524
koi8-r: 28.07679900564193
maccyrillic: 21.481341737898905
shift_jis: 4.096676458365973
tis-620: 9.342617557947577
utf-16: 93518.48383500558
utf-32: 82727.88954635109
utf-8: 23.69032943598593
utf-8-sig: 94042.69058295964
windows-1251: 25.231271611701636
windows-1252: 212.92666644668208
windows-1255: 13.980774444448262

Total time: 503.4055278301239s (7.131427450696289 calls per second)

After change

Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
.......................................................................................................................................................................................................................................................................................................................................................................
Calls per second for each encoding:
ascii: 38176.31067961165
big5: 12.86915132656389
cp932: 4.656400877065864
cp949: 7.282976434315926
euc-jp: 4.329381447610525
euc-kr: 8.16386823884839
euc-tw: 90.230745070368
gb2312: 14.248865889128146
ibm855: 33.30225548069821
ibm866: 44.181691968506
iso-2022-jp: 3024.2295767539117
iso-2022-kr: 25055.57945041816
iso-8859-1: 59.25262902122995
iso-8859-5: 39.7069713674529
iso-8859-7: 61.008422013862194
koi8-r: 41.21560517643845
maccyrillic: 31.402474369805002
shift_jis: 4.9091652743515155
tis-620: 14.408875278821073
utf-16: 177349.00634249471
utf-32: 186413.51111111112
utf-8: 108.62174360115105
utf-8-sig: 181965.46637744035
windows-1251: 43.16933400329809
windows-1252: 211.27653358317968
windows-1255: 16.15113643694104

Total time: 268.0230791568756s (13.394368915143872 calls per second)

Futhermore, this finally makes chardet 4.x faster than 3.x, which was one of the main things holding up its release.

3.0.4 benchmarks for reference:

Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
.......................................................................................................................................................................................................................................................................................................................................................................
Calls per second for each encoding:
ascii: 25559.439366240098
big5: 7.187002209518091 X
cp932: 4.71090956645177 X
cp949: 2.937256786994428 X
euc-jp: 4.870580412090848 X
euc-kr: 6.6910755971933416 X
euc-tw: 87.71098043480079 X
gb2312: 6.614302607154443 X
ibm855: 27.595893549680685 X
ibm866: 29.93483661732791
iso-2022-jp: 3379.5052775763434 X
iso-2022-kr: 26181.67290886392 X
iso-8859-1: 120.63424740403983 X
iso-8859-5: 32.65106262196898 X
iso-8859-7: 62.480089080556084 X
koi8-r: 13.72481001727257
maccyrillic: 33.018537255804496 X
shift_jis: 4.996013583677438 X
tis-620: 14.323112928341818 X
utf-16: 166771.53081510935 X
utf-32: 198782.18009478672 X
utf-8: 13.966236809766901 X
utf-8-sig: 193732.28637413395 X
windows-1251: 23.038910006925768
windows-1252: 99.48409117053738
windows-1255: 6.336261495718825

Total time: 357.05358052253723s (10.054513372323958 calls per second)
Throughout the rest of the chardet code we assume that FOUND_IT means we
can stop looking. Previously the CharsetGroupProber did not set its
state appropriately when a child prober returned FOUND_IT. This
substantially speeds up the chardet for most encodings.

Fixes #202
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

1 participant