slight increase in performance #252

deedy5 · 2022-05-11T13:18:27Z

Pytest

pytest

Master: 367 passed, 5 xfailed, 1 xpassed in 15.24s
Commit: 367 passed, 5 xfailed, 1 xpassed in 14.73s

bench.py

python3 bench.py

Master:

Benchmarking chardet 5.0.0dev0 on CPython 3.10.4 (main, Mar 24 2022, 13:07:27) [GCC 11.2.0]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 58908.76404494382
big5: 21.534473374712764
cp932: 8.165127375060932
cp949: 12.310296784615652
euc-jp: 7.129459713562277
euc-kr: 20.82451066846787
euc-tw: 145.64819862835316
gb2312: 24.251619432823837
ibm855: 56.16913800549732
ibm866: 69.08196781313022
iso-2022-jp: 5885.916362615773
iso-2022-kr: 45964.975342465754
iso-8859-1: 204.59753369061866
iso-8859-2-croatian: 60.762269278743005
iso-8859-2-czech: 144.05223146286218
iso-8859-2-polish: 84.63527142263315
iso-8859-2-slovak: 140.93952875710696
iso-8859-2-slovene: 73.71334371408184
iso-8859-5: 64.61028679712285
iso-8859-5-russian: 73.5648856709805
iso-8859-7: 113.83742441504452
iso-8859-9: 140.97884798311335
johab: 28.186055700945715
koi8-r: 66.42928707884049
maccyrillic: 61.512972449106535
shift_jis: 7.7978653245510605
tis-620: 23.24070785378144
utf-16: 264624.858044164
utf-32: 310689.18518518517
utf-8: 204.82674033308214
utf-8-sig: 353949.7046413502
windows-1250-croatian: 61.03033980405937
windows-1250-czech: 136.30194835265004
windows-1250-polish: 85.01284018379603
windows-1250-romanian: 100.35252430273928
windows-1250-slovak: 161.34606877291245
windows-1250-slovene: 136.3132442410691
windows-1251: 80.60570973381769
windows-1251-russian: 99.22585387864957
windows-1252: 132.90642009675736
windows-1255: 27.411148392784067

Total time: 155.6653971672058s (24.732535746943018 calls per second)

Commit:

Benchmarking chardet 5.0.0dev0 on CPython 3.10.4 (main, Mar 24 2022, 13:07:27) [GCC 11.2.0]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 59158.02538787024
big5: 21.87834983185563
cp932: 8.563019000126646
cp949: 12.235340242658744
euc-jp: 7.25511220010562
euc-kr: 20.948329544558874
euc-tw: 145.46685441189732
gb2312: 24.446441737041507
ibm855: 55.74363390076592
ibm866: 69.21712534861845
iso-2022-jp: 5875.198207031797
iso-2022-kr: 43873.47280334728
iso-8859-1: 216.7746470259389
iso-8859-2-croatian: 65.63958747398237
iso-8859-2-czech: 148.38925830206736
iso-8859-2-polish: 89.8694052636428
iso-8859-2-slovak: 146.3461984672097
iso-8859-2-slovene: 74.40073082688095
iso-8859-5: 66.97330179430043
iso-8859-5-russian: 72.73432412039581
iso-8859-7: 120.94023205923739
iso-8859-9: 141.92913532371642
johab: 28.300857106099606
koi8-r: 65.96703271469738
maccyrillic: 61.18715133776373
shift_jis: 8.021083913612316
tis-620: 24.002499649965074
utf-16: 259709.22600619195
utf-32: 314180.07490636705
utf-8: 202.57007372837302
utf-8-sig: 369542.2026431718
windows-1250-croatian: 66.69747398048835
windows-1250-czech: 141.04806518544424
windows-1250-polish: 88.87668353378933
windows-1250-romanian: 107.88235142816724
windows-1250-slovak: 168.99228429259253
windows-1250-slovene: 148.99942450745652
windows-1251: 79.77231163293987
windows-1251-russian: 101.00520819866695
windows-1252: 137.4552950371879
windows-1255: 28.375280134181445

Total time: 152.8467354774475s (25.188630872447167 calls per second)

line_profiler

kernprof -lv performance_kernprof.py

performance_kernprof.py

#!/bin/python
from glob import glob
from time import monotonic
import argparse
from sys import argv
from os.path import isdir

from chardet import detect

def performance_compare(arguments):
    parser = argparse.ArgumentParser()
    parser.add_argument('-s', '--size-increase', action="store", default=1, type=int, dest='size_coeff',
                        help="Apply artificial size increase to challenge the detection mechanism further")

    args = parser.parse_args(arguments)

    chardet_results = []
    datapaths = sorted(glob("./tests/**/*.*"))
    for i, tbt_path in enumerate(datapaths, start=1):
        with open(tbt_path, "rb") as fp:
            content = fp.read() * args.size_coeff

        t0 = monotonic()
        detect(content)
        chardet_results.append(round((monotonic() - t0), 5))
        print(f"{i}/{len(datapaths)}\t{str(chardet_results[-1])}\t{tbt_path}")

    return


if __name__ == "__main__":
    exit(
        performance_compare(
            argv[1:]
        )
    )

Total time: 8.22394 s
File: /home/user/Downloads/chardet-master/chardet/charsetprober.py
Function: remove_xml_tags at line 102

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   102                                               @staticmethod
   103                                               @profile
   104                                               def remove_xml_tags(buf):
   105                                                   """
   106                                                   Returns a copy of ``buf`` that retains only the sequences of English
   107                                                   alphabet and high byte characters that are not between <> characters.
   108                                           
   109                                                   This filter can be applied to all scripts which contain both English
   110                                                   characters and extended ASCII characters, but is currently only used by
   111                                                   ``Latin1Prober``.
   112                                                   """
   113       482        489.0      1.0      0.0          filtered = bytearray()
   114       482        195.0      0.4      0.0          in_tag = False
   115       482        133.0      0.3      0.0          prev = 0
   116                                           
   117   5958119    1669798.0      0.3     20.3          for curr in range(len(buf)):
   118                                                       # Slice here to get bytes instead of an int with Python 3
   119   5957637    2266384.0      0.4     27.6              buf_char = buf[curr : curr + 1]
   120                                                       # Check if we're coming out of or entering an XML tag
   121   5957637    1978452.0      0.3     24.1              if buf_char == b">":
   122    118029      36315.0      0.3      0.4                  prev = curr + 1
   123    118029      33817.0      0.3      0.4                  in_tag = False
   124   5839608    2076011.0      0.4     25.2              elif buf_char == b"<":
   125    117993      38223.0      0.3      0.5                  if curr > prev and not in_tag:
   126                                                               # Keep everything after last non-extended-ASCII,
   127                                                               # non-alphabetic character
   128    100417      50089.0      0.5      0.6                      filtered.extend(buf[prev:curr])
   129                                                               # Output a space to delimit stretch we kept
   130    100417      37360.0      0.4      0.5                      filtered.extend(b" ")
   131    117993      36052.0      0.3      0.4                  in_tag = True
   132                                           
   133                                                   # If we're not in a tag...
   134       482        137.0      0.3      0.0          if not in_tag:
   135                                                       # Keep everything after last non-extended-ASCII, non-alphabetic
   136                                                       # character
   137       476        353.0      0.7      0.0              filtered.extend(buf[prev:])
   138                                           
   139       482        131.0      0.3      0.0          return filtered

commit

Total time: 6.10093 s
File: /home/user/Downloads/chardet-test/chardet/charsetprober.py
Function: remove_xml_tags at line 102

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   102                                               @staticmethod
   103                                               @profile
   104                                               def remove_xml_tags(buf):
   105                                                   """
   106                                                   Returns a copy of ``buf`` that retains only the sequences of English
   107                                                   alphabet and high byte characters that are not between <> characters.
   108                                           
   109                                                   This filter can be applied to all scripts which contain both English
   110                                                   characters and extended ASCII characters, but is currently only used by
   111                                                   ``Latin1Prober``.
   112                                                   """
   113       482        458.0      1.0      0.0          filtered = bytearray()
   114       482        199.0      0.4      0.0          in_tag = False
   115       482        148.0      0.3      0.0          prev = 0
   116       482        938.0      1.9      0.0          buf = memoryview(buf).cast('c')
   117                                                   
   118   5958119    1907358.0      0.3     31.3          for curr, buf_char in enumerate(buf):
   119                                                       # Check if we're coming out of or entering an XML tag
   120   5957637    1921628.0      0.3     31.5              if buf_char == b">":
   121    118029      36919.0      0.3      0.6                  prev = curr + 1
   122    118029      34301.0      0.3      0.6                  in_tag = False
   123   5839608    2032802.0      0.3     33.3              elif buf_char == b"<":
   124    117993      39770.0      0.3      0.7                  if curr > prev and not in_tag:
   125                                                               # Keep everything after last non-extended-ASCII,
   126                                                               # non-alphabetic character
   127    100417      50635.0      0.5      0.8                      filtered.extend(buf[prev:curr])
   128                                                               # Output a space to delimit stretch we kept
   129    100417      38459.0      0.4      0.6                      filtered.extend(b" ")
   130    117993      36687.0      0.3      0.6                  in_tag = True
   131                                           
   132                                                   # If we're not in a tag...
   133       482        157.0      0.3      0.0          if not in_tag:
   134                                                       # Keep everything after last non-extended-ASCII, non-alphabetic
   135                                                       # character
   136       476        318.0      0.7      0.0              filtered.extend(buf[prev:])
   137                                                   
   138       482        150.0      0.3      0.0          return filtered

deedy5 · 2022-05-26T06:14:46Z

lint error:

ImportError: cannot import name '_unicodefun' from 'click'

Solution: psf/black#2964 (comment)
Commit: 62fb1f1

deedy5 added 2 commits May 11, 2022 16:16

slight increase in performance

8fd120b

update black version to 22.3.0

62fb1f1

deedy5 added 2 commits May 27, 2022 18:51

reformat code

918f207

reformat code

9ff0892

dan-blanchard merged commit f1f9d42 into chardet:master May 27, 2022

This was referenced Jun 25, 2022

Release 5.0.0 #254

Merged

Add type annotations to the project and run mypy on CI #261

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slight increase in performance #252

slight increase in performance #252

deedy5 commented May 11, 2022

deedy5 commented May 26, 2022 •

edited

slight increase in performance #252

slight increase in performance #252

Conversation

deedy5 commented May 11, 2022

Pytest

bench.py

line_profiler

deedy5 commented May 26, 2022 • edited

deedy5 commented May 26, 2022 •

edited