Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slight increase in performance #252

Merged
merged 4 commits into from May 27, 2022
Merged

slight increase in performance #252

merged 4 commits into from May 27, 2022

Conversation

deedy5
Copy link
Contributor

@deedy5 deedy5 commented May 11, 2022

Pytest

pytest

Master: 367 passed, 5 xfailed, 1 xpassed in 15.24s
Commit: 367 passed, 5 xfailed, 1 xpassed in 14.73s


bench.py

python3 bench.py

Master:

Benchmarking chardet 5.0.0dev0 on CPython 3.10.4 (main, Mar 24 2022, 13:07:27) [GCC 11.2.0]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 58908.76404494382
big5: 21.534473374712764
cp932: 8.165127375060932
cp949: 12.310296784615652
euc-jp: 7.129459713562277
euc-kr: 20.82451066846787
euc-tw: 145.64819862835316
gb2312: 24.251619432823837
ibm855: 56.16913800549732
ibm866: 69.08196781313022
iso-2022-jp: 5885.916362615773
iso-2022-kr: 45964.975342465754
iso-8859-1: 204.59753369061866
iso-8859-2-croatian: 60.762269278743005
iso-8859-2-czech: 144.05223146286218
iso-8859-2-polish: 84.63527142263315
iso-8859-2-slovak: 140.93952875710696
iso-8859-2-slovene: 73.71334371408184
iso-8859-5: 64.61028679712285
iso-8859-5-russian: 73.5648856709805
iso-8859-7: 113.83742441504452
iso-8859-9: 140.97884798311335
johab: 28.186055700945715
koi8-r: 66.42928707884049
maccyrillic: 61.512972449106535
shift_jis: 7.7978653245510605
tis-620: 23.24070785378144
utf-16: 264624.858044164
utf-32: 310689.18518518517
utf-8: 204.82674033308214
utf-8-sig: 353949.7046413502
windows-1250-croatian: 61.03033980405937
windows-1250-czech: 136.30194835265004
windows-1250-polish: 85.01284018379603
windows-1250-romanian: 100.35252430273928
windows-1250-slovak: 161.34606877291245
windows-1250-slovene: 136.3132442410691
windows-1251: 80.60570973381769
windows-1251-russian: 99.22585387864957
windows-1252: 132.90642009675736
windows-1255: 27.411148392784067

Total time: 155.6653971672058s (24.732535746943018 calls per second)

Commit:

Benchmarking chardet 5.0.0dev0 on CPython 3.10.4 (main, Mar 24 2022, 13:07:27) [GCC 11.2.0]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 59158.02538787024
big5: 21.87834983185563
cp932: 8.563019000126646
cp949: 12.235340242658744
euc-jp: 7.25511220010562
euc-kr: 20.948329544558874
euc-tw: 145.46685441189732
gb2312: 24.446441737041507
ibm855: 55.74363390076592
ibm866: 69.21712534861845
iso-2022-jp: 5875.198207031797
iso-2022-kr: 43873.47280334728
iso-8859-1: 216.7746470259389
iso-8859-2-croatian: 65.63958747398237
iso-8859-2-czech: 148.38925830206736
iso-8859-2-polish: 89.8694052636428
iso-8859-2-slovak: 146.3461984672097
iso-8859-2-slovene: 74.40073082688095
iso-8859-5: 66.97330179430043
iso-8859-5-russian: 72.73432412039581
iso-8859-7: 120.94023205923739
iso-8859-9: 141.92913532371642
johab: 28.300857106099606
koi8-r: 65.96703271469738
maccyrillic: 61.18715133776373
shift_jis: 8.021083913612316
tis-620: 24.002499649965074
utf-16: 259709.22600619195
utf-32: 314180.07490636705
utf-8: 202.57007372837302
utf-8-sig: 369542.2026431718
windows-1250-croatian: 66.69747398048835
windows-1250-czech: 141.04806518544424
windows-1250-polish: 88.87668353378933
windows-1250-romanian: 107.88235142816724
windows-1250-slovak: 168.99228429259253
windows-1250-slovene: 148.99942450745652
windows-1251: 79.77231163293987
windows-1251-russian: 101.00520819866695
windows-1252: 137.4552950371879
windows-1255: 28.375280134181445

Total time: 152.8467354774475s (25.188630872447167 calls per second)

line_profiler

kernprof -lv performance_kernprof.py
performance_kernprof.py
#!/bin/python
from glob import glob
from time import monotonic
import argparse
from sys import argv
from os.path import isdir

from chardet import detect

def performance_compare(arguments):
    parser = argparse.ArgumentParser()
    parser.add_argument('-s', '--size-increase', action="store", default=1, type=int, dest='size_coeff',
                        help="Apply artificial size increase to challenge the detection mechanism further")

    args = parser.parse_args(arguments)

    chardet_results = []
    datapaths = sorted(glob("./tests/**/*.*"))
    for i, tbt_path in enumerate(datapaths, start=1):
        with open(tbt_path, "rb") as fp:
            content = fp.read() * args.size_coeff

        t0 = monotonic()
        detect(content)
        chardet_results.append(round((monotonic() - t0), 5))
        print(f"{i}/{len(datapaths)}\t{str(chardet_results[-1])}\t{tbt_path}")

    return


if __name__ == "__main__":
    exit(
        performance_compare(
            argv[1:]
        )
    )
Total time: 8.22394 s
File: /home/user/Downloads/chardet-master/chardet/charsetprober.py
Function: remove_xml_tags at line 102

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   102                                               @staticmethod
   103                                               @profile
   104                                               def remove_xml_tags(buf):
   105                                                   """
   106                                                   Returns a copy of ``buf`` that retains only the sequences of English
   107                                                   alphabet and high byte characters that are not between <> characters.
   108                                           
   109                                                   This filter can be applied to all scripts which contain both English
   110                                                   characters and extended ASCII characters, but is currently only used by
   111                                                   ``Latin1Prober``.
   112                                                   """
   113       482        489.0      1.0      0.0          filtered = bytearray()
   114       482        195.0      0.4      0.0          in_tag = False
   115       482        133.0      0.3      0.0          prev = 0
   116                                           
   117   5958119    1669798.0      0.3     20.3          for curr in range(len(buf)):
   118                                                       # Slice here to get bytes instead of an int with Python 3
   119   5957637    2266384.0      0.4     27.6              buf_char = buf[curr : curr + 1]
   120                                                       # Check if we're coming out of or entering an XML tag
   121   5957637    1978452.0      0.3     24.1              if buf_char == b">":
   122    118029      36315.0      0.3      0.4                  prev = curr + 1
   123    118029      33817.0      0.3      0.4                  in_tag = False
   124   5839608    2076011.0      0.4     25.2              elif buf_char == b"<":
   125    117993      38223.0      0.3      0.5                  if curr > prev and not in_tag:
   126                                                               # Keep everything after last non-extended-ASCII,
   127                                                               # non-alphabetic character
   128    100417      50089.0      0.5      0.6                      filtered.extend(buf[prev:curr])
   129                                                               # Output a space to delimit stretch we kept
   130    100417      37360.0      0.4      0.5                      filtered.extend(b" ")
   131    117993      36052.0      0.3      0.4                  in_tag = True
   132                                           
   133                                                   # If we're not in a tag...
   134       482        137.0      0.3      0.0          if not in_tag:
   135                                                       # Keep everything after last non-extended-ASCII, non-alphabetic
   136                                                       # character
   137       476        353.0      0.7      0.0              filtered.extend(buf[prev:])
   138                                           
   139       482        131.0      0.3      0.0          return filtered

commit

Total time: 6.10093 s
File: /home/user/Downloads/chardet-test/chardet/charsetprober.py
Function: remove_xml_tags at line 102

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   102                                               @staticmethod
   103                                               @profile
   104                                               def remove_xml_tags(buf):
   105                                                   """
   106                                                   Returns a copy of ``buf`` that retains only the sequences of English
   107                                                   alphabet and high byte characters that are not between <> characters.
   108                                           
   109                                                   This filter can be applied to all scripts which contain both English
   110                                                   characters and extended ASCII characters, but is currently only used by
   111                                                   ``Latin1Prober``.
   112                                                   """
   113       482        458.0      1.0      0.0          filtered = bytearray()
   114       482        199.0      0.4      0.0          in_tag = False
   115       482        148.0      0.3      0.0          prev = 0
   116       482        938.0      1.9      0.0          buf = memoryview(buf).cast('c')
   117                                                   
   118   5958119    1907358.0      0.3     31.3          for curr, buf_char in enumerate(buf):
   119                                                       # Check if we're coming out of or entering an XML tag
   120   5957637    1921628.0      0.3     31.5              if buf_char == b">":
   121    118029      36919.0      0.3      0.6                  prev = curr + 1
   122    118029      34301.0      0.3      0.6                  in_tag = False
   123   5839608    2032802.0      0.3     33.3              elif buf_char == b"<":
   124    117993      39770.0      0.3      0.7                  if curr > prev and not in_tag:
   125                                                               # Keep everything after last non-extended-ASCII,
   126                                                               # non-alphabetic character
   127    100417      50635.0      0.5      0.8                      filtered.extend(buf[prev:curr])
   128                                                               # Output a space to delimit stretch we kept
   129    100417      38459.0      0.4      0.6                      filtered.extend(b" ")
   130    117993      36687.0      0.3      0.6                  in_tag = True
   131                                           
   132                                                   # If we're not in a tag...
   133       482        157.0      0.3      0.0          if not in_tag:
   134                                                       # Keep everything after last non-extended-ASCII, non-alphabetic
   135                                                       # character
   136       476        318.0      0.7      0.0              filtered.extend(buf[prev:])
   137                                                   
   138       482        150.0      0.3      0.0          return filtered

@deedy5
Copy link
Contributor Author

deedy5 commented May 26, 2022

lint error:

ImportError: cannot import name '_unicodefun' from 'click'

Solution: psf/black#2964 (comment)
Commit: 62fb1f1

@dan-blanchard dan-blanchard merged commit f1f9d42 into chardet:master May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants