Release 4.0.0 #140
The optimizations in master really could use a release, even though my model retraining work has stalled out lately (because I've been too busy with work and life).
This will have to be a major version change since the model format has changed entirely (even though 99% of users never mess with that).
The text was updated successfully, but these errors were encountered:
…models (#121) * Convert single byte charset modules to use dicts of dicts for language modules - Also provide conversion script * Fix debug logging check * Keep Hungarian commented out until we retrain
* Add API option to get all the encodings confidence #96 * make code more straightforward by treating the self.done = True as a real finish point of the analysis * use detect_all instead of detect(.., all=True) * fix corner case of when there is no good prober
Ugh. I put together a little benchmark script to show the performance improvements that switching to dicts would have made. Unfortunately, it showed exactly the opposite. Turns out the microbenchmarks I was running to justify the change didn't hold up when everything was fully integrated.
Significant Differences (ignoring everything where difference is less than 1 call per second):
I don't quite know what to do with these results. It looks like we're much faster at detecting ASCII, ISO-2022-KR, and UTF-16 now, but much worse at detecting UTF-32 and UTF-8-SIG. The most confusing part of that is that I didn't actually change of the ASCII detection code, and that happens before we even use the SBCS probers I actually modified.the ones I actually modified appear to have been mostly a wash.
Also, if you want to see what a true speed up actually is, try running it with
The wheel package format supports including the license file. This is done using the [metadata] section in the setup.cfg file. For additional information on this feature, see: https://wheel.readthedocs.io/en/stable/index.html#including-the-license-in-the-generated-wheel-file
Helps pip decide what version of the library to install. https://packaging.python.org/tutorials/distributing-packages/#python-requires > If your project only runs on certain Python versions, setting the > python_requires argument to the appropriate PEP 440 version specifier > string will prevent pip from installing the project on other Python > versions. https://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords > python_requires > > A string corresponding to a version specifier (as defined in PEP 440) > for the Python version, used to specify the Requires-Python defined in > PEP 345.
On the first parsing remove all ASCII character?
4.71 s ± 63.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
39 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For details on the new PyPI, see the blog post: https://pythoninsider.blogspot.ca/2018/04/new-pypi-launched-legacy-pypi-shutting.html
Python 3.4 went EOL on 2019-03-18. For additional details, see: https://devguide.python.org/devcycle/#end-of-life-branches
Starting with wheel 0.32.0 (2018-09-29), the "license_file" option is deprecated. https://wheel.readthedocs.io/en/stable/news.html The wheel will continue to include LICENSE, it is now included automatically: https://wheel.readthedocs.io/en/stable/user_guide.html#including-license-files-in-the-generated-wheel-file
When packaging chardet and pip (which bundles it) in Fedora, we have realized that there is a nonexecuatble file with a shebang line. It seems that the primary purpose of this file is to be imported from Python code or to be executed via python chardet/cli/chardetect.py or python -m chardet.cli.chardetect and hence the shebang appears to be unnecessary. Shebangs are hard to handle when doing downstream packaging, because it makes sense for upstream to use #!/usr/bin/env python while in the RPM package, we need to avoid that and use a more specific interpreter. Since the shebang was unused, I propose to remove it to avoid the problems.
Since setuptools v41.5.0 (27 Oct 2019), the 'test' command is formally deprecated and should not be used. The pytest-runner package also lists itself as deprecated: https://github.com/pytest-dev/pytest-runner > Deprecation Notice > > pytest-runner depends on deprecated features of setuptools and relies > on features that break security mechanisms in pip. For example > 'setup_requires' and 'tests_require' bypass pip --require-hashes. See > also pypa/setuptools#1684.
Has been unnecessary since Python 2.7.
The CLI entry point is installed by setuptools through the console_scripts option. This setuptools feature automatically constructs a file with a shebang and sets the executable bit. The imported file chardet.cli.chardetect doesn't also require this bit.
Throughout the rest of the chardet code we assume that FOUND_IT means we can stop looking. Previously the CharsetGroupProber did not set its state appropriately when a child prober returned FOUND_IT. This substantially speeds up the chardet for most encodings. Fixes #202