-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release 4.0.0 #140
Release 4.0.0 #140
Conversation
…models (#121) * Convert single byte charset modules to use dicts of dicts for language modules - Also provide conversion script * Fix debug logging check * Keep Hungarian commented out until we retrain
* Add API option to get all the encodings confidence #96 * make code more straightforward by treating the self.done = True as a real finish point of the analysis * use detect_all instead of detect(.., all=True) * fix corner case of when there is no good prober
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only had time to review this visually, but it all looks fine to me.
Ugh. I put together a little benchmark script to show the performance improvements that switching to dicts would have made. Unfortunately, it showed exactly the opposite. Turns out the microbenchmarks I was running to justify the change didn't hold up when everything was fully integrated. Before:
After:
Significant Differences (ignoring everything where difference is less than 1 call per second):
I don't quite know what to do with these results. It looks like we're much faster at detecting ASCII, ISO-2022-KR, and UTF-16 now, but much worse at detecting UTF-32 and UTF-8-SIG. The most confusing part of that is that I didn't actually change of the ASCII detection code, and that happens before we even use the SBCS probers I actually modified.the ones I actually modified appear to have been mostly a wash. Also, if you want to see what a true speed up actually is, try running it with
|
The wheel package format supports including the license file. This is done using the [metadata] section in the setup.cfg file. For additional information on this feature, see: https://wheel.readthedocs.io/en/stable/index.html#including-the-license-in-the-generated-wheel-file
Include license file in the generated wheel package
Python 2.6 and 3.3 is EOL and is no longer receiving bug fixes, including security fixes. https://snarky.ca/stop-using-python-2-6/ http://www.curiousefficiency.org/posts/2015/04/stop-supporting-python26.html Fixes #133
Last use removed in 1f1cf1c.
No longer considered "beta".
Helps pip decide what version of the library to install. https://packaging.python.org/tutorials/distributing-packages/#python-requires > If your project only runs on certain Python versions, setting the > python_requires argument to the appropriate PEP 440 version specifier > string will prevent pip from installing the project on other Python > versions. https://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords > python_requires > > A string corresponding to a version specifier (as defined in PEP 440) > for the Python version, used to specify the Requires-Python defined in > PEP 345.
Edited On the first parsing remove all ASCII character? def guess_encoding():
%t guess_encoding() 4.71 s ± 63.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) def guess_encoding2():
%t guess_encoding2() 39 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
For details on the new PyPI, see the blog post: https://pythoninsider.blogspot.ca/2018/04/new-pypi-launched-legacy-pypi-shutting.html
Bulgairan -> Bulgarian
Document that PyPy is also supported.
Python 3.4 went EOL on 2019-03-18. For additional details, see: https://devguide.python.org/devcycle/#end-of-life-branches
Starting with wheel 0.32.0 (2018-09-29), the "license_file" option is deprecated. https://wheel.readthedocs.io/en/stable/news.html The wheel will continue to include LICENSE, it is now included automatically: https://wheel.readthedocs.io/en/stable/user_guide.html#including-license-files-in-the-generated-wheel-file
When packaging chardet and pip (which bundles it) in Fedora, we have realized that there is a nonexecuatble file with a shebang line. It seems that the primary purpose of this file is to be imported from Python code or to be executed via python chardet/cli/chardetect.py or python -m chardet.cli.chardetect and hence the shebang appears to be unnecessary. Shebangs are hard to handle when doing downstream packaging, because it makes sense for upstream to use #!/usr/bin/env python while in the RPM package, we need to avoid that and use a more specific interpreter. Since the shebang was unused, I propose to remove it to avoid the problems.
Since setuptools v41.5.0 (27 Oct 2019), the 'test' command is formally deprecated and should not be used. The pytest-runner package also lists itself as deprecated: https://github.com/pytest-dev/pytest-runner > Deprecation Notice > > pytest-runner depends on deprecated features of setuptools and relies > on features that break security mechanisms in pip. For example > 'setup_requires' and 'tests_require' bypass pip --require-hashes. See > also pypa/setuptools#1684.
Has been unnecessary since Python 2.7.
The CLI entry point is installed by setuptools through the console_scripts option. This setuptools feature automatically constructs a file with a shebang and sets the executable bit. The imported file chardet.cli.chardetect doesn't also require this bit.
Throughout the rest of the chardet code we assume that FOUND_IT means we can stop looking. Previously the CharsetGroupProber did not set its state appropriately when a child prober returned FOUND_IT. This substantially speeds up the chardet for most encodings. Fixes #202
Now that we've got an actual performance improvement in #203, I'm going to push this absurdly long-lived release out. Retrained models will come in next release. |
9a754c9
to
53854fb
Compare
The optimizations in master really could use a release, even though my model retraining work has stalled out lately (because I've been too busy with work and life).
This will have to be a major version change since the model format has changed entirely (even though 99% of users never mess with that).