Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upstream changes and clean up where possible #42

Merged
merged 38 commits into from Jan 9, 2015

Conversation

dan-blanchard
Copy link
Member

This is very much a work in progress at the moment, and I'm just creating the PR to make it easier for me to keep track of Travis results.

I have a few goals for this branch:

  1. Pull in changes from Mozilla's upstream code. There aren't as many as I had initially expected but there are some.
  2. Improve PEP8 compliance all over the place. The previous maintainers tried to keep variable names identical to the C code, presumably to ease the comparison with the Mozilla code, but we're going to be diverging from upstream after pulling in the changes mentioned in 1. Basically, Mozilla seems very likely to abandon their character encoding detector in the near future and switch to using ICU, but ICU doesn't support all of the codecs we currently do, because it is more web-focused. If our goal here is to be a truly universal character encoding detector, we'll need to go our own way in the future in that respect.
  3. Make the unit tests pass, or at the very least make it obvious that the tests are actually failing (instead of ignoring the failures like our current Travis build does).

So far, I've done a little bit of point 1 and updated the Travis testing setup to use nose and report test coverage via Coveralls.

Before we were cleaning up the known encoding string for each file of
that encoding.  This was completely unnecessary.
Much like "filterWithoutEnglishLetters", it doesn't actually strip out
non-English characters. Instead, it's a filter that is used by
Latin1Prober to inaccurately remove HTML tags.  In reality, it removes
a lot of other stuff too, but I'm not sure if that was intentional yet.

Also cleaned up some of the pylint warnings for charsetprober.py.
-  Made all classes use super() instead of explicitly listing calling
   parent __init__
-  Moved latin1 test file that's actually ascii into the correct dir
-  Removed some commented out lines of code that have been there forever
-  Made sure all attributes are defined in __init__ instead of at will
-  Other pylint warning fixes
…uff.

Clean up stuff includes:

-  Stop importing chardet from setup.py now that we have a dependency
   (enum34 on everything other than Python 3.4).
-  Add a lot of notes to NOTES.rst about how chardet actually works.
-  Removed sections of frequency rank tables that we do not actually use.
   It was just wasting memory.
-  Removed "m" prefix from attributes all over.  Will fix snake case and
   things like that in a later commit.
-  Added a lot of comments to UniversalDetector.
-  Added the ability to ignore certain encodings when running unit tests.
   This was necessary because we don't actually support some of the
   encodings we were being tested on!
-  Removed constants.py now that we have enums.py
-  Switched to using logging instead of printing to sys.stderr.  This
   actually should help a lot for debugging failed unit tests.
-  Made a CLI sub-package for chardetect.  In the future, we'll
   reorganize more things into sub-packages.
I'm sure there are plenty of line length and alignment issues after all
the find-and-replace stuff I've done.  Will fix those next.
@dan-blanchard
Copy link
Member Author

@sigmavirus24 I know this is a total bear of a PR at this point, but most of the changes are purely cosmetic. The main substantive changes are:

  1. PEP8 compliance as much as possible (especially with variable names).
  2. Implemented filter_english_letters (filter_with_english_letters #46)
  3. Switched to using enums (via enum34 and Python 3.4 standard library) instead of constants in a few places.
  4. Added a bunch of comments to UniversalDetector (I'll try to add more to others later).
  5. Added the same language filtering support present in the upstream Mozilla code (which appears to be the only substantive change they made other than commenting out the Hungarian detectors).

I'd like to propose a more substantial refactoring of things to make it simpler to retrain some of the models, which were never updated after they were established ~15 years ago, but I'll file a separate issue about that.

If you get a chance to look this over, I'd appreciate your thoughts.

@dan-blanchard
Copy link
Member Author

Further, if we're going to care about style, we should add it as a Travis CI job to at least catch regressions. Or we could add landscape.io. If we want to do the former, I'll send a PR to this branch with a set-up I use on pretty much every other project that makes it significantly easier.

Wow, I hadn't seen landscape.io before. That seems pretty nice. I'm fine with using that if you are.

@dan-blanchard
Copy link
Member Author

So I have a few thoughts that I've left as comments. I noticed you selectively made one attribute of the probers public. Given the defined API for the library, why not make everything public? Or leave everything private as it was? I'm not sure I understand the point of it. Seems like unnecessary churn.

That was a mishap of all the massive find-and-replacing I was doing. That said, I'd honestly prefer to make everything public that doesn't currently need/use a property.

@dan-blanchard dan-blanchard added this to the 3.0 milestone Jan 6, 2015
@dan-blanchard
Copy link
Member Author

@sigmavirus24 I believe I've addressed all of your comments at this point.

With the Hungarian probers commented out, and the missing encodings list
updated to include ISO-8859-6 (since we don't detect that), only one
unit test failure remains, and it's a mix up for a document where the
difference between Windows-1253 and ISO-8859-7 would lead to a single
character being improperly decoded.
@dan-blanchard
Copy link
Member Author

I've gone ahead and disabled our Hungarian probers (like the Mozilla authors did). This fixes all but one of our unit test failures. The remaining failure is for a Greek XML file where we misidentify ISO-8859-7 as Windows-1253, which leads to a one character difference in the decoded output. Without retraining our models, I'm not sure how we could actually fix that one.

The difference is:

-  δραστήριου τετραπληγικού. Η εργασία είναι στην Αθήνα, στον Άγιο Νικόλαο (Αχαρνών). Τα καθήκοντά αφορούν στην υποστήριξη των υλικών καθημερινών αναγκών, την 
?                                                             ^

+  δραστήριου τετραπληγικού. Η εργασία είναι στην Αθήνα, στον ¶γιο Νικόλαο (Αχαρνών). Τα καθήκοντά αφορούν στην υποστήριξη των υλικών καθημερινών αναγκών, την 
?                                                             ^

@dan-blanchard
Copy link
Member Author

I'm going to merge this because I believe I've addressed all of the objections @sigmavirus24 and we have much bigger future direction things to think about at the moment.

dan-blanchard added a commit that referenced this pull request Jan 9, 2015
…haul

Add upstream changes and clean up where possible
@dan-blanchard dan-blanchard merged commit c8c752a into master Jan 9, 2015
@dan-blanchard dan-blanchard deleted the feature/upstream-changes-and-overhaul branch January 9, 2015 19:01
@sigmavirus24
Copy link
Member

So I just realized, @dan-blanchard that you added a reliance on enum34 which is going to pose problems for requests. Any chance we can revert that work since it isn't necessary or a public API change?

@dan-blanchard
Copy link
Member Author

@sigmavirus24 why is the enum34 reliance an issue for requests? Since requests is probably the main way that chardet gets used, we can change things for it if necessary, but IMHO the code is a lot cleaner using enums.

@sigmavirus24
Copy link
Member

Because Kenneth insists on vendoring requests' dependencies. Which means dependencies with dependencies need to either vendor them or they're axed. Maybe once we can convince Kenneth to stop vendoring it won't be a problem, but this will prevent requests users from getting any of the bug fixes introduced in this mega pull request.

Disclaimer: I think Enums are superior too.

@dan-blanchard
Copy link
Member Author

@sigmavirus24 I just made our enums inherit from object instead of IntEnum because we weren't actually using any of the fancier features. That should eliminate the problem. 02edb35

pypingou pushed a commit to Pagure/pagure that referenced this pull request Feb 13, 2023
Attribute '_mCharSetProbers' removed from 'UniversalDetector' in 'cchardet' v3.0.0, replaced by '_charset_probers'.
Version check adjusted to cover also version 5 in addition to 3 and 4.

References:
chardet/chardet@be09612
chardet/chardet#42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants