Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Retrain SBCS Models and some refactoring #99

wants to merge 15 commits into
base: master
Choose a base branch


Copy link

@dan-blanchard dan-blanchard commented Apr 10, 2017

This branch is not ready to go yet because I still have several test failures, but I opened this WIP PR so that people can see what I'm working on (instead of me reiterating it in comments on issues).

The main changes are:

  • Cleans up abandoned PR #52
  • Adds SBCS language model training script that can train from text files or wikipedia data
  • Adds support for several languages we were misisng (will enumerate them all when the WIP tag is removed from this)
  • Makes test all languages and encodings that we have data for, since we now have models and for them.
  • Retrains all SBCS models, and even adds support for an English language model that we might be able to use to get rid of the latin-1 specific prober (more testing is needed here).
  • Fix a bug in the XML tag filter where parts of the XML tags themselves would be retained.
  • Adds language to UniversalDetector output.
  • Eliminates wrap_ord usage, which provides a nice speedup.
  • All SBCS models are now stored as dicts of dicts, because that is way faster than storing them as giant lists. The model files are much longer (and a bit harder to read), but no one really needs to look through them manually except when you're retraining them anyway.
  • Adds a languages metadata module that contains the information necessary for training all of the SBCS models (language name, supported encodings, alphabet, does it use ASCII, etc.).

I am well aware that this monstrosity is very hard to review given its size, so I may try to pull some parts out of it into separate PRs as possible. For example, the change that recapitalizes all the enum attributes (since they're class attributes and we're not using the Python-3-style enums because of the extra dependency that would require us to add) could certainly be pulled out.

- unlikely = occurred at least 3 times in training data
- negative = did not occur at least 3 times in training data

We should probably allow tweaking these thresholds when training models, as 64
Copy link
Member Author

@dan-blanchard dan-blanchard Apr 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These notes are slightly out of date, as we now use the length of the alphabet for the language instead of 64 here.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants