Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Retrain SBCS Models and some refactoring #99

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

dan-blanchard
Copy link
Member

@dan-blanchard dan-blanchard commented Apr 10, 2017

This branch is not ready to go yet because I still have several test failures, but I opened this WIP PR so that people can see what I'm working on (instead of me reiterating it in comments on issues).

The main changes are:

  • Cleans up abandoned PR New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added #52
  • Adds SBCS language model training script that can train from text files or wikipedia data
  • Adds support for several languages we were misisng (will enumerate them all when the WIP tag is removed from this)
  • Makes test.py test all languages and encodings that we have data for, since we now have models and for them.
  • Retrains all SBCS models, and even adds support for an English language model that we might be able to use to get rid of the latin-1 specific prober (more testing is needed here).
  • Fix a bug in the XML tag filter where parts of the XML tags themselves would be retained.
  • Adds language to UniversalDetector output.
  • Eliminates wrap_ord usage, which provides a nice speedup.
  • All SBCS models are now stored as dicts of dicts, because that is way faster than storing them as giant lists. The model files are much longer (and a bit harder to read), but no one really needs to look through them manually except when you're retraining them anyway.
  • Adds a languages metadata module that contains the information necessary for training all of the SBCS models (language name, supported encodings, alphabet, does it use ASCII, etc.).

I am well aware that this monstrosity is very hard to review given its size, so I may try to pull some parts out of it into separate PRs as possible. For example, the change that recapitalizes all the enum attributes (since they're class attributes and we're not using the Python-3-style enums because of the extra dependency that would require us to add) could certainly be pulled out.

- unlikely = occurred at least 3 times in training data
- negative = did not occur at least 3 times in training data

We should probably allow tweaking these thresholds when training models, as 64
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These notes are slightly out of date, as we now use the length of the alphabet for the language instead of 64 here.

@dan-blanchard
Copy link
Member Author

State update on this PR: I got a bit discouraged by my initial work on this PR not panning out, because it turns out that the retrained models cause nearly all the unit tests to fail (meaning we fail to detect most encodings). In picking this up again about a month ago, I figured out that there were some bugs in the training code that were including some bad characters in the training data. I retrained them again with that fixed and... all the tests still failed. Obviously, there's something I'm missing here, but I haven't been able to figure it out quite yet.

So, if anyone wants to help with this PR, looking into the test failures and proposing hypotheses for what's wrong with the new models would go along way. The fact that I only speak English has also hindered some of my progress here, as it's hard for me to look at a language model for a foreign language and immediately recognize problems. For example, if the English model said that "qm" was a highly likely character bigram, I would know that that was wrong, but I don't have that same innate knowledge of phonotactic patterns for other languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants