-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Retrain SBCS Models and some refactoring #99
base: main
Are you sure you want to change the base?
Conversation
e87d31d
to
8e62a21
Compare
8e62a21
to
c4a53fb
Compare
e5f1b41
to
e1c3712
Compare
- unlikely = occurred at least 3 times in training data | ||
- negative = did not occur at least 3 times in training data | ||
|
||
We should probably allow tweaking these thresholds when training models, as 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These notes are slightly out of date, as we now use the length of the alphabet for the language instead of 64 here.
1e048ab
to
8358159
Compare
f5b31aa
to
47e7f3b
Compare
State update on this PR: I got a bit discouraged by my initial work on this PR not panning out, because it turns out that the retrained models cause nearly all the unit tests to fail (meaning we fail to detect most encodings). In picking this up again about a month ago, I figured out that there were some bugs in the training code that were including some bad characters in the training data. I retrained them again with that fixed and... all the tests still failed. Obviously, there's something I'm missing here, but I haven't been able to figure it out quite yet. So, if anyone wants to help with this PR, looking into the test failures and proposing hypotheses for what's wrong with the new models would go along way. The fact that I only speak English has also hindered some of my progress here, as it's hard for me to look at a language model for a foreign language and immediately recognize problems. For example, if the English model said that "qm" was a highly likely character bigram, I would know that that was wrong, but I don't have that same innate knowledge of phonotactic patterns for other languages. |
3c1569a
to
7c9b4b5
Compare
7c9b4b5
to
dca2072
Compare
…ge_model This is in case we encouter some really crazy article with millions of links, but it's also nice for debugging.
…pedia text so we do not have to download it a bunch of times
5a1821e
to
eac7414
Compare
This branch is not ready to go yet because I still have several test failures, but I opened this WIP PR so that people can see what I'm working on (instead of me reiterating it in comments on issues).
The main changes are:
test.py
test all languages and encodings that we have data for, since we now have models and for them.UniversalDetector
output.wrap_ord
usage, which provides a nice speedup.languages
metadata module that contains the information necessary for training all of the SBCS models (language name, supported encodings, alphabet, does it use ASCII, etc.).I am well aware that this monstrosity is very hard to review given its size, so I may try to pull some parts out of it into separate PRs as possible. For example, the change that recapitalizes all the enum attributes (since they're class attributes and we're not using the Python-3-style enums because of the extra dependency that would require us to add) could certainly be pulled out.