New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New models and model loading, depending on sklearn version #38
Conversation
- Added sklearn_path to compat, model dirs TBD - Caught exception in model training for incorrectly block-corrected html files - Tweaked `train_models` output dir and file names
- Models now loaded from separate directories, depending on env’s version of sklearn - Removed tests for other models that are (currently) no longer in use - Model training now saves models in gzip compressed form
Looks like the test I modified isn't passing in Travis, but it definitely passes locally! |
Nice! This looks great and is a huge improvement to maintaining with different sklearn versions. What version of sklearn do you have locally? I wonder if the issue with the Travis build is due to different sklearn versions. The different models will extract slightly different content so we may need one test fixture to support each version of sklearn. |
Hm, that's a possibility... I used sklearn 0.17.1 and 0.18.1 for the "old" and "new" models, respectively. I didn't use, say, 0.16.1 because the API was compatible with 0.17.1 ... I realize now that I didn't update the version requirements in |
Ah I see. So your As for test cases, we could compute the precision, recall, f1 for the predicted version vs the gold standard and make sure they exceed some high threshold like 0.99. |
Ah, maybe I wasn't clear... sklearn's API seemed to be compatible from 0.15.2 to 0.17.1, then changed at 0.18.0 +. So we should only need the two models (for now). |
Hey @matt-peters , you were right: differences in |
print('len(blocks) =', len(blocks)) | ||
print('len(this_labels) =', len(this_labels)) | ||
print('len(this_weight) =', len(this_weight)) | ||
# raise ValueError("Number of features, labels and weights do not match!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue to raise this an error. By trapping it, we could mask another larger significant problem and create a much smaller or corrupted data set. Someone who is retraining without being familiar with the code details is likely to not notice it, so let's make it explicit when it is ignored.
content_comments_extractor = None | ||
content_and_content_comments_extractor = None | ||
|
||
# weninger_model = Weninger() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove these lines too.
kohlschuetter_css_model, | ||
kohlschuetter_css_weninger_model, | ||
content_extractor, | ||
# models = [kohlschuetter_model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These commented out lines can go
Would you mind also adding your training script to train/evaluate end-to-end? It will make it easier for someone else to add support for a new sklean version. Otherwise it looks good except for the few minor comments above. |
- re-introduced valueerror in training for faulty input file - cleared out some commented-out code
Okay, that last commit does everything you mentioned in the comments. Two things:
|
I don't see any reason to keep them so let's just delete them. |
👍 Thanks for all your work on this! |
My pleasure! I don't suppose you could push this as a new release to pypi? I'd love to use it in production. :) |
The installation is still failing in Travis like this:
|
@sistawendy The travis build looks fine to me: https://travis-ci.org/seomoz/dragnet ? Where do you see the failure? |
In a Freshscape build. It's here, if you have access:
https://travis-ci.com/seomoz/freshscape/builds/37887192
…On Wed, Jan 11, 2017 at 1:12 PM, Matthew Peters ***@***.***> wrote:
@sistawendy <https://github.com/sistawendy> The travis build looks fine
to me: https://travis-ci.org/seomoz/dragnet ? Where do you see the
failure?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACFWyo75C6GbQ_KEA9N30_7SYUSOBLwdks5rRUXVgaJpZM4LdeyV>
.
|
This is fixed in #39 I will push a new release to pypi tomorrow. |
Hey @matt-peters , did you push a new release? The latest I see on pypi is 1.0.2, which I believe came before this set of changes. |
Just pushed 1.1.0 to pypi. |
Changes:
sklearn_path
tocompat
module, use it for saving and loading different content extraction models for "old" and "new" versions ofscikit-learn
ContentExtractionModel.analyze()
Performance in terms of accuracy/precision/recall/F1 of the new models was on par or slightly better than previous benchmarks. See Issue #33 for details.