New models and model loading, depending on sklearn version #38

bdewilde · 2017-01-07T17:58:26Z

Changes:

Add sklearn_path to compat module, use it for saving and loading different content extraction models for "old" and "new" versions of scikit-learn
Wrap default content extraction model loading in a try/except; in case of failure, display a warning message and set models to null; this lets devs use other parts of the package even when the default models can't be loaded
Added newly-trained models for content and content+comments extraction
Commented out loading and testing of non-default models, but can totally add these back in (or totally remove them) if you'd prefer
Updated the docstring on the main API method, ContentExtractionModel.analyze()

Performance in terms of accuracy/precision/recall/F1 of the new models was on par or slightly better than previous benchmarks. See Issue #33 for details.

- Added sklearn_path to compat, model dirs TBD - Caught exception in model training for incorrectly block-corrected html files - Tweaked `train_models` output dir and file names

- Models now loaded from separate directories, depending on env’s version of sklearn - Removed tests for other models that are (currently) no longer in use - Model training now saves models in gzip compressed form

bdewilde · 2017-01-07T18:04:24Z

Looks like the test I modified isn't passing in Travis, but it definitely passes locally!

matt-peters · 2017-01-09T18:26:09Z

Nice! This looks great and is a huge improvement to maintaining with different sklearn versions. What version of sklearn do you have locally? I wonder if the issue with the Travis build is due to different sklearn versions. The different models will extract slightly different content so we may need one test fixture to support each version of sklearn.

bdewilde · 2017-01-09T18:32:19Z

Hm, that's a possibility... I used sklearn 0.17.1 and 0.18.1 for the "old" and "new" models, respectively. I didn't use, say, 0.16.1 because the API was compatible with 0.17.1 ... I realize now that I didn't update the version requirements in setup.py, which should probably be 'scikit-learn>=0.15.2,<=0.19'. Maintaining separate tests for (all?) different sklearn versions could get ugly, I wonder if there's a way to do approximate equality checks?

matt-peters · 2017-01-09T18:49:13Z

Ah I see. So your sklearn_0.15.2_0.17.1 picked model was trained with version 0.17.1 and there is another API change between 0.16.X and 0.17? In that case, it isn't compatible with 0.15-0.16 and we three need different pickled models to support 0.16 - 0.18. Ugh.

As for test cases, we could compute the precision, recall, f1 for the predicted version vs the gold standard and make sure they exceed some high threshold like 0.99.

bdewilde · 2017-01-09T18:54:27Z

Ah, maybe I wasn't clear... sklearn's API seemed to be compatible from 0.15.2 to 0.17.1, then changed at 0.18.0 +. So we should only need the two models (for now).

bdewilde · 2017-01-09T21:06:53Z

Hey @matt-peters , you were right: differences in scikit-learn versions account for the failed test. Travis uses 0.16.1 (despite changes to setup.py) to test the model, while I used 0.17.1 to train it, and although the ExtraTreesClassifier API didn't change across versions, the model results changed just slightly. I updated the model tests to use the bag-of-words f1 score between predicted and actual content, with a minimum acceptable value of 0.98. Tests pass on all versions: 0.16.1, 0.17.1, and 0.18.1.

matt-peters · 2017-01-10T18:08:59Z

dragnet/model_training.py

+                print('len(blocks) =', len(blocks))
+                print('len(this_labels) =', len(this_labels))
+                print('len(this_weight) =', len(this_weight))
+                # raise ValueError("Number of features, labels and weights do not match!")


Let's continue to raise this an error. By trapping it, we could mask another larger significant problem and create a much smaller or corrupted data set. Someone who is retraining without being familiar with the code details is likely to not notice it, so let's make it explicit when it is ignored.

matt-peters · 2017-01-10T18:09:49Z

dragnet/models.py

+    content_comments_extractor = None
+    content_and_content_comments_extractor = None
+
+# weninger_model = Weninger()


You can remove these lines too.

matt-peters · 2017-01-10T18:10:15Z

test/test_models.py

-                  kohlschuetter_css_model,
-                  kohlschuetter_css_weninger_model,
-                  content_extractor,
+        # models = [kohlschuetter_model,


These commented out lines can go

matt-peters · 2017-01-10T18:12:10Z

Would you mind also adding your training script to train/evaluate end-to-end? It will make it easier for someone else to add support for a new sklean version.

Otherwise it looks good except for the few minor comments above.

- re-introduced valueerror in training for faulty input file - cleared out some commented-out code

bdewilde · 2017-01-10T19:27:06Z

Okay, that last commit does everything you mentioned in the comments. Two things:

A test in test_models.py failed on Travis (ugh), but I could only replicate this behavior intermittently. I believe this can be attributed to the nondeterminism in one of the features. So, I'm going to re-bump up the number of trials from 5 to 10.
What should be done about the models in the pickled_models directory (not the new ones I've added in sklearn-specific folders)?

matt-peters · 2017-01-10T21:05:31Z

What should be done about the models in the pickled_models directory (not the new ones I've added in sklearn-specific folders)?

I don't see any reason to keep them so let's just delete them.

matt-peters · 2017-01-11T04:37:09Z

👍 Thanks for all your work on this!

bdewilde · 2017-01-11T14:58:27Z

My pleasure! I don't suppose you could push this as a new release to pypi? I'd love to use it in production. :)

sistawendy · 2017-01-11T19:18:52Z

The installation is still failing in Travis like this:

creating build/lib.linux-x86_64-2.7/dragnet/pickled_models
error: can't copy 'dragnet/pickled_models/sklearn_0.18.0': doesn't exist or not a regular file

matt-peters · 2017-01-11T21:12:52Z

@sistawendy The travis build looks fine to me: https://travis-ci.org/seomoz/dragnet ? Where do you see the failure?

sistawendy · 2017-01-11T21:16:37Z

In a Freshscape build. It's here, if you have access: https://travis-ci.com/seomoz/freshscape/builds/37887192

…

On Wed, Jan 11, 2017 at 1:12 PM, Matthew Peters ***@***.***> wrote: @sistawendy <https://github.com/sistawendy> The travis build looks fine to me: https://travis-ci.org/seomoz/dragnet ? Where do you see the failure? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACFWyo75C6GbQ_KEA9N30_7SYUSOBLwdks5rRUXVgaJpZM4LdeyV> .

matt-peters · 2017-01-11T22:17:43Z

This is fixed in #39

I will push a new release to pypi tomorrow.

bdewilde · 2017-01-17T18:06:55Z

Hey @matt-peters , did you push a new release? The latest I see on pypi is 1.0.2, which I believe came before this set of changes.

matt-peters · 2017-01-17T19:23:15Z

Just pushed 1.1.0 to pypi.

Burton DeWilde added 4 commits January 5, 2017 10:20

Merge remote-tracking branch 'seomoz/master'

160fe2c

Model load exception caught, plus warning message

eac05b5

- Added sklearn_path to compat, model dirs TBD - Caught exception in model training for incorrectly block-corrected html files - Tweaked `train_models` output dir and file names

Added trained models for different sklearn versions

22e1313

- Models now loaded from separate directories, depending on env’s version of sklearn - Removed tests for other models that are (currently) no longer in use - Model training now saves models in gzip compressed form

Updated docstring for main API method, .analyze()

3f12550

Burton DeWilde added 2 commits January 9, 2017 14:03

updated scikit-learn version requirements in setup

885ad26

Model tests now check for approximate equality

aff7f6f

matt-peters reviewed Jan 10, 2017

View reviewed changes

Added script to train and test content extraction models

6abec43

- re-introduced valueerror in training for faulty input file - cleared out some commented-out code

Increase number of trials in model testing because nondeterminism

8c6a654

Removed deprecated models from repo

25b2f27

matt-peters merged commit cbec553 into dragnet-org:master Jan 11, 2017

matt-peters mentioned this pull request Jan 11, 2017

scikit-learn update #33

Closed

bdewilde deleted the new-models branch January 11, 2017 19:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New models and model loading, depending on sklearn version #38

New models and model loading, depending on sklearn version #38

bdewilde commented Jan 7, 2017

bdewilde commented Jan 7, 2017

matt-peters commented Jan 9, 2017

bdewilde commented Jan 9, 2017

matt-peters commented Jan 9, 2017

bdewilde commented Jan 9, 2017

bdewilde commented Jan 9, 2017

matt-peters Jan 10, 2017

matt-peters Jan 10, 2017

matt-peters Jan 10, 2017

matt-peters commented Jan 10, 2017

bdewilde commented Jan 10, 2017 •

edited

matt-peters commented Jan 10, 2017

matt-peters commented Jan 11, 2017

bdewilde commented Jan 11, 2017

sistawendy commented Jan 11, 2017

matt-peters commented Jan 11, 2017

sistawendy commented Jan 11, 2017 via email

matt-peters commented Jan 11, 2017

bdewilde commented Jan 17, 2017

matt-peters commented Jan 17, 2017 •

edited

New models and model loading, depending on sklearn version #38

New models and model loading, depending on sklearn version #38

Conversation

bdewilde commented Jan 7, 2017

bdewilde commented Jan 7, 2017

matt-peters commented Jan 9, 2017

bdewilde commented Jan 9, 2017

matt-peters commented Jan 9, 2017

bdewilde commented Jan 9, 2017

bdewilde commented Jan 9, 2017

matt-peters Jan 10, 2017

Choose a reason for hiding this comment

matt-peters Jan 10, 2017

Choose a reason for hiding this comment

matt-peters Jan 10, 2017

Choose a reason for hiding this comment

matt-peters commented Jan 10, 2017

bdewilde commented Jan 10, 2017 • edited

matt-peters commented Jan 10, 2017

matt-peters commented Jan 11, 2017

bdewilde commented Jan 11, 2017

sistawendy commented Jan 11, 2017

matt-peters commented Jan 11, 2017

sistawendy commented Jan 11, 2017 via email

matt-peters commented Jan 11, 2017

bdewilde commented Jan 17, 2017

matt-peters commented Jan 17, 2017 • edited

bdewilde commented Jan 10, 2017 •

edited

matt-peters commented Jan 17, 2017 •

edited