Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New models and model loading, depending on sklearn version #38

Merged
merged 9 commits into from Jan 11, 2017
Merged

New models and model loading, depending on sklearn version #38

merged 9 commits into from Jan 11, 2017

Conversation

bdewilde
Copy link
Collaborator

@bdewilde bdewilde commented Jan 7, 2017

Changes:

  • Add sklearn_path to compat module, use it for saving and loading different content extraction models for "old" and "new" versions of scikit-learn
  • Wrap default content extraction model loading in a try/except; in case of failure, display a warning message and set models to null; this lets devs use other parts of the package even when the default models can't be loaded
  • Added newly-trained models for content and content+comments extraction
  • Commented out loading and testing of non-default models, but can totally add these back in (or totally remove them) if you'd prefer
  • Updated the docstring on the main API method, ContentExtractionModel.analyze()

Performance in terms of accuracy/precision/recall/F1 of the new models was on par or slightly better than previous benchmarks. See Issue #33 for details.

Burton DeWilde added 4 commits January 5, 2017 10:20
- Added sklearn_path to compat, model dirs TBD
- Caught exception in model training for incorrectly block-corrected
html files
- Tweaked `train_models` output dir and file names
- Models now loaded from separate directories, depending on env’s
version of sklearn
- Removed tests for other models that are (currently) no longer in use
- Model training now saves models in gzip compressed form
@bdewilde
Copy link
Collaborator Author

bdewilde commented Jan 7, 2017

Looks like the test I modified isn't passing in Travis, but it definitely passes locally!

@matt-peters
Copy link
Collaborator

Nice! This looks great and is a huge improvement to maintaining with different sklearn versions. What version of sklearn do you have locally? I wonder if the issue with the Travis build is due to different sklearn versions. The different models will extract slightly different content so we may need one test fixture to support each version of sklearn.

@bdewilde
Copy link
Collaborator Author

bdewilde commented Jan 9, 2017

Hm, that's a possibility... I used sklearn 0.17.1 and 0.18.1 for the "old" and "new" models, respectively. I didn't use, say, 0.16.1 because the API was compatible with 0.17.1 ... I realize now that I didn't update the version requirements in setup.py, which should probably be 'scikit-learn>=0.15.2,<=0.19'. Maintaining separate tests for (all?) different sklearn versions could get ugly, I wonder if there's a way to do approximate equality checks?

@matt-peters
Copy link
Collaborator

Ah I see. So your sklearn_0.15.2_0.17.1 picked model was trained with version 0.17.1 and there is another API change between 0.16.X and 0.17? In that case, it isn't compatible with 0.15-0.16 and we three need different pickled models to support 0.16 - 0.18. Ugh.

As for test cases, we could compute the precision, recall, f1 for the predicted version vs the gold standard and make sure they exceed some high threshold like 0.99.

@bdewilde
Copy link
Collaborator Author

bdewilde commented Jan 9, 2017

Ah, maybe I wasn't clear... sklearn's API seemed to be compatible from 0.15.2 to 0.17.1, then changed at 0.18.0 +. So we should only need the two models (for now).

@bdewilde
Copy link
Collaborator Author

bdewilde commented Jan 9, 2017

Hey @matt-peters , you were right: differences in scikit-learn versions account for the failed test. Travis uses 0.16.1 (despite changes to setup.py) to test the model, while I used 0.17.1 to train it, and although the ExtraTreesClassifier API didn't change across versions, the model results changed just slightly. I updated the model tests to use the bag-of-words f1 score between predicted and actual content, with a minimum acceptable value of 0.98. Tests pass on all versions: 0.16.1, 0.17.1, and 0.18.1.

print('len(blocks) =', len(blocks))
print('len(this_labels) =', len(this_labels))
print('len(this_weight) =', len(this_weight))
# raise ValueError("Number of features, labels and weights do not match!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue to raise this an error. By trapping it, we could mask another larger significant problem and create a much smaller or corrupted data set. Someone who is retraining without being familiar with the code details is likely to not notice it, so let's make it explicit when it is ignored.

content_comments_extractor = None
content_and_content_comments_extractor = None

# weninger_model = Weninger()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove these lines too.

kohlschuetter_css_model,
kohlschuetter_css_weninger_model,
content_extractor,
# models = [kohlschuetter_model,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commented out lines can go

@matt-peters
Copy link
Collaborator

Would you mind also adding your training script to train/evaluate end-to-end? It will make it easier for someone else to add support for a new sklean version.

Otherwise it looks good except for the few minor comments above.

- re-introduced valueerror in training for faulty input file
- cleared out some commented-out code
@bdewilde
Copy link
Collaborator Author

bdewilde commented Jan 10, 2017

Okay, that last commit does everything you mentioned in the comments. Two things:

  1. A test in test_models.py failed on Travis (ugh), but I could only replicate this behavior intermittently. I believe this can be attributed to the nondeterminism in one of the features. So, I'm going to re-bump up the number of trials from 5 to 10.
  2. What should be done about the models in the pickled_models directory (not the new ones I've added in sklearn-specific folders)?

@matt-peters
Copy link
Collaborator

What should be done about the models in the pickled_models directory (not the new ones I've added in sklearn-specific folders)?

I don't see any reason to keep them so let's just delete them.

@matt-peters
Copy link
Collaborator

👍 Thanks for all your work on this!

@matt-peters matt-peters merged commit cbec553 into dragnet-org:master Jan 11, 2017
@matt-peters matt-peters mentioned this pull request Jan 11, 2017
@bdewilde
Copy link
Collaborator Author

My pleasure! I don't suppose you could push this as a new release to pypi? I'd love to use it in production. :)

@sistawendy
Copy link
Contributor

The installation is still failing in Travis like this:

creating build/lib.linux-x86_64-2.7/dragnet/pickled_models
error: can't copy 'dragnet/pickled_models/sklearn_0.18.0': doesn't exist or not a regular file

@bdewilde bdewilde deleted the new-models branch January 11, 2017 19:57
@matt-peters
Copy link
Collaborator

@sistawendy The travis build looks fine to me: https://travis-ci.org/seomoz/dragnet ? Where do you see the failure?

@sistawendy
Copy link
Contributor

sistawendy commented Jan 11, 2017 via email

@matt-peters
Copy link
Collaborator

This is fixed in #39

I will push a new release to pypi tomorrow.

@bdewilde
Copy link
Collaborator Author

Hey @matt-peters , did you push a new release? The latest I see on pypi is 1.0.2, which I believe came before this set of changes.

@matt-peters
Copy link
Collaborator

matt-peters commented Jan 17, 2017

Just pushed 1.1.0 to pypi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants