-
Notifications
You must be signed in to change notification settings - Fork 181
Update models #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update models #63
Conversation
|
First of all: good catch on the weninger init! :) And just out of curiosity... Where do the new models fall in terms of accuracy / F-score? |
|
Build is failing for 2.7. |
b2dfd6b to
ada759c
Compare
|
Ok, so I re-trained a couple of models with issues(poor performance on content-only extraction, which failed the Travis build, and a couple of Also, I added in the files recording the performance of each model. @bdewilde The content-only models are still only getting about a 0.73 F1-score, but that's all I was able to get from master as well. |
|
@bdewilde To give a bit more info on the models' performance: |
|
I've been playing with this a bit more the last couple days, and have a couple of notes on that:
Adding the kmeans step back and changing the scaling gives me the ability to load the old model, and I get a much better F1 score by just loading in the pre-trained |
|
@pakelley I did train the model on the |
|
@matt-peters yeah, I'm using the same hyperparameters as the saved model. Not sure why it's doing so much worse :/ |
|
Another update: I was able to train a model with a content-only F1 score of 0.85, but it was using a modified version the old training code. |
|
It's been an entire year since I looked at it, so... not really, no. 😅 I recall having trouble replicating the original model's performance, but was never able to track down the root cause. The newer training code uses |
|
@bdewilde No worries, thanks for all of your help! Hey all, I finally got some decent models trained for content-only extraction! 🎉 |
|
👏 👏 👏 |
Instead of concatenating all blocks before training/predicting, calculate features on each HTML document individually.
|
Updating the API to be able to extract features for documents individually needed a bit more change than I'd hoped, and there are a couple of sizeable changes here. Let me know if any of that seems unreasonable. Aside from that, The new F1 scores are roughly 0.87 for content-only extraction, 0.81 for comment-only extraction, and 0.92 for both. :) |
Closes #61. It looks like I just had a couple of issues with my data and my comparison of the two versions of the library: Included in this PR are newly trained models, as well as a minor bugfix for passing along the
sigmaparameter in theWeningerFeaturesclass's__init__.