Update models #63

pakelley · 2018-03-01T17:33:47Z

Closes #61. It looks like I just had a couple of issues with my data and my comparison of the two versions of the library: Included in this PR are newly trained models, as well as a minor bugfix for passing along the sigma parameter in the WeningerFeatures class's __init__.

bdewilde · 2018-03-01T17:37:17Z

First of all: good catch on the weninger init! :) And just out of curiosity... Where do the new models fall in terms of accuracy / F-score?

b4hand · 2018-03-01T18:01:25Z

Build is failing for 2.7.

pakelley · 2018-03-02T17:25:03Z

Ok, so I re-trained a couple of models with issues(poor performance on content-only extraction, which failed the Travis build, and a couple of sklearn<0.18 models were having trouble being loaded).

Also, I added in the files recording the performance of each model. @bdewilde The content-only models are still only getting about a 0.73 F1-score, but that's all I was able to get from master as well.

pakelley · 2018-03-02T18:03:09Z

@bdewilde To give a bit more info on the models' performance:
Content: F1 ~= 0.72-0.73, Accuracy ~= 0.81
Comments: F1 ~= 0.6, Accuracy ~= 0.83
Both: F1 ~= 0.93, Accuracy ~= 0.92

pakelley · 2018-03-09T22:19:50Z

I've been playing with this a bit more the last couple days, and have a couple of notes on that:

One is that the feature output is slightly different now, with the removal of the kmeans step from the weninger features. From my training, this doesn't actually seem to make much difference though.
Another note is that the output of sklearn's StandardScaler is different from the old manual implementation of scaling features, but this also doesn't make a difference in my tests, which I guess makes sense for a random forest.

Adding the kmeans step back and changing the scaling gives me the ability to load the old model, and I get a much better F1 score by just loading in the pre-trained ExtraTreesClassifier to the modified extractor(about 0.85, as advertised) for content-only extraction.
That said, I still can't train a model with the new or old version of the library with nearly that accuracy.
@matt-peters how did you train that model? Was it trained with the dragnet_data dataset?

matt-peters · 2018-03-10T00:55:08Z

@pakelley I did train the model on the dragnet_data dataset. Unfortunately it's been a while and I can't remember the details though. I do remember the model was quite sensitive to the hyperparameters used to train, are you re-using the same hyperparameters from the original model? You should be able to recover them from the serialized model.

pakelley · 2018-03-10T02:01:47Z

@matt-peters yeah, I'm using the same hyperparameters as the saved model. Not sure why it's doing so much worse :/

pakelley · 2018-03-13T16:09:38Z

Another update: I was able to train a model with a content-only F1 score of 0.85, but it was using a modified version the old training code.
@bdewilde Do you have any idea on what might be causing that? I'm trying to investigate what's going on there, but you're definitely more familiar with that code than I am :)

bdewilde · 2018-03-13T16:15:26Z

It's been an entire year since I looked at it, so... not really, no. 😅 I recall having trouble replicating the original model's performance, but was never able to track down the root cause.

The newer training code uses sklearn-standard k-fold cross-validation and grid search, right? Seems like that ought to find a good set of model params without doing anything dumb, but I can't rule out a bug on my part...

pakelley · 2018-03-14T18:38:13Z

@bdewilde No worries, thanks for all of your help!

Hey all, I finally got some decent models trained for content-only extraction! 🎉
It looks like concatenating all of the blocks before training/predicting was the source of the issue. I'm currently working on getting the model_training utilities to do training/evaluation without concatenating the blocks, which I'll just add to this PR when I'm finished.

bdewilde · 2018-03-14T18:58:04Z

👏 👏 👏

Instead of concatenating all blocks before training/predicting, calculate features on each HTML document individually.

pakelley · 2018-03-16T21:12:53Z

Updating the API to be able to extract features for documents individually needed a bit more change than I'd hoped, and there are a couple of sizeable changes here. Let me know if any of that seems unreasonable.

Aside from that, The new F1 scores are roughly 0.87 for content-only extraction, 0.81 for comment-only extraction, and 0.92 for both. :)

pakelley force-pushed the update-models branch 2 times, most recently from b2dfd6b to ada759c Compare March 2, 2018 17:15

pakelley mentioned this pull request Mar 14, 2018

model file size is around 447Kb for new training using same dataset #64

Closed

pakelley added 2 commits March 16, 2018 13:58

Trained new models

c8fcb32

Pass along sigma param in weninger features

09baf61

pakelley force-pushed the update-models branch from ada759c to 78e75dc Compare March 16, 2018 21:00

pakelley added 2 commits March 16, 2018 14:11

Maintain document-partitioned feature extraction

52476b9

Instead of concatenating all blocks before training/predicting, calculate features on each HTML document individually.

Remove old/commented code

66cbeff

pakelley force-pushed the update-models branch from 78e75dc to 66cbeff Compare March 16, 2018 21:11

b4hand merged commit 569d32e into dragnet-org:master Mar 19, 2018

nehalecky mentioned this pull request Mar 20, 2018

publish a new release on pypi? #60

Closed

nehalecky deleted the update-models branch April 24, 2018 04:51

Update models #63

Update models #63

Uh oh!

Conversation

pakelley commented Mar 1, 2018

Uh oh!

bdewilde commented Mar 1, 2018

Uh oh!

b4hand commented Mar 1, 2018

Uh oh!

pakelley commented Mar 2, 2018

Uh oh!

pakelley commented Mar 2, 2018

Uh oh!

pakelley commented Mar 9, 2018

Uh oh!

matt-peters commented Mar 10, 2018

Uh oh!

pakelley commented Mar 10, 2018

Uh oh!

pakelley commented Mar 13, 2018

Uh oh!

bdewilde commented Mar 13, 2018

Uh oh!

pakelley commented Mar 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdewilde commented Mar 14, 2018

Uh oh!

pakelley commented Mar 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pakelley commented Mar 14, 2018 •

edited

Loading