Skip to content

Conversation

@pakelley
Copy link
Collaborator

@pakelley pakelley commented Mar 1, 2018

Closes #61. It looks like I just had a couple of issues with my data and my comparison of the two versions of the library: Included in this PR are newly trained models, as well as a minor bugfix for passing along the sigma parameter in the WeningerFeatures class's __init__.

@bdewilde
Copy link
Collaborator

bdewilde commented Mar 1, 2018

First of all: good catch on the weninger init! :) And just out of curiosity... Where do the new models fall in terms of accuracy / F-score?

@b4hand
Copy link
Contributor

b4hand commented Mar 1, 2018

Build is failing for 2.7.

@pakelley pakelley force-pushed the update-models branch 2 times, most recently from b2dfd6b to ada759c Compare March 2, 2018 17:15
@pakelley
Copy link
Collaborator Author

pakelley commented Mar 2, 2018

Ok, so I re-trained a couple of models with issues(poor performance on content-only extraction, which failed the Travis build, and a couple of sklearn<0.18 models were having trouble being loaded).

Also, I added in the files recording the performance of each model. @bdewilde The content-only models are still only getting about a 0.73 F1-score, but that's all I was able to get from master as well.

@pakelley
Copy link
Collaborator Author

pakelley commented Mar 2, 2018

@bdewilde To give a bit more info on the models' performance:
Content: F1 ~= 0.72-0.73, Accuracy ~= 0.81
Comments: F1 ~= 0.6, Accuracy ~= 0.83
Both: F1 ~= 0.93, Accuracy ~= 0.92

@pakelley
Copy link
Collaborator Author

pakelley commented Mar 9, 2018

I've been playing with this a bit more the last couple days, and have a couple of notes on that:

  • One is that the feature output is slightly different now, with the removal of the kmeans step from the weninger features. From my training, this doesn't actually seem to make much difference though.
  • Another note is that the output of sklearn's StandardScaler is different from the old manual implementation of scaling features, but this also doesn't make a difference in my tests, which I guess makes sense for a random forest.

Adding the kmeans step back and changing the scaling gives me the ability to load the old model, and I get a much better F1 score by just loading in the pre-trained ExtraTreesClassifier to the modified extractor(about 0.85, as advertised) for content-only extraction.
That said, I still can't train a model with the new or old version of the library with nearly that accuracy.
@matt-peters how did you train that model? Was it trained with the dragnet_data dataset?

@matt-peters
Copy link
Collaborator

@pakelley I did train the model on the dragnet_data dataset. Unfortunately it's been a while and I can't remember the details though. I do remember the model was quite sensitive to the hyperparameters used to train, are you re-using the same hyperparameters from the original model? You should be able to recover them from the serialized model.

@pakelley
Copy link
Collaborator Author

@matt-peters yeah, I'm using the same hyperparameters as the saved model. Not sure why it's doing so much worse :/

@pakelley
Copy link
Collaborator Author

Another update: I was able to train a model with a content-only F1 score of 0.85, but it was using a modified version the old training code.
@bdewilde Do you have any idea on what might be causing that? I'm trying to investigate what's going on there, but you're definitely more familiar with that code than I am :)

@bdewilde
Copy link
Collaborator

It's been an entire year since I looked at it, so... not really, no. 😅 I recall having trouble replicating the original model's performance, but was never able to track down the root cause.

The newer training code uses sklearn-standard k-fold cross-validation and grid search, right? Seems like that ought to find a good set of model params without doing anything dumb, but I can't rule out a bug on my part...

@pakelley
Copy link
Collaborator Author

pakelley commented Mar 14, 2018

@bdewilde No worries, thanks for all of your help!

Hey all, I finally got some decent models trained for content-only extraction! 🎉
It looks like concatenating all of the blocks before training/predicting was the source of the issue. I'm currently working on getting the model_training utilities to do training/evaluation without concatenating the blocks, which I'll just add to this PR when I'm finished.

@bdewilde
Copy link
Collaborator

👏 👏 👏

Instead of concatenating all blocks before training/predicting, calculate
features on each HTML document individually.
@pakelley
Copy link
Collaborator Author

Updating the API to be able to extract features for documents individually needed a bit more change than I'd hoped, and there are a couple of sizeable changes here. Let me know if any of that seems unreasonable.

Aside from that, The new F1 scores are roughly 0.87 for content-only extraction, 0.81 for comment-only extraction, and 0.92 for both. :)

@b4hand b4hand merged commit 569d32e into dragnet-org:master Mar 19, 2018
@nehalecky nehalecky deleted the update-models branch April 24, 2018 04:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Poor performance for content-only extraction

4 participants