-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updating / moving the training dataset? #85
Comments
New data would be amazing, as you said the web has changed substantially in the last few years. I'm up for moving over the dragnet_data repo to this github org. I'd recommend adding the new data to the existing data instead of replacing completely, this will almost certainly make any trained model on the datasets more robust across different types of markup. |
FYI, I can't do the org transfer anymore, but we can just clone/fork the repo to the new org. I think it would make sense as well. |
Hi folks! I'm finally ready to move forward on this task. First things first: I'm not able to create a new Next big question is, do we actually want to keep the same "content + comments" setup as before? I had some back-and-forth with @matt-peters a couple(!) years ago — seomoz/dragnet_data#2 — and my needs are the same as then: comments aren't useful (plus, these days they're usually generated via javascript, so don't show up in the raw html), and content could be split into "body text" and "metadata" (byline, pubdate, maybe even image captions, etc.). What do y'all think? |
Currently looking into using (a small subset of) Common Crawl data to build a new training dataset. It should be possible to write code that pulls down a sample of crawled web pages' HTML and text content; manually cleaning up the latter shouldn't be too hard. Since new pages are crawled regularly, we could have a basically endless supply of training data. :) Will keep y'all posted. |
@bdewilde Thanks for the update, this would be awesome if implemented. A refresh of the training data would probably greatly improve the model significantly for newer web pages. If you have the bandwidth for updating the training and model code to handle multiple different types of content (e.g. body text and metadata) then I'm 100% supportive. If not, including additional types of content in the "content" label (author, publication date) makes sense, but would be incompatible with the old annotation so you'd probably need to annotate at least the same amount of pages (~1000) to match the performance of the existing model for that type of content. In any case, thanks for your continued work on this project 💯 |
Hi @matt-peters , happy to help! For simplicity's sake, I'm leaning towards lumping everything — title, byline, captions, and main article text — into "content", and skipping comments altogether. There's a case for splitting the metadata out, but it's definitely secondary, and I think it can wait. The only thing I need now (besides time to pull a new training dataset together 😉) is GitHub permissions to create a new |
Update: I've manually compiled a training dataset of 200 (html, text) pairs on modern, news-y web pages from a variety of sites and in a variety of languages. The gold-standard, content-only text extractions include
and do not include
Current block-level classification performance is F1 ~0.92. If I combine this dataset with CleanEval (which includes 680 examples), I get up to F1 ~0.95, but I'm not convinced it does a better job on the sort of modern, news-y web pages dominating my dataset. HTML really has changed in the past 10 years! I'd like to get to ~500 examples, but this is a slow, not-fun process. Will keep y'all posted. |
FYI, @bdewilde: Since I added you to the org, you should be able to create repos in it now, so feel free to create the new dragnet-data repo. |
Any news on this? |
hi @ovidiususan , i've recently begun another attempt at this — pandemic lockdown has given me some extra free time 🙃 — using a different, more scalable method for pulling together high-quality training data. will post an update here when i have news to share, or just create a new dragnet data repo as discussed above. appreciate your patience, i left this on my backburner much longer than planned. |
hi folks, i've made some decent progress on this task, and in fact have set up a work-in-progress repo w/ an initial iteration of the data and data-generating code: https://github.com/bdewilde/dragnet_data i want to finish a few key to-do's — documentation, tests, and actually cleaning / filling in most of the gold-standard texts — then will see about transfering or duplicating the code and data over to this org. will keep y'all posted. |
hey @bdewilde! hope all goes well, how's this effort looking? |
Hi @nehalecky , very sorry about the delay on this. I've built up a training dataset of ~300 (html, text) pairs out of a total of ~1400, but progress is slow, and I keep getting detoured by other side projects. 🤦♂️ I'll try to push other projects aside so I can focus on finishing this project over the next few weeks. Will let you know here how it goes... 🙏 |
Hi @bdewilde! Wow, thanks for the quick reply, and totally sympathize: data labeling is hard work. 😓
Thanks much, appreciated! |
Oh gosh, thank you tons for the offer to help! I set the code up in such a way that it's locked me into the original ~1400 pages — a large fraction of which are about the early days of the covid-19 pandemic, so both repetitive and bleak — but I've been meaning to restructure so that I or multiple people could extract gold-standard texts in more manageable chunks. Will try to implement a good method for this ASAP. As for your questions:
|
Nice to see work on this front. FWIW, I'll all for changing the license on |
Hi @nehalecky , I've made some changes to the code so that it's easier to do and track gold-standard extractions in batches, which should be nice for me alone, but with a couple minor git additions (branching, rebasing, and pull requests!), should also work for multiple annotators. I think. Here is the process — does it make sense? |
Hi all, I've finished compiling 500 (HTML, gold-standard text extraction) example pairs, have added permissive licenses for both the associated code and data, and have transferred ownership of the project to this group: https://github.com/dragnet-org/dragnet_data The project readme includes thorough instructions on how to add additional examples to the dataset, if one were so inclined, but I think 500 is a fine enough start. There are no tests, for which I hope you'll forgive me. Given all this, I'm going to close this issue out. Mission accomplished! |
As a user of this library, greatly appreciate the hard work done here - anything to improve the quality is always a bonus and I do definitely sympathise with the pain of manually building the gold standard! Just wondering, since there was already a model bundled with this library, are there any plans to push a newly trained version of the model (either just to this repo or to pypi as well), or is the recommendation more to do this ourselves now? Would be nice to be able to just consume one already done, but equally don't mind doing this myself since the instructions to train it aren't too difficult. |
@shakeelsoogun Thanks for asking :) I've hacked a bit on a major revamp of |
Is there any interest in moving the dragnet_data repository from seomoz to dragnet-org (this) GitHub account? It would be nice to have the two repos together and under the same administrative control.
On a related note, is there any interest in updating the training data (and retraining the various models)? The HTML in the current data is quite old at this point, so the trained models don't know how to learn from, say, HTML5's new syntactic features. I'm sure content extraction performance on newer webpages suffers. I don't know what the legal issues are (if any) of compiling a new dataset, but if somebody could advise, I would be interested in taking on some of the work.
Lastly, if we opted to compile a new training dataset from scratch, we wouldn't have to move the old repository and could, instead, just make a new one alongside this.
The text was updated successfully, but these errors were encountered: