Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating / moving the training dataset? #85

Closed
bdewilde opened this issue Apr 2, 2019 · 20 comments
Closed

updating / moving the training dataset? #85

bdewilde opened this issue Apr 2, 2019 · 20 comments

Comments

@bdewilde
Copy link
Collaborator

bdewilde commented Apr 2, 2019

Is there any interest in moving the dragnet_data repository from seomoz to dragnet-org (this) GitHub account? It would be nice to have the two repos together and under the same administrative control.

On a related note, is there any interest in updating the training data (and retraining the various models)? The HTML in the current data is quite old at this point, so the trained models don't know how to learn from, say, HTML5's new syntactic features. I'm sure content extraction performance on newer webpages suffers. I don't know what the legal issues are (if any) of compiling a new dataset, but if somebody could advise, I would be interested in taking on some of the work.

Lastly, if we opted to compile a new training dataset from scratch, we wouldn't have to move the old repository and could, instead, just make a new one alongside this.

@matt-peters
Copy link
Collaborator

New data would be amazing, as you said the web has changed substantially in the last few years. I'm up for moving over the dragnet_data repo to this github org. I'd recommend adding the new data to the existing data instead of replacing completely, this will almost certainly make any trained model on the datasets more robust across different types of markup.

@b4hand
Copy link
Contributor

b4hand commented Apr 4, 2019

FYI, I can't do the org transfer anymore, but we can just clone/fork the repo to the new org. I think it would make sense as well.

@bdewilde
Copy link
Collaborator Author

bdewilde commented Jul 1, 2019

Hi folks! I'm finally ready to move forward on this task. First things first: I'm not able to create a new dragnet-data repository under dragnet-org. Would somebody (want to) give me those permissions?

Next big question is, do we actually want to keep the same "content + comments" setup as before? I had some back-and-forth with @matt-peters a couple(!) years ago — seomoz/dragnet_data#2 — and my needs are the same as then: comments aren't useful (plus, these days they're usually generated via javascript, so don't show up in the raw html), and content could be split into "body text" and "metadata" (byline, pubdate, maybe even image captions, etc.). What do y'all think?

@bdewilde
Copy link
Collaborator Author

bdewilde commented Jul 3, 2019

Currently looking into using (a small subset of) Common Crawl data to build a new training dataset. It should be possible to write code that pulls down a sample of crawled web pages' HTML and text content; manually cleaning up the latter shouldn't be too hard. Since new pages are crawled regularly, we could have a basically endless supply of training data. :) Will keep y'all posted.

@matt-peters
Copy link
Collaborator

@bdewilde Thanks for the update, this would be awesome if implemented. A refresh of the training data would probably greatly improve the model significantly for newer web pages. If you have the bandwidth for updating the training and model code to handle multiple different types of content (e.g. body text and metadata) then I'm 100% supportive. If not, including additional types of content in the "content" label (author, publication date) makes sense, but would be incompatible with the old annotation so you'd probably need to annotate at least the same amount of pages (~1000) to match the performance of the existing model for that type of content. In any case, thanks for your continued work on this project 💯

@bdewilde
Copy link
Collaborator Author

bdewilde commented Jul 8, 2019

Hi @matt-peters , happy to help! For simplicity's sake, I'm leaning towards lumping everything — title, byline, captions, and main article text — into "content", and skipping comments altogether. There's a case for splitting the metadata out, but it's definitely secondary, and I think it can wait.

The only thing I need now (besides time to pull a new training dataset together 😉) is GitHub permissions to create a new dragnet-data repo under the dragnet-org account, or someone else to do it for me and add me as an author. Could you do that, or point me to someone who can? Thanks a ton!

@bdewilde
Copy link
Collaborator Author

Update: I've manually compiled a training dataset of 200 (html, text) pairs on modern, news-y web pages from a variety of sites and in a variety of languages. The gold-standard, content-only text extractions include

  • title
  • visible byline / publish date info
  • visible image captions
  • main article text
  • visible text within embedded social media posts (e.g. tweets)

and do not include

  • image captions buried in photo galleries
  • photo credits when not appended directly to a caption
  • section/content tags, before or after the main article
  • urls of images displayed in embedded social media posts

Current block-level classification performance is F1 ~0.92. If I combine this dataset with CleanEval (which includes 680 examples), I get up to F1 ~0.95, but I'm not convinced it does a better job on the sort of modern, news-y web pages dominating my dataset. HTML really has changed in the past 10 years!

I'd like to get to ~500 examples, but this is a slow, not-fun process. Will keep y'all posted.

@b4hand
Copy link
Contributor

b4hand commented Jul 25, 2019

FYI, @bdewilde: Since I added you to the org, you should be able to create repos in it now, so feel free to create the new dragnet-data repo.

@ovidiususan
Copy link

Any news on this?

@bdewilde
Copy link
Collaborator Author

bdewilde commented May 4, 2020

hi @ovidiususan , i've recently begun another attempt at this — pandemic lockdown has given me some extra free time 🙃 — using a different, more scalable method for pulling together high-quality training data. will post an update here when i have news to share, or just create a new dragnet data repo as discussed above. appreciate your patience, i left this on my backburner much longer than planned.

@bdewilde
Copy link
Collaborator Author

hi folks, i've made some decent progress on this task, and in fact have set up a work-in-progress repo w/ an initial iteration of the data and data-generating code: https://github.com/bdewilde/dragnet_data

i want to finish a few key to-do's — documentation, tests, and actually cleaning / filling in most of the gold-standard texts — then will see about transfering or duplicating the code and data over to this org. will keep y'all posted.

@nehalecky
Copy link
Contributor

hey @bdewilde! hope all goes well, how's this effort looking?

@bdewilde
Copy link
Collaborator Author

Hi @nehalecky , very sorry about the delay on this. I've built up a training dataset of ~300 (html, text) pairs out of a total of ~1400, but progress is slow, and I keep getting detoured by other side projects. 🤦‍♂️ I'll try to push other projects aside so I can focus on finishing this project over the next few weeks. Will let you know here how it goes... 🙏

@nehalecky
Copy link
Contributor

Hi @bdewilde! Wow, thanks for the quick reply, and totally sympathize: data labeling is hard work. 😓
This quarter, we're (at https://github.com/bomboradata) putting effort into enhancing our content extraction performance, and looking to contribute to labeled data. We'd like to contribute to this effort, and had a few questions:

  1. Appreciate your description of workflows here (https://github.com/bdewilde/dragnet_data#methodology-and-data), but am curious to understand how they might be enhanced by use of a data labeling and annotation tool, such as: https://github.com/heartexlabs/label-studio or https://prodi.gy/?
  2. There is still no license added to https://github.com/bdewilde/dragnet_data, and in particular, wanted to know how you and https://github.com/dragnet-org would view granting the labeled data a commercial license (e.g., https://creativecommons.org/publicdomain/zero/1.0/), which makes our ability to contribute to this so much easier?
  3. Finally, I'd like to ask you or the community if know of any other efforts to advance SoA OSS around content extraction use cases?

Thanks much, appreciated!

@bdewilde
Copy link
Collaborator Author

Oh gosh, thank you tons for the offer to help! I set the code up in such a way that it's locked me into the original ~1400 pages — a large fraction of which are about the early days of the covid-19 pandemic, so both repetitive and bleak — but I've been meaning to restructure so that I or multiple people could extract gold-standard texts in more manageable chunks. Will try to implement a good method for this ASAP.

As for your questions:

  1. I've looked at those two options extensively and particularly, but neither seems to have a good built-in solution for this particular task. This is almost in the same ballpark, but I couldn't figure out how to adapt it. I tried to automate part of the task by programatically extracting page metadata/text from the HTML when possible, but too large a fraction of those extractions are noisy and not of "gold-standard" quality. The manual method I wrote up and follow is slower but much safer. If you know of a good, managed workflow tool, please point me to it!
  2. To be totally honest, I've always been confused by code/data licensing, but am inclined toward a very permissive license. I don't know if that'll be CC0 1.0, or MIT, or something else. Input and expertise would be appreciated!
  3. I've scouted broadly but not deeply into more recent / "state-of-the-art" methods in the html content extraction task, but nothing has really struck me as a huge improvement over what dragnet does. Excluding the methods that fully render a page (JavaScript and all!) then use computer vision to identify main body text, many methods seem to "blockify" text and perform binary classification on blocks based on a mix of structural and content-based features. That's not to say we can't and shouldn't improve upon the existing algo, just that we may not have to reinvent the wheel here. :) I have some ideas on this, but have been waiting on a fresh training dataset (again, sorry about the delay...) before committing to any new methods.

@matt-peters
Copy link
Collaborator

matt-peters commented Jul 16, 2020

Nice to see work on this front. FWIW, I'll all for changing the license on dragnet_data to something else that allows permissive re-use. EDIT: I see the original dragnet_data is still owned by seomoz org, and wasn't moved over to dragnet org. In that case, we'd need to work with someone there to change the license as I'm no longer a member.

@bdewilde
Copy link
Collaborator Author

Hi @nehalecky , I've made some changes to the code so that it's easier to do and track gold-standard extractions in batches, which should be nice for me alone, but with a couple minor git additions (branching, rebasing, and pull requests!), should also work for multiple annotators. I think. Here is the process — does it make sense?

@bdewilde
Copy link
Collaborator Author

bdewilde commented Aug 5, 2020

Hi all, I've finished compiling 500 (HTML, gold-standard text extraction) example pairs, have added permissive licenses for both the associated code and data, and have transferred ownership of the project to this group: https://github.com/dragnet-org/dragnet_data

The project readme includes thorough instructions on how to add additional examples to the dataset, if one were so inclined, but I think 500 is a fine enough start. There are no tests, for which I hope you'll forgive me.

Given all this, I'm going to close this issue out. Mission accomplished!

@bdewilde bdewilde closed this as completed Aug 5, 2020
@shakeelsoogun
Copy link

As a user of this library, greatly appreciate the hard work done here - anything to improve the quality is always a bonus and I do definitely sympathise with the pain of manually building the gold standard! Just wondering, since there was already a model bundled with this library, are there any plans to push a newly trained version of the model (either just to this repo or to pypi as well), or is the recommendation more to do this ourselves now? Would be nice to be able to just consume one already done, but equally don't mind doing this myself since the instructions to train it aren't too difficult.

@bdewilde
Copy link
Collaborator Author

@shakeelsoogun Thanks for asking :) I've hacked a bit on a major revamp of dragnet 's underlying models / methodology, since there's been some progress on this task over the past few years, but honestly I haven't had bandwidth to do much. It would be a lower lift to adapt the current setup to the new data, but it would still require changes to dragnet 's code base owing to changes made in structure and content of the new training dataset. This is on my to-do list, but I don't have any guesses on timelines. Don't let me deter you from training your own!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants