Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore undateable domains more intentionally #34

Open
rahulbot opened this issue Aug 2, 2021 · 7 comments
Open

ignore undateable domains more intentionally #34

rahulbot opened this issue Aug 2, 2021 · 7 comments
Labels
question Further information is requested

Comments

@rahulbot
Copy link
Contributor

rahulbot commented Aug 2, 2021

In our testing the current code produces unreliable results when tested on Wikipedia articles. Sometimes it returns a data, sometimes it doesn't. Wikipedia articles are constantly updated, so @coreydockser and I would like to propose to change it so it returns no date if the URL is a wikipedia.org one. In our broader experience with Media Cloud this produces more useful results (for our open web news analysis context).

In terms of implementation, we could just copy filter_url_for_undateable function from date_guesser and use that as is to include the other checks it does for undateable domains. We'd call it early on in guess_date.

@adbar adbar added the enhancement New feature or request label Aug 2, 2021
@adbar
Copy link
Owner

adbar commented Aug 2, 2021

Hi @rahulbot, it would be OK but I'd prefer to get to chance to tackle the problem first.
There is certainly a field in the HTML where the date can be extracted from, would you mind giving examples of pages where the result wasn't as expected?

@rahulbot
Copy link
Contributor Author

@coreydockser can you please provide an example of a wikipedia page that does return a publication date, and one that does not?

@coreydockser
Copy link
Contributor

Sorry for the delay, I ran into some odd issues of my own making. Anyways, here's a sample of four articles with different results.

https://en.wikipedia.org/wiki/Among_Us – returns None (this is the behavior we want)

https://en.wikipedia.org/wiki/January_1969 – returns 2018-06-19, this date appears as datePublished in the html

https://en.wikipedia.org/wiki/F-scale_(personality_test) - returns 2005-07-05. the datePublished on this page is 2005-07-25, though, so I'm unsure where it came from.

https://en.wikipedia.org/wiki/2021_United_States_Capitol_attack - 2021-01-06, this is the date of the event, but it's also the datePublished.

@adbar
Copy link
Owner

adbar commented Aug 24, 2021

@coreydockser Thanks, I'll look at it and see if I can find a solution.

@adbar
Copy link
Owner

adbar commented Sep 14, 2021

Hi @coreydockser, I checked the cases and I don't agree with you at all:

  • A few results were different (maybe you didn't try the last version).
  • Besides, None cannot possibly be the expected behavior since there is information to be found in the page.
  • Most importantly, htmldate extracts both modified and original dates correctly, that is here the last edit and page creation dates.

So I fail to grasp where the problem lies, could you please be more specific and/or provide further examples for other websites?

@adbar adbar added question Further information is requested and removed enhancement New feature or request labels Sep 14, 2021
@rahulbot
Copy link
Contributor Author

rahulbot commented Oct 7, 2021

The library version issue could explain some of those specific results. However the second piece is more of a question of your intentions. In our projects, "publication date" means the date a news article was listed as being published online. That is rooted in ideas from the historical news industry (despite edits and iterations of online stories becoming more commonplace). Wikipedia articles are meant to be living documents, so for us they don't have a "publication date" in that sense. This is important for our time-series based analysis of news attention.

So I guess the one way to state the question is like this: for this library do you intend "publication date" to have a technology-informed definition such as the date of last edit? Or do you want a more "news-ish" definition like we use?

It sounds like it is more the former, in which case there are no "undateable" domains. If that is what you intend, then we can close this issue as won't-fix and we can handle the idea of "undateable" domains based on our project definition in our own code before we pass content into htmldate.

Thanks for any clarifications and your great work on this library!

@adbar
Copy link
Owner

adbar commented Oct 15, 2021

Thanks for the explanations, I get your point. Indeed, htmldate mostly provides a technology-informed concept of datation. It hopefully intersects the news-ish definition in most cases, however the two may vary.

I guess it would be possible to focus on a "news-ish" understanding of publication date by setting an additional parameter prior to the extraction. What would be the formal requirements for it to happen?

I'm leaving this thread open to see if we can address the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants