Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added published date to the extractor+article #63

Merged
merged 2 commits into from Jul 28, 2014

Conversation

parhammmm
Copy link
Contributor

Nothing fancy just checks the meta tags and then if nothing is there checks the URL for the published date.

@codelucas
Copy link
Owner

Thanks for this PR, will read & test it tomorrow.

@adamlwgriffiths
Copy link

You could also parse the article HTML too using dateutil if no date is in the meta tags.
http://stackoverflow.com/a/3276459/1591957

Not sure if those tags would be trimmed by newspaper's extractor though.

@parhammmm
Copy link
Contributor Author

It would be hard to identify the correct date, there's usually more than one date on a page. After digging around I found only in a small number of cases the date was neither in the meta or the URL.

@codelucas
Copy link
Owner

@SkinnyP Thanks again for this PR. After reading it over more and testing it locally my thoughts are:

1.) Reference: https://github.com/skinnyp/newspaper/blob/master/newspaper/article.py#L466
L466-L467 are not needed because the set_published_date method will only be called internally and the returned value from extractors.py is always of datetime.datetime. Also, remove the import datetime from articles.py.

2.) Reference: https://github.com/skinnyp/newspaper/blob/master/newspaper/extractors.py#L226
I see your strategy for trying to guess out the date from the article's URL. Smart stuff, this exact same technique is used by newspaper to guess if a URL is a valid news URL or not.

However, for something as important as publishing date, we can't afford to give any incorrect value. Having a URL guesser like this can result in bogus values which are hard to debug. Please remove lines 226-232 of your extractors.py.

Otherwise nice work! After fixing the two nits above, add 1-2 unit tests of this working and i'll be ready to accept this pull request. Thanks again!

codelucas added a commit that referenced this pull request Jul 28, 2014
Added published date to the extractor+article
@codelucas codelucas merged commit 9179db5 into codelucas:master Jul 28, 2014
@codelucas
Copy link
Owner

^ Oops! I just accidentally merged this commit. I then reverted that commit in a seperate commit and now this entire issue can't be re-opened.

Oh well, @SkinnyP, note my above comment and just submit a new pull request with the new changes!
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants