Getting older articles #245

tehnar · 2016-05-10T17:51:26Z

Hello, is there any way to get more articles from a particular website? I get only the latest (amount and beginning date depends on site) news, not all of them. Caching is disabled, but it doesn't help.

yprez · 2016-05-10T18:31:50Z

@tehnar can you provide a specific example?

tehnar · 2016-05-10T18:38:40Z

@yprez
For example, http://blog.jetbrains.com/ruby/
Newspaper thinks that there are only 94 articles while the real amount is much larger. The latest article downloaded by newspaper is dated autumn 2015.

Also, there is an additional trouble with this website: if I try some other blog (for example, http://blog.jetbrains.com/pycharm/), I'll get 0 articles. I've managed to fix it by manually deleting contents of ~/.newspaper/feed_category_cache, but it's a strange hack.

yprez · 2016-05-11T10:28:39Z

Did you try disabling cache?
e.g. newspaper.build('http://blog.jetbrains.com/ruby/', memoize_articles=False)

I'm getting 127 articles from http://blog.jetbrains.com/ruby/, not sure if it's all of them or not.
I think you're getting 0 on the 2nd result because the articles are cached by the domain...

tehnar · 2016-05-11T10:37:52Z

@yprez
I'm still getting only 95 articles (I disabled caching and removed ~/.newspaper_scraper). There are 32 pages (the last page http://blog.jetbrains.com/ruby/page/32/), 10 articles on each, 320 articles in total, the first article is published in 2010, while I'm getting only articles published in 2015-2016

tehnar · 2016-05-13T13:37:09Z

Also, publish dates are not extracted properly for all the articles.
For example, publish date for this article http://blog.jetbrains.com/ruby/2016/05/rubymine-2016-1-1-security-update/ is not recognized while for this article http://blog.jetbrains.com/ruby/2015/12/20-years-of-ruby/ it's extracted properly. It looks like a bug.

yprez · 2016-05-13T19:33:49Z

Funny, but the bug with parsing the date is actually in the 2nd article where it succeeds...

It parses the date from the URL (I couldn't find any meta date attributes in these articles), so 2015/12/20 part is parsed as a date, and the result is wrong too (20/05/2015 instead of the 22nd)

tehnar · 2016-05-13T19:54:16Z

@yprez
Oh, really funny bug :)
How about adding some more regexps for different date formats? I suppose that not so many sites represent date as year/month/day. Actually, I think that most of websites put the year as the last part of publish date (but I might be wrong).
And what about a low amount of downloaded articles? Is that a king of bug, or what?

yprez · 2016-05-14T13:21:43Z

There was a ticket about trying to find dates with regex within the article - #168, and a closed pull-request somewhere. Can't really get it from the URL if it only has the year and month.

Regarding the amount of articles, newspaper combines several strategies to get a list of articles, e.g. links from the page, categories, rss feeds, etc. It doesn't go over the pages of paginated results...
The first page http://blog.jetbrains.com/ruby/ only has articles from January 2016, same about the rss feed at: http://blog.jetbrains.com/ruby/feed/

To get all the articles you would need to paginate over all result pages, I don't think that's currently possible with newspaper...

tehnar · 2016-05-14T18:39:49Z

Ok, I've got it. Thank for the clarification.

yprez added the question label May 11, 2016

PaulKMandal mentioned this issue Sep 13, 2023

Getting Older News Articles #973

Open

This was referenced Oct 24, 2023

Getting older articles AndyTheFactory/newspaper4k#30

Open

Getting Older News Articles AndyTheFactory/newspaper4k#580

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting older articles #245

Getting older articles #245

tehnar commented May 10, 2016

yprez commented May 10, 2016

tehnar commented May 10, 2016

yprez commented May 11, 2016

tehnar commented May 11, 2016

tehnar commented May 13, 2016

yprez commented May 13, 2016

tehnar commented May 13, 2016

yprez commented May 14, 2016

tehnar commented May 14, 2016

Getting older articles #245

Getting older articles #245

Comments

tehnar commented May 10, 2016

yprez commented May 10, 2016

tehnar commented May 10, 2016

yprez commented May 11, 2016

tehnar commented May 11, 2016

tehnar commented May 13, 2016

yprez commented May 13, 2016

tehnar commented May 13, 2016

yprez commented May 14, 2016

tehnar commented May 14, 2016