#updated? not working #218

Closed
RedFred7 opened this Issue Apr 9, 2014 · 11 comments

Comments

Projects
None yet
4 participants
@RedFred7

RedFred7 commented Apr 9, 2014

require 'feedjira'


feed = Feedjira::Feed.fetch_and_parse 'http://feeds.bbci.co.uk/news/rss.xml'
puts feed.title
puts feed.entries.length

@flag = true

while @flag
  sleep(20)
  updated_feed = Feedjira::Feed.update(feed)

 if updated_feed.updated?
    puts updated_feed.new_entries.length
   @flag = false
 end
end
@RedFred7

This comment has been minimized.

Show comment
Hide comment
@RedFred7

RedFred7 Apr 9, 2014

I'm running the above code (Ruby 1.9.3). I'm monitoring the URL on my feed reader's window alongside my console. The updated_feed.updated? method seems to be returning true or false totally randomly. It may return true when the feed hasn't been updated and it may return false when the feed has been updated.

updated_feed.new_entries.length will always be 0.

any thoughts?

RedFred7 commented Apr 9, 2014

I'm running the above code (Ruby 1.9.3). I'm monitoring the URL on my feed reader's window alongside my console. The updated_feed.updated? method seems to be returning true or false totally randomly. It may return true when the feed hasn't been updated and it may return false when the feed has been updated.

updated_feed.new_entries.length will always be 0.

any thoughts?

@jonallured

This comment has been minimized.

Show comment
Hide comment
@jonallured

jonallured Apr 11, 2014

Member

Hi @RedFred7, sorry this isn't working as you'd expect - to be totally honest, I don't think very highly of the update features of Feedjira. What I recommend is that users stick to the fetch_and_parse method.

Can you tell me a little more about what you're doing with the library - maybe I can help point you in the right direction.

Thanks!
Jon

Member

jonallured commented Apr 11, 2014

Hi @RedFred7, sorry this isn't working as you'd expect - to be totally honest, I don't think very highly of the update features of Feedjira. What I recommend is that users stick to the fetch_and_parse method.

Can you tell me a little more about what you're doing with the library - maybe I can help point you in the right direction.

Thanks!
Jon

@RedFred7

This comment has been minimized.

Show comment
Hide comment
@RedFred7

RedFred7 Apr 11, 2014

Hi Jon and thanks for the quick reply.

I'm working on a collaboration app that requires RSS integration, that
is the ability to monitor certain feeds and let the user know when
there are updates and what the updates are. So, for example, if the
user wants to monitor BBC news, the app needs to check the feed every
x minutes and -if it's updated- let the user know what the update is.

The trouble is, as I said in the issue description, that #updated?
seems to return true or false randomly, even when they are no updates.
When there are updates and #updated? is true there are no new entries ,
updated_feeds.new_entries.length is 0.

I've been getting round the issue by keeping track of the feed entries
pre and post- update and comparing them but keeping bumping into some
other problems that I can't get into without bloating this email reply.

Couple of questions if I may:

  1. is the latest entry always pushed to the top of the array, i.e. is
    feed.entries.first always the latest item ?

  2. will sorting the entries array by the published date help me
    determine the latest entry?

Any guidance on what I need to do in order to know when the feed's
updated and what the updates are would be greatly appreciated.

thanks!

Fred Heath

On Fri 11 Apr 2014 22:26:22 BST, Jon Allured wrote:

Hi @RedFred7 https://github.com/RedFred7, sorry this isn't working
as you'd expect - to be totally honest, I don't think very highly of
the update features of Feedjira. What I recommend is that users stick
to the |fetch_and_parse| method.

Can you tell me a little more about what you're doing with the library

  • maybe I can help point you in the right direction.

Thanks!
Jon


Reply to this email directly or view it on GitHub
#218 (comment).

Hi Jon and thanks for the quick reply.

I'm working on a collaboration app that requires RSS integration, that
is the ability to monitor certain feeds and let the user know when
there are updates and what the updates are. So, for example, if the
user wants to monitor BBC news, the app needs to check the feed every
x minutes and -if it's updated- let the user know what the update is.

The trouble is, as I said in the issue description, that #updated?
seems to return true or false randomly, even when they are no updates.
When there are updates and #updated? is true there are no new entries ,
updated_feeds.new_entries.length is 0.

I've been getting round the issue by keeping track of the feed entries
pre and post- update and comparing them but keeping bumping into some
other problems that I can't get into without bloating this email reply.

Couple of questions if I may:

  1. is the latest entry always pushed to the top of the array, i.e. is
    feed.entries.first always the latest item ?

  2. will sorting the entries array by the published date help me
    determine the latest entry?

Any guidance on what I need to do in order to know when the feed's
updated and what the updates are would be greatly appreciated.

thanks!

Fred Heath

On Fri 11 Apr 2014 22:26:22 BST, Jon Allured wrote:

Hi @RedFred7 https://github.com/RedFred7, sorry this isn't working
as you'd expect - to be totally honest, I don't think very highly of
the update features of Feedjira. What I recommend is that users stick
to the |fetch_and_parse| method.

Can you tell me a little more about what you're doing with the library

  • maybe I can help point you in the right direction.

Thanks!
Jon


Reply to this email directly or view it on GitHub
#218 (comment).

@jonallured

This comment has been minimized.

Show comment
Hide comment
@jonallured

jonallured Apr 12, 2014

Member

Hey @RedFred7, thanks for taking the time to write that out - hopefully I can provide some insight!

I think you're trying to decide if you can trust latest_entry and this should help:

# lib/feedjira/feed_utilities.rb
def find_new_entries_for(feed)
  return feed.entries if self.entries.length == 0
  latest_entry = self.entries.first
  found_new_entries = []
  feed.entries.each do |entry|
    if entry.entry_id.nil? && latest_entry.entry_id.nil?
      break if entry.url == latest_entry.url
    else
      break if entry.entry_id == latest_entry.entry_id || entry.url == latest_entry.url
    end
    found_new_entries << entry
  end
  found_new_entries
end

Here you can see that latest_entry is being set to entries.first. The find_new_entries_for method is called by update_from_feed in the same file, which is what Feedjira::Feed.update ends up calling.

But more broadly, I wanted to talk strategy for a sec. Like I said in my comment yesterday, I don't feel very good about the update parts of Feedjira. Even if they worked a little more consistently, I think the approach is pretty naive.

If you take a look at that find_new_entries_for method, I think you'll see what I mean. Once we've got the latest_entry set, we loop through the entries that were just found and for each one, try to determine if its "new". The criteria for an entry being "new" is all about entry_id and url.

The reality is that feed authors do all sorts of wacky things and break this implementation. What if an article in the feed has a typo in the url and they "fix" it after you've already seen the article? Should that be a new entry or not? Currently, it would be added to the list. What if the content of the post has been updated, but entry_id and url don't change? Should that be a new entry? Because with this implementation it wouldn't be.

What if the order of the posts change?

There are all kinds of business rules here that a Feedjira user will need to decide for themselves and thus, I believe the update stuff is of no value. I think users of Feedjira should stick to fetch_and_parse and define what they want to happen in their code.

If you want to see an implementation of updating your Feeds in a Rails context, I'd recommend you take a look at Stringer. Its a user of Feedjira and I like how the updating works, see app/tasks/fetch_feed.rb and app/commands/feeds/find_new_stories.rb.

The business logic here is still a little naive, but that was the choice they made - they didn't try to rely on Feedjira deciding which articles were new, they just fetch them all each time the job is run and then use their own code to find the ones that are new. Simple and completely in their control.

Sorry for the novel, I've been sketching what Feedjira 2.0 might look like recently and so this stuff has been on my mind. I hope that helps and I'd be really interested in any feedback you might have.

Thanks!
Jon

Member

jonallured commented Apr 12, 2014

Hey @RedFred7, thanks for taking the time to write that out - hopefully I can provide some insight!

I think you're trying to decide if you can trust latest_entry and this should help:

# lib/feedjira/feed_utilities.rb
def find_new_entries_for(feed)
  return feed.entries if self.entries.length == 0
  latest_entry = self.entries.first
  found_new_entries = []
  feed.entries.each do |entry|
    if entry.entry_id.nil? && latest_entry.entry_id.nil?
      break if entry.url == latest_entry.url
    else
      break if entry.entry_id == latest_entry.entry_id || entry.url == latest_entry.url
    end
    found_new_entries << entry
  end
  found_new_entries
end

Here you can see that latest_entry is being set to entries.first. The find_new_entries_for method is called by update_from_feed in the same file, which is what Feedjira::Feed.update ends up calling.

But more broadly, I wanted to talk strategy for a sec. Like I said in my comment yesterday, I don't feel very good about the update parts of Feedjira. Even if they worked a little more consistently, I think the approach is pretty naive.

If you take a look at that find_new_entries_for method, I think you'll see what I mean. Once we've got the latest_entry set, we loop through the entries that were just found and for each one, try to determine if its "new". The criteria for an entry being "new" is all about entry_id and url.

The reality is that feed authors do all sorts of wacky things and break this implementation. What if an article in the feed has a typo in the url and they "fix" it after you've already seen the article? Should that be a new entry or not? Currently, it would be added to the list. What if the content of the post has been updated, but entry_id and url don't change? Should that be a new entry? Because with this implementation it wouldn't be.

What if the order of the posts change?

There are all kinds of business rules here that a Feedjira user will need to decide for themselves and thus, I believe the update stuff is of no value. I think users of Feedjira should stick to fetch_and_parse and define what they want to happen in their code.

If you want to see an implementation of updating your Feeds in a Rails context, I'd recommend you take a look at Stringer. Its a user of Feedjira and I like how the updating works, see app/tasks/fetch_feed.rb and app/commands/feeds/find_new_stories.rb.

The business logic here is still a little naive, but that was the choice they made - they didn't try to rely on Feedjira deciding which articles were new, they just fetch them all each time the job is run and then use their own code to find the ones that are new. Simple and completely in their control.

Sorry for the novel, I've been sketching what Feedjira 2.0 might look like recently and so this stuff has been on my mind. I hope that helps and I'd be really interested in any feedback you might have.

Thanks!
Jon

@swanson

This comment has been minimized.

Show comment
Hide comment
@swanson

swanson Apr 13, 2014

Contributor

Will chime in (with another novel - sorry!) and say that things aren't all sunshine and rainbows for Stringer either :)

I think a simple but largely effective approach is to use the URL as a unique constraint. Trying to break out early is dangerous - most feeds are ordered by a timestamp (which could be when it was created, updated, published, etc) but there is nothing stopping someone from publishing a draft that was written two weeks ago (and timestamped in the past) at a later date.

You can't really trust the last_modified field at the root of some feed elements either, we've encountered a mismatch in some rather popular feeds where the last modified date is not correct.

Even if you completely distrust the feed's timestamps and use a unique URL scheme as I mentioned above - there are some feeds that just update posts with new information, but keep the same URL. One particular example I remember was an auction site that would constantly be updated as items were sold/posted for sale. You can't really know if the author was fixing a typo (probably not necessary to alert an end user) or a larger content update (probably does warrant an alert).

You will still find some cases where a blogger moves platforms or domains and all of the URLs change - now you've got a 100 "new posts" that are actually old content!

For your particular use case: I would recommend you store the URLs of every entry in the feed in a database (or some other persistent source) and detect "new entries" by checking for the existing of the URL. I think you'll find that this gets you 90% of the way there with much less headache than trying to compare timestamps.

Contributor

swanson commented Apr 13, 2014

Will chime in (with another novel - sorry!) and say that things aren't all sunshine and rainbows for Stringer either :)

I think a simple but largely effective approach is to use the URL as a unique constraint. Trying to break out early is dangerous - most feeds are ordered by a timestamp (which could be when it was created, updated, published, etc) but there is nothing stopping someone from publishing a draft that was written two weeks ago (and timestamped in the past) at a later date.

You can't really trust the last_modified field at the root of some feed elements either, we've encountered a mismatch in some rather popular feeds where the last modified date is not correct.

Even if you completely distrust the feed's timestamps and use a unique URL scheme as I mentioned above - there are some feeds that just update posts with new information, but keep the same URL. One particular example I remember was an auction site that would constantly be updated as items were sold/posted for sale. You can't really know if the author was fixing a typo (probably not necessary to alert an end user) or a larger content update (probably does warrant an alert).

You will still find some cases where a blogger moves platforms or domains and all of the URLs change - now you've got a 100 "new posts" that are actually old content!

For your particular use case: I would recommend you store the URLs of every entry in the feed in a database (or some other persistent source) and detect "new entries" by checking for the existing of the URL. I think you'll find that this gets you 90% of the way there with much less headache than trying to compare timestamps.

@RedFred7

This comment has been minimized.

Show comment
Hide comment
@RedFred7

RedFred7 Apr 14, 2014

Hi guys,

thanks very much for both your replies. I see what you're getting at,
there is no standard protocol for updating RSS so I need to do it the
'manual' way and use what makes sense in my domain. No silver bullet
for RSS, so to speak :)

My only concern is the extra overhead it takes to keep track of all
'current' items' urls for each feed. We're currently building RSS
integration for http://Honbu.io and potentially we'll be keeping track
of thousands of feeds (all key-value stores in-memory) so we can notify
our users when something changes. Having to keep hold of a potentially
large feed-item array for each feed increases the memory requirements
exponentially. I suppose I'll have to deal with it with some kind of
over-flow logic in the code for now, but c'est la vie, as they say.

Ideally, what I'd like from an RSS gem or service would be an
asynchronous callback every time the feed is updated. Then I could
decide if and when to notify my users and wouldn't have to
fetch-parse-compare every x minutes. From a design perspective I try
to stick to the single-responsibility principle so I kind of resent if
my app starts acquiring knowledge of things that are outside its
immediate scope. At the same time I have to be pragmatic about such
things, so fetch-parse-compare it is!

If that's agreeable with Jon, may be I could fork Feedjira and try to
add some functionality towards that end? Naturally certain assumptions
would have to be made, as you point out in your replies, such as new
url = new content, ignore timestamps, and so on.

sorry about the length of this email, got myself in the 'novel' mood. :D

thanks,

Fred

On Sun 13 Apr 2014 21:39:35 BST, matt swanson wrote:

Will chime in (with another novel - sorry!) and say that things aren't
all sunshine and rainbows for Stringer either :)

I think a simple but largely effective approach is to use the URL as a
unique constraint. Trying to break out early is dangerous - most feeds
are ordered by a timestamp (which could be when it was created,
updated, published, etc) but there is nothing stopping someone from
publishing a draft that was written two weeks (and timestamped in the
past) at a later date.

You can't really trust the |last_modified| field at the root of some
feed elements either, we've encountered a mismatch in some rather
popular feeds where the last modified date is not correct.

When if you completely distrust the feed's timestamps and use a unique
URL scheme as I mentioned above - there are some feeds that just
update posts with new information, but keep the same URL. One
particular example I remember was an auction site that would
constantly be updated as items were sold/posted for sale. You can't
really know if the author was making a typo (probably not necessary to
alert an end user) or a larger content update (probably does warrant
an alert).

For your particular use case: I would recommend you store the URLs of
every entry in the feed in a database (or some other persistent
source) and detect "new entries" by checking for the existing of the
URL. I think you'll find that this gets you 90% of the way there with
much less headache than trying to compare timestamps.


Reply to this email directly or view it on GitHub
#218 (comment).

Hi guys,

thanks very much for both your replies. I see what you're getting at,
there is no standard protocol for updating RSS so I need to do it the
'manual' way and use what makes sense in my domain. No silver bullet
for RSS, so to speak :)

My only concern is the extra overhead it takes to keep track of all
'current' items' urls for each feed. We're currently building RSS
integration for http://Honbu.io and potentially we'll be keeping track
of thousands of feeds (all key-value stores in-memory) so we can notify
our users when something changes. Having to keep hold of a potentially
large feed-item array for each feed increases the memory requirements
exponentially. I suppose I'll have to deal with it with some kind of
over-flow logic in the code for now, but c'est la vie, as they say.

Ideally, what I'd like from an RSS gem or service would be an
asynchronous callback every time the feed is updated. Then I could
decide if and when to notify my users and wouldn't have to
fetch-parse-compare every x minutes. From a design perspective I try
to stick to the single-responsibility principle so I kind of resent if
my app starts acquiring knowledge of things that are outside its
immediate scope. At the same time I have to be pragmatic about such
things, so fetch-parse-compare it is!

If that's agreeable with Jon, may be I could fork Feedjira and try to
add some functionality towards that end? Naturally certain assumptions
would have to be made, as you point out in your replies, such as new
url = new content, ignore timestamps, and so on.

sorry about the length of this email, got myself in the 'novel' mood. :D

thanks,

Fred

On Sun 13 Apr 2014 21:39:35 BST, matt swanson wrote:

Will chime in (with another novel - sorry!) and say that things aren't
all sunshine and rainbows for Stringer either :)

I think a simple but largely effective approach is to use the URL as a
unique constraint. Trying to break out early is dangerous - most feeds
are ordered by a timestamp (which could be when it was created,
updated, published, etc) but there is nothing stopping someone from
publishing a draft that was written two weeks (and timestamped in the
past) at a later date.

You can't really trust the |last_modified| field at the root of some
feed elements either, we've encountered a mismatch in some rather
popular feeds where the last modified date is not correct.

When if you completely distrust the feed's timestamps and use a unique
URL scheme as I mentioned above - there are some feeds that just
update posts with new information, but keep the same URL. One
particular example I remember was an auction site that would
constantly be updated as items were sold/posted for sale. You can't
really know if the author was making a typo (probably not necessary to
alert an end user) or a larger content update (probably does warrant
an alert).

For your particular use case: I would recommend you store the URLs of
every entry in the feed in a database (or some other persistent
source) and detect "new entries" by checking for the existing of the
URL. I think you'll find that this gets you 90% of the way there with
much less headache than trying to compare timestamps.


Reply to this email directly or view it on GitHub
#218 (comment).

@swanson

This comment has been minimized.

Show comment
Hide comment
@swanson

swanson Apr 14, 2014

Contributor

@RedFred7 you may want to have a look at https://superfeedr.com/ then - they handle all the feed parsing and hit a callback when new items are added. I don't know how robust their "new post detection" algorithms are, but I assume they have at least the same functionality as feedjira (if not better).

Maybe @julien51 can weigh in :)

Contributor

swanson commented Apr 14, 2014

@RedFred7 you may want to have a look at https://superfeedr.com/ then - they handle all the feed parsing and hit a callback when new items are added. I don't know how robust their "new post detection" algorithms are, but I assume they have at least the same functionality as feedjira (if not better).

Maybe @julien51 can weigh in :)

@RedFred7

This comment has been minimized.

Show comment
Hide comment
@RedFred7

RedFred7 Apr 14, 2014

very interesting! I'll have a play with their API and see how it goes,

thanks Matt

On Mon 14 Apr 2014 17:27:56 BST, matt swanson wrote:

@RedFred7 https://github.com/RedFred7 you may want to have a look at
https://superfeedr.com/ then - they handle all the feed parsing and
hit a callback when new items are added. I don't know how robust their
"new post detection" algorithms are, but I assume they have at least
the same functionality as feedjira (if not better).

Maybe @julien51 https://github.com/julien51 can weigh in :)


Reply to this email directly or view it on GitHub
#218 (comment).

very interesting! I'll have a play with their API and see how it goes,

thanks Matt

On Mon 14 Apr 2014 17:27:56 BST, matt swanson wrote:

@RedFred7 https://github.com/RedFred7 you may want to have a look at
https://superfeedr.com/ then - they handle all the feed parsing and
hit a callback when new items are added. I don't know how robust their
"new post detection" algorithms are, but I assume they have at least
the same functionality as feedjira (if not better).

Maybe @julien51 https://github.com/julien51 can weigh in :)


Reply to this email directly or view it on GitHub
#218 (comment).

@julien51

This comment has been minimized.

Show comment
Hide comment
@julien51

julien51 Apr 14, 2014

Contributor

Thanks @swanson for the mention! @RedFred7 I'd be happy to help directly should you have any question/problem. Feel free to post to https://github.com/superfeedr/documentation/issues?state=open

Contributor

julien51 commented Apr 14, 2014

Thanks @swanson for the mention! @RedFred7 I'd be happy to help directly should you have any question/problem. Feel free to post to https://github.com/superfeedr/documentation/issues?state=open

@jonallured

This comment has been minimized.

Show comment
Hide comment
@jonallured

jonallured Apr 20, 2014

Member

Hi @RedFred7, I'm going to go ahead and close this issue - thanks for bringing this up and hopefully you got what you needed from this discussion. Let me know if you need anything else!

Jon

Member

jonallured commented Apr 20, 2014

Hi @RedFred7, I'm going to go ahead and close this issue - thanks for bringing this up and hopefully you got what you needed from this discussion. Let me know if you need anything else!

Jon

@jonallured jonallured closed this Apr 20, 2014

@RedFred7

This comment has been minimized.

Show comment
Hide comment
@RedFred7

RedFred7 Apr 21, 2014

Hi Jon,

yes please do close this. Also, thanks for the insights during our
brief discussion.

Fred

On Sun 20 Apr 2014 14:25:40 BST, Jon Allured wrote:

Hi @RedFred7 https://github.com/RedFred7, I'm going to go ahead and
close this issue - thanks for bringing this up and hopefully you got
what you needed from this discussion. Let me know if you need anything
else!

Jon


Reply to this email directly or view it on GitHub
#218 (comment).

Hi Jon,

yes please do close this. Also, thanks for the insights during our
brief discussion.

Fred

On Sun 20 Apr 2014 14:25:40 BST, Jon Allured wrote:

Hi @RedFred7 https://github.com/RedFred7, I'm going to go ahead and
close this issue - thanks for bringing this up and hopefully you got
what you needed from this discussion. Let me know if you need anything
else!

Jon


Reply to this email directly or view it on GitHub
#218 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment