New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: Duplicated Feed Entries #23

Closed
ghost opened this Issue Mar 27, 2013 · 30 comments

Comments

Projects
None yet
10 participants
@ghost

ghost commented Mar 27, 2013

This issue started happening earlier today...

I have cleared out my unread list several times this afternoon, and the duplicates come back every time there are new, unread entries. I'm not sure it's always the same duplicates - but I do know that the ones pictured below do recur often...

Also, the last time I read through all unread entries, I clicked "Mark all as read" to see if that would help. It apparently did not.

feedbin-duplicates

@benubois

This comment has been minimized.

Show comment
Hide comment
@benubois

benubois Mar 28, 2013

Member

Thanks, seeing the duplicate entry issue some feeds. Looking into it...

Member

benubois commented Mar 28, 2013

Thanks, seeing the duplicate entry issue some feeds. Looking into it...

@benubois

This comment has been minimized.

Show comment
Hide comment
@benubois

benubois Mar 28, 2013

Member

Fixed.

There will be many duplicates from high volume feeds today, but no more going forward.

Member

benubois commented Mar 28, 2013

Fixed.

There will be many duplicates from high volume feeds today, but no more going forward.

@benubois benubois closed this Mar 28, 2013

@recurser

This comment has been minimized.

Show comment
Hide comment
@recurser

recurser May 3, 2013

Seeing quite a few doubled-up posts pointing at the same URL:

example1

example2

recurser commented May 3, 2013

Seeing quite a few doubled-up posts pointing at the same URL:

example1

example2

@benubois

This comment has been minimized.

Show comment
Hide comment
@benubois

benubois May 3, 2013

Member

Yeah, this is a major problem on Hacker News. An entry on HN looks like

<item>
  <title>Portal Released For Steam On Linux</title>
  <link>http://www.phoronix.com/scan.php?page=news_item&amp;px=MTM2Mzk</link>
  <comments>https://news.ycombinator.com/item?id=5647914</comments>
  <description><![CDATA[<a href="https://news.ycombinator.com/item?id=5647914">Comments</a>]]></description>
</item>

So there isn't much to uniquely identify items by. In cases where a publisher does not have and <id> or <guid> or <published>, Feedbin uses a combination of the link and title to attempt to uniquely identify items.

If your example a period was added to the headline later on, which makes it look like a duplicate to a human, but it looks unique to the id generator.

Member

benubois commented May 3, 2013

Yeah, this is a major problem on Hacker News. An entry on HN looks like

<item>
  <title>Portal Released For Steam On Linux</title>
  <link>http://www.phoronix.com/scan.php?page=news_item&amp;px=MTM2Mzk</link>
  <comments>https://news.ycombinator.com/item?id=5647914</comments>
  <description><![CDATA[<a href="https://news.ycombinator.com/item?id=5647914">Comments</a>]]></description>
</item>

So there isn't much to uniquely identify items by. In cases where a publisher does not have and <id> or <guid> or <published>, Feedbin uses a combination of the link and title to attempt to uniquely identify items.

If your example a period was added to the headline later on, which makes it look like a duplicate to a human, but it looks unique to the id generator.

@recurser

This comment has been minimized.

Show comment
Hide comment
@recurser

recurser May 3, 2013

Aha I see what you mean... I didn't notice that the titles are slightly different. Curious since it's the official HN feed, and they have the same HN post ID in the link, so it's the same canonical 'article' so to speak. I guess these are cases of the title being edited by HN mods after feedbin has already picked them up?

So there isn't much to uniquely identify items by. In cases where a publisher does not have 
and <id> or <guid> or <published>, Feedbin uses a combination of the link and title to attempt
to uniquely identify items.

In the case of hacker news, item?id=5647914 uniquely identifies it, though I realise it's a slippery slope once you start customizing things on a feed-by-feed basis.

Thanks for the explanation!

recurser commented May 3, 2013

Aha I see what you mean... I didn't notice that the titles are slightly different. Curious since it's the official HN feed, and they have the same HN post ID in the link, so it's the same canonical 'article' so to speak. I guess these are cases of the title being edited by HN mods after feedbin has already picked them up?

So there isn't much to uniquely identify items by. In cases where a publisher does not have 
and <id> or <guid> or <published>, Feedbin uses a combination of the link and title to attempt
to uniquely identify items.

In the case of hacker news, item?id=5647914 uniquely identifies it, though I realise it's a slippery slope once you start customizing things on a feed-by-feed basis.

Thanks for the explanation!

@benubois

This comment has been minimized.

Show comment
Hide comment
@benubois

benubois May 3, 2013

Member

In the case of hacker news, item?id=5647914 uniquely identifies it, though I realise it's a slippery slope once you start customizing things on a feed-by-feed basis.

Hehe exactly. As I was posting the example I noticed that the <description> on Hacker News would make for a great ID, but that would be totally unique to them.

If I do start customizing the strategy for certain feeds this is first on my list.

Member

benubois commented May 3, 2013

In the case of hacker news, item?id=5647914 uniquely identifies it, though I realise it's a slippery slope once you start customizing things on a feed-by-feed basis.

Hehe exactly. As I was posting the example I noticed that the <description> on Hacker News would make for a great ID, but that would be totally unique to them.

If I do start customizing the strategy for certain feeds this is first on my list.

@nissimk

This comment has been minimized.

Show comment
Hide comment
@nissimk

nissimk commented May 3, 2013

Use a hash of the item:

http://swik.net/RSS/RSS+Item+Uniqueness

@recurser

This comment has been minimized.

Show comment
Hide comment
@recurser

recurser May 3, 2013

@nissimk

Hash – most common method, simply hashing the entire item results in a somewhat unique id. This is however vulnerable to repeated feed items.

The problem is that the title of articles is changing over time, which makes hashing pretty difficult in the absence of a canonical id (?)

recurser commented May 3, 2013

@nissimk

Hash – most common method, simply hashing the entire item results in a somewhat unique id. This is however vulnerable to repeated feed items.

The problem is that the title of articles is changing over time, which makes hashing pretty difficult in the absence of a canonical id (?)

@benubois

This comment has been minimized.

Show comment
Hide comment
@benubois

benubois May 3, 2013

Member

Here's the id strategy for Feedbin, definitely open to suggestions, although any changes would have to maintain backward compatibility so duplicates of old entries are not created:

def build_public_id(entry, feedzirra, saved_feed_url = nil)
  if saved_feed_url
    id_string = saved_feed_url.dup
  else      
    id_string = feedzirra.feed_url.dup
  end

  if entry.entry_id
    id_string << entry.entry_id.dup
  else
    if entry.url
      id_string << entry.url.dup
    end
    if entry.published
      id_string << entry.published.iso8601
    end
    if entry.title
      id_string << entry.title.dup
    end
  end
  Digest::SHA1.hexdigest(id_string)
end
Member

benubois commented May 3, 2013

Here's the id strategy for Feedbin, definitely open to suggestions, although any changes would have to maintain backward compatibility so duplicates of old entries are not created:

def build_public_id(entry, feedzirra, saved_feed_url = nil)
  if saved_feed_url
    id_string = saved_feed_url.dup
  else      
    id_string = feedzirra.feed_url.dup
  end

  if entry.entry_id
    id_string << entry.entry_id.dup
  else
    if entry.url
      id_string << entry.url.dup
    end
    if entry.published
      id_string << entry.published.iso8601
    end
    if entry.title
      id_string << entry.title.dup
    end
  end
  Digest::SHA1.hexdigest(id_string)
end
@nissimk

This comment has been minimized.

Show comment
Hide comment
@nissimk

nissimk May 3, 2013

What about just excluding the title field from the item before hashing?

nissimk commented May 3, 2013

What about just excluding the title field from the item before hashing?

@benubois

This comment has been minimized.

Show comment
Hide comment
@benubois

benubois May 3, 2013

Member

What about just excluding the title field from the item before hashing?

Definitely something I considered.

There are feeds that link to the same story multiple times, so the the link is not necessarily unique.

At the time I was thinking it would be better to create a duplicate than not import the item at all and potentially have missing unique items. I'm not sure which is the more common case.

Certainly with HN this is a bigger problem than other feeds.

Member

benubois commented May 3, 2013

What about just excluding the title field from the item before hashing?

Definitely something I considered.

There are feeds that link to the same story multiple times, so the the link is not necessarily unique.

At the time I was thinking it would be better to create a duplicate than not import the item at all and potentially have missing unique items. I'm not sure which is the more common case.

Certainly with HN this is a bigger problem than other feeds.

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost May 3, 2013

Note: This has also been happening to me pretty often on Engadget's feed, if you need another test case...

ghost commented May 3, 2013

Note: This has also been happening to me pretty often on Engadget's feed, if you need another test case...

@Zegnat

This comment has been minimized.

Show comment
Hide comment
@Zegnat

Zegnat May 4, 2013

Contributor

@roomanitarian that’s weird, are you subscribed to http://www.engadget.com/rss.xml? That one includes guid elements for Feedbin to use so there should not be any duplicates. If there are any duplicates it’s either because Feedbin is broken or because Engadget is changing their own unique IDs (which would be completely silly and probably means a broken CMS on their part).

Contributor

Zegnat commented May 4, 2013

@roomanitarian that’s weird, are you subscribed to http://www.engadget.com/rss.xml? That one includes guid elements for Feedbin to use so there should not be any duplicates. If there are any duplicates it’s either because Feedbin is broken or because Engadget is changing their own unique IDs (which would be completely silly and probably means a broken CMS on their part).

@ghost

This comment has been minimized.

Show comment
Hide comment
@ghost

ghost May 4, 2013

@Zegnat

Actually, I just removed that feed last week, due mainly to lack of interest on my part...

However, I just added it back in to see if I could find any duplicates... I didn't have to look very hard to find these 2 sets:

engadget-dupes

ghost commented May 4, 2013

@Zegnat

Actually, I just removed that feed last week, due mainly to lack of interest on my part...

However, I just added it back in to see if I could find any duplicates... I didn't have to look very hard to find these 2 sets:

engadget-dupes

@recurser

This comment has been minimized.

Show comment
Hide comment
@recurser

recurser May 4, 2013

It might not be ideal, but would it be possible to make 'ignore duplicate URLs' or similar a global option in feedbin settings?

recurser commented May 4, 2013

It might not be ideal, but would it be possible to make 'ignore duplicate URLs' or similar a global option in feedbin settings?

@Zegnat

This comment has been minimized.

Show comment
Hide comment
@Zegnat

Zegnat May 4, 2013

Contributor

That’s really odd @recurser.

@benubois are you sure you are using guid tags? Looking at the Aspire example I believe Engadget did not change the guid value so it should not have copied.

Contributor

Zegnat commented May 4, 2013

That’s really odd @recurser.

@benubois are you sure you are using guid tags? Looking at the Aspire example I believe Engadget did not change the guid value so it should not have copied.

@benubois

This comment has been minimized.

Show comment
Hide comment
@benubois

benubois May 4, 2013

Member

I see what the problem is.

Feedbin uses Feedzirra for XML parsing.

In almost all cases, <guid> and <id> are normalized into entry_id.

The exception here, and the source of this particular problem is the Feedzirra::Parser::ITunesRSSItem strategy.

In this case the <guid> is NOT being normalized to entry_id so Feedbin falls back to not including the entry_id at all and instead uses link + title.

A fix for this is tricky. If the problem were fixed upstream or in the Feedbin fork of Feedzirra, duplicates would be created for every entry of every feed that uses Feedzirra::Parser::ITunesRSSItem, so that's no good.

One workaround would be to do something like:

if entry.published > DATE_OF_ITUNES_BUG_FIX
  if entry.entry_id
    entry.entry_id  = entry.entry_id.strip
  elsif entry.guid
    entry.entry_id  = entry.guid
  else
    entry.entry_id = nil
  end
else
  entry.entry_id = entry.entry_id ? entry.entry_id.strip : nil
end

The other alternative would be to generate two ids, one for the entry before the fix and one after. I think this is more work long term because then every item needs to be checked for dupes twice forever.

Does anyone see any potential issues with fix 1?

Member

benubois commented May 4, 2013

I see what the problem is.

Feedbin uses Feedzirra for XML parsing.

In almost all cases, <guid> and <id> are normalized into entry_id.

The exception here, and the source of this particular problem is the Feedzirra::Parser::ITunesRSSItem strategy.

In this case the <guid> is NOT being normalized to entry_id so Feedbin falls back to not including the entry_id at all and instead uses link + title.

A fix for this is tricky. If the problem were fixed upstream or in the Feedbin fork of Feedzirra, duplicates would be created for every entry of every feed that uses Feedzirra::Parser::ITunesRSSItem, so that's no good.

One workaround would be to do something like:

if entry.published > DATE_OF_ITUNES_BUG_FIX
  if entry.entry_id
    entry.entry_id  = entry.entry_id.strip
  elsif entry.guid
    entry.entry_id  = entry.guid
  else
    entry.entry_id = nil
  end
else
  entry.entry_id = entry.entry_id ? entry.entry_id.strip : nil
end

The other alternative would be to generate two ids, one for the entry before the fix and one after. I think this is more work long term because then every item needs to be checked for dupes twice forever.

Does anyone see any potential issues with fix 1?

@benubois benubois reopened this May 4, 2013

@recurser

This comment has been minimized.

Show comment
Hide comment
@recurser

recurser May 8, 2013

Looks good to me 👍

recurser commented May 8, 2013

Looks good to me 👍

@andypearson

This comment has been minimized.

Show comment
Hide comment
@andypearson

andypearson Jun 20, 2013

+1 to say I care about this issue :)

andypearson commented Jun 20, 2013

+1 to say I care about this issue :)

@nickel715

This comment has been minimized.

Show comment
Hide comment
@nickel715

nickel715 Jul 19, 2013

+1 the problem is still present, in my case in a feed from dokuwiki

nickel715 commented Jul 19, 2013

+1 the problem is still present, in my case in a feed from dokuwiki

@Zegnat

This comment has been minimized.

Show comment
Hide comment
@Zegnat

Zegnat Jul 19, 2013

Contributor

@nickel715, do you have an exact URL?

Contributor

Zegnat commented Jul 19, 2013

@nickel715, do you have an exact URL?

@nickel715

This comment has been minimized.

Show comment
Hide comment
@Zegnat

This comment has been minimized.

Show comment
Hide comment
@Zegnat

Zegnat Jul 19, 2013

Contributor

Hmm, that’s unrelated to this then. That feed does not seem to be an iTunes feed so the duplicates are for a separate reason. It might be related to the Pinboard feed issue as they both seem to use RSS 1 (RDF) syntax.

Contributor

Zegnat commented Jul 19, 2013

Hmm, that’s unrelated to this then. That feed does not seem to be an iTunes feed so the duplicates are for a separate reason. It might be related to the Pinboard feed issue as they both seem to use RSS 1 (RDF) syntax.

@joshhinman

This comment has been minimized.

Show comment
Hide comment
@joshhinman

joshhinman Sep 3, 2013

I'm still seeing this with several of my feeds, most notably MacWorld (http://rss.macworld.com/macworld/feeds/main) and LA Times (http://feeds2.feedburner.com/lanowblog)

joshhinman commented Sep 3, 2013

I'm still seeing this with several of my feeds, most notably MacWorld (http://rss.macworld.com/macworld/feeds/main) and LA Times (http://feeds2.feedburner.com/lanowblog)

@benubois

This comment has been minimized.

Show comment
Hide comment
@benubois

benubois Sep 3, 2013

Member

@joshhinman Both Macworld and the LA Times feeds don't include and <id> or <guid> so Feedbin makes one up based on the title and link. The issue with this is that if the title or link change at all the id changes too so the entry looks like a duplicate.

Member

benubois commented Sep 3, 2013

@joshhinman Both Macworld and the LA Times feeds don't include and <id> or <guid> so Feedbin makes one up based on the title and link. The issue with this is that if the title or link change at all the id changes too so the entry looks like a duplicate.

@recurser

This comment has been minimized.

Show comment
Hide comment
@recurser

recurser commented Apr 1, 2014

@benubois

This comment has been minimized.

Show comment
Hide comment
@benubois

benubois Apr 1, 2014

Member

@recurser,

Thanks I added a comment.

Member

benubois commented Apr 1, 2014

@recurser,

Thanks I added a comment.

@fma16

This comment has been minimized.

Show comment
Hide comment
@fma16

fma16 Apr 16, 2014

Hi everyone,
I don't know if it's the same error, but the PCInpact private feeds (they give one to their premium subscribers -like me) seems to suffers from this duplicate bug.
The Feed (see https://gist.github.com/Zegnat/e0524aa33fb2b286f778)

feedbin 244 2014-04-16 12-59-48

fma16 commented Apr 16, 2014

Hi everyone,
I don't know if it's the same error, but the PCInpact private feeds (they give one to their premium subscribers -like me) seems to suffers from this duplicate bug.
The Feed (see https://gist.github.com/Zegnat/e0524aa33fb2b286f778)

feedbin 244 2014-04-16 12-59-48

@Zegnat

This comment has been minimized.

Show comment
Hide comment
@Zegnat

Zegnat Apr 16, 2014

Contributor

Here is a dump of the feed. Feel free to remove the link to your private feed. I don’t see a problem there though, the items have <guid> elements et al.

Contributor

Zegnat commented Apr 16, 2014

Here is a dump of the feed. Feel free to remove the link to your private feed. I don’t see a problem there though, the items have <guid> elements et al.

@benubois

This comment has been minimized.

Show comment
Hide comment
@benubois

benubois Apr 17, 2014

Member

Looks like they may have just switched domains nextinpact.com -> pcinpact.com.

This can cause duplicates when the domain is part of the guid.

Here's an example of a duplicated item with two distinct guids: "Les Google Glass se sont bien vendues aux États-Unis et passent à Android 4.4"

http://www.nextinpact.com/news/87085-les-google-glass-se-sont-bien-vendues-aux-etats-unis-et-passent-a-android-4-4.htm
http://www.pcinpact.com/news/87085-les-google-glass-se-sont-bien-vendues-aux-etats-unis-et-passent-a-android-4-4.htm

The way around this is to not use the domain in the guid, but a lot of blogging software does this automatically.

Member

benubois commented Apr 17, 2014

Looks like they may have just switched domains nextinpact.com -> pcinpact.com.

This can cause duplicates when the domain is part of the guid.

Here's an example of a duplicated item with two distinct guids: "Les Google Glass se sont bien vendues aux États-Unis et passent à Android 4.4"

http://www.nextinpact.com/news/87085-les-google-glass-se-sont-bien-vendues-aux-etats-unis-et-passent-a-android-4-4.htm
http://www.pcinpact.com/news/87085-les-google-glass-se-sont-bien-vendues-aux-etats-unis-et-passent-a-android-4-4.htm

The way around this is to not use the domain in the guid, but a lot of blogging software does this automatically.

@fma16

This comment has been minimized.

Show comment
Hide comment
@fma16

fma16 Apr 17, 2014

Yep, they made the change pcinpact.com -> nextinpact.com a couple weeks ago, but nextinpact.com stills redirects to pcinpact.com for now.
Anyway, It's looks like that the bug doesn't happen anymore, so I'll consider it fixed for now.

Thanks for the help! :-D

fma16 commented Apr 17, 2014

Yep, they made the change pcinpact.com -> nextinpact.com a couple weeks ago, but nextinpact.com stills redirects to pcinpact.com for now.
Anyway, It's looks like that the bug doesn't happen anymore, so I'll consider it fixed for now.

Thanks for the help! :-D

@svraka

This comment has been minimized.

Show comment
Hide comment
@svraka

svraka Apr 21, 2014

Recently duplicates items started popping up in some wordpress.com feeds like http://formerf1doc.wordpress.com/feed/ and http://britishisms.wordpress.com/feed/.

svraka commented Apr 21, 2014

Recently duplicates items started popping up in some wordpress.com feeds like http://formerf1doc.wordpress.com/feed/ and http://britishisms.wordpress.com/feed/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment