libearth 0.3.0

@dahlia dahlia released this Jul 12, 2014 · 129 commits to master since this release

  • Root MergeableDocumentElements' MergeableDocumentElement.__merge_entities__() methods are not ignored anymore. Respnosibilty to merge two documents is now moved from Session.merge() method to __merge_entities__() method.
  • crawl() now return a set of CrawlResult objects instead of tuples.
  • feeds parameter of crawl() function was renamed to feed_urls.
  • Added feed_uri parameter and corresponding feed_uri attribute to CrawlError exception.
  • Timeout option was added to crawler.
    • Added optional timeout parameter to crawl().
    • Added optional timeout parameter to get_feed().
    • Added DEFAULT_TIMEOUT constant which is 10 seconds.
  • Added LinkList.favicon property. [#49]
  • Link.relation attribute which had been optional now becomes required
  • AutoDiscovery.find_feed_url() method (that returned feed links) was gone. Instead AutoDiscovery.find() method (that returns a pair of feed links and favicon links) was introduced. [#49]
  • Subscription.icon_uri attribute was introduced. [#49]
  • Added an optional icon_uri parameter to SubscriptionSet.subscribe() method. [#49]
  • Added normalize_xml_encoding() function to workaround xml.etree.ElementTree module's encoding detection bug. [#41]
  • Added guess_tzinfo_by_locale() function. [#41]
  • Added microseconds option to Rfc822 codec.
  • Fixed incorrect merge of subscription/category deletion.
    • Subscriptions are now archived rather than deleted.
    • Outline (which is a common superclass of Subscription and Category) now has deleted_at attribute and deleted property.
  • Fixed several rss2 parser bugs.
    • Now the parser accepts several malformed <pubDate> and <lastBuildDate> elements.
    • It become to guess the time zone according to its <language> and the ccTLD (if applicable) when the date time doesn't give any explicit time zone (which is also malformed). [#41]
    • It had ignored <category> elements other than the last one, now it become to accept as many as there are.
    • It had ignored <comments> links at all, now these become to be parsed to Link objects with relation='discussion'.
    • Some RSS 2 feeds put a URI into <generator>, so the parser now treat it as uri rather than value for such situation.
    • <enclosure> links had been parsed as Link object without relation attribute, but it becomes to properly set the attribute to 'enclosure'.
    • Mixed <link> elements with Atom namespace also becomes to be parsed well.
  • Fixed several atom parser bugs.
    • Now it accepts obsolete PURL Atom namespace.
    • Since some broken Atom feeds (e.g. Naver Blog) provide date time as RFC 822 format which is incorrect according to RFC 4287 (section 3.3), the parser becomes to accept RFC 822 format as well.
    • Some broken Atom feeds (e.g. Naver Blog) use <modified> which is not standard instead of <updated> which is standard, so the parser now treats <modified> equivalent to <updated>.
    • <content> and <summary> can has text/plain and text/html in addition to text and html.
    • <author>/<contributor> becomes ignored if it hasn't any of <name>, <uri>, or <email>.
    • Fixed a parser bug that hadn't interpret omission of link[rel] attribute as 'alternate'.
  • Fixed the parser to work well even if there's any file separator characters (FS, '\x1c').