Skip to content
Browse files

Make unified object to return Curl and FeedNormalizer results to user

  • Loading branch information...
1 parent b876b62 commit e5c819d09c562f569e923590c1071816bc0af6a2 @jsl jsl committed May 26, 2009
View
174 README.rdoc
@@ -1,108 +1,140 @@
= Description
-Myzofeedtosis is a library that helps you to efficiently process syndicated web resources. It helps by automatically using
-conditional HTTP GET requests, as well as by pointing out which entries are new in any given feed.
+Myzofeedtosis is a library that helps you to efficiently process syndicated web
+resources. It helps by automatically using conditional HTTP GET requests, as
+well as by pointing out which entries are new in any given feed.
== Name
-The name "myzofeedtosis" is based on the form of cellular digestion "myzocytosis". According to Wikipedia [1], myzocytosis
-is described as the process where "one cell pierces another using a feeding tube, and sucks out cytoplasm". Myzofeedtosis is
-kind of like that, except it works with RSS/Atom feeds instead of cytoplasm.
+The name "myzofeedtosis" is based on the form of cellular digestion
+"myzocytosis". According to Wikipedia [1], myzocytosis is described as the
+process where "one cell pierces another using a feeding tube, and sucks out
+cytoplasm". Myzofeedtosis is kind of like that, except it works with RSS/Atom
+feeds instead of cytoplasm.
1 - http://en.wikipedia.org/wiki/List_of_vores
== Philosophy
-Myzofeedtosis is designed to help you with book-keeping about feed fetching details. This is usually something that is
-mundane and not fundamentally related to the business logic of applications that deal with the consumption of syndicated
-content on the web. Myzofeedtosis keeps track of these mundane details so you can just keep grabbing new content without
-wasting bandwidth in making unnecessary requests and programmer time in implementing algorithms to figure out which feed
-entries are new.
-
-Myzofeedtosis fits into other frameworks to do the heavy lifting, including the Curb library which does HTTP requests through
-curl, and FeedNormalizer which abstracts the differences between syndication formats. In the sense that it fits into these
-existing, robust programs, Myzofeedtosis is a modular middleware piece that efficiently glues together disparate parts to create
-a helpful feed reader with a minimal (< 200 LOC), test-covered codebase.
+Myzofeedtosis is designed to help you with book-keeping about feed fetching
+details. This is usually something that is mundane and not fundamentally related
+to the business logic of applications that deal with the consumption of
+syndicated content on the web. Myzofeedtosis keeps track of these mundane
+details so you can just keep grabbing new content without wasting bandwidth in
+making unnecessary requests and programmer time in implementing algorithms to
+figure out which feed entries are new.
+
+Myzofeedtosis fits into other frameworks to do the heavy lifting, including the
+Curb library which does HTTP requests through curl, and FeedNormalizer which
+abstracts the differences between syndication formats. In the sense that it fits
+into these existing, robust programs, Myzofeedtosis is a modular middleware
+piece that efficiently glues together disparate parts to create a helpful feed
+reader with a minimal (< 200 LOC), test-covered codebase.
== Installation
-Assuming that you've followed the directions on gems.github.com to allow your computer to install gems from GitHub, the
-following command will install the Myzofeedtosis library:
+Assuming that you've followed the directions on gems.github.com to allow your
+computer to install gems from GitHub, the following command will install the
+Myzofeedtosis library:
- sudo gem install jsl-myzofeedtosis
+ sudo gem install jsl-myzofeedtosis
== Usage
-Myzofeedtosis is easy to use. Just create a client object, and invoke the "fetch" method:
-
- require 'myzofeedtosis'
- client = Myzofeedtosis::Client.new('http://feeds.feedburner.com/wooster')
- result = client.fetch
-
-+result+ will be a FeedNormalizer::Feed object, which responds to the method +entries+. In this first case, Myzofeedtosis
-isn't much more useful than just using FeedNormalizer by itself. On subsequent requests, though, it helps significantly.
-Invoke client.fetch again. Instead of getting back an object that responds to +entries+, you'll probably get back a Curl::Easy
-object with a response code of 304, indicating that the resource hasn't changed since the last retrieval. If you invoke
-fetch again, assuming one entry was added, you would get back another FeedNormalizer::Feed object. You'll also notice that
-all of the FeedNormalizer::Feed objects respond not only to +entries+, but +new_entries+, which is a selection of the entries
-that haven't been seen before.
-
-Myzofeedtosis is designed for these situations where you have one or maybe thousands of feeds that you have to update on a
-regular basis. If the results that you receive respond to the method new_entries, iterate over the new entries, processing them
-according to your business logic.
-
-You will most likely want to allow Myzofeedtosis to remember details about the last retrieval of a feed after the client is
-removed from memory. Myzofeedtosis uses Moneta, a unified interface to key-value storage systems to remember "summaries" of
-feeds that it has seen in the past. See the document section on Customization for more details on how to configure this system.
+Myzofeedtosis is easy to use. Just create a client object, and invoke the
+"fetch" method:
+
+ require 'myzofeedtosis'
+ client = Myzofeedtosis::Client.new('http://feeds.feedburner.com/wooster')
+ result = client.fetch
+
++result+ will be a Myzofeedtosis::Result object which delegates methods to
+the FeedNormalizer::Feed object as well as the Curl::Easy object used to fetch
+the feed. Useful methods on this object include +entries+, +new_entries+ and
++response_code+ among many others (basically all of the methods that
+FeedNormalizer::Feed and Curl::Easy objects respond to should be implemented).
+
+Note that since Myzofeedtosis uses HTTP conditional GET, it may not actually
+have received a full XML response from the server suitable for being parsed
+into entries. In this case, methods such as +entries+ on the Myzofeedtosis::Result
+will return +nil+. Depending on your application logic, you may want to inspect
+the methods that are delegated to the Curl::Easy object, such as +response_code+,
+for more information on what happened in these cases.
+
+On subsequent requests of a particular resource, Myzofeedtosis will update
++new_entries+ to contain the feed entries that we haven't seen yet. In most
+applications, your program will probably call the same batch of URLS multiple
+times, and process the elements in +new_entries+.
+
+You will most likely want to allow Myzofeedtosis to remember details about the
+last retrieval of a feed after the client is removed from memory. Myzofeedtosis
+uses Moneta, a unified interface to key-value storage systems to remember
+"summaries" of feeds that it has seen in the past. See the document section on
+Customization for more details on how to configure this system.
== Customization
-Myzofeedtosis stores summaries of feeds in a key-value storage system. If no options are included when creating a new
-Myzofeedtosis::Client object, the default is to use a "memory" storage system. The memory system is just a basic ruby Hash, so it
-won't keep track of feeds after a particular Client is removed from memory. To configure a different backend, pass an options hash
-to the Myzofeedtosis client initialization:
+Myzofeedtosis stores summaries of feeds in a key-value storage system. If no
+options are included when creating a new Myzofeedtosis::Client object, the
+default is to use a "memory" storage system. The memory system is just a basic
+ruby Hash, so it won't keep track of feeds after a particular Client is removed
+from memory. To configure a different backend, pass an options hash to the
+Myzofeedtosis client initialization:
- url = "http://newsrss.bbc.co.uk/rss/newsonline_world_edition/south_asia/rss.xml"
- mf = Myzofeedtosis::Client.new(url, :backend => {:moneta_klass => 'Moneta::Memcache', :server => 'localhost:1978'})
- res = mf.fetch
-
-This example sets up a Memcache backend, which in this case points to Tokyo Tyrant on port 1978. Note that Moneta::Memcache
-can be given as a string, in which case you don't have to manually require Moneta::Memcache before initializing the client.
+ url = "http://newsrss.bbc.co.uk/rss/newsonline_world_edition/south_asia/rss.xml"
+ mf = Myzofeedtosis::Client.new(url, :backend => {:moneta_klass => 'Moneta::Memcache', :server => 'localhost:1978'})
+ res = mf.fetch
-Generally, Myzofeedtosis supports all systems supported by Moneta, and any one of the supported systems can be given to the
-+moneta_klass+ parameter. Other options following +backend+ are passed directly to Moneta for configuration.
+This example sets up a Memcache backend, which in this case points to Tokyo
+Tyrant on port 1978. Note that Moneta::Memcache can be given as a string, in
+which case you don't have to manually require Moneta::Memcache before
+initializing the client.
-== Implementation
+Generally, Myzofeedtosis supports all systems supported by Moneta, and any one
+of the supported systems can be given to the +moneta_klass+ parameter. Other
+options following +backend+ are passed directly to Moneta for configuration.
-Myzofeedtosis helps to identify new feed entries and to figure out when conditional GET can be used in retrieving resources. In
-order to accomplish this without having to require that the user store information such as etags and dates of the last retrieved entry,
-Myzofeedtosis stores a summary structure in the configured key-value store (backed by Moneta). In order to do conditional GET
-requests, Myzofeedtosis stores the Last-Modified date, as well as the ETag of the last request in the summary structure, which is
-put in a namespaced element consisting of the term 'Myzofeedtosis' (bet you won't have to worry about name collisions on that one!)
-and the MD5 of the URL retrieved.
+== Implementation
-It can also be a bit tricky to decipher which feed entries are new since many feed sources don't include unique ids with their
-feeds. Myzofeedtosis reliably keeps track of which entries in a feed are new by storing (in the summary hash mentioned above) an
-MD5 signature of each entry in a feed. It takes elements such as the published-at date, title and content and generates the MD5
-of these elements. This allows Myzofeedtosis to cheaply compute (both in terms of computation and storage) which feed entries
-should be presented to the user as "new". Below is an example of a summary structure:
+Myzofeedtosis helps to identify new feed entries and to figure out when
+conditional GET can be used in retrieving resources. In order to accomplish this
+without having to require that the user store information such as etags and
+dates of the last retrieved entry, Myzofeedtosis stores a summary structure in
+the configured key-value store (backed by Moneta). In order to do conditional
+GET requests, Myzofeedtosis stores the Last-Modified date, as well as the ETag
+of the last request in the summary structure, which is put in a namespaced
+element consisting of the term 'Myzofeedtosis' (bet you won't have to worry
+about name collisions on that one!) and the MD5 of the URL retrieved.
+
+It can also be a bit tricky to decipher which feed entries are new since many
+feed sources don't include unique ids with their feeds. Myzofeedtosis reliably
+keeps track of which entries in a feed are new by storing (in the summary hash
+mentioned above) an MD5 signature of each entry in a feed. It takes elements
+such as the published-at date, title and content and generates the MD5 of these
+elements. This allows Myzofeedtosis to cheaply compute (both in terms of
+computation and storage) which feed entries should be presented to the user as
+"new". Below is an example of a summary structure:
{
- :etag=>"4c8f-46ac09fbbe940", :last_modified=>"Mon, 25 May 2009 18:17:33 GMT",
- :digests=>["f2993783ded928637ce5f2dc2d837f10", "da64efa6dd9ce34e5699b9efe73a37a7"]
+ :etag => "4c8f-46ac09fbbe940",
+ :last_modified => "Mon, 25 May 2009 18:17:33 GMT",
+ :digests => ["f2993783ded928637ce5f2dc2d837f10", "da64efa6dd9ce34e5699b9efe73a37a7"]
}
-
-The data stored by Myzofeedtosis in the summary structure allows it to be helpful to the user without storing lots of
-data that are unnecessary for efficient functioning.
+
+The data stored by Myzofeedtosis in the summary structure allows it to be
+helpful to the user without storing lots of data that are unnecessary for
+efficient functioning.
== HTML cleaning/sanitizing
-Myzofeedtosis doesn't do anything about feed sanitizing, as other libraries have been built for this purpose. FeedNormalizer
-has methods for escaping entries, but to strip HTML I suggest that you look at the Ruby gem "sanitize".
+Myzofeedtosis doesn't do anything about feed sanitizing, as other libraries have
+been built for this purpose. FeedNormalizer has methods for escaping entries,
+but to strip HTML I suggest that you look at the Ruby gem "sanitize".
== Feedback
-Please let me know if you have any problems with or questions about Myzofeedtosis.
+Please let me know if you have any problems with or questions about
+Myzofeedtosis.
= Author
View
4 lib/myzofeedtosis.rb
@@ -10,5 +10,7 @@
end
lib_dirs.each do |d|
- Dir[File.join(d, "**", "*.rb")].each {|file| require file }
+ Dir[File.join(d, "**", "*.rb")].each do |file|
+ require file
+ end
end
View
67 lib/myzofeedtosis/client.rb
@@ -1,9 +1,11 @@
module Myzofeedtosis
- # Myzofeedtosis::Client is the primary interface to the feed reader. Call it with a url that was previously fetched while
- # connected to the configured backend, and it will 1) only do a retrieval if deemed necessary based on the etag and modified-at
- # of the last etag and 2) mark all entries retrieved as either new or not new. Entries retrieved are normalized using
- # the feed-normalizer gem.
+ # Myzofeedtosis::Client is the primary interface to the feed reader. Call it
+ # with a url that was previously fetched while connected to the configured
+ # backend, and it will 1) only do a retrieval if deemed necessary based on the
+ # etag and modified-at of the last etag and 2) mark all entries retrieved as
+ # either new or not new. Entries retrieved are normalized using the
+ # feed-normalizer gem.
class Client
attr_reader :options, :url
@@ -22,31 +24,34 @@ def initialize(url, options = { })
@options[:backend].except(:moneta_klass) )
end
- # Retrieves the latest entries from this feed. Returns a Feednormalizer::Feed object with
- # method +new_entries+ if the response was successful, otherwise returns the Curl::Easy object
- # that we used to retrieve this resource. Note that we also return this object if the request
- # resulted in a 304 (Not Modified) response code since we don't have any new results. Depending
- # on your business logic, you may want to do something in this case, such as putting this
- # resource in a lower-priority queue if it is not frequently updated.
+ # Retrieves the latest entries from this feed. Returns a
+ # Feednormalizer::Feed object with method +new_entries+ if the response was
+ # successful, otherwise returns the Curl::Easy object that we used to
+ # retrieve this resource. Note that we also return this object if the
+ # request resulted in a 304 (Not Modified) response code since we don't have
+ # any new results. Depending on your business logic, you may want to do
+ # something in this case, such as putting this resource in a lower-priority
+ # queue if it is not frequently updated.
def fetch
curl = build_curl_easy
curl.perform
- process_curl_response(curl)
+ feed = process_curl_response(curl)
+ Myzofeedtosis::Result.new(curl, feed)
end
private
- # Marks entries as either seen or not seen based on the unique signature of the entry, which
- # is calculated by taking the MD5 of common attributes.
+ # Marks entries as either seen or not seen based on the unique signature of
+ # the entry, which is calculated by taking the MD5 of common attributes.
def mark_new_entries(response)
digests = if summary_for_feed.nil? || summary_for_feed[:digests].nil?
[ ]
else
summary_for_feed[:digests]
end
- # For each entry in the responses object, mark @_seen as false if the digest of this entry
- # doesn't exist in the cached object.
+ # For each entry in the responses object, mark @_seen as false if the
+ # digest of this entry doesn't exist in the cached object.
response.entries.each do |e|
seen = digests.include?(digest_for(e))
e.instance_variable_set(:@_seen, seen)
@@ -63,17 +68,16 @@ def process_curl_response(curl)
response = mark_new_entries(response)
store_summary_to_backend(response, curl)
response
- else
- curl
end
end
- # Sets options for the Curl::Easy object, including parameters for HTTP conditional GET.
+ # Sets options for the Curl::Easy object, including parameters for HTTP
+ # conditional GET.
def build_curl_easy
curl = new_curl_easy(url)
- # Many feeds have a 302 redirect to another URL. For more recent versions of Curl,
- # we need to specify this.
+ # Many feeds have a 302 redirect to another URL. For more recent versions
+ # of Curl, we need to specify this.
curl.follow_location = true
set_header_options(curl)
@@ -93,8 +97,8 @@ def set_header_options(curl)
summary = summary_for_feed
unless summary.nil?
- # We should only try to populate the headers for a conditional GET if we know both
- # of these values.
+ # We should only try to populate the headers for a conditional GET if
+ # we know both of these values.
if summary[:etag] && summary[:last_modified]
curl.headers['If-None-Match'] = summary[:etag]
curl.headers['If-Modified-Since'] = summary[:last_modified]
@@ -108,9 +112,10 @@ def key_for_cached
MD5.hexdigest(@url)
end
- # Stores information about the retrieval, including ETag, Last-Modified, and MD5 digests of all
- # entries to the backend store. This enables conditional GET usage on subsequent requests and
- # marking of entries as either new or seen.
+ # Stores information about the retrieval, including ETag, Last-Modified,
+ # and MD5 digests of all entries to the backend store. This enables
+ # conditional GET usage on subsequent requests and marking of entries as
+ # either new or seen.
def store_summary_to_backend(feed, curl)
headers = HttpHeaders.new(curl.header_str)
@@ -120,7 +125,8 @@ def store_summary_to_backend(feed, curl)
summary.merge!(:etag => headers.etag) unless headers.etag.nil?
summary.merge!(:last_modified => headers.last_modified) unless headers.last_modified.nil?
- # Store digest for each feed entry so we can detect new feeds on the next retrieval
+ # Store digest for each feed entry so we can detect new feeds on the next
+ # retrieval
digests = feed.entries.map do |e|
digest_for(e)
end
@@ -133,11 +139,12 @@ def set_summary(summary)
@backend[key_for_cached] = summary
end
- # Computes a unique signature for the FeedNormalizer::Entry object given. This signature
- # will be the MD5 of enough fields to have a reasonable probability of determining if the
- # entry is unique or not.
+ # Computes a unique signature for the FeedNormalizer::Entry object given.
+ # This signature will be the MD5 of enough fields to have a reasonable
+ # probability of determining if the entry is unique or not.
def digest_for(entry)
- MD5.hexdigest([entry.date_published, entry.url, entry.title, entry.content].join)
+ MD5.hexdigest( [ entry.date_published, entry.url, entry.title,
+ entry.content ].join)
end
def parser_for_xml(xml)
View
43 lib/myzofeedtosis/result.rb
@@ -0,0 +1,43 @@
+module Myzofeedtosis
+
+ # Makes the response components both from the Curl::Easy object and the
+ # FeedNormalizer::Feed object available to the user by delegating appropriate
+ # method calls to the correct object. If FeedNormalizer wasn't able to process
+ # the response, calls which would be delegated to this object return nil. In
+ # these cases, depending on your business logic you may want to inspect the
+ # state of the Curl::Easy object.
+ class Result
+
+ # Methods which should be delegated to the FeedNormalizer::Feed object.
+ FEED_METHODS = [ :title, :description, :last_updated, :copyright, :authors,
+ :author, :urls, :url, :image, :generator, :items, :entries, :new_items,
+ :new_entries, :channel, :ttl, :skip_hours, :skip_days
+ ] unless defined?(FEED_METHODS)
+
+ def initialize(curl, feed)
+ @curl = curl
+ @feed = feed
+
+ raise ArgumentError, "Curl object must not be nil" if curl.nil?
+
+ # See what the Curl::Easy object responds to, and send any appropriate
+ # messages its way.
+ @curl.public_methods(false).each do |meth|
+ (class << self; self; end).class_eval do
+ define_method meth do |*args|
+ @curl.send(meth, *args)
+ end
+ end
+ end
+ end
+
+ # Send methods through to the feed object unless it is nil. If feed
+ # object is nil, return nil in response to method call.
+ FEED_METHODS.each do |meth|
+ define_method meth do |*args|
+ @feed.nil? ? nil : @feed.__send__(meth, *args)
+ end
+ end
+
+ end
+end
View
19 myzofeedtosis.gemspec
@@ -1,6 +1,6 @@
Gem::Specification.new do |s|
s.name = %q{myzofeedtosis}
- s.version = "0.0.1.2"
+ s.version = "0.0.1.4"
s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version=
s.authors = ["Justin Leitgeb"]
@@ -9,18 +9,25 @@ Gem::Specification.new do |s|
s.email = %q{justin@phq.org}
s.files = ["lib/extensions/core/array.rb", "lib/extensions/core/hash.rb",
- "lib/extensions/feed_normalizer/feed_instance_methods.rb", "lib/myzofeedtosis/client.rb", "lib/myzofeedtosis.rb",
- "LICENSE", "myzofeedtosis.gemspec", "Rakefile", "README.rdoc", "spec/extensions/feed_normalizer/feed_instance_methods_spec.rb",
- "spec/fixtures/http_headers/wooster.txt", "spec/fixtures/xml/older_wooster.xml", "spec/fixtures/xml/wooster.xml",
- "spec/myzofeedtosis/client_spec.rb", "spec/myzofeedtosis_spec.rb", "spec/spec_helper.rb"]
+ "lib/extensions/feed_normalizer/feed_instance_methods.rb",
+ "lib/myzofeedtosis/result.rb",
+ "lib/myzofeedtosis/client.rb", "lib/myzofeedtosis.rb", "LICENSE",
+ "myzofeedtosis.gemspec", "Rakefile", "README.rdoc",
+ "spec/extensions/feed_normalizer/feed_instance_methods_spec.rb",
+ "spec/fixtures/http_headers/wooster.txt",
+ "spec/fixtures/xml/older_wooster.xml", "spec/fixtures/xml/wooster.xml",
+ "spec/myzofeedtosis/client_spec.rb", "spec/myzofeedtosis_spec.rb",
+ "spec/myzofeedtosis/result_spec.rb",
+ "spec/spec_helper.rb"]
s.has_rdoc = true
s.homepage = %q{http://github.com/jsl/myzofeedtosis}
s.rdoc_options = ["--charset=UTF-8"]
s.require_paths = ["lib"]
s.rubygems_version = %q{1.3.1}
s.summary = %q{Retrieves feeds using conditional GET and marks entries that you haven't seen before}
- s.test_files = ["spec/myzofeedtosis_spec.rb", "spec/spec_helper.rb", "spec/myzofeedtosis/client_spec.rb"]
+ s.test_files = ["spec/myzofeedtosis_spec.rb", "spec/spec_helper.rb", "spec/myzofeedtosis/client_spec.rb",
+ "spec/myzofeedtosis/result_spec.rb" ]
s.extra_rdoc_files = [ "README.rdoc" ]
View
18 spec/myzofeedtosis/client_spec.rb
@@ -2,9 +2,9 @@
describe Myzofeedtosis::Client do
before do
- @url = "http://www.example.com/feed.rss"
+ @url = "http://www.example.com/feed.rss"
@opts = { :user_agent => "My Cool Application" }
- @fr = Myzofeedtosis::Client.new(@url, @opts)
+ @fr = Myzofeedtosis::Client.new(@url, @opts)
end
describe "initialization" do
@@ -46,10 +46,18 @@
end
describe "when the response code is not 200" do
- it "should return the Curl::Easy object" do
+ it "should return nil for feed methods such as #title and #author" do
curl = mock('curl', :perform => true, :response_code => 304)
@fr.expects(:build_curl_easy).returns(curl)
- @fr.fetch.should == curl
+ res = @fr.fetch
+ res.title.should be_nil
+ res.author.should be_nil
+ end
+
+ it "should return a Myzofeedtosis::Result object" do
+ curl = mock('curl', :perform => true, :response_code => 304)
+ @fr.expects(:build_curl_easy).returns(curl)
+ @fr.fetch.should be_a(Myzofeedtosis::Result)
end
end
@@ -59,7 +67,7 @@
curl = mock('curl', :perform => true, :response_code => 200,
:body_str => xml_fixture('wooster'), :header_str => http_header('wooster'))
@fr.expects(:build_curl_easy).returns(curl)
- @fr.fetch
+ @fr.fetch
end
it "should have an empty array for new_entries" do
View
30 spec/myzofeedtosis/result_spec.rb
@@ -0,0 +1,30 @@
+require File.join(File.dirname(__FILE__), %w[.. spec_helper])
+
+describe Myzofeedtosis::Result do
+ before do
+ @c = mock('curl')
+ @f = mock('feed')
+ @r = Myzofeedtosis::Result.new(@c, @f)
+ end
+
+ it "should raise an ArgumentError if the Curl object is nil" do
+ lambda {
+ Myzofeedtosis::Result.new(nil, nil)
+ }.should raise_error(ArgumentError)
+ end
+
+ it "should send author to the Feed object" do
+ @f.expects(:author)
+ @r.author
+ end
+
+ it "should send body_str to the curl object" do
+ @c.expects(:body_str)
+ @r.body_str
+ end
+
+ it "should return nil for author if the Feed is nil" do
+ r = Myzofeedtosis::Result.new(@c, nil)
+ r.author
+ end
+end

0 comments on commit e5c819d

Please sign in to comment.
Something went wrong with that request. Please try again.