Added the code_tests tasks. Refactored scraper to accomodate the sugg…

…estions, broke each objct into its on rb file in lib
brighton36 · Jul 5, 2009 · e73e9d9 · e73e9d9
1 parent 942377b
commit e73e9d9
Show file tree

Hide file tree

Showing 9 changed files with 765 additions and 623 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,5 +1,8 @@
 == Change Log
 
+=== Release 0.8.0 (TODO, 2009)
+- Added :code_tests to the rakefile
+
 === Release 0.7.0 (Jul 5, 2009)
 - A good bit of refactoring
 - Eager-loading in the Post object without the need of the full_post method

diff --git a/Rakefile b/Rakefile
@@ -11,7 +11,7 @@ include FileUtils
 RbConfig = Config unless defined? RbConfig
 
 NAME = "libcraigscrape"
-VERS = ENV['VERSION'] || "0.7.0"
+VERS = ENV['VERSION'] || "0.8.0"
 PKG = "#{NAME}-#{VERS}"
 
 RDOC_OPTS = ['--quiet', '--title', 'The libcraigscrape Reference', '--main', 'README', '--inline-source']
@@ -77,3 +77,44 @@ task :uninstall => [:clean] do
   sh %{sudo gem uninstall #{NAME}}
 end
 
+require 'roodi'
+require 'roodi_task'
+
+namespace :code_tests do
+  desc "Analyze for code complexity"
+  task :flog do
+    require 'flog'
+
+    flog = Flog.new
+    flog.flog_files ['lib']
+    threshold = 105
+
+    bad_methods = flog.totals.select do |name, score|
+       score > threshold
+    end
+
+    bad_methods.sort { |a,b| a[1] <=> b[1] }.each do |name, score|
+      puts "%8.1f: %s" % [score, name]
+    end
+
+    puts "WARNING : #{bad_methods.size} methods have a flog complexity > #{threshold}" unless bad_methods.empty?
+  end
+
+  desc "Analyze for code duplication"
+    require 'flay'
+    task :flay do
+    threshold = 25
+    flay = Flay.new({:fuzzy => false, :verbose => false, :mass => threshold})
+    flay.process(*Flay.expand_dirs_to_files(['lib']))
+
+    flay.report
+
+    raise "#{flay.masses.size} chunks of code have a duplicate mass > #{threshold}" unless flay.masses.empty?
+  end
+
+  RoodiTask.new 'roodi', ['lib/*.rb'], 'roodi.yml'
+end
+
+desc "Run all code tests"
+task :code_tests => %w(code_tests:flog code_tests:flay code_tests:roodi)
+
diff --git a/TODO.txt b/TODO.txt
@@ -0,0 +1,54 @@
+0.8.0 TODO: 
+ * It'd be nice to let the yaml's not need a full path to the .db, just use dir(__FILE__) as cwd for that
+ * We should hve a listings.next_page which returns the next page - that would clean up our while loop a bit
+ * Change craigwatch's default non-regex search to be case insensitive
+ * Reduce memory consumption in craigwatch.
+ * I think we need to update the package to include the enw rake tasks for flogging
+ * Add some rdoc text to the top of all the new lib files...
+
+Post-0.7:
+ * A debug_craigwatch, which shows the current progress... (pages fetched, objects in caches..)
+ * Some pages are legitimate 404's and we just can't parse them no matter how hard we try - what to do about this?
+ * Break the scraper objects into separate files...
+ * Maybe we should make an instance out of CraigScrape.new('us/fl/south florida') kind of thing..
+ * Finsih testing out that geo location todo list
+ 	* Test out that array-paramter to the GeoListings constructor, make sure it actually works
+	* integrate it better nito craigscrape
+	* It'd be nice to tell craigscrape 'us/ca' or 'us/ca/losangeles' as the scrape location
+	* and maybe have 'search text" and "search section" type stuff where everything ends up scraping from there..
+	* We should really cache pages if we're going to do this - and I'd say to cache the geolisting pages first...
+ * Stats in the email: bytes transferred, generation time, urls scrapped, posts scrapped
+
+ * It'd also be nice to run an erb over the yaml file? No, we should take some steps to DRY out the code though. 
+	* Particularly with respect to the searches which the same regex for multiple searches.
+	* and particularly with those searches which are usingt he same listings urls to search for different things (IE 'cta' searches)
+
+Recheks in a week (5.11.09 was last tried)
+
+	* This thread:
+	http://sfbay.craigslist.org/forums/?ID=29345737
+	Title: craigwatch does this - if you're a little handy 
+	Message: 
+		craigwatch and libcraigscrape are a tightly-coupled, ruby solution for (largely) unix-based systems. 
+		<br>
+		<br>
+		Check it out here: 
+		<br>
+		<a target="_top" href="http://www.derosetechnologies.com/community/libcraigscrape">http://www.derosetechnologies.com/community/libcraigscrape</a>
+	* http://www.craigslistwatch.com/
+	* Did this actuallyt post?: http://digg.com/tech_news/Stop_wasting_money_use_Craigslist_Watch
+
+email: 
+
+	http://www.dostuffright.com/Craigwatch
+	http://wareseeker.com/Network-Internet/Craigslist-All-City-Search-Tool-1.2.zip/8036652
+	http://www.killerstartups.com/Search/craigslittlebuddy-com-multiple-city-craigslist-search
+
+Scripts aggregators:
+	  bigwebmaster.com 
+	http://www.scripts.com/
+	http://www.scriptarchive.com/
+	http://www.needscripts.com/
+	http://www.scriptsearch.com/
+	http://www.sitescripts.com/PHP/
+	http://www.scriptsbank.com/
diff --git a/lib/geo_listings.rb b/lib/geo_listings.rb
@@ -0,0 +1,46 @@
+# TODO: file rdoc
+
+require 'scraper'
+
+class CraigScrape
+  # GeoListings represents a parsed Craigslist geo lisiting page. (i.e. {'http://geo.craigslist.org/iso/us'}[http://geo.craigslist.org/iso/us]) 
+  # These list all the craigslist sites in a given region.
+  class GeoListings < Scraper
+    LOCATION_NAME = /[ ]*\>[ ](.+)[ ]*/
+    GEOLISTING_BASE_URL = %{http://geo.craigslist.org/iso/}
+
+    # The geolisting constructor works like all other Scraper objects, in that it accepts a string 'url'. 
+    # In addition though, here we'll accept an array like %w(us fl) which gets converted to
+    # {'http://geo.craigslist.org/iso/us/fl'}[http://geo.craigslist.org/iso/us/fl]
+    def initialize(init_via = nil)
+      super init_via.kind_of?(Array) ? "#{GEOLISTING_BASE_URL}#{init_via.join '/'}" : init_via
+
+      # Validate that required fields are present, at least - if we've downloaded it from a url
+      parse_error! unless location
+    end
+
+    # Returns the GeoLocation's full name
+    def location
+      unless @name
+        cursor = html % 'h3 > b > a:first-of-type'
+        cursor = cursor.next_node if cursor       
+        @name = $1 if cursor and LOCATION_NAME.match he_decode(cursor.to_s)
+      end
+
+      @name
+    end
+
+    # Returns a hash of site name to urls in the current listing
+    def sites
+      unless @sites
+        @sites = {}
+        (html / 'div#list > a').each do |el_a|
+          site_name = he_decode strip_html(el_a.inner_html)
+          @sites[site_name] = el_a[:href]
+        end
+      end
+
+      @sites
+    end
+  end
+end