Skip to content

Commit

Permalink
Added the code_tests tasks. Refactored scraper to accomodate the sugg…
Browse files Browse the repository at this point in the history
…estions, broke each objct into its on rb file in lib
  • Loading branch information
brighton36 committed Jul 5, 2009
1 parent 942377b commit e73e9d9
Show file tree
Hide file tree
Showing 9 changed files with 765 additions and 623 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG
@@ -1,5 +1,8 @@
== Change Log

=== Release 0.8.0 (TODO, 2009)
- Added :code_tests to the rakefile

=== Release 0.7.0 (Jul 5, 2009)
- A good bit of refactoring
- Eager-loading in the Post object without the need of the full_post method
Expand Down
43 changes: 42 additions & 1 deletion Rakefile
Expand Up @@ -11,7 +11,7 @@ include FileUtils
RbConfig = Config unless defined? RbConfig

NAME = "libcraigscrape"
VERS = ENV['VERSION'] || "0.7.0"
VERS = ENV['VERSION'] || "0.8.0"
PKG = "#{NAME}-#{VERS}"

RDOC_OPTS = ['--quiet', '--title', 'The libcraigscrape Reference', '--main', 'README', '--inline-source']
Expand Down Expand Up @@ -77,3 +77,44 @@ task :uninstall => [:clean] do
sh %{sudo gem uninstall #{NAME}}
end

require 'roodi'
require 'roodi_task'

namespace :code_tests do
desc "Analyze for code complexity"
task :flog do
require 'flog'

flog = Flog.new
flog.flog_files ['lib']
threshold = 105

bad_methods = flog.totals.select do |name, score|
score > threshold
end

bad_methods.sort { |a,b| a[1] <=> b[1] }.each do |name, score|
puts "%8.1f: %s" % [score, name]
end

puts "WARNING : #{bad_methods.size} methods have a flog complexity > #{threshold}" unless bad_methods.empty?
end

desc "Analyze for code duplication"
require 'flay'
task :flay do
threshold = 25
flay = Flay.new({:fuzzy => false, :verbose => false, :mass => threshold})
flay.process(*Flay.expand_dirs_to_files(['lib']))

flay.report

raise "#{flay.masses.size} chunks of code have a duplicate mass > #{threshold}" unless flay.masses.empty?
end

RoodiTask.new 'roodi', ['lib/*.rb'], 'roodi.yml'
end

desc "Run all code tests"
task :code_tests => %w(code_tests:flog code_tests:flay code_tests:roodi)

54 changes: 54 additions & 0 deletions TODO.txt
@@ -0,0 +1,54 @@
0.8.0 TODO:
* It'd be nice to let the yaml's not need a full path to the .db, just use dir(__FILE__) as cwd for that
* We should hve a listings.next_page which returns the next page - that would clean up our while loop a bit
* Change craigwatch's default non-regex search to be case insensitive
* Reduce memory consumption in craigwatch.
* I think we need to update the package to include the enw rake tasks for flogging
* Add some rdoc text to the top of all the new lib files...

Post-0.7:
* A debug_craigwatch, which shows the current progress... (pages fetched, objects in caches..)
* Some pages are legitimate 404's and we just can't parse them no matter how hard we try - what to do about this?
* Break the scraper objects into separate files...
* Maybe we should make an instance out of CraigScrape.new('us/fl/south florida') kind of thing..
* Finsih testing out that geo location todo list
* Test out that array-paramter to the GeoListings constructor, make sure it actually works
* integrate it better nito craigscrape
* It'd be nice to tell craigscrape 'us/ca' or 'us/ca/losangeles' as the scrape location
* and maybe have 'search text" and "search section" type stuff where everything ends up scraping from there..
* We should really cache pages if we're going to do this - and I'd say to cache the geolisting pages first...
* Stats in the email: bytes transferred, generation time, urls scrapped, posts scrapped

* It'd also be nice to run an erb over the yaml file? No, we should take some steps to DRY out the code though.
* Particularly with respect to the searches which the same regex for multiple searches.
* and particularly with those searches which are usingt he same listings urls to search for different things (IE 'cta' searches)

Recheks in a week (5.11.09 was last tried)

* This thread:
http://sfbay.craigslist.org/forums/?ID=29345737
Title: craigwatch does this - if you're a little handy
Message:
craigwatch and libcraigscrape are a tightly-coupled, ruby solution for (largely) unix-based systems.
<br>
<br>
Check it out here:
<br>
<a target="_top" href="http://www.derosetechnologies.com/community/libcraigscrape">http://www.derosetechnologies.com/community/libcraigscrape</a>
* http://www.craigslistwatch.com/
* Did this actuallyt post?: http://digg.com/tech_news/Stop_wasting_money_use_Craigslist_Watch

email:

http://www.dostuffright.com/Craigwatch
http://wareseeker.com/Network-Internet/Craigslist-All-City-Search-Tool-1.2.zip/8036652
http://www.killerstartups.com/Search/craigslittlebuddy-com-multiple-city-craigslist-search

Scripts aggregators:
bigwebmaster.com
http://www.scripts.com/
http://www.scriptarchive.com/
http://www.needscripts.com/
http://www.scriptsearch.com/
http://www.sitescripts.com/PHP/
http://www.scriptsbank.com/
46 changes: 46 additions & 0 deletions lib/geo_listings.rb
@@ -0,0 +1,46 @@
# TODO: file rdoc

require 'scraper'

class CraigScrape
# GeoListings represents a parsed Craigslist geo lisiting page. (i.e. {'http://geo.craigslist.org/iso/us'}[http://geo.craigslist.org/iso/us])
# These list all the craigslist sites in a given region.
class GeoListings < Scraper
LOCATION_NAME = /[ ]*\>[ ](.+)[ ]*/
GEOLISTING_BASE_URL = %{http://geo.craigslist.org/iso/}

# The geolisting constructor works like all other Scraper objects, in that it accepts a string 'url'.
# In addition though, here we'll accept an array like %w(us fl) which gets converted to
# {'http://geo.craigslist.org/iso/us/fl'}[http://geo.craigslist.org/iso/us/fl]
def initialize(init_via = nil)
super init_via.kind_of?(Array) ? "#{GEOLISTING_BASE_URL}#{init_via.join '/'}" : init_via

# Validate that required fields are present, at least - if we've downloaded it from a url
parse_error! unless location
end

# Returns the GeoLocation's full name
def location
unless @name
cursor = html % 'h3 > b > a:first-of-type'
cursor = cursor.next_node if cursor
@name = $1 if cursor and LOCATION_NAME.match he_decode(cursor.to_s)
end

@name
end

# Returns a hash of site name to urls in the current listing
def sites
unless @sites
@sites = {}
(html / 'div#list > a').each do |el_a|
site_name = he_decode strip_html(el_a.inner_html)
@sites[site_name] = el_a[:href]
end
end

@sites
end
end
end

0 comments on commit e73e9d9

Please sign in to comment.