Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added the code_tests tasks. Refactored scraper to accomodate the sugg…
…estions, broke each objct into its on rb file in lib
- Loading branch information
1 parent
942377b
commit e73e9d9
Showing
9 changed files
with
765 additions
and
623 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
0.8.0 TODO: | ||
* It'd be nice to let the yaml's not need a full path to the .db, just use dir(__FILE__) as cwd for that | ||
* We should hve a listings.next_page which returns the next page - that would clean up our while loop a bit | ||
* Change craigwatch's default non-regex search to be case insensitive | ||
* Reduce memory consumption in craigwatch. | ||
* I think we need to update the package to include the enw rake tasks for flogging | ||
* Add some rdoc text to the top of all the new lib files... | ||
|
||
Post-0.7: | ||
* A debug_craigwatch, which shows the current progress... (pages fetched, objects in caches..) | ||
* Some pages are legitimate 404's and we just can't parse them no matter how hard we try - what to do about this? | ||
* Break the scraper objects into separate files... | ||
* Maybe we should make an instance out of CraigScrape.new('us/fl/south florida') kind of thing.. | ||
* Finsih testing out that geo location todo list | ||
* Test out that array-paramter to the GeoListings constructor, make sure it actually works | ||
* integrate it better nito craigscrape | ||
* It'd be nice to tell craigscrape 'us/ca' or 'us/ca/losangeles' as the scrape location | ||
* and maybe have 'search text" and "search section" type stuff where everything ends up scraping from there.. | ||
* We should really cache pages if we're going to do this - and I'd say to cache the geolisting pages first... | ||
* Stats in the email: bytes transferred, generation time, urls scrapped, posts scrapped | ||
|
||
* It'd also be nice to run an erb over the yaml file? No, we should take some steps to DRY out the code though. | ||
* Particularly with respect to the searches which the same regex for multiple searches. | ||
* and particularly with those searches which are usingt he same listings urls to search for different things (IE 'cta' searches) | ||
|
||
Recheks in a week (5.11.09 was last tried) | ||
|
||
* This thread: | ||
http://sfbay.craigslist.org/forums/?ID=29345737 | ||
Title: craigwatch does this - if you're a little handy | ||
Message: | ||
craigwatch and libcraigscrape are a tightly-coupled, ruby solution for (largely) unix-based systems. | ||
<br> | ||
<br> | ||
Check it out here: | ||
<br> | ||
<a target="_top" href="http://www.derosetechnologies.com/community/libcraigscrape">http://www.derosetechnologies.com/community/libcraigscrape</a> | ||
* http://www.craigslistwatch.com/ | ||
* Did this actuallyt post?: http://digg.com/tech_news/Stop_wasting_money_use_Craigslist_Watch | ||
|
||
email: | ||
|
||
http://www.dostuffright.com/Craigwatch | ||
http://wareseeker.com/Network-Internet/Craigslist-All-City-Search-Tool-1.2.zip/8036652 | ||
http://www.killerstartups.com/Search/craigslittlebuddy-com-multiple-city-craigslist-search | ||
|
||
Scripts aggregators: | ||
bigwebmaster.com | ||
http://www.scripts.com/ | ||
http://www.scriptarchive.com/ | ||
http://www.needscripts.com/ | ||
http://www.scriptsearch.com/ | ||
http://www.sitescripts.com/PHP/ | ||
http://www.scriptsbank.com/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# TODO: file rdoc | ||
|
||
require 'scraper' | ||
|
||
class CraigScrape | ||
# GeoListings represents a parsed Craigslist geo lisiting page. (i.e. {'http://geo.craigslist.org/iso/us'}[http://geo.craigslist.org/iso/us]) | ||
# These list all the craigslist sites in a given region. | ||
class GeoListings < Scraper | ||
LOCATION_NAME = /[ ]*\>[ ](.+)[ ]*/ | ||
GEOLISTING_BASE_URL = %{http://geo.craigslist.org/iso/} | ||
|
||
# The geolisting constructor works like all other Scraper objects, in that it accepts a string 'url'. | ||
# In addition though, here we'll accept an array like %w(us fl) which gets converted to | ||
# {'http://geo.craigslist.org/iso/us/fl'}[http://geo.craigslist.org/iso/us/fl] | ||
def initialize(init_via = nil) | ||
super init_via.kind_of?(Array) ? "#{GEOLISTING_BASE_URL}#{init_via.join '/'}" : init_via | ||
|
||
# Validate that required fields are present, at least - if we've downloaded it from a url | ||
parse_error! unless location | ||
end | ||
|
||
# Returns the GeoLocation's full name | ||
def location | ||
unless @name | ||
cursor = html % 'h3 > b > a:first-of-type' | ||
cursor = cursor.next_node if cursor | ||
@name = $1 if cursor and LOCATION_NAME.match he_decode(cursor.to_s) | ||
end | ||
|
||
@name | ||
end | ||
|
||
# Returns a hash of site name to urls in the current listing | ||
def sites | ||
unless @sites | ||
@sites = {} | ||
(html / 'div#list > a').each do |el_a| | ||
site_name = he_decode strip_html(el_a.inner_html) | ||
@sites[site_name] = el_a[:href] | ||
end | ||
end | ||
|
||
@sites | ||
end | ||
end | ||
end |
Oops, something went wrong.