A google crawler to "read" page results for a query and output them in a CSV file.
Ruby Shell
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
archived
lib
results
scraped
tmp
utils/stop_words
.gitignore
Gemfile
LICENSE
README.md
aggregate.sh
archiver.sh
crawler.rb
executer.sh
get_smaller_dataset.sh
mapreducer.sh
run.example.sh

README.md

A little script to parse google search pages and retrieve all the results.

The results are saved in csv format inside the scraped/ folder. CSV headers are:

url => the URL of the result ( aka website URL )
title => the title of the result ( aka website page title )
google_page => the google page number the results was found in
google_position => the absolute position in the google results of the result
keywords => the 10 words more used on the page ( got by parsing the whole <body> of the page and applying a word frequency algorightm )

Tested using ruby 1.9.3

References

Some reference used to build the project:

http://ruby.bastardsbook.com/chapters/html-parsing/ http://ruby.bastardsbook.com/chapters/web-crawling/ http://ruby.bastardsbook.com/chapters/mechanize/

Tutorial: Writing a Word Frequencies Script in Ruby http://semantichumanities.wordpress.com/2006/02/21/word-frequencies-in-ruby-tutorial/

Known issues

Does not handle redirects.

TODO

search all items present in more than one list ( by search tag ) and aggregate the tags ( in separate files? )