Citation

DESCRIPTION: This is code from a few summers ago when a Finance professor hired me for his project to change Academic Citation ranking software. My job was to crawl nearly a million pages on online journal sites to retrieve a complete list of articles and their authors. I focused my work on large sites such as ScienceDirect, SpringerLink, and Blackwell because they house the great majority of the journals. I ran about 50 workers on Rackspace cloud servers (at the time it was better for our purposes than AWS), each which accepted crawling tasks from the main server. Those workers would then batch out to the TokyoTyrant instance to store the data collected.

HIGHLIGHTS: dependencies_tokyo/tokyo_record.rb - a ActiveRecord-like interface for TokyoTyrant dependencies_tokyo/indexers_tokyo/* - scraping code for each site enscapulated into a common interface

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dependencies_tokyo		dependencies_tokyo
indexers_tokyo		indexers_tokyo
models_tokyo		models_tokyo
.DS_Store		.DS_Store
README.md		README.md
README.txt		README.txt
require_all_tokyo.rb		require_all_tokyo.rb
worker.rb		worker.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Citation

About

Releases

Packages

Languages

bvishny/Citation

Folders and files

Latest commit

History

Repository files navigation

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages