Skip to content

Crawler for large scientific journal sites such as ScienceDirect

Notifications You must be signed in to change notification settings

bvishny/Citation

Repository files navigation

Citation

DESCRIPTION: This is code from a few summers ago when a Finance professor hired me for his project to change Academic Citation ranking software. My job was to crawl nearly a million pages on online journal sites to retrieve a complete list of articles and their authors. I focused my work on large sites such as ScienceDirect, SpringerLink, and Blackwell because they house the great majority of the journals. I ran about 50 workers on Rackspace cloud servers (at the time it was better for our purposes than AWS), each which accepted crawling tasks from the main server. Those workers would then batch out to the TokyoTyrant instance to store the data collected.

HIGHLIGHTS: dependencies_tokyo/tokyo_record.rb - a ActiveRecord-like interface for TokyoTyrant dependencies_tokyo/indexers_tokyo/* - scraping code for each site enscapulated into a common interface

About

Crawler for large scientific journal sites such as ScienceDirect

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages