Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
europeana-crawler is a simple proof of concept script for extracting rdfa metadata from record pages using the sitemap they make available for search engine crawlers. The triples for each resource are persisted as a file to the filesystem using a pairtree to evenly distribute the files across subdirectories. To run the crawler you'll need to install a few dependencies. You might want to do this with a virtualenv, or globally on your system. The instructions here are for using a virtualenv: 1. virtualenv --no-site-packages ENV 2. source ENV/bin/activate 3. pip install -r requirements.pip 4. ./crawl.py 5. tail -f crawl.log 6. ./aggregate.py > europeana.nt Questions, comments: Ed Summers <email@example.com>