Repository with tools to spider journal websites for links to articles
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
journal-links
.gitignore
README.md
get_links.py
journal_list.csv
run_all.py
spiderer.py

README.md

journal-spider: facilitating the spidering of journal articles to scrape


The ContentMine facilitates scraping journals, via both getpapers, quickscrape, and journal-scrapers, but finding the links to input into quickscrape remains a tedious job if done manually. This repository provides a way of spidering journals which requires only minimal user adjustment.

The main workhorse, get_links.py was written by Laszlo Szathmary in 2011. This file returns all links on a webpage, which is all we really need. Link extraction currently works for SAGE journals and Springer journals. In order to run this:

  1. Import spiderer as module into python (make sure to have installed the BeautifulSoup module! pip install BeautifulSoup to do this)
  2. Run spiderer.sage(journal = '') or spiderer.springer(journal = '') to download all links for that specific journal. For the spiderer.sage() you only need the first three letters of the web url (e.g., pss for Psychological Science); for springer() you require the unique journal identifier (e.g., 13428 for Behavior Research Methods); for elsevier() you require the unique journal identifier (e.g., 2212683X for Biologically Inspired Cognitive Architectures).

If you want to collect the links for all journals available in journal_list.csv, you only need to use the command python run_all.py in the commandline of your choosing.

To-do

  • Incorporate some form of selection mechanism into the journal_list
  • Incorporate a date checker to prevent re-spidering of recently spidered journals (what is a reasonable timeframe for this?)
  • Incorporate Elsevier
  • Incorporate Taylor & Francis
  • Incorporate Wiley