journal-spider: facilitating the spidering of journal articles to scrape
The ContentMine facilitates scraping journals, via both
journal-scrapers, but finding the links to input into
quickscrape remains a tedious job if done manually. This repository provides a way of spidering journals which requires only minimal user adjustment.
The main workhorse,
get_links.py was written by Laszlo Szathmary in 2011. This file returns all links on a webpage, which is all we really need. Link extraction currently works for SAGE journals and Springer journals. In order to run this:
spidereras module into python (make sure to have installed the
pip install BeautifulSoupto do this)
spiderer.sage(journal = '')or
spiderer.springer(journal = '')to download all links for that specific journal. For the
spiderer.sage()you only need the first three letters of the web url (e.g.,
pssfor Psychological Science); for
springer()you require the unique journal identifier (e.g., 13428 for Behavior Research Methods); for
elsevier()you require the unique journal identifier (e.g., 2212683X for Biologically Inspired Cognitive Architectures).
If you want to collect the links for all journals available in
journal_list.csv, you only need to use the command
python run_all.py in the commandline of your choosing.
- Incorporate some form of selection mechanism into the journal_list
- Incorporate a date checker to prevent re-spidering of recently spidered journals (what is a reasonable timeframe for this?)
- Incorporate Elsevier
- Incorporate Taylor & Francis
- Incorporate Wiley