The tool consists of two applications. A web scraper (a collection of scrapy spiders) to scrape information from Google Scholar and a flask web application to show and analyze the scraped data. The scraping process can be invoked interactively from the web application.
Details on GRESPA can be found in the following publication:
P. Meschenmoser, N. Meuschke, M. Hotz, and B. Gipp, “Scraping Scientific Web Repositories: Challenges and Solutions for Automated Content Extraction,” D-Lib Magazine, vol. 22, iss. 9/10, 2016. (Online) (PDF)
Citation:
Article{Meschenmoser2016a,
author = {{M}eschenmoser, {P}hilipp and {M}euschke, {N}orman and {H}otz, {M}anuel and {G}ipp, {B}ela},
title = {{S}craping {S}cientific {W}eb {R}epositories: {C}hallenges and {S}olutions for {A}utomated {C}ontent {E}xtraction},
year = {2016},
volume = {22},
number = {9/10},
doi = {10.1045/september2016-meschenmoser},
Url = {https://dx.doi.org/10.1045/september2016-meschenmoser},
journal = {D-Lib Magazine},
}
Annotated directory structure and useful files:
.
├── README.md -- current readme
├── database.env-sample -- sample database variables
├── proxy.env-sample -- sample proxy variables
├── webapp.env-sample -- sample webapp variables
├── create-db.sh -- script for creation of the database
├── gscholar_scraper -- scraper project root
│ ├── README.md -- more information on the scraper
│ ├── gscholar_scraper -- scraper implementation
│ │ ├── models.py
│ │ ├── ...
│ │ ├── settings.py
│ │ ├── spiders
│ │ │ ├── ...
│ ├── main.py -- programmatic access to the spiders
│ ├── prepare-db.py -- script for creation of tables
│ ├── names.txt -- list of 1000 most frequent english names
│ ├── requirements.txt
│ └── scrapy.cfg -- config for scrapy
└── webapp -- webapp project root
├── __init__.py -- python file that contains the small webapp (view & controller logic)
├── config.py
├── queries
│ ├── ...
├── requirements.txt
├── static -- static files, like css and js
│ ├── ...
└── templates -- page templates
├── ...
The scraper consists of a couple of scrapy spiders, notably:
author_complete
: Crawls the profile page of a single given author (via thestart_authors
param) and colleagues until reaching the configured link depth (seesettings.py
).author_labels
: Searches for the names inSEED_NAME_LIST
(see settings.py) and scrapes the labels from author's profilesauthor_general
: Searches for all labels in the database and scrapes general author informationauthor_detail
: Complements existing author information by requesting the profile page of specific authorsauthor_co
: Scrapes co-authorship information of specified authors
A typical scraping workflow using the above spiders would be, to first scrape label information using the popular names, then getting authors for these labels and finally augmenting general author information with detail information regarding scientific measurements or co-authorship.
Or you can issue a Multi Search from the webapp to start crawling a list of authors.
For an easier installation, the project uses Anaconda environments (enabled by conda-env).
The basic software requirements are:
- PostgreSQL > 9.4
- Python 2.7
- Optional: Anaconda/Miniconda, conda-env
If you are new to Anaconda, we recommend using miniconda, for which the link provides installation instructions for Windows, OS X and Linux and a quick start guide.
First, you have to set up an instance of PostgreSQL, although--technically--you can use every database for which there exist compatible SQLAlchemy drivers. Put the credentials you want to use in the file database.env-sample
and remove the -sample
suffix. .env
files are used to store secrets, because these files are exempt from version control.
To create the database and tables, you can use the provided scripts from the project's root directory:
export $(cat *.env | xargs) # set the environment variables from all secret files
sh ./create-db.sh # create the database
python ./gscholar_scraper/prepare-db.py # prepare the database
To install all necessary dependencies, invoke conda env
on the provided environment specifications.
conda env create --file webapp/environment.yml
conda env create --file gscholar_scraper/environment.yml
Now you should have two conda environments (conda env list
):
grespa-scraper ~/miniconda3/envs/grespa-scraper
grespa-webapp ~/miniconda3/envs/grespa-webapp
root * ~/miniconda3
Activate the environment you want to use with:
source activate ENV_YOU_WANT_TO_ACTIVATE_HERE
You should not need to use pyenv
, virtualenv
, virtualenvwrapper
or other tools, if everything went smoothly.
gscholar_scraper
: Readme located atgscholar_scraper/README.md
webapp
: Readme located atwebapp/README.md
Note:: Do not forget to activate the scraper or webapp environment!
You should export variables in the following way, because the syntax of sub-shells differs to bash
:
export (cat ../*.env);
If your conda installation did not provide a working fish config, use conda.fish
from conda/conda, remember to include the conda directory in your path and source the conda.fish
file.
To activate an environment, you should use:
conda activate ENV_YOU_WANT_TO_ACTIVATE_HERE