ReviewBuilder

A collection of tools for automating parts of a systematic review of scientific literature.

Currently supports one use case: creating a bibtex file with the results of a Google Scholar search and augmenting the metadata for each result by retrieving its abstract and finding Open Access versions of the paper on the web, including preprints.

All results are cached locally in a SQLite database, aiming to make iterating over queries for obtaining papers for a review less painful.
All data ingestion is nice :), locally enforcing rate limiting, both from the known requirements of each service, and by parsing the X-Rate-Limit-Limit and X-Rate-Limit-Interval where provided in the response.
Implemented: Google Scholar, Crossref, SemanticScholar (metadata), PubMed, arXiv, Unpaywall.
Not yet implemented: Microsoft Academic, Semantic Scholar (search), Web of Science
Coming very soon:
- locally filtering results (i.e. "selecting articles for inclusion") based on keywords and the detected language the paper is written in
- automatic downloading of PDFs

Installation

Tested on Python 3.7 only. May work with earlier versions of Python 3, but not 2.

pip install -r requirements.txt

Example usage

python search_to_file.py -q "OR "natural language" OR "radiology reports" OR lstm OR rnn OR bert OR elmo OR word2vec" -m 100 -f test.bib -ys 2015

This will send the supplied query to Google Scholar, and set the minimum year (--year-start) to 2015, retrieve a maximum of 100 results and save them in the file test.bib.

Alternatively, we can save the query in a text file and pass that as a parameter:

python search_to_file.py -qf query1.txt -m 100 -f test.bib -ys 2015

Bibtex does not store everything we are interested in, so by default, extra data from Scholar such as the link to the "related articles", number of citations and other tidbits will be directly saved to the local SQLite cache (see below).

Google Scholar offers perhaps the best coverage (recall) over all fields of science and does a great job at surfacing relevant articles. What it does not do, however, is make it easy to scrape, or connect these results to anything else useful. It does not provide any useful identifier for the results (DOI, PMID, etc) or the abstract of the paper, and a lot of information is mangled in the results, including authors' names. To get high quality data, we need to use other services.

Once we have the list of results, we can collect extra data, such as the abstract of the paper and locations on the web where we may find it in open access, whether in HTML or PDF.

python gather_metadata.py -i test.bib -o test_plus.bib --max 20

This will process a maximum of 200 entries from the test.bib file, and output an "enriched" version to test_plus.bib. For each entry it will try to:

match it with an entry in the local cache. If it can't be found go to step 2.
attempt to match the paper with its DOI via the Crossref API.
once we have a DOI, check SemanticScholar for metadata and abstract for the paper
if we don't have a DOI or abstract, search PubMed for its PubMed ID (PMID) and retrieve the abstract from there, if available
search arXiv for a preprint of the paper
search Unpawall for available open access versions of the paper if we are missing a PDF link from the above

Many of these steps require approximate matching, both for the local cache and the results from the remote APIs. Often a preprint version of a paper will have a slightly different title or will be missing an author or two. This repo implements several heuristics for dealing with this.

A SQLite database cache is automatically created in papers.sqlite in the /db directory.

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
base		base
db		db
search		search
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
add_abstracts_from_pdf.py		add_abstracts_from_pdf.py
bib_diff.py		bib_diff.py
bib_to_csv.py		bib_to_csv.py
bulk_download.py		bulk_download.py
export_rayyan_results.py		export_rayyan_results.py
export_to_ris.py		export_to_ris.py
filter_results.py		filter_results.py
gather_metadata.py		gather_metadata.py
import_from_endnote.py		import_from_endnote.py
import_metadata.py		import_metadata.py
reasons_for_exclusion.py		reasons_for_exclusion.py
requirements.txt		requirements.txt
search_to_file.py		search_to_file.py
snowball_citations.py		snowball_citations.py
titles_and_bibs.py		titles_and_bibs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReviewBuilder

Installation

Example usage

About

Releases

Packages

Contributors 2

Languages

danduma/ReviewBuilder

Folders and files

Latest commit

History

Repository files navigation

ReviewBuilder

Installation

Example usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages