Extracting relevant information from Cochrane reviews

This repository contains an updated version of the implementation of rules for automatically extracting relevant information from Cochrane reviews.

Initial implementation by Rabia Bashir: https://github.com/Rabia-Bashir/rules_data_ext/

Environment

The code was built and tested on:

python 2.7.16 (Anaconda)
scikit-learn: 0.20.3
SciPy: 1.2.1
NumPy: 1.16.5
macOS Catalina 10.15.6

Note

Please note that the newer version of some libraries from scikit-learn used in this code might have different default values for the parameters, or, different available parameters, which could give different results.

Usage

There are 3 python scripts:

crawler.py
extractor.py
classifiers.py

crawler.py
Re-running the script will download HTML files to your own folder. To run:
python crawler.py
You will be presented a menu:
> Enter your folder name:

You need to enter your folder name, i.e., Your_folder

This code will read a list of DOI in 'Datasets/DOI.csv' and download the reviews with .pub2 (original version) and .pub3 (updated version) from Cochrane library. The downloaded HTML files are saved to HTML_SystematicReviews folder in Your_Folder (see above).

Specifically, the crawler will download:

http://cochranelibrary.com/cdsr/doi/{}/full
http://cochranelibrary.com/cdsr/doi/{}/references
http://cochranelibrary.com/cdsr/doi/{}/information
where {} is the DOI.

extractor.py
This code will extract relevant information from the HTML files:

Search date
Abstract > Search methods, in the HTML file downloaded from http://cochranelibrary.com/cdsr/doi/{}/full, where {} is the DOI
Number of trials, number of participants in each trial
Characteristics of studies > Characteristics of included studies, in the HTML file downloaded from http://cochranelibrary.com/cdsr/doi/{}/references, where {} is the DOI
Conclusion
What's New and History, in the HTML file downloaded from http://cochranelibrary.com/cdsr/doi/{}/information

To run:
python extractor.py

You will be presented a menu:
> Enter your folder name:

The code will read the HTML files in HTML_SystematicReviews folder in Your_Folder and produce 'extracted_info.txt' in Results folder also in Your_Folder. Alternatively, the 'extracted_info.txt' is also provided in 'Results/' folder.

classifiers.py
Type python classifiers.py on the console, or run classifiers.py from IDE. A menu will appear:

[1] Load previous trained classifiers
[2] Train the classifiers on your dataset

[1] Load previous trained classifiers
This choice will:

Read features in 'Results/features.txt'
Split into 80% for training set (not used), and 20% as test set
Load previous trained classifiers in 'Results/cpickle/' folder and produce the results on the 20% test set
Reproduce the reported results

[2] Train the classifiers on your dataset
This choice will:

You will need to enter Your_folder name
Read 'extracted_info.txt' in 'Your_folder/Results/' folder
Split into 80% for training set, and 20% as test set.
Train classifiers using the 80% training set and test on the 20% test set.

The code contains 3 classifiers: logistic regression, decision tree, and random forest. All classifiers were trained using GridSearch to find the best combination of paramaters. Specifically, the tested combinations of parameters for each trained classifiers were:

Logistic regression

    parameters = {'penalty': ['l1', 'l2'],'class_weight': ['balanced'],'solver': ['liblinear'], ‘C’: [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0], ’n_jobs': [-1],'random_state':[42]}

Decision tree

    parameters = {"criterion": ["entropy", "gini"],'class_weight': ['balanced'],'max_features': ['auto', 'sqrt', 'log2'],'max_depth': [2, 3, 4],'random_state':[42]}

Random forest

    parameters={'n_estimators': range(5,105,5),'criterion':['entropy','gini'],'class_weight':['balanced'],'max_features':['auto', 'sqrt', 'log2'],'max_depth': [2, 3, 4],'random_state':[42]

The combinations of parameters can be found in classifiers.py (run_gridsearchcv_RFClassifier(), run_gridsearchcv_DTClassifier(), run_gridsearchcv_LogisticRegression())

Reference

A rule-based approach for automatically extracting data from systematic reviews and their updates to model the risk of conclusion change. Rabia Bashir, Adam G. Dunn, Didi Surian. Research Synthesis Methods, 2021:1-10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets

Datasets

Results

Results

README.md

README.md

classifiers.py

classifiers.py

crawler.py

crawler.py

extractor.py

extractor.py

rules.py

rules.py

Repository files navigation

Extracting relevant information from Cochrane reviews

Environment

Note

Usage

Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
Datasets		Datasets
Results		Results
README.md		README.md
classifiers.py		classifiers.py
crawler.py		crawler.py
extractor.py		extractor.py
rules.py		rules.py

dsurian/rules_cochranereviews

Folders and files

Latest commit

History

Repository files navigation

Extracting relevant information from Cochrane reviews

Environment

Note

Usage

Reference

About

Resources

Stars

Watchers

Forks

Languages