This repository contains an updated version of the implementation of rules for automatically extracting relevant information from Cochrane reviews.
Initial implementation by Rabia Bashir: https://github.com/Rabia-Bashir/rules_data_ext/
The code was built and tested on:
- python 2.7.16 (Anaconda)
- scikit-learn: 0.20.3
- SciPy: 1.2.1
- NumPy: 1.16.5
- macOS Catalina 10.15.6
Please note that the newer version of some libraries from scikit-learn used in this code might have different default values for the parameters, or, different available parameters, which could give different results.
There are 3 python scripts:
- crawler.py
- extractor.py
- classifiers.py
You can create your own Results folder, but the folder should in the same directory with the python scripts and has the same structure, i.e.:
Main folder
|- crawler.py
|- extractor.py
|- classifier.py
|- Datasets
| |- DOI.csv
|- Results
| |- cpickle
|- Your_folder
| |- Results
| |- HTML_SystematicReviews
crawler.py
Re-running the script will download HTML files to your own folder. To run:
python crawler.py
You will be presented a menu:
> Enter your folder name:
You need to enter your folder name, i.e., Your_folder
This code will read a list of DOI in 'Datasets/DOI.csv' and download the reviews with .pub2 (original version) and .pub3 (updated version) from Cochrane library. The downloaded HTML files are saved to HTML_SystematicReviews folder in Your_Folder (see above).
Specifically, the crawler will download:
http://cochranelibrary.com/cdsr/doi/{}/full
http://cochranelibrary.com/cdsr/doi/{}/references
http://cochranelibrary.com/cdsr/doi/{}/information
where {} is the DOI.
extractor.py
This code will extract relevant information from the HTML files:
- Search date
Abstract > Search methods, in the HTML file downloaded fromhttp://cochranelibrary.com/cdsr/doi/{}/full
, where {} is the DOI - Number of trials, number of participants in each trial
Characteristics of studies > Characteristics of included studies, in the HTML file downloaded fromhttp://cochranelibrary.com/cdsr/doi/{}/references
, where {} is the DOI - Conclusion
What's New and History, in the HTML file downloaded fromhttp://cochranelibrary.com/cdsr/doi/{}/information
To run:
python extractor.py
You will be presented a menu:
> Enter your folder name:
The code will read the HTML files in HTML_SystematicReviews folder in Your_Folder and produce 'extracted_info.txt' in Results folder also in Your_Folder. Alternatively, the 'extracted_info.txt' is also provided in 'Results/' folder.
classifiers.py
Type python classifiers.py on the console, or run classifiers.py from IDE. A menu will appear:
[1] Load previous trained classifiers
[2] Train the classifiers on your dataset
[1] Load previous trained classifiers
This choice will:
- Read features in 'Results/features.txt'
- Split into 80% for training set (not used), and 20% as test set
- Load previous trained classifiers in 'Results/cpickle/' folder and produce the results on the 20% test set
- Reproduce the reported results
[2] Train the classifiers on your dataset
This choice will:
- You will need to enter Your_folder name
- Read 'extracted_info.txt' in 'Your_folder/Results/' folder
- Split into 80% for training set, and 20% as test set.
- Train classifiers using the 80% training set and test on the 20% test set.
The code contains 3 classifiers: logistic regression, decision tree, and random forest. All classifiers were trained using GridSearch to find the best combination of paramaters. Specifically, the tested combinations of parameters for each trained classifiers were:
Logistic regression
parameters = {'penalty': ['l1', 'l2'],'class_weight': ['balanced'],'solver': ['liblinear'], ‘C’: [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0], ’n_jobs': [-1],'random_state':[42]}
Decision tree
parameters = {"criterion": ["entropy", "gini"],'class_weight': ['balanced'],'max_features': ['auto', 'sqrt', 'log2'],'max_depth': [2, 3, 4],'random_state':[42]}
Random forest
parameters={'n_estimators': range(5,105,5),'criterion':['entropy','gini'],'class_weight':['balanced'],'max_features':['auto', 'sqrt', 'log2'],'max_depth': [2, 3, 4],'random_state':[42]
The combinations of parameters can be found in classifiers.py (run_gridsearchcv_RFClassifier(), run_gridsearchcv_DTClassifier(), run_gridsearchcv_LogisticRegression())
- A rule-based approach for automatically extracting data from systematic reviews and their updates to model the risk of conclusion change. Rabia Bashir, Adam G. Dunn, Didi Surian. Research Synthesis Methods, 2021:1-10