Skip to content

dsurian/rules_cochranereviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extracting relevant information from Cochrane reviews

This repository contains an updated version of the implementation of rules for automatically extracting relevant information from Cochrane reviews.

Initial implementation by Rabia Bashir: https://github.com/Rabia-Bashir/rules_data_ext/

Environment


The code was built and tested on:

  • python 2.7.16 (Anaconda)
  • scikit-learn: 0.20.3
  • SciPy: 1.2.1
  • NumPy: 1.16.5
  • macOS Catalina 10.15.6

Note


Please note that the newer version of some libraries from scikit-learn used in this code might have different default values for the parameters, or, different available parameters, which could give different results.

Usage


There are 3 python scripts:

  1. crawler.py
  2. extractor.py
  3. classifiers.py

You can create your own Results folder, but the folder should in the same directory with the python scripts and has the same structure, i.e.:
Main folder
      |- crawler.py
      |- extractor.py
      |- classifier.py
      |- Datasets
      |        |- DOI.csv
      |- Results
      |        |- cpickle
      |- Your_folder
      |        |- Results
      |        |- HTML_SystematicReviews

crawler.py
Re-running the script will download HTML files to your own folder. To run:
     python crawler.py
You will be presented a menu:
    > Enter your folder name:

You need to enter your folder name, i.e., Your_folder

This code will read a list of DOI in 'Datasets/DOI.csv' and download the reviews with .pub2 (original version) and .pub3 (updated version) from Cochrane library. The downloaded HTML files are saved to HTML_SystematicReviews folder in Your_Folder (see above).

Specifically, the crawler will download:

  • http://cochranelibrary.com/cdsr/doi/{}/full
  • http://cochranelibrary.com/cdsr/doi/{}/references
  • http://cochranelibrary.com/cdsr/doi/{}/information
    where {} is the DOI.

extractor.py
This code will extract relevant information from the HTML files:

  • Search date
        Abstract > Search methods, in the HTML file downloaded from http://cochranelibrary.com/cdsr/doi/{}/full, where {} is the DOI
  • Number of trials, number of participants in each trial
        Characteristics of studies > Characteristics of included studies, in the HTML file downloaded from http://cochranelibrary.com/cdsr/doi/{}/references, where {} is the DOI
  • Conclusion
        What's New and History, in the HTML file downloaded from http://cochranelibrary.com/cdsr/doi/{}/information

To run:
     python extractor.py

You will be presented a menu:
     > Enter your folder name:

The code will read the HTML files in HTML_SystematicReviews folder in Your_Folder and produce 'extracted_info.txt' in Results folder also in Your_Folder. Alternatively, the 'extracted_info.txt' is also provided in 'Results/' folder.

classifiers.py
Type python classifiers.py on the console, or run classifiers.py from IDE. A menu will appear:

[1] Load previous trained classifiers
[2] Train the classifiers on your dataset

[1] Load previous trained classifiers
This choice will:

  • Read features in 'Results/features.txt'
  • Split into 80% for training set (not used), and 20% as test set
  • Load previous trained classifiers in 'Results/cpickle/' folder and produce the results on the 20% test set
  • Reproduce the reported results

[2] Train the classifiers on your dataset
This choice will:

  • You will need to enter Your_folder name
  • Read 'extracted_info.txt' in 'Your_folder/Results/' folder
  • Split into 80% for training set, and 20% as test set.
  • Train classifiers using the 80% training set and test on the 20% test set.

The code contains 3 classifiers: logistic regression, decision tree, and random forest. All classifiers were trained using GridSearch to find the best combination of paramaters. Specifically, the tested combinations of parameters for each trained classifiers were:

Logistic regression

    parameters = {'penalty': ['l1', 'l2'],'class_weight': ['balanced'],'solver': ['liblinear'], ‘C’: [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0], ’n_jobs': [-1],'random_state':[42]}

Decision tree

    parameters = {"criterion": ["entropy", "gini"],'class_weight': ['balanced'],'max_features': ['auto', 'sqrt', 'log2'],'max_depth': [2, 3, 4],'random_state':[42]}

Random forest

    parameters={'n_estimators': range(5,105,5),'criterion':['entropy','gini'],'class_weight':['balanced'],'max_features':['auto', 'sqrt', 'log2'],'max_depth': [2, 3, 4],'random_state':[42]

The combinations of parameters can be found in classifiers.py (run_gridsearchcv_RFClassifier(), run_gridsearchcv_DTClassifier(), run_gridsearchcv_LogisticRegression())

Reference


  1. A rule-based approach for automatically extracting data from systematic reviews and their updates to model the risk of conclusion change. Rabia Bashir, Adam G. Dunn, Didi Surian. Research Synthesis Methods, 2021:1-10

About

Extracting relevant information from Cochrane reviews

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages