Skip to content

dokato/c2s2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Medical Cohort Selection

Python scripts with a competition solution prepared for "2018 N2C2 Shared Tasks".

Data to train the models is available upon request at https://n2c2.dbmi.hms.harvard.edu/.

Setup

You might need the following to run these scripts:

- python 2.7
- numpy
- sklearn
- xgboost
- conreader (optional)

You might also need the following data structure:

clitri/
test_gold/
models/
runsequence.sh
README.md 
testoutput/
output/
preproc/

where clitri/ is the folder with scripts for training, testing models; test_gold/ - is folder with annotation of the test files; models/ is the folder where you keep trained ML models and vectorisers; output/ output of training data (eg. for crossvalidation); testoutput/ output of prediction on test set.

!!! You may change all paths to the files in clitri/utils.py.

Running the preprocessing (feature extraction)

If you followed the directory structure from above, you should be able to navigate into the preproc/ folder and simply call:

python preprocessing.py

You might want to also change in_dir variable from preprocessing.py script to the path containing the raw data:

# --- input files
in_dir = './00_input/'

All the lexicons used for data cleaning and filtering are available in preproc/lexicon/ catalogue.

Running the classification script

This script is intended to be used via command line:

$ ./runsequence.sh

It performs model trainign with clitri/classifiers.py script, model prediction with clitri/discovry.py and evaluation with track1_eval.py.

Output for Track 1: Cohort Selection for Clinical Trials

Our best score:

                      ------------ met -------------    ------ not met -------    -- overall ---
                      Prec.   Rec.    Speci.  F(b=1)    Prec.   Rec.    F(b=1)    F(b=1)  AUC   
           Abdominal  0.6486  0.8000  0.7679  0.7164    0.8776  0.7679  0.8190    0.7677  0.7839
        Advanced-cad  0.8302  0.9778  0.7805  0.8980    0.9697  0.7805  0.8649    0.8814  0.8791
       Alcohol-abuse  0.2222  0.6667  0.9157  0.3333    0.9870  0.9157  0.9500    0.6417  0.7912
          Asp-for-mi  0.8767  0.9412  0.5000  0.9078    0.6923  0.5000  0.5806    0.7442  0.7206
          Creatinine  0.8000  0.8333  0.9194  0.8163    0.9344  0.9194  0.9268    0.8716  0.8763
       Dietsupp-2mos  0.7885  0.9318  0.7381  0.8542    0.9118  0.7381  0.8158    0.8350  0.8350
          Drug-abuse  0.4000  0.6667  0.9639  0.5000    0.9877  0.9639  0.9756    0.7378  0.8153
             English  0.9125  1.0000  0.4615  0.9542    1.0000  0.4615  0.6316    0.7929  0.7308
               Hba1c  1.0000  0.8286  1.0000  0.9062    0.8947  1.0000  0.9444    0.9253  0.9143
            Keto-1yr  0.0000  0.0000  1.0000  0.0000    1.0000  1.0000  1.0000    0.5000  0.5000
      Major-diabetes  0.8500  0.7907  0.8605  0.8193    0.8043  0.8605  0.8315    0.8254  0.8256
     Makes-decisions  0.9762  0.9880  0.3333  0.9820    0.5000  0.3333  0.4000    0.6910  0.6606
             Mi-6mos  0.3333  0.5000  0.8974  0.4000    0.9459  0.8974  0.9211    0.6605  0.6987
                      ------------------------------    ----------------------    --------------
     Overall (micro)  0.8397  0.9129  0.8786  0.8747    0.9354  0.8786  0.9061    0.8904  0.8957
     Overall (macro)  0.6645  0.7634  0.7799  0.6991    0.8850  0.7799  0.8201    0.7596  0.7716

The official ranking measure is Overall (micro) F(b=1) (0.8904).

Criterion

In this study - based on a textual hisotry of patients - we wanted to predict if they meet or not the following medical criteria:

  • Abdominal
  • Advanced-cad
  • Alcohol-abuse
  • Asp-for-mi
  • Creatinine
  • Dietsupp-2mos
  • Drug-abuse
  • English
  • Hba1c
  • Keto-1yr
  • Major-diabetes
  • Makes-decisions
  • Mi-6mos

Details

For details of this approach, please refer to this article: https://medinform.jmir.org/2019/4/e15980/.

If you use lexicons, or part of our code, please cite the above-mentioned article as:

Spasic I, Krzeminski D, Corcoran P, Balinsky A
Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach
JMIR Med Inform 2019;7(4):e15980
URL: https://medinform.jmir.org/2019/4/e15980
DOI: 10.2196/15980
PMID: 31674914

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published