-
Notifications
You must be signed in to change notification settings - Fork 0
Home
TEPAPA: Text-based Exploratory Pattern Analyser for Prognostic and Associative factor discovery (v0.99 - 7 September 2020)
Home URL: http://tepapadiscoverer.org/
TEPAPA is a feature learning pipeline that identifies differentiating patterns of text associated with an outcome of interest from a list of examples (i.e. supervised). It is purposely built to mine knowledge from electronic medical records (EMR) for generating hypotheses to drive scientific enquiries of the underlying biological/clinical process.
Specifically, TEPAPA prioritises a list of candidate prognostic (e.g. factors that are associated with survival/adverse events of diseases) or disease-associated factors (e.g., risk factors or exposures) associated with a known clinical outcome, describing these variables as fragment of text (n-gram) or regular expressions.
The TEPAPA pipeline combines the following methods to discover patterns from EMR:
- Semantic-free (i.e. shallow) Natural Language Processing (NLP) methods.
- Sequence-based feature search methods. ** Simple n-gram search ** Combinatorial (deep) search
- Statistical association testing.
- Predictive regular expression induction methods
- Feature filtering tools
- Feature ranking tools
TEPAPA is designed to be extensible. All pre-processing modules can be readily replaced by dedicated and/or more sophiscated tools for processing EMR text.
Please see the associated publications for detailed discussions.
Hardware requirements:
- An UNIX-like system (e.g. GNU/Linux).
- Multiple processors will be used in the pattern search and regular expression induction.
- Depending on the size of corpus / depth of search, a minimum of 4GB RAM is recommended. ** (>16GB is recommended for general use). ** Note that computation time is linear to the size of corpus, and is highly dependent to the depth of search.
Run-time requirements:
- Perl5 is used by the default text cleaner (prep-CLEAN.pl) and tokenizer (prep-TOKENISE.pl).
- Lingua::Stem::En is required for word stemming / tokenization by prep-TOKENISE.pl (the default tokeniser)
The tarball contains the following source tree:
src/ - Source code directory src/Makefile - Makefile of the TEPAPA binary bin/ - output directory for the binary scripts/ scripts/prep-* - Scripts for preprocessing scripts/tepapa-pl-simple.tpp - Example script to run TEPAPA
To compile TEPAPA, you will need the following tools:
- GNU make
- gcc (version 5 or above) - must support C++11
- libc++ (with C++11 support)
- The following external libraries are needed: ** libgsl - GNU scientific library ** libgslcblas - GNU scientific library, BLAS (Basic Linear Algebra Subprograms) ** libpthread - GNU POSIX threading Library
- Perl5 ** The following Perl5 modules (downloadable from CPAN) are required at compile time: ** ExtUtils::Embed - for embedding Perl in C/C++ applications)
Then extracting the source code:
# cd src; make
To run TEPAPA, you will need:
- A list of cases with known variables of interest (e.g. binary outcomes or survival data) as in preparing for a case-control study.
- Collect EMR text associated with each case (i.e. the corpus).
- Prepare a case file (below).
- Prepare a pipeline script. An example script has been provided (scripts/tepapa-pl-simple.tpp) is pipeline for extracting both binary and numeric features from the EMR corpus.
Synopsis
# /scripts/tepapa-pl-simple.tpp
Example usage:
- To identify prognostic and associative factors from the EMR text listed in the case definition file ``case.txt'' (format see below), and:
- limit to all patterns p<0.001 (filtering threshold)
- reduce text pattern by occurrence profiles (b), and
- sort output by direction of association ('d'), p-value ('p'), statistical estimates ('e'), and number of cases containing the given pattern ('n')
# /scripts/tepapa-pl-simple.tpp case.txt 0.001 b dpen
- The above but less (-q) or more verbose (-v [level]) output.
# /scripts/tepapa-pl-simple.tpp case.txt 0.001 b dpen -q # quiet output # /scripts/tepapa-pl-simple.tpp case.txt 0.001 b dpen -v 7 # verbose output with debug level 7
The case definition file ``case.txt'' states the outcome variable(s) and the corpus for feature minining. The format is:
score raw_text_file score raw_text_file score raw_text_file ...
where the ``score'' is a numeric value of indicating the outcome variable associate with each raw_text_file listed above.
For binary outcome variables (e.g. disease vs no disease), use 0 to indicate FALSE and 1 to indicate TRUE, such that:
0 cases/case0001.txt 0 cases/case0002.txt 1 cases/case0003.txt 1 cases/case0004.txt 1 cases/case0005.txt 1 cases/case0006.txt ...
For numeric variables, set the scores to reflect the clinical attribute of interest:
64 patients/case0001.txt # case 1 score 64 132 patients/case0002.txt # case 2 score 132 56 patients/case0003.txt # case 3 score 56 245 patients/case0004.txt # ... etc 11 patients/case0005.txt 15 patients/case0006.txt ...
where ``cases000x.txt'' contain the raw EMR in plain text format.
For survival analysis, the number indicates duration of survival and whether an event have occurred is indicated by ">" or "=" (e.g., '>': alive '=': deceased). TEPAPA will automatically detects the format of case file.
=64 patients/case0001.txt # case 1 survived 64 days >132 patients/case0002.txt # case 2 survived at least 132 days =56 patients/case0003.txt # case 3 survived 56 days >245 patients/case0004.txt # ... etc =11 patients/case0005.txt =15 patients/case0006.txt ...
where ``cases000x.txt'' contain the raw EMR in plain text format.
Sample output:
---------------------------------------------------------------------------------------------------------------------- Group Dir Method Estimate P-value N Type Occurrence profile Pattern ---------------------------------------------------------------------------------------------------------------------- G0010 POS LOR.FET 2.2609404 5.25597e-05 31 NGRAM 5BB326B8BB5C0000E008 base of tongue G0014 POS LOR.FET 3.1388331 5.68851e-05 21 REGEX CAA00E8AF50C00000800 of the? (right|left)? (tonsil|tongue) ... G0001 NEG LOR.FET -3.5156652 0.000374902 8 REGEX 00000000000018C09120 (of the) left? (oral tongue) ... ----------------------------------------------------------------------------------------------------------------------
- Group - Internal grouping number specified after feature grouping with the <reduction_grouping> assignment
- Dir. - Direction of association (either POSitive or NEGative)
- Method - Methods for evaluating statistical associations ** LOR.FET: Log odds ratio, with P-value evaluated by Fisher's exact test. ** AUC and AUROC2: Area under the receiver operating characterstic (ROC) curve. ** SPEARMAN: Spearman's ρ
- Estimate - The statistical estimate specified by the method above
- P-value
- N - Number of cases with mentions of Pattern in the corpus
- Type - Type of text pattern evaluated: ** NGRAM: n-gram ** REGEX: regular expression
- Occurrence profile - The hexadecimal ``heatmap'' indicating cases where the pattern is mentioned.
- Pattern - The pattern of text with type described by TYPE
The text patterns discovered by TEPAPA are only mentions of text correlating to the variables of interest. While these patterns may provide informative ``leads'' for research, it should NOT be regarded as definitive clinical evidence without domain expert interepretation and careful validations.
Rigorous confirmatory studies downstream these exploratory analyses are ALWAYS REQUIRED before reaching a scientific conclusion for healthcare decision making. Wrong interpretations and biases are well-known issues in EMR-based research.
The full documentation with different usages will accompany the v1.0 release.
Version 0.99 - 7 September, 2020
- Added survival analysis (logrank)
- Added option to specify minimum support required (-s) in exhaustive search
- Bug fixes
Version 0.9 - 29 July, 2017
- Initial release - 02 Nov 2016
Lin, F., Pokorny, A., Teng, C., Epstein, R.J. TEPAPA: a novel in silico feature learning pipeline for mining prognostic and associative factors from text- based electronic medical records. Sci Rep 2017; 7: 6918. doi:10.1038/s41598-017-07111-0