Skip to content
Frank Lin edited this page Aug 4, 2017 · 12 revisions

TEPAPA: Text-based Exploratory Pattern Analyser for Prognostic and Associative factor discovery (v0.9 - 02 Nov 2016)

Home URL:


TEPAPA is a feature learning pipeline that identifies differentiating patterns of text associated with an outcome of interest from a list of examples (i.e. supervised). It is purposely built to mine knowledge from electronic medical records (EMR) for generating hypotheses to drive scientific enquiries of the underlying biological/clinical process.

Specifically, TEPAPA prioritises a list of candidate prognostic (e.g. factors that are associated with survival/adverse events of diseases) or disease-associated factors (e.g., risk factors or exposures) associated with a known clinical outcome, describing these variables as fragment of text (n-gram) or regular expressions.

The TEPAPA pipeline combines the following methods to discover patterns from EMR:

  • Semantic-free (i.e. shallow) Natural Language Processing (NLP) methods.
  • Sequence-based feature search methods. ** Simple n-gram search ** Combinatorial (deep) search
  • Statistical association testing.
  • Predictive regular expression induction methods
  • Feature filtering tools
  • Feature ranking tools

TEPAPA is designed to be extensible. All pre-processing modules can be readily replaced by dedicated and/or more sophiscated tools for processing EMR text.

Please see the associated publications for detailed discussions.

System requirements

Hardware requirements:

  • An UNIX-like system (e.g. GNU/Linux).
  • Multiple processors will be used in the pattern search and regular expression induction.
  • Depending on the size of corpus / depth of search, a minimum of 4GB RAM is recommended. ** (>16GB is recommended for general use). ** Note that computation time is linear to the size of corpus, and is highly dependent to the depth of search.

Run-time requirements:

  • Perl5 is used by the default text cleaner ( and tokenizer (
  • Lingua::Stem::En is required for word stemming / tokenization by (the default tokeniser)

Building TEPAPA

The tarball contains the following source tree:

  src/                           - Source code directory
  src/Makefile                   - Makefile of the TEPAPA binary
  bin/                           - output directory for the binary
  scripts/prep-*                 - Scripts for preprocessing
  scripts/tepapa-pl-simple.tpp   - Example script to run TEPAPA

To compile TEPAPA, you will need the following tools:

  • GNU make
  • gcc (version 5 or above) - must support C++11
  • libc++ (with C++11 support)
  • The following external libraries are needed: ** libgsl - GNU scientific library ** libgslcblas - GNU scientific library, BLAS (Basic Linear Algebra Subprograms) ** libpthread - GNU POSIX threading Library
  • Perl5 ** The following Perl5 modules (downloadable from CPAN) are required at compile time: ** ExtUtils::Embed - for embedding Perl in C/C++ applications)

Then extracting the source code:

# tar zxvf TEPAPA-0.9.tar.gz

Then run:

# cd TEPAPA-0.9/src; make

Running TEPAPA - Quick Start Guide

To run TEPAPA, you will need:

  • A list of cases with known variables of interest (e.g. binary outcomes or survival data) as in preparing for a case-control study.
  • Collect EMR text associated with each case (i.e. the corpus).
  • Prepare a case file (below).
  • Prepare a pipeline script. An example script has been provided (scripts/tepapa-pl-simple.tpp) is pipeline for extracting both binary and numeric features from the EMR corpus.


 # /scripts/tepapa-pl-simple.tpp    

Example usage:

  1. To identify prognostic and associative factors from the EMR text listed in the case definition file ``case.txt'' (format see below), and:
  • limit to all patterns p<0.001 (filtering threshold)
  • reduce text pattern by occurrence profiles (b), and
  • sort output by direction of association ('d'), p-value ('p'), statistical estimates ('e'), and number of cases containing the given pattern ('n')
# /scripts/tepapa-pl-simple.tpp case.txt 0.001 b dpen
  1. The above but less (-q) or more verbose (-v [level]) output.
# /scripts/tepapa-pl-simple.tpp case.txt 0.001 b dpen -q     # quiet output 
# /scripts/tepapa-pl-simple.tpp case.txt 0.001 b dpen -v 7   # verbose output with debug level 7

Format of case definition file

The case definition file ``case.txt'' states the outcome variable(s) and the corpus for feature minining. The format is:

score  raw_text_file 
score  raw_text_file 
score  raw_text_file 

where the ``score'' is a numeric value of indicating the outcome variable associate with each raw_text_file listed above.

For binary outcome variables (e.g. disease vs no disease), use 0 to indicate FALSE and 1 to indicate TRUE, such that:

0    cases/case0001.txt
0    cases/case0002.txt
1    cases/case0003.txt
1    cases/case0004.txt
1    cases/case0005.txt
1    cases/case0006.txt

For numeric outcome variables (e.g. survival), set the scores to reflect the clinical attribute of interest:

64   patients/case0001.txt      # case 1 survived 64 days 
132  patients/case0002.txt      # case 2 survived 132 days 
56   patients/case0003.txt      # case 3 survived 56 days 
245  patients/case0004.txt      # ... etc
11   patients/case0005.txt
15   patients/case0006.txt

where ``cases000x.txt'' contain the raw EMR in plain text format.

(Note that survival censoring is yet to be supported).

Sample output

Group  Dir  Method  Estimate    P-value      N   Type   Occurrence profile    Pattern
G0010  POS  LOR.FET 2.2609404   5.25597e-05  31  NGRAM  5BB326B8BB5C0000E008  base  of  tongue
G0014  POS  LOR.FET 3.1388331   5.68851e-05  21  REGEX  CAA00E8AF50C00000800  of  the?  (right|left)?  (tonsil|tongue)
G0001  NEG  LOR.FET -3.5156652  0.000374902  8   REGEX  00000000000018C09120  (of  the)  left?  (oral  tongue)
  • Group - Internal grouping number specified after feature grouping with the <reduction_grouping> assignment
  • Dir. - Direction of association (either POSitive or NEGative)
  • Method - Methods for evaluating statistical associations ** LOR.FET: Log odds ratio, with P-value evaluated by Fisher's exact test. ** AUC and AUROC2: Area under the receiver operating characterstic (ROC) curve. ** SPEARMAN: Spearman's ρ
  • Estimate - The statistical estimate specified by the method above
  • P-value
  • N - Number of cases with mentions of Pattern in the corpus
  • Type - Type of text pattern evaluated: ** NGRAM: n-gram ** REGEX: regular expression
  • Occurrence profile - The hexadecimal ``heatmap'' indicating cases where the pattern is mentioned.
  • Pattern - The pattern of text with type described by TYPE


The text patterns discovered by TEPAPA are only mentions of text correlating to the variables of interest. While these patterns may provide informative ``leads'' for research, it should NOT be regarded as definitive clinical evidence without domain expert interepretation and careful validations.

Rigorous confirmatory studies downstream these exploratory analyses are ALWAYS REQUIRED before reaching a scientific conclusion for healthcare decision making. Wrong interpretations and biases are well-known issues in EMR-based research.

The full documentation with different usages will accompany the v1.0 release.


Lin, F., Pokorny, A., Teng, C., Epstein, R.J. TEPAPA: a novel in silico feature learning pipeline for mining prognostic and associative factors from text- based electronic medical records. Sci Rep 2017; 7: 6918. doi:10.1038/s41598-017-07111-0

Clone this wiki locally
You can’t perform that action at this time.