# Phase 2 - Utility annotations and endpoint values

In this notebook, we define and create some annotations that could be useful for the intented task.

First, we fill some wordlists/dictionaries for indicator words like "ORR", "survival rate" and "PFS". If no rows are created for the output, then we have at least found one indicator in a sentence with an endpoint, which is ok for the start.

In [None]:
%inputDir data-nlp
%displayMode CSV
%csvConfig MissingIndSentence

TYPESYSTEM TrialsTypeSystem;
TYPESYSTEM DKProCoreTypeSystem;

DECLARE EndpointInd;
DECLARE EndpointInd ORRInd, OSInd, PFSInd;
WORDLIST orrIndList = "orr_ind.txt";
WORDLIST osIndList = "os_ind.txt";
WORDLIST pfsIndList = "pfs_ind.txt";

MARKFAST(ORRInd, orrIndList, true);
MARKFAST(OSInd, osIndList, true);
MARKFAST(PFSInd, pfsIndList, true);

// hotfix sentences, broken char is a question mark
s1:Sentence{ENDSWITH(QUESTION)} s2:@Sentence{->UNMARK(s1),s2.begin=s1.begin};

DECLARE MissingIndSentence;
Sentence{CONTAINS(TrialsEntity),-CONTAINS(EndpointInd)-> MissingIndSentence};

COLOR(ORR, "#F07C62");
COLOR(OSMean, "#65FF5B");
COLOR(OSTime, "#51C849");
COLOR(OSRate, "#23C318");
COLOR(PFSMean, "#788AFF");
COLOR(PFSTime, "#707BC3");
COLOR(PFSRate, "#0020F5");


Next, we try to detect numeric values by comparing the output of our rules to POS_NUM annotations created by CoreNLP.

In [None]:
%resetCas
%inputDir data-nlp
%outputDir ./temp/num-out
%displayMode CSV
%csvConfig MissingNumSentence

TYPESYSTEM TrialsTypeSystem;
TYPESYSTEM DKProCoreTypeSystem;

TYPE RutaNUM = org.apache.uima.ruta.type.NUM;
DOUBLE num;

DECLARE NumericValue (DOUBLE value, DOUBLE min, DOUBLE max);

// add annotate word that represent numbers
WORDTABLE NumberTable = "numbers.csv";
MARKTABLE(NumericValue, 2, NumberTable, true, 2, "", 2, "value" = 1);


ADDRETAINTYPE(WS);
// normal numbers like 1,000.95
(RutaNUM{-PARTOF(NumericValue)} (COMMA RutaNUM{REGEXP("...")}) 
    (PERIOD RutaNUM)?){PARSE(num, "en")-> nv:NumericValue, nv.value=num};
(RutaNUM{-PARTOF(NumericValue)} (PERIOD RutaNUM)?){PARSE(num, "en")-> nv:NumericValue, nv.value=num};
(PERIOD{-PARTOF(NumericValue)} RutaNUM){PARSE(num, "en")-> nv:NumericValue, nv.value=num};

// combined numbers like twenty-two
(nv1:NumericValue{PARTOF(W)-> UNMARK(nv1)} 
    SPECIAL.ct=="-" 
    nv2:NumericValue{PARTOF(W)-> UNMARK(nv2)}){-> nv:NumericValue, nv.value = (nv1.value+nv2.value)};
// intervals like 39-54
(nv1:NumericValue{-> UNMARK(nv1)} SPECIAL?
    SPECIAL.ct=="-" 
    nv2:@NumericValue{-> UNMARK(nv2)}){-> new:NumericValue, new.min=nv1.value, new.max=nv2.value};
REMOVERETAINTYPE(WS);

DECLARE MissingNumSentence, MissingNum;
POS_NUM{-IS(NumericValue)->MissingNum};

// we ignore some in the reporting
// no roman numbers
m:MissingNum{IS(CAP) -> UNMARK(m)};
m:MissingNum{IS(CW), REGEXP(".") -> UNMARK(m)};
// no 1990s
m:MissingNum{STARTSWITH(RutaNUM),ENDSWITH(SW) -> UNMARK(m)};
// no slashes or fractions ... for now
m:MissingNum{ -> UNMARK(m)}<-{SPECIAL{REGEXP("[+/]")};COLON;};
// no negative numbers
m:MissingNum{STARTSWITH(SPECIAL) -> UNMARK(m)}<-{SPECIAL{REGEXP("[-]")};};

Sentence{CONTAINS(MissingNum)-> MissingNumSentence};

COLOR(NumericValue, "lightgreen");
COLOR(MissingNum, "pink");

We investigate the values of some NumericValue annotations. The cell before needs to be evaluated for the type systems.

In [None]:
%loadCas temp/num-out/18720480.txt.xmi
%displayMode DYNAMIC_HTML

The numeric value are not good, but good enough for now concerning the endpoints. Next, we try to detect percentages and durations, the values of the endpoints. If the output generates no rows, then we are able to annotate a value for the corresponding endpoint.

In [None]:
%resetCas
%inputDir data-nlp
%outputDir ./temp/value-out
%displayMode CSV
%csvConfig MissingValueSentence
//%saveTypeSystem ./TypeSystem.xml

TYPESYSTEM TrialsTypeSystem;
TYPESYSTEM DKProCoreTypeSystem;

DECLARE TimeInd (STRING kind);
DECLARE NumericValue (DOUBLE value, DOUBLE min, DOUBLE max, DOUBLE var);
DECLARE Unit (STRING kind);
DECLARE Value (NumericValue value, Unit unit);

// hotfix sentences
s1:Sentence{ENDSWITH(QUESTION)} s2:@Sentence{->UNMARK(s1),s2.begin=s1.begin};

// copied rules from above
TYPE RutaNUM = org.apache.uima.ruta.type.NUM;
DOUBLE num;
WORDTABLE NumberTable = "numbers.csv";
MARKTABLE(NumericValue, 2, NumberTable, true, 2, "", 2, "value" = 1);

BLOCK(NumericValues) Document{}{
    // normal numbers like 1,000.95
    ADDRETAINTYPE(WS);
    (RutaNUM{-PARTOF(NumericValue)} (COMMA RutaNUM{REGEXP("...")}) 
        (PERIOD RutaNUM)?){PARSE(num, "en")-> nv:NumericValue, nv.value=num};
    (RutaNUM{-PARTOF(NumericValue)} (PERIOD RutaNUM)?){PARSE(num, "en")-> nv:NumericValue, nv.value=num};
    (PERIOD{-PARTOF(NumericValue)} RutaNUM){PARSE(num, "en")-> nv:NumericValue, nv.value=num};

    // like twenty-two
    (nv1:NumericValue{PARTOF(W)-> UNMARK(nv1)} 
        SPECIAL.ct=="-" 
        nv2:NumericValue{PARTOF(W)-> UNMARK(nv2)}){-> nv:NumericValue, nv.value = (nv1.value+nv2.value)};
    // intervals like 39-54
    (nv1:NumericValue{-> UNMARK(nv1)} SPECIAL?
        SPECIAL.ct=="-" 
        nv2:@NumericValue{-> UNMARK(nv2)}){-> new:NumericValue, new.min=nv1.value, new.max=nv2.value};
    
    // NEW: we also need to detect variance like 3+/-0.4
    (nv1:@NumericValue{-> nv1.var=nv2.value,nv1.end=nv2.end} "+/-" nv2:NumericValue{-> UNMARK(nv2)});
    
    REMOVERETAINTYPE(WS);
}

// now to the actual rules

// indicators for durations like months
WORDTABLE TimeIndTable = "time_ind.csv";
MARKTABLE(TimeInd, 1, TimeIndTable, "kind"=2);

// something that could hint an arm
DECLARE ArmInd;
// we should probably refactor this to a dictionary
(W{REGEXP("arm", true)} W{REGEXP("[abc]", true)} RutaNUM? COLON?){-> ArmInd};

// ignore text in brackets
DECLARE InBrackets;
(SPECIAL.ct=="(" ANY.ct!=")"[1,25] SPECIAL.ct==")"){-> InBrackets};
ADDFILTERTYPE(InBrackets);

// annotate the acutal Value
// 10%
(nv:NumericValue SPECIAL.ct=="%"{-> u:Unit,u.kind="percent"}){-> v:Value, v.value=nv, v.unit=u};
// 12 months
(nv:NumericValue SPECIAL.ct=="-"? ti:TimeInd{-> u:Unit,u.kind=ti.kind}){-> v:Value, v.value=nv, v.unit=u};

// chunks that could be an arm indicator
Value (POS_ADP{-REGEXP("in")} W[1,2]{-PARTOF(TimeInd),-PARTOF(POS_CONJ),-PARTOF(NumericValue)}){-> ArmInd};
(POS_ADP{-REGEXP("in")} W[1,2]{-PARTOF(TimeInd),-PARTOF(POS_CONJ),-PARTOF(NumericValue)}){-> ArmInd} POS_CONJ @Value;

// now some additional logic for combined mentions
DECLARE Enum;
DECLARE VSInd;
// we could add a wordlist dictionary, but for new we simple classify the words
(W{REGEXP("v|vs|versus")} PERIOD?){-> VSInd};

// 25 vs. 8%
(nv1:NumericValue{-PARTOF(Value)-> v:Value, v.value=nv1, v.unit=v2.unit}
    VSInd v2:Value){-> Enum};
// 2, 3, and 4 months
((NumericValue{-PARTOF(Value) -> v:Value, Value.value=NumericValue, Value.unit=v2.unit} COMMA?)+ 
    W{REGEXP("and")} v2:@Value){->Enum};
// 2- and 3 months
((nv1:NumericValue{-PARTOF(Value)} SPECIAL.ct=="-"?){-> v:Value, v.value=nv1, v.unit=v2.unit}
    W{REGEXP("and")} v2:Value){->Enum};

// no unit? like "was 0.89"
W{REGEXP("was")} nv:@NumericValue{-PARTOF(Value), nv.value > 0, nv.value < 1 -> u:Unit, u.kind="percent", v:Value, v.value=nv, v.unit=u};

// even more distant combinations
ADDFILTERTYPE(ArmInd,COMMA,POS_CONJ);
v:Value nv:NumericValue{-PARTOF(Value)-> new:Value, new.value=nv, new.unit=v.unit};
nv:NumericValue{-PARTOF(Value)-> new:Value, new.value=nv, new.unit=v.unit} v:@Value ;

// some clean up of false positives
DECLARE NoValueContextInd;
W{REGEXP("patients?", true)->NoValueContextInd};
v:Value{-> UNMARK(v)} NoValueContextInd;

DECLARE MissingValueSentence, MissingValue;
TrialsEntity{-IS(Value)-> MissingValue};
Sentence{CONTAINS(MissingValue)-> MissingValueSentence};

//COLOR(NumericValue, "#F0F050");
COLOR(ArmInd, "#F0F0A0");
COLOR(InBrackets, "#F0F0F0");
COLOR(MissingValue, "red");
COLOR(Value, "lightgreen");



We have found a value for every endpoint span and potientially also false positves. Now, we simply plot the Values to get an overview what we actually extract and can again finetune some rules above if neccessary.

In [None]:
%inputDir ./temp/value-out
%outputDir ./temp/trash
%displayMode CSV
%csvConfig Value unit.kind value.value value.min value.max value.var   