# Date detection

This notebook is the first part of three notebooks that are used to segment scientific references.

This notebook detects `Date` annotations as an initial step. We start with Dates because they are probably the easiest annotations and they can also be useful for the detection of the other annotations.

The first cell contains the rules for the detection of `Date`s. The input is set to the training documents. These files (XMIs) contain complete reference section and also gold annotations for Author, Date, Title, Venue and Reference. We configure the cell to evaluate Date annotations and to display the evaluation results. Additionally, we store the content of the cell as a Ruta script and also the corresponding type system. 

We encourage you to remove some rules and investigate the effect...

#### Let's first inspect a single training example with gold standard information

In [None]:
%displayMode RUTA_COLORING
%loadCas data/train/A00-1042.txt.xmi
%typeSystemDir typesystems/
TYPESYSTEM ReferencesTypeSystem;
COLOR(Date,"lightgreen");

#### Let's now write some rules to detect dates

We are making use of the `EVALUATION` mode which can be used to compare the generated `Date` annotation with the gold standard information.

In [None]:
%inputDir data/train
%outputDir temp/out_date_train

// Evaluation mode for Date annotations
%displayMode EVALUATION
%evalTypes Date

// Writing this script and the TypeSystem
%writescript temp/Date.ruta
%saveTypeSystem typesystems/DateTypeSystem.xml

TYPESYSTEM ReferencesTypeSystem;

DECLARE MonthInd, YearInd;
WORDLIST MonthList = "resources/months.txt";
MARKFAST(MonthInd, MonthList);

DECLARE NoDatePrefix;
WORDLIST NoDatePrefixList = "resources/no_date_prefix.txt";
MARKFAST(NoDatePrefix, NoDatePrefixList);

BLOCK(Date) Reference{}{
    
    // some simple candidates for dates
    NUM{REGEXP("19..|20..")-> YearInd};
    // some dates also have an additional char
    y:@YearInd{-> y.end=sw.end} sw:SW{REGEXP("[abc]")};
    
    // create dates using YearInd
    (SPECIAL.ct=="(" @YearInd SPECIAL.ct==")" COMMA?){-> Date};
    (@YearInd{-PARTOF(Date)} PERIOD){-> Date};
    (@YearInd{-PARTOF(Date)} COMMA[0,2]){-> Date};
    
    // expand Dates based on context
    m:MonthInd (NUM SPECIAL)? NUM COMMA? d:@Date{-> d.begin = m.begin};
    s:NUM SPECIAL NUM MonthInd d:@Date{-> d.begin = s.begin};
    m:MonthInd d:@Date{-> d.begin = m.begin};
    w1:W{INLIST(MonthList, ""+w1.ct+w2.ct)} SPECIAL w2:W d:@Date{-> d.begin = w1.begin};
    
    s1:SPECIAL.ct=="(" d:@Date{-> d.begin=s1.begin,d.end=s2.end} s2:SPECIAL.ct==")";
    d:@Date{-> d.end=e.end} e:PERIOD;
    b:NUM{REGEXP(".|[12].|3[01]")} d:@Date{STARTSWITH(YearInd)-> d.begin=b.begin};
    
    // remove some false positive dates
    NoDatePrefix SPECIAL? d:@Date{-> UNMARK(d)};
    d1:Date{-> UNMARK(d1)} SPECIAL NUM;
    NUM SPECIAL d2:@Date{-> UNMARK(d2)};
}

// if there are several dates in one reference, try to remove some potential false positive
Reference{CONTAINS(Date,2,100)} -> {
    CAP SPECIAL? d:@Date{-> UNMARK(d)};
};
Reference{CONTAINS(Date,2,100)} -> {
     d:@Date{-> UNMARK(d)} SPECIAL? CAP;
};


In the second cell, we use the results of the previous cell, i.e. the CAS files in the out_train folder, to investigate the remaining problems. This is achieved by setting the display mode CSV to summarize BadReferences, spans that contains a evaluation error (FalsePrositive or FalseNegative). The evaluation annotations are highlighted using different colors.

In [None]:
%inputDir temp/out_date_train
%outputDir temp/unused
%displayMode CSV
%csvConfig BadReference

DECLARE BadReference;
Reference{OR(CONTAINS(FalsePositive),CONTAINS(FalseNegative)),-PARTOF(BadReference)-> BadReference};

COLOR(TruePositive, "lightgreen");
COLOR(FalsePositive, "lightblue");
COLOR(FalseNegative, "pink");

#### Evaluation on validation data set

In this cell, we load and apply the rules of the first cell (`Date.ruta`) on the validation documents. Thus, we can investigate how well the rules generalize to other examples. The evaluation results are displayed as before and the resulting annotated CAS files are stored in the folder out_validation.

In [None]:
%inputDir data/validation
%outputDir temp/out_date_validation
%displayMode EVALUATION
%evalTypes Date
%scriptDir temp/ 

SCRIPT Date;
CALL(Date);

Now, we investigate the falsely annotated Dates in the validation set.

In [None]:
%inputDir temp/out_date_validation
%outputDir temp/unused
%displayMode CSV
%csvConfig BadReference

DECLARE BadReference;
Reference{OR(CONTAINS(FalsePositive),CONTAINS(FalseNegative)),-PARTOF(BadReference)-> BadReference};

COLOR(TruePositive, "lightgreen");
COLOR(FalsePositive, "lightblue");
COLOR(FalseNegative, "pink");

#### Results on test data set

Finally, we also evaluate the rules on the test set but without investigating the resulting annotations.

In [None]:
%inputDir data/test
%displayMode EVALUATION
%evalTypes Date

SCRIPT Date;
CALL(Date);