# Title

In this notebook, we try to annotate the title of the reference using the previously extracted information. We apply the same paradigm and try to detect the start of the title using the other entities and the end of the title based on chars and additional information.

#### Let's first inspect a single training example with gold standard information

In [None]:
%displayMode RUTA_COLORING
%loadCas data/train/A00-1042.txt.xmi
%typeSystemDir typesystems/
TYPESYSTEM ReferencesTypeSystem;
COLOR(Title,"lightgreen");

#### Let's now write some rules to detect dates

In [None]:
%inputDir data/train
%outputDir temp/out_title_train

// Evaluation mode for Title annotations
%displayMode EVALUATION
%evalTypes Title

// Paths for resources. The script for detecting Dates and Author annotations
// has been written to temp/ when executing exercise 1 and exercise 2.
%scriptDir temp/
%typeSystemDir typesystems/

// Writing this script and the TypeSystem
%writescript temp/Title.ruta
%saveTypeSystem typesystems/TitleTypeSystem.xml

TYPESYSTEM ReferencesTypeSystem;
SCRIPT Date;
SCRIPT Author;

// only try to find dates and authors if there aren't any yet
Document{-CONTAINS(Date)-> CALL(Date)};
Document{-CONTAINS(Author)-> CALL(Author)};

INT maxLength = 50;
DECLARE TitleStart, TitleStop, Quote, Quoted;

BLOCK(utils) Reference{}{
    
    // detect quoted text as it specifies the title in some formats
    SPECIAL{REGEXP("\"")->Quote};
    (Quote ANY[1,maxLength]{-PARTOF(Quote)} Quote){-> Quoted};
    
    // the title normally starts after the Author or Date
    (Author Date?){-> MARKLAST(TitleStart)};
    Reference{-CONTAINS(TitleStart), -CONTAINS(Author)} ->{
        // in case there was no author but an editor or something else 
        (AuthorStop Date?){-> MARKLAST(TitleStart)};
    };
    
    // the title normally stops with a period
    PERIOD{-PARTOF(Author),-PARTOF(Title)-> TitleStop};
    
    // some specific periods are excluded
    NUM ts:@TitleStop{-> UNMARK(ts)} NUM;
    ts:TitleStop{ENDSWITH(Reference)-> UNMARK(ts)};
    ts:TitleStop{-> UNMARK(ts)} # TitleStart;
    W{REGEXP("pp", true)} ts:@TitleStop{-> UNMARK(ts)};
    sw1:SW ts:@TitleStop{sw1.end==ts.begin,ts.end==sw2.begin-> UNMARK(ts)} sw2:SW;
    
    // if we did not find any, maybe it ends with a comma
    Reference{-CONTAINS(TitleStop)} ->{
        TitleStart # COMMA{->TitleStop};
    };
}

BLOCK(Title) Reference{}{
    // annotate titles either based on quoted text or based on start and stop (boundaries)
    TitleStart Quoted{-PARTOF(Title)-> Title};
    TitleStart (ANY[1,maxLength]{-PARTOF(TitleStop),-PARTOF(Title)} TitleStop){-> Title};
    
    // extend the span if there is an period
    t:Title{->t.end=e.end} e:ANY{PARTOF({PERIOD,COMMA})};
    
    // remove duplicates just in case
    t1:Title{CONTAINS(Title,2,100)}->{t2:Title{t2!=t1-> UNMARK(t2)};};
}



#### Error analysis on training data set

In [None]:
%inputDir temp/out_title_train
%outputDir temp/unused
%displayMode CSV
%csvConfig BadReference

DECLARE BadReference;
Reference{OR(CONTAINS(FalsePositive),CONTAINS(FalseNegative)),-PARTOF(BadReference)-> BadReference};

COLOR(TitleStart, "green");
COLOR(TitleStop, "red");
COLOR(TruePositive, "lightgreen");
COLOR(FalsePositive, "lightblue");
COLOR(FalseNegative, "pink");

#### Results and error analysis on validation data set

In [None]:
%inputDir data/validation
%outputDir temp/out_title_validation
%displayMode EVALUATION
%evalTypes Title

SCRIPT Title;
CALL(Title);

In [None]:
%inputDir temp/out_title_validation
%outputDir temp/unused
%displayMode CSV
%csvConfig BadReference

DECLARE BadReference;
Reference{OR(CONTAINS(FalsePositive),CONTAINS(FalseNegative)),-PARTOF(BadReference)-> BadReference};

COLOR(TitleStart, "green");
COLOR(TitleStop, "red");
COLOR(TruePositive, "lightgreen");
COLOR(FalsePositive, "lightblue");
COLOR(FalseNegative, "pink");

#### Results on test data set

In [None]:
%inputDir data/test
%displayMode EVALUATION
%evalTypes Title

SCRIPT Title;
CALL(Title);