# Exercise 3: Sectioning

The goal of this exercise if to create a simple rule script for segmenting documents by their headlines. The resulting annotation named "Section" should contain two features "headline" and "content". The "headline" feature should contain an annotation of the type "Headline", which specifies the headline of the section. The "content" feature should contains an annotation of the type "Content", which specifies the content of the section. The offsets of the "Section" annotation should cover the headline as well as the content.

We declare the required types and specify rules that process all input documents of the project as correct as possible. Subsections, listings, tables, figure, header and footers are not considered.

In [None]:
%inputDir input_sectioning
%displayMode CSV
%csvConfig DocumentAnnotation

ENGINE org.apache.uima.ruta.engine.PlainTextAnnotator;
TYPESYSTEM org.apache.uima.ruta.engine.PlainTextTypeSystem;

EXEC(PlainTextAnnotator, {Line, EmptyLine});
DECLARE FreeLine, LineFree;
RETAINTYPE(WS);
EmptyLine Line{-> FreeLine};
Line{-> LineFree} BREAK[1,2] @EmptyLine;
Line{-> TRIM(WS)};
FreeLine{-> TRIM(WS)};
LineFree{-> TRIM(WS)};
RETAINTYPE;

DECLARE Headline, Content, HeadlineCandidate;
DECLARE Section (Headline headline, Content content);

FOREACH(line) Line {} {
    line{STARTSWITH(CAP), ENDSWITH(CAP) -> HeadlineCandidate};
    line{->HeadlineCandidate}<-{NUM{STARTSWITH(Line)} W+;};
}

DECLARE OverallFirst;
MARKFIRST(OverallFirst);

HeadlineCandidate{CONTAINS(SPECIAL,2,100) -> UNMARK(HeadlineCandidate)};
HeadlineCandidate{CONTAINS(NUM,2,100) -> UNMARK(HeadlineCandidate)};
HeadlineCandidate{ENDSWITH(PM), -ENDSWITH(COLON) -> UNMARK(HeadlineCandidate)};

BLOCK(eachLine) HeadlineCandidate {} {
    Document{-PARTOF(Headline), CONTAINS(CAP,100,100,true)-> Headline};
    Document{IS(FreeLine), IS(LineFree) -> Headline};
    Document{STARTSWITH(OverallFirst), IS(LineFree) -> Headline};
}

BOOLEAN hlNum;

BLOCK(failedYet) Document{-CONTAINS(Headline)} {
    HeadlineCandidate{REGEXP("\\d\\s*Introduction") -> hlNum = true};
    BLOCK(failedYet) Document{hlNum} {
        HeadlineCandidate{-STARTSWITH(NUM) -> UNMARK(HeadlineCandidate)};
        HeadlineCandidate{-> Headline};
    }
}


(Headline #{-> Content})
    {-> CREATE(Section,
        "headline" = Headline, 
        "content" = Content)} Headline;

(Headline{-PARTOF(Section)} #{-> Content})
    {-> CREATE(Section,
        "headline" = Headline, 
        "content" = Content)};

COLOR(Headline, "lightgreen");