# Exercise 3: Sectioning

The goal of this exercise is to create a simple rule script for segmenting documents by their headlines. The resulting annotation named `Section` should contain two features `headline` and `content`. The `headline` feature should contain an annotation of the type `Headline`, which specifies the headline of the section. The `content` feature should contain an annotation of the type `Content`, which specifies the content of the section. The offsets of the `Section` annotation should cover the headline as well as the content.

We declare the required types and specify rules that process all input documents of the project as correct as possible. Subsections, listings, tables, figures, headers and footers are not considered.

In [None]:
%loadDocument data/ex3_sectioning/example1.txt
// We encourage you to also try out example2.txt and example3.txt

// Loading and applying the external PlainTextAnnotator for creating "Line" and "EmptyLine" annotations
ENGINE org.apache.uima.ruta.engine.PlainTextAnnotator;
TYPESYSTEM org.apache.uima.ruta.engine.PlainTextTypeSystem;
EXEC(PlainTextAnnotator, {Line, EmptyLine});

// Auxillary types: 
//   FreeLine ("Free and then a Line") is a Line after an empty (free) Line
//   LineFree ("Line and then a Free") is a Line before an empty (free) Line
DECLARE FreeLine, LineFree;
RETAINTYPE(WS);
EmptyLine Line{-> FreeLine};
Line{-> LineFree} BREAK[1,2] @EmptyLine;
Line{-> TRIM(WS)};
FreeLine{-> TRIM(WS)};
LineFree{-> TRIM(WS)};
RETAINTYPE;

DECLARE Headline, Content, HeadlineCandidate;
DECLARE Section (Headline headline, Content content);

// Candidates for Headlines are lines which start and end with fully capitalized words 
// or that start with a number
FOREACH(line) Line {} {
    line{STARTSWITH(CAP), ENDSWITH(CAP) -> HeadlineCandidate};
    line{->HeadlineCandidate}<-{NUM{STARTSWITH(Line)} W+;};
}

// We discard Headline candidates with more than two special characters or numbers, 
// or those that end with a punctuation mark that is not a colon
HeadlineCandidate{CONTAINS(SPECIAL,2,100) -> UNMARK(HeadlineCandidate)};
HeadlineCandidate{CONTAINS(NUM,2,100) -> UNMARK(HeadlineCandidate)};
HeadlineCandidate{ENDSWITH(PM), -ENDSWITH(COLON) -> UNMARK(HeadlineCandidate)};

// OverallFirst is the first token in the complete document - it requires special treatment
DECLARE OverallFirst;
MARKFIRST(OverallFirst);

// We convert HeadlineCandidates to Headline if one of the three conditions is satisfied
BLOCK(eachLine) HeadlineCandidate {} {
    // 1) The candidate is not already a Headline and contains ONLY fully capitalized words
    Document{-PARTOF(Headline), CONTAINS(CAP,100,100,true)-> Headline};
    
    // 2) The candidate is preceded and followed by an empty Line
    Document{IS(FreeLine), IS(LineFree) -> Headline};
    
    // 3) It is the first line in the document and followed by an empty Line
    Document{STARTSWITH(OverallFirst), IS(LineFree) -> Headline};
}

// Fallback rule: 
// This is a specific rule for example3.txt where we did not annotate any Headlines with the above rules
// In this case, we generate headlines based on numbering, e.g. "1 Introduction", "2 Related Work", etc.
// will be annotated as Headlines
BOOLEAN hlNum;
BLOCK(failedYet) Document{-CONTAINS(Headline)} {
    HeadlineCandidate{REGEXP("\\d\\s*Introduction") -> hlNum = true};
    BLOCK(failedYet) Document{hlNum} {
        HeadlineCandidate{-STARTSWITH(NUM) -> UNMARK(HeadlineCandidate)};
        HeadlineCandidate{-> Headline};
    }
}

// Finally we generate Sections from one Headline to the next
(Headline #{-> Content})
    {-> CREATE(Section,
        "headline" = Headline, 
        "content" = Content)} Headline;

(Headline{-PARTOF(Section)} #{-> Content})
    {-> CREATE(Section,
        "headline" = Headline, 
        "content" = Content)};

COLOR(Headline,"lightgreen");