# Exercise 2: Sequential Patterns

This execise provides an introduction to how annotations are combined in sequential patterns.

#### Setup

First, we define some input text for the following examples.

In [None]:
%%documentText
The dog barked at the cat.
Dogs, cats and mice are mammals.
There are 12 tuna swimming in the sea.
This text was created approx. on 13.04.2021.

Then, we provide some initial Animal annotations using an dictionary.

In [None]:
DECLARE Animal;
WORDLIST AnimalList = 'animals.txt';
MARKFAST(Animal, AnimalList, true);

This annotates all animals in the text (dog, cat, dogs, cats, mice, tuna). The annotations are not visible as we did not execute the statement `COLOR(Animal, "lightgreen");`.

### Sequential Patterns

#### Simple pattern: An animal preceded by a number

In the previous exercise 1, you have seen how we can annotate a word based on its covered text. Often times, it is desired, to only create an annotation if the word is part of a certain sequence. This is very easy to do with Ruta.

Suppose that we want to create annotations of the type `MultipleAnimals`, if an Animal is preceded by a number. Then we can do the following.

In [None]:
DECLARE MultipleAnimals;             // Declaring a new type Animals
(NUM Animal){-> MultipleAnimals};    // If a number is followed by an Animal, create a new annotation of type MultipleAnimals
COLOR(MultipleAnimals,"lightblue");  // Show "MultipleAnimals" in blue

#### Annotating an enumeration of animals

Next, we want to annotate enumerations of Animals, i.e. "*Dogs, cats and mice*" in this text example. This should be annotated with a new Type called `AnimalEnum`.

As a preliminary step, we will annotate all `Conjunction` elements with the rule below using the Condiction-Action structure that we have already seen in exercise 1. The rule in line 2 goes through all tokens of the document (`ANY`) and matches if this token is a `COMMA` or if the token is the word "and" or "or". In these cases, it creates a new `Conjunction` annotation.

In [None]:
DECLARE Conjunction;
ANY{OR(IS(COMMA), REGEXP("and|or")) -> Conjunction};
COLOR(Conjunction,"pink");

Now, we declare a new annotation type `AnimalEnum` and annotate enumerations of Animals using composed rule elements and a quantifier. The general structure of these enumerations should be: `Animal Conjunction Animal .... Conjunction Animal`. Similar to regular expressions, the plus quantifier `+` can be used to model this behavior. It matches if there is at least one occurrence of the part `(Conjunction Animal)`. 

Other important quantifiers are `?` (matches zero or one time), `*` (matches zero or more times) and the notation `[2,5]` - matches two to five repetitions. These quantifiers are greedy and always match with the longest valid sequence.

In [None]:
DECLARE AnimalEnum;
(Animal (Conjunction Animal)+) {-> AnimalEnum};
COLOR(AnimalEnum, "lightgreen");

#### Annotating sentences

Let's reset the document with all its annotations — which is called Common Analysis Structure, `CAS` — and try to annotate sentences.

In [None]:
%resetCas

In [None]:
%%documentText
The dog barked at the cat.
Dogs, cats and mice are mammals.
Zander and tuna are fishes.
This text was created approx. on 13.04.2021.

We declare a new annotation type `Sentence` and create a Sentence annotation for each sentence in the input document. We use the wildcard `#` which uses the next rule element to determine its match and always takes the shortest possible sequence. For instance, the rule `(# PERIOD){-> Sentence};` creates a Sentence annotation on anything until the first `PERIOD` token. The second rule `PERIOD (# PERIOD){-> Sentence};` creates a Sentence annotation on anything between two consecutive `PERIOD` tokens.

In [None]:
// Let's also switch to a different output display mode
// We list the detected sentences in a table
%displayMode CSV
%csvConfig Sentence
DECLARE Sentence;

(# PERIOD){-> Sentence};
PERIOD (# PERIOD){-> Sentence};

This initial try to annotate sentence has several problems with the Period symbols in `approx.` and within the date. 

In [None]:
// To start over, we can remove the initial faulty Sentence annotations.
// For each Sentence, we apply UNMARK() which removes annotations.
s:Sentence{-> UNMARK(s)};

Let's introduce a "helper" annotation for sentence ends and improve the resulting annotations.

In [None]:
DECLARE SentenceEnd;
// A period is a "SentenceEnd" only if it is followed by any Token (_) that is 
// not a number and not a small written word (SW).
//
// The "_" is a special matching condition. It is also fulfilled if nothing is left to match
// and necessary to match the last Sentence. 
PERIOD{-> SentenceEnd} _{-PARTOF(NUM), -PARTOF(SW)};

(# SentenceEnd){-> Sentence};             // Matches the first sentence.
SentenceEnd (# SentenceEnd){-> Sentence}; // Matches other sentences.

Looks good! Of course, there are still problems, e.g. exclamation marks and question marks should also be considered a `SentenceEnd` ...

### Annotating a simple Date pattern

In the next cell, we declare four new annotation types: `Day`, `Month`, `Year` and `Date`.
We create a single rule for detecting dates of the form `DD.MM.YYYY`. 
This single rule should create four annotations:
1. A "Date" annotation for the complete date mention.
2. A "Day" annotation for the two digits of the day.
3. A "Month" annotation for the two digits of the month.
4. A "Year" annotation for the four digits of the year.

In [None]:
DECLARE Date, Day, Month, Year;

//we restrict the number using a regex
(NUM{REGEXP("..")-> Day} PERIOD 
    NUM{REGEXP("..")-> Month} PERIOD 
    NUM{REGEXP(".{4}")-> Year}){-> Date};

COLOR(Day, "pink");
COLOR(Month, "lightgreen");
COLOR(Year, "lightblue");
COLOR(Date, "lightgrey");

It is often very useful to specify more than one action in a rule. This way, we can detect multiple entities using a single sequential pattern and only in combination with other information.