## Temporal expression (TIMEX) tagger

### Introduction

Natural language texts often convey semantics related to time. Frequently, time is expressed in calendrical terms (or in terms closely related to calendrical time-keeping). We can use _temporal expressions_ to answer time-related questions, such as:
  
  *  _when did it (some event) happen_?
      * _24. aprillil_ (on 24th of April), _neljapäeva pärastlõunal_ (on Thursday afternoon), _eile kell 3 päeval_ (yesterday at 3 p.m.)
      
      
  * _how long did it last_?
      * _kaks aastat_ (two years), _kolmveerand tundi_ (three-fourths of an hour), _pool päeva_ (half a day)
      
      
  * _how often does it reoccur_? 
      * _kolmapäeviti_ (on every Wednesday), _igal aastal_ (annually)

Temporal expressions (like in the examples above) can be automatically detected and semantically analysed by a temporal expression tagger. This tool identifies temporal expression phrases (in short: _timexes_) in text and normalizes these expressions, providing corresponding calendrical dates, times and durations. Example:

In [1]:
from estnltk import Text
from estnltk.taggers import TimexTagger

# Create new timex tagger
timexTagger = TimexTagger()

# Create new text object and add prerequisite layers
text = Text('Potsataja ütles eile, et vaatavad nüüd Genaga viie aasta plaanid uuesti üle.')
text.analyse('morphology')

# Mark creation time of the document
text.meta['document_creation_time'] = '2014-12-03'

# Annotate temporal expressions
timexTagger.tag(text)

# Browse results
text.timexes

layer name,attributes,parent,enveloping,ambiguous,span count
timexes,"tid, type, value, temporal_function, anchor_ti...",,words,False,3

text,tid,type,value,temporal_function,anchor_time_id,mod,quant,freq,begin_point,end_point,part_of_interval
['eile'],t1,DATE,2014-12-02,True,t0,,,,,,
['nüüd'],t2,DATE,PRESENT_REF,True,t0,,,,,,
"['viie', 'aasta']",t3,DURATION,P5Y,False,,,,,,,


**Technical note**: EstNLTK's temporal expression tagger uses a Java-based implementation of the tool. Before using the tagger, make sure that:
  * Java SE Runtime Environment (version >= 1.8) is installed into the system;
  * `java` is in the [PATH environment variable](https://docs.oracle.com/javase/tutorial/essential/environment/paths.html);

Source code of the Java-based temporal expression tagger is available [here](https://github.com/soras/Ajavt).

### TIMEX attributes: a breif overview

The attributes of temporal expression are based on the attributes of the TIMEX3 tag in [TimeML](http://www.timeml.org/). For concerns of a robust practical analysis, the most important normalization information is conveyed in attributes `type`, `value` and `temporal_function`, so you may want to focus firstly at these attributes:

In [2]:
# Browse types, values, mods and temporal functions
text.timexes[['text','type', 'value', 'temporal_function']]

Unnamed: 0,text,type,value,temporal_function
0,['eile'],DATE,2014-12-02,True
1,['nüüd'],DATE,PRESENT_REF,True
2,"['viie', 'aasta']",DURATION,P5Y,False


Attributes explained in more detail:

   * **`type`** -- type of the temporal expression. Can be one of the following:
       * `DATE` -- occurrence dates, such as _24. aprillil_ (on 24th of April), or _eelmise aastal_ (in last year);
       * `TIME` -- occurrence times which have a granularity smaller than day, 
         such as _neljapäeva pärastlõunal_ (on Thursday afternoon),  _eile kell 3 päeval_ (yesterday at 3 p.m.);
       * `DURATION` -- duration specifications, such as _kaks aastat_ (two years), or _pool päeva_ (half a day);
       * `SET` -- recurrence specifications, such as _kolmapäeviti_ (on every Wednesday) or _igal aastal_ (annually);


   * **`value`** -- semantics of the expression (mostly calendrical). Examples:
   
       * Most date and time expressions will be normalized based on the ISO datetime format `yyyy-mm-ddThh:mm`. For instance, _'24. aprillil 2009'_ (on 24th of April, 2009) will be normalized as `value=2009-04-24`;
       * Duration expressions will be normalized based on the ISO duration format `P[n]Y[n]M[n]DT[n]H[n]M[n]S`. For instance, _'kaks aastat'_ (two years) will be normalized as `value=P2Y`;
       * For common non-calendrical time expressions, special labels will be used in the value part. For instance, _'nüüd'_ (now) will be normalised as a reference to the present time (`value=PRESENT_REF`);
       
       See below TODO for details about possible formats of the `value`;
       
     
   * **`temporal_function`** -- boolean indicating whether the semantics of the expression are relative to the context ( that is: have been calculated/need to be calculated by some function, hence the name `temporal_function` );
   
      * For `DATE` and `TIME` expressions:
      
           * `temporal_function=true` indicates that the expression is relative. For instance, _'eile'_ (yesterday) has `temporal_function=true` because its value will be calculated relative to the context (the creation time of the document);
           
           * `temporal_function=false` indicates that the expression is absolute. For instance, _'2009. aastal'_ (in 2009) has `temporal_function=false` because its value will be copied from the textual part of the expression (no need for calculations);
           
     * For `DURATION` expressions, `temporal_function` is mostly `false`, except for vague durations;
     * For `SET` expressions, `temporal_function` is always `true`;

### Document creation date

In order to get the semantics of relative date and time expressions correct, you need to provide the _document creation time_ (_DCT_ in short), which is then used to calculate semantics of expressions such as _'eile'_ (yesterday) and _'järgmisel neljapäeval'_ (on next Thursday).

`TimexTagger` looks for the _document creation time_ under the text metadata, searching for keys named `'document_creation_time'`, `'creation_time'` or `'dct'`. Normally, it is expected that values under these keys are in the ISO datetime format (`YYYY-mm-ddTHH:MM`) or in ISO date format (`YYYY-mm-dd`):

In [3]:
# Create new text object and add prerequisite layers
text = Text('Tulid eile meile, selle asemel et tulla täna?')
text.analyse('morphology')

# Mark creation time of the document in the metadata
text.meta['document_creation_time'] = '2010-04-26'

# Annotate temporal expressions
timexTagger.tag(text)

# Browse results
text.timexes[['text', 'type', 'value', 'temporal_function']]

Unnamed: 0,text,type,value,temporal_function
0,['eile'],DATE,2010-04-25,True
1,['täna'],DATE,2010-04-26,True


Optionally, you can also use Python's `datetime` object to specify the creation date:

In [4]:
# Create new text object and add prerequisite layers
text = Text('Tulid eile meile, selle asemel et tulla täna või homme?')
text.analyse('morphology')

# Mark creation time of the document in the metadata
import datetime
text.meta['document_creation_time'] = datetime.datetime(1986, 12, 21)

# Annotate temporal expressions
timexTagger.tag(text)

# Browse results
text.timexes[['text', 'type', 'value', 'temporal_function']]

Unnamed: 0,text,type,value,temporal_function
0,['eile'],DATE,1986-12-20,True
1,['täna'],DATE,1986-12-21,True
2,['homme'],DATE,1986-12-22,True


 * **Note:** if there is no _document creation time_ specified in the metadata of the `Text` object, then `TimexTagger` assumes that the execution time of the tagger is the DCT. As a result of tagging, the execution time of the tagger will also be stored in the metadata, under the key `'document_creation_time'`;

#### Gaps in document creation date

There can be situations when the exact document creation date cannot be specified. 
For instance, it may be that only year or month when the document was created is known, and there is no information about the exact date. 
In such cases, the string-based document creation date can have gaps: unknown granulatities can be replaced by `'X'` symbols. For instance, if we only know the year of writing (_2009_), we can use `'2009-XX-XX'` as the document creation date:

In [5]:
# Create new text object and add prerequisite layers
text = Text('Homme või järgmisel aastal?')
text.analyse('morphology')

# Mark creation time of the document
text.meta['document_creation_time'] = '2009-XX-XX'

# Annotate temporal expressions
timexTagger.tag(text)

# Browse results
text.timexes[['text', 'type', 'value', 'temporal_function']]

Unnamed: 0,text,type,value,temporal_function
0,['Homme'],DATE,XXXX-XX-XX,True
1,"['järgmisel', 'aastal']",DATE,2010,True


Note that using gaps in DCT also affects how relative date and time expressions are normalized. If a relative expression has granularity that is not specified ( such as the expression _homme_ (tomorrow) in the previous example -- it has granularities _day_ and _month_ which cannot be resolved using the given DCT ), then its value is also covered with `'X'` symbols, indicating that there is not enough information to find the exact value. 

 * What to keep in mind when using gaps in DCT:

    * You should start marking `'X'` symbols from the right side of DCT, and the markings should be continuous. Discontinuous gaps (such as `'2009-0X-X1'`) and gaps that cover a granularity only partially (such as `'2009-1X-XX'`) do not work -- they may actually lead to unexpected processing errors. However, DCT formats `XXXX-XX-XX`, `yyyy-XX-XX` and `yyyy-mm-XX` should be safe for usage;

    * Marking gaps in DCT is an _experimental feature_. Do not expect that it always works automagically, rather, test it by yourself and see, if the results fit your purpose;


### Details of the annotation format

In this section, we will introduce more details about the TIMEX attributes used by the `TimexTagger`.

#### The attribute `tid`

The attribute `tid` provides a unique identifier for each temporal expressions. Identifier is a string that has prefix `t`, followed by the number of the timex. Numbering of the timexes starts from 1. Example:

In [6]:
# Create new text object and add prerequisite layers
text = Text('Kaks aastat põrandaalust aktiivset tegevust, 14 aastat vanglat ning 30 aastat pideva '+
            'nuhkimise all elamist.')
text.analyse('morphology')

# Annotate temporal expressions
timexTagger.tag(text)

# See results
text.timexes[['text', 'tid', 'type', 'value']]

Unnamed: 0,text,tid,type,value
0,"['Kaks', 'aastat']",t1,DURATION,P2Y
1,"['14', 'aastat']",t2,DURATION,P14Y
2,"['30', 'aastat']",t3,DURATION,P30Y


 * **Note:** the identifier **`t0`** refers to the _document creation time_. It is never used as a `tid` of a timex, but it can be used in the attributes `anchor_time_id`, `begin_point` and `end_point`, whenever calculations of semantics involve the using _document creation time_;

#### The attribute `type`

... indicates the type of the temporal expression. Can be one of the following:
   * `DATE` -- occurrence dates, such as _24. aprillil_ (on 24th of April), or _eelmise aastal_ (in last year);
   * `TIME` -- occurrence times which have a granularity smaller than day, such as _neljapäeva pärastlõunal_ (on Thursday afternoon),  _eile kell 3 päeval_ (yesterday at 3 p.m.);
   * `DURATION` -- duration specifications, such as _kaks aastat_ (two years), or _pool päeva_ (half a day);
   * `SET` -- recurrence specifications, such as _kolmapäeviti_ (on every Wednesday) or _igal aastal_ (annually);


#### The attribute `value`

The attribute `value` conveys most important part of the semantics of the temporal expression -- the semantics of a date, time or duration based on the ISO datetime format. There are five possible formats:

   * I. Date-based format: `yyyy-mm-ddThh:mm`

           yyyy - year (4 digits)
           mm - month (01-12)
           dd - day (01-31)
             
   * II. Weekday-based: `yyyy-Wnn-wdThh:mm`

           nn - the week of the year (01-53)
           wd - day of the week (1-7, where 1 denotes Monday).

   * III. Time-based: `Thh:mm`

           hh - hour of day (00-23)
           mm - minute of hour (00-59)
       
   * IV. Time span: `Pn1Yn2Mn3Wn4DTn5Hn6M`

           where ni denotes a value and Y (year), M (month), 
           W (week), D (day), H (hours), M (minutes) denotes 
           respective time granularity.
       
   * V. Special labels, such as `PRESENT_REF` and `PAST_REF`
             
           in some cases, special labels are used to 
           express the date & time semantics. See the 
           annotation guidelines below for more details 
           
           
Formats I and II are used with DATE, TIME and SET types. Format I is always preferred if both I and II can be used. Format III is used in cases it is impossible to extract the date. Format IV is used in time span expressions, and in some recurrence expressions.

Parts of the ISO datetime format can be replaced by labels conveying special semantics of commonly used temporal expressions:

   * `hh:mm` (_hours and minutes_) can be replaced by a label referring to a _time of the day_:
         
         MO - morning - hommik
         AF - afternoon - pärastlõuna
         EV - evening - õhtu
         NI - night - öö
         DT - daytime - päevane aeg

   * `wd` (_weekday_) can be replaced by a general label referring to a group of weekdays:
   
         WD - workday - tööpäev
         WE - weekend - nädalalõpp

   * `mm` (_month_) can be replaced by a general label referring to a season:
         
         SP - spring - kevad
         SU - summer - suvi
         FA - fall - sügis
         WI - winter - talv
         
   * `mm` (_month_) can also be replaced by a general label referring to a quarter (of year):
   
         Q1, Q2, Q3, Q4
         QX - unknown/unspecified quarter
         

If the expression refers to date/time before the Common Era, then its `value` will have prefix `BC`:

In [7]:
# Create new text object and add prerequisite layers
text = Text('Lülle laevkalmed, rajatud umbes 8. sajandil e.m.a., on Eestis ainulaadsed .')
text.analyse('morphology')

# Annotate temporal expressions
timexTagger.tag(text)

# See results
text.timexes[['text', 'type', 'value']]

Unnamed: 0,text,type,value
0,"['umbes', '8.', 'sajandil', 'e.m.a.']",DATE,BC07


More detailed description of the `value` can be found in the annotation guidelines [here](https://github.com/soras/Ajavt/blob/master/doc/margendusformaat_et.pdf?raw=true) (currently only in Estonian);

#### The attribute `temporal_function`

... is boolean value indicating whether the semantics of the expression are relative to the context ( that is: have been calculated/need to be calculated by some function, hence the name `temporal_function` );
   
   * For `DATE` and `TIME` expressions:
      
       * `temporal_function=true` indicates that the expression is relative. For instance, _'eile'_ (yesterday) has `temporal_function=true` because its value will be calculated relative to the context (the creation time of the document);
           
       * `temporal_function=false` indicates that the expression is absolute. For instance, _'2009. aastal'_ (in 2009) has `temporal_function=false` because its value will be copied from the textual part of the expression (no need for calculations);
           
           
   * For `DURATION` expressions, `temporal_function` is mostly `false`, except for vague durations;
   * For `SET` expressions, `temporal_function` is always `true`;

#### The attribute `mod`

... refers to a modifier of the semantics part in the `value`. It is used in special occasions when semantics cannot be expressed completely by the attribute `value` -- there is a need for an elaboration. For instance, the expression _'2009. aasta alguses'_ (in the beginning of 2009) will have `value=2009` and `mod=START`.
      
Another example:

In [8]:
# Create new text object and add prerequisite layers
text = Text('Tavaliselt võtab paariks kasvamine aega umbes kaks aastat, mil toimub teineteise nurkade '+
            'mahalihvimine ning ühiste reeglite ja maailmapildi kujunemine.')
text.analyse('morphology')

# Annotate temporal expressions
timexTagger.tag(text)

# See results
text.timexes[['text', 'type', 'value', 'mod']]

Unnamed: 0,text,type,value,mod
0,"['umbes', 'kaks', 'aastat']",DURATION,P2Y,APPROX


The attribute `mod` can have the following string values: `START`, `MID`, `END`, `FIRST_HALF`, `SECOND_HALF`, `APPROX`, `LESS_THAN`, `MORE_THAN`, `EQUAL_OR_LESS` or `EQUAL_OR_MORE`.

---

**T O D O**

---