-
Notifications
You must be signed in to change notification settings - Fork 40
Event Extraction
Baleen 2.6 introduces Event Extraction to the Baleen annotators. An event in this context is considered to be a relationship that contains both a location and a time or date.
A new consumer MongoEvents
has been created for the output of events from Baleen.
A basic method of extracting events has been implemented within the events.SimpleEventExtractor
annotator which can be configured to detect events in sentences or paragraphs which contain both a temporal and location annotation as well as at least one entity.
It is necessary to also include language.OpenNLP
in the pipeline in order that sentences or paragraphs may be parsed for events.
Odin has been integrated into Baleen to provide a rule based system for extracting events. A full description of the rule language can be found in Description of the Odin Event Extraction Framework and Rule Language.
Two simple examples are given below:
- name: token_example label: LivesIn priority: 2 type: token pattern: | (?Oscar) (?[lemma=live]) on [tag=DT]? (?[tag=/^N/]+)
Each rule has a name and a label. The label (or list of labels) are assigned to mentions found by this rule. The priority describes the order the rule is run in. Odin runs all the rules on each sentence until no new mentions are given and therefore the priority is not required, but performance can be improved by assigning priorities if there are any dependencies between the rules. The type is either ‘token’ or ‘dependency’ and describes the rule syntax used. In this example, we have a token rule. To satisfy rules of this type a sequence of tokens must be found that match each token rule in the pattern in turn. The token rules are space separated. For example, the simplest rule in the above sequence is the ‘on’, which matches the word ‘on’ exactly. We now take each in turn:
-
(?<resident>Oscar)
matches the word Oscar and assigns that match to the role resident in the mention. -
(?<trigger>[lemma=live])
matches any token that has the lemma live, this is assigned the role trigger. Trigger is a special role in an event, and every event must have a trigger. -
on
matches the word ‘on’ -
[tag=DT]?
Matches a token that has the part of speech tag DT, meaning it is a determiner. The ‘?’ at the end shows that this is optional, and as such can also match an empty token. -
(?<location>[tag=/^N/]+)
matches at least one token whose part of speech tag matches the regular expression ^N , meaning it is a Noun and is assigned the role location. If Location entities exist, or a rule that defines Locations, then this could have been replaced with@location:Location
.
This rule would match the following sentences:
- Oscar lives on Sesame Street.
- Oscar lives on the Sesame Street in a trash can.
- Back then, Oscar lived on a Sesame Street.
This next rule uses a dependency pattern, we omit type: dependency
as this is the default.
- name: dependency_example label: Jump priority: 2 pattern: | trigger = [lemma=live] resident:Person = nsubj location:Location = prep pobj
In this rule pattern we start with a trigger, here any word with the lemma live. Then each role in the mention is fulfilled by traversing the dependency graph from the trigger according to the path specified. We assume Person and Location types are entity extracted and allow the matching of such types at the end of the path. The role resident is fulfilled by the ‘nsubj’ of the verb and the location by the following a ‘prep’ and ‘pobj’ edge of the dependency graph. As well as the above examples, more complex sentence structures can be found by this rule, such as:
- Back then Oscar lived in a small grey trash can on Sesame Street.
- Oscar, the green bodied Grouch, lives, with Big Bird, on Sesame Street. These examples just scratch the surface of what can be expressed with these rule languages. Complex domain knowledge can be captured and used to extract a rich set of events from a corpus.
Odin can be used with a predefined rules file within a Baleen pipeline as follows:
collectionreader: # A collection reader annotators: # Usual extraction pipeline including either: #- language.OdinParser # or #- language.OpenNLP #- language.MaltParser - class: events.Odin rules: ./rulesFile.yml # types: # - Event # - Entity consumers: - class: MongoEvents #- class: print.Events
A simple Odin rules file might be:
rules: - name: event label: Event priority: 2 pattern: | trigger = [tag=/^V/] location: Location = >> time: Temporal = >> involved: Entity* = >> [!entity=/Location|Temporal/]
whereas a more complex rules file could include
taxonomy: - List - Meeting - Communitation - Actor - Number - Quote - Group - Effect: - Killing rules: - name: numbers label: Number priority: 1 type: token pattern: | [tag=CD] - name: single-quotes label: Quote priority: 1 type: token pattern: /[']/ /[^']+/+ /[']/ - name: double-quotes label: Quote priority: 1 type: token pattern: /["]/ /[^"]+/+ /["]/ - name: group label: Group priority: 2 keep: false pattern: | trigger = [lemma=/people|civilian/] number: Number = /num/ - name: actor label: Actor priority: 2 type: token pattern: | [entity=/Person|Organisation/] - name: list label: List priority: 2 type: token pattern: | @item:Entity ("," @item:Entity)+ (and @item:Entity)? - name: said label: Communitation priority: 3 pattern: | trigger = [tag=/^V/ & lemma=/say|declare/ & !outgoing=neg] subject: Actor = nsubj quote: Quote = >> | << - name: killed label: Killing priority: 3 pattern: | trigger = [lemma=/kill/] subject: Actor? = /nsubj/ target: Group = dobj - name: event label: Event priority: 2 pattern: | trigger = [tag=/^V/] time: Temporal+ = >> location: Location+ = >> involved: Entity* = >> [!entity=/Location|Temporal/] - name: meeting label: Meeting priority: 3 pattern: | trigger = [tag=/^V/ & lemma=/met|meet|gather|host|assemble/ & !mention=Meeting] subject: Actor = nsubj prep? pobj? object: Actor? = dobj participant: Actor* = /nsubj|dobj/ prep? pobj? /appos|conj/ location: Location? = xcomp? prep pobj time: Temporal? = xcomp? prep pobj - name: meeting-of label: Meeting priority: 3 pattern: | trigger = [lemma=/met|meet|gather|host|assemble/ & !mention=Meeting] subject: Actor = (prep pobj nn) | (prep obj) | (/nmod/ compound|amon) - name: meeting-located label: Meeting priority: 3 pattern: | trigger = [tag=/^V/ & lemma=/take|took/] subject: Meeting = nsubj object: Actor? = dobj participant: Actor* = /nsubj|dobj/ prep? pobj? /appos|conj/ location: Location = xcomp? prep pobj time: Temporal? = xcomp? prep pobj - name: communicating label: Communitation priority: 3 pattern: | trigger = [tag=/^V/ & lemma=/talk|speak|argue|chat|communicate|tell|converse|say|shout|utter|discus/] participant: Actor{2,} = (/nsubj|dobj/) | (/nsubj|dobj/ (prep pobj)? /appos|conj/) location: Location? = xcomp? prep pobj time: Temporal? = xcomp? prep pobj - name: born label: Event priority: 3 pattern: | trigger = [tag=/^V/ & lemma=/bear|born/] subject: Person = /nsubj/ location: Location? = xcomp? prep? pobj time: Temporal? = xcomp? prep? pobj - name: attack label: Event priority: 4 pattern: | trigger = [tag=/^V/ & lemma=/attack/] subject: Actor = /nsubj/ effect: Effect? = >> location: Location? = xcomp? prep? pobj time: Temporal? = xcomp? prep? pobj