Event extraction using the Structured Perceptron algorithm

Preprocessing

Before applying structured perceptron, we do some preprocessing based on common sense.

Non-alphanumeric tokens

Events are most often words and not numbers or symbols or such. This is why we choose to ignore any events that might actually contain numbers or be labelled as events by annotators either mistakingly or due to their extravagant understanding of what to call an event. Therefore,

all tokens that contain not only letters but also something else or no letters at all will be labelled as non-events

Tokens with multiple labels

There is a number of tokens that have multiple labels. Specifically, we have found the following in the training set:

I-Justice_Sentence,I-Justice_Sentence sentence
I-Conflict_Attack,I-Life_Die slaying, attempt, kill, massacre, death, the, blow, killer, genocide, killing, murder, slaughter, manslaughter
I-Contact_Meet,I-Movement_Transport-Person visit
I-Personnel_End-Position,I-Personnel_End-Position former
I-Contact_Meet,I-Justice_Trial-Hearing tell, plead, testimony
I-Justice_Execute,I-Life_Die penalty, execute, lethal, death, punishment, to, put, capital, execution, hang
I-Personnel_End-Position,I-Personnel_End-Position,I-Personnel_End-Position work
I-Movement_Transport-Artifact,I-Transaction_Transfer-Money,I-Transaction_Transfer-Ownership trafficking, smuggling
I-Conflict_Attack,I-Transaction_Transaction robbery
I-Justice_Execute,I-Justice_Sentence penalty
I-Transaction_Transfer-Money,I-Transaction_Transfer-Ownership sale, purchase, run, buyer, buy, donate, sell
I-Contact_Correspondence,I-Transaction_Transfer-Ownership receive
I-Movement_Transport-Artifact,I-Transaction_Transfer-Ownership smuggle, ship, transfer, smuggling, trafficking, pick, receive, smugglee, supply, smuggler
I-Transaction_Transfer-Money,I-Transaction_Transfer-Money pay
I-Conflict_Attack,I-Life_Injure over, hurt, abuse, attack, knee, assault, run, rape, shot, injure, wound, injured, cap
I-Contact_Meet,I-Justice_Arrest-Jail apprehension
I-Contact_Broadcast,I-Justice_Trial-Hearing rule
I-Conflict_Attack,I-Transaction_Transfer-Ownership hijacking, seize, rob, robbery, burglary
I-Justice_Extradite,I-Movement_Transport-Person deport, extradition, extradite
I-Justice_Charge-Indict,I-Justice_Charge-Indict charge
I-Conflict_Attack,I-Transaction_Transfer-Money robbery
I-Justice_Fine,I-Transaction_Transfer-Money fine
I-Movement_Transport-Artifact,I-Transaction_Transaction supply

What to do about these labels

If we have the same label duplicated like I-Justice_Sentence,I-Justice_Sentence we simply aim to label the corresponding token once and consider the predicted label correct if it matches a single instance of the duplicated label
We consider the relatively widespread multiple labels as separate labels. Specifically, I-Justice_Execute,I-Life_Die means something like killed according to justice. Then I-Conflict_Attack,I-Transaction_Transfer-Ownership means assault that involves taking someone’s property. Also, I-Conflict_Attack,I-Life_Die is clearly violent death and I-Conflict_Attack,I-Life_Injure is assault that results in injusries.

Training and testing data

We have both the training and testing datasets as JSON files in the data directory. The main script first loads the training data and does some preprocessing.

Preprocessing

Find all names in the text and replace them with a token NAME. To find the names, we use the US babyname dataset from Kaggle as well as the name gazeteers from GATE.

Viterbi Algorithm

Suppose there is a sentence Cat eats a cake. We labelled this sentence as E E NE NE where NE stands for "non-event" and E is for "event". This is almost correct except for the label for word "cat" which is supposed to be a NE.
For each word in this sentence we take all features related to this word and increase their weights by 1 if we labelled this word correctly. If we labelled the word incorrectly, all features related to this word have their weights decreased by 1.

Viterbi algorithm

A very good description of this algorithm can be found in Jurafsky, Daniel, and James H. Martin. 2008. Speech and Language Processing. Second Edition. Prentice Hall or even better, in the unpublished third edition here

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
data		data
gazetteers		gazetteers
README.md		README.md
beam.py		beam.py
features.py		features.py
main_script.py		main_script.py
make_nomlex_dict.py		make_nomlex_dict.py
preprocess_training_data.py		preprocess_training_data.py
scorer.py		scorer.py
splitter_train_test.py		splitter_train_test.py
structured_percentron.py		structured_percentron.py
viterbi.py		viterbi.py

eeghor/tac-event-extractor

Folders and files

Latest commit

History

Repository files navigation

Event extraction using the Structured Perceptron algorithm

Preprocessing

Non-alphanumeric tokens

Tokens with multiple labels

What to do about these labels

Training and testing data

Preprocessing

Viterbi Algorithm

Viterbi algorithm

About

Resources

Stars

Watchers

Forks

Languages