# L2: Information Extraction

In this lab you will implement and evaluate a simple system for information extraction. The task of the system is to read sentences and extract entity pairs of the form *x*&ndash;*y* where *x*&nbsp;is a person, *y*&nbsp;is an organisation, and *x* is the &lsquo;leader&rsquo; of&nbsp;*y*. Consider the following example sentence:

<blockquote>
Mr. Obama also selected Lisa Jackson to head the Environmental Protection Agency.
</blockquote>

From this sentence the system should extract the pair
```
("Lisa Jackson", "Environmental Protection Agency")
```

The system will have to solve the following sub-tasks:
* entity extraction &ndash; identifying mentions of person entities in text
* relation extraction &ndash; identifying instances of the &lsquo;is-leader-of&rsquo; relation

The data set for the lab consists of 62,010&nbsp;sentences from the [Groningen Meaning Bank](http://gmb.let.rug.nl) (release 2.2.0), an open corpus of English. To analyse the sentences you will use [spaCy](https://spacy.io/).

## Getting started

The first cell imports the Python module required for this lab.

In [2]:
import spacy

# Definition of the functions 

def read_data(file, n = None) :
    with open(file, encoding="utf8") as file:
        return [next(file) for x in range(n)]

The data is contained in the following file:

In [3]:
data_file = "gmb.txt"

The `tm2` module defines a function `read_data` that returns an iterator over the lines in a file. You should use this function to read the data for this lab. Use the optional argument `n` to restrict the iteration to the first few lines of the file. Here is an example:

In [4]:
for sentence in read_data(data_file, n=3):
    print(sentence)

Masked assailants with grenades and automatic weapons attacked a wedding party in southeastern Turkey, killing 45 people and wounding at least six others.

Turkish officials said the attack occurred Monday in the village of Bilge about 600 kilometers from Ankara.

The wounded were taken to the hospital in the nearby city of Mardin.



The next cell imports spaCy and loads its English language model.

In [5]:
nlp = spacy.load('en_core_web_sm', disable=['textcat'])
nlp.pipe(read_data(data_file, n=3))

<generator object Language.pipe at 0x000002B5713CAF68>

## Entity extraction

To implement the entity extraction part of your system, you do not need to do much, as you can use the full natural language processing power built into spaCy. The following code extracts the entities from the first 5&nbsp;sentences of the data:

In [6]:
for i, doc in enumerate(nlp.pipe(read_data(data_file, n=20))):
    for ent in doc.ents:
        print("{}\t{}\t{}\t{}".format(ent.text, ent.start, ent.end, ent.label_))

Turkey	13	14	GPE
45	16	17	CARDINAL
at least six	20	23	CARDINAL

	25	26	GPE
Turkish	0	1	NORP
Monday	6	7	DATE
Bilge	11	12	ORG
about 600 kilometers	12	15	QUANTITY
Ankara	16	17	GPE

	18	19	GPE
Mardin	12	13	ORG

	14	15	GPE

	16	17	GPE

	27	28	GPE
Interior	0	1	ORG
Besir Atalay	2	4	PERSON

	13	14	GPE
Turkey	0	1	GPE
NTV	2	3	ORG
Kurdish	21	22	NORP

	25	26	GPE

	12	13	GPE
The United Nations	0	3	ORG
Haiti	4	5	GPE

	22	23	GPE
Tuesday	4	5	DATE
U.N.	6	7	ORG
Kofi Annan	10	12	PERSON
Haiti	13	14	GPE
first	26	27	ORDINAL
Jean-Bertrand Aristide	35	39	ORG
February	40	41	DATE
2004	42	43	DATE

	44	45	GPE
recent months	8	10	DATE

	25	26	GPE

	24	25	GPE
November this year	4	7	DATE
December	18	19	DATE

	25	26	GPE
Americans	11	12	NORP

	21	22	GPE

	7	8	GPE

	26	27	GPE
46 million	6	8	CARDINAL
Americans	8	9	NORP
age this year	16	19	DATE

	20	21	GPE
VOA	0	1	ORG
Mil Arcega	2	4	PERSON

	6	7	GPE
Ethiopia	0	1	GPE
18	3	4	CARDINAL
more than 16 million	16	20	CARDINAL
the age of five	22	26	DATE

	27	28	GPE
Tigray	12	13	GPE

Read the [section about named entities](https://spacy.io/usage/linguistic-features#section-named-entities) from spaCy&rsquo;s documentation to get some background on this. (Please note that we are using version&nbsp;1 of the spaCy library, which means that there may be slight differences in the usage. At the time of writing, the current version&nbsp;2 is not yet stable and fast enough for this lab.)

## Problem 1: Extract relevant pairs

The first problem that you will have to solve is to identify pairs of entities that are in the &lsquo;is-leader-of&rsquo; relation, as in the example above. There are many ways to do this, but for this lab it suffices to implement the strategy outlined in the section on [Relation Extraction](http://www.nltk.org/book/ch07.html#relation-extraction) in the book by Bird, Klein, and Loper (2009):

* look for all triples of the form $(X, \alpha, Y)$ where $X$ and $Y$ are named entities of type *person* and $\alpha$ is the intervening text
* write a regular expression to match just those instances of $\alpha$ that express the &lsquo;is-leader-of&rsquo; relation

You can restrict your attention to adjacent pairs of entities &ndash; that is, cases where $X$ precedes $Y$ and $\alpha$ does not contain other named entities.

Write a function `extract` that takes an analysed sentence (represented as a spaCy [`Doc`](https://spacy.io/api/doc) object) and yields pairs $(X, Y)$ of strings representing entity mentions predicted to be in the &lsquo;is-leader-of&rsquo; relation.

In [7]:
import re
def extract(doc):
    """Extract relevant relation instances from the specified document.
    
    Args:
        doc: The sentence as analysed by spaCy.
    Yields:
        Pairs of strings representing the extracted relation instances.
    """
    relation = []
    
    # Entities to look at : "PERSON" + relation + "ORG"
    
    # identifying instances of the ‘is-leader-of’ relation
    # regular expression
    leader = re.compile(r'.*(lead|command|direct|govern|head|manage|preside|supervis|chief|patron).*')
    
    person =  None
    org = None
    
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            person = ent
        else :
            if ent.label_ == "ORG" and person != None : # we find an ORG and they was a PERSON before
                org = ent
                intervening_words = doc.text.split(" ")[person.end:org.start]
                intervening_sentence = " ".join(intervening_words)
                if re.search(leader,intervening_sentence) :
                    relation.append((person,org))
                    org = None
                    person = None
                    
    return relation # array of tuples (X,Y)

The following cell shows how your function is supposed to be used. The code prints out the extracted pairs for the first 1,000&nbsp;sentences in the data. It additionally numbers each pair with the identifier of the sentence (line number in the data file) which it was extracted from. Note that the sentence (line) numbering starts at index&nbsp;0.

In [8]:
for i, doc in enumerate(nlp.pipe(read_data(data_file, n=1007))):
    for person, org in extract(doc):
        print("{}\t{}\t{}".format(i, person, org))

144	John Mayer	Save The Music
207	Rugova	European Union
283	Michael Green	the U.S. National Security Council
351	Jendayi Frazer	Sudan Liberation Army
391	Mahmoud Abbas	Fatah
512	Aung San Suu Kyi	the National League for Democracy
638	Hassan	CARE
802	Asif Ali Zardari	the Pakistan People's Party


Once you feel confident that your `extract` function does what it is supposed to do, execute the following cell to extract the entities from the full data set. Note that this will probably take a few minutes.

In [9]:
extracted = set()
for i, doc in enumerate(nlp.pipe(read_data(data_file, n=62010))):
    for person, org in extract(doc):
        extracted.add((i, person, org))

After executing the above cell, all extracted id-string-string triples are in the set `extracted`. The code in the next cell will print the first 10&nbsp;triples in this set.

In [10]:
for i, person, org in sorted(extracted):
    print("{}\t{}\t{}".format(i, person, org))

144	John Mayer	Save The Music
207	Rugova	European Union
283	Michael Green	the U.S. National Security Council
351	Jendayi Frazer	Sudan Liberation Army
391	Mahmoud Abbas	Fatah
512	Aung San Suu Kyi	the National League for Democracy
638	Hassan	CARE
802	Asif Ali Zardari	the Pakistan People's Party
1262	Alasay Valley	Taliban
1349	Karen Hughes	State Department
1591	Fidel Castro	the Communist Party
1790	Koizumi	the United Nations
1966	Lech Walesa	Solidarity
2350	Basayev	Nalchik
2477	Ismail Haniyeh	Fatah
3053	Lecturer John Gai Yoh	the Sudanese Liberation Movement
3160	Jack Straw	Straw
3291	Krasniqi	the Kosovo Protection Corps
3399	MPRP	the Democratic Party
3520	Peres	Amir Peretz
3543	Hassan	Care International
4324	Ma Zhenchuan	the Beijing Municipal Public Security Bureau
4567	Bush	the U.S. Justice Department
4692	Heliodoro Diaz	House of Representatives
4699	Agim Ceku	the Kosovo Protection Corps
4753	Junichiro Koizumi	APEC
5046	Daniel Pearl	Al-Qaida
5082	Gul	the AK Party
5450	Nicolas Sarkozy	Gro

39051	Tom Ridge	Homeland Security
39355	Augusto Pinochet	Pinochet
39722	Francisco Galan	Uribe
39888	Su Hon	U.N.
40015	Shieh Jhy-wey	Cabinet
40315	John Solecki	U.N.
40372	Mahmoud Ahmadinejad	Supreme National Security Council
40460	Mohamed GHANNOUCHI	the Chamber of Deputies
40538	Kevin Rudd	the HMAS Stuart
40611	Ali Larijani	IAEA
40736	Lal Krishna Advani	the Bharatiya Janata Party
41032	Viktor Yanukovych	the Supreme Court
41054	Goss	CIA
41056	Goss	al-Qaida
41375	Bush	Social Security
41379	Scott McClellan	Social Security
41702	al-Aqsa	the Palestinian Authority
41907	Pakistani Taliban	Mansoor
42098	Abbas	Fatah
42115	John Holmes	U.N.
42277	Bush	the United Nations General Assembly
42281	Jacques Chirac	the U.N. Security Council's
42349	Vicente FOX	the Institutional Revolutionary Party
42405	Abdul Haq	Taleban
42563	Sharon	Likud
43322	Daschle	al Qaida
43348	Bush	the International Atomic Energy Agency
43375	Ramush Haradinaj	United Nations
43377	Haradinaj	Albanian Kosovo Liberation Army
43491	Hum

## Problem 2: Evaluate your system

You now have an extractor, but how good is it? To help you answer this question, we provide you with a &lsquo;gold standard&rsquo; of entity pairs that your system should be able to extract. The following code loads them (again augmented with the relevant sentence id) from the file `gold.txt` and adds them to the set `gold`:

In [11]:
gold_file = "gold.txt"

gold = set()
with open(gold_file) as fp:
    for line in fp:
        columns = line.rstrip().split('\t')
        gold.add((int(columns[0]), columns[1], columns[2]))
        
print(len(gold))

46


The following code prints the 10&nbsp;first pairs from the gold standard:

In [12]:
for i, person, org in sorted(gold)[:10]:
    print("{}\t{}\t{}".format(i, person, org))

802	Ali Zardari	Pakistan People 's Party
2297	Abdul Aziz al-Hakim	Supreme Council
4823	Slavkov	Bulgarian National Olympic Committee
7902	Mr. Hakim	Supreme Council
8206	J. Patrick Boyle	American Meat Institute
8633	Ali Rodriguez	Petroleos de Venezuela
9004	Foreign Minister Joschka Fischer	Green Party
11021	Khalaf	al-Qaida
11259	Joseph Domenech	U.N. 's Food and Agricultural Organization
13043	David Petraeus	U.S. Central Command


Your task now is to write code that computes the precision, recall, and F1 measure of your extractor relative to the gold standard.

In [13]:
def evaluate(reference, predicted):
    """Print out the precision, recall, and F1 for the id-entity-entity
    triples in the set `predicted`, given the triples in the reference set.
    
    Args:
        reference: The reference set of triples.
        predicted: The set of predicted triples.
    Returns:
        Nothing, but prints out precision, recall, and F1.
    """
    
    # compute the number of relevant association found
    pairs_found = 0
    
    for pair in sorted(gold):
        if pair in extracted :
            pairs_found = pairs_found + 1
    
    indice = []
    for (ind,pers,org) in sorted(extracted):
        indice.append(ind)
    
    for (i,pers,org) in sorted(gold):
        if i in indice :
            pairs_found = pairs_found + 1
    
            
    precision = pairs_found/len(extracted)
    recall = pairs_found/len(gold)
    if precision + recall != 0 :
        F1 = 2*(precision*recall)/(precision + recall)
    else  :
        F1 = None

    print("Precision : {}".format(precision))
    print("Recall : {}".format(recall))
    print("F1 : {}".format(F1))

The next cell shows how your function is intended to be used, as well as the suggested output format.

In [18]:
evaluate(gold, extracted)

Precision : 0.06982543640897755
Recall : 0.6086956521739131
F1 : 0.12527964205816555


## Problem 3: Entity resolution

Looking at the results of your quantitative evaluation, you will realise that your extractor (probably) does a rather poor job in matching the gold standard. One reason for this is that the NLP preprocessing is not perfect (spaCy was not trained on the annotations in the Groningen Meaning Bank), and that the approach of using regular expressions for relation extraction is rather naive.

Another reason however is that the current version of your system does not include a component for *entity resolution*. To give an example, your system does not realise that the strings `David Petraeus` and `General David Petraeus` refer to the same entity.

While writing an entity resolver is beyond the scope of this assignment, we ask you to *simulate* such a resolver. More specifically, you should implement a function `normalise` that takes an entity mention (a string) as its input and rewrites it to the form used in the gold standard. While in some sense this is &lsquo;cheating&rsquo;, it allows you to assess the performance of a more realistic system.

The following cell contains skeleton code for the `normalise` function.

In [68]:
extracted_normalised = set()

#for i, person, org in sorted(extracted):
#    for j, gold_person, gold_org in sorted(gold):
#        if i == j:
#            extracted_normalised.add((i, gold_person, org))
            

import re
            
for i, person, org in sorted(extracted):
    for j, gold_person, gold_org in sorted(gold):
        if re.search(r'%s' %person, gold_person):
            extracted_normalised.add((i, gold_person, org))
            print('1',gold_person)
        elif re.search(r'%s' %gold_person, str(person)):
            extracted_normalised.add((i, gold_person, org))
            print('2', gold_person)
        else:
            extracted_normalised.add((i, str(person), org))
            print('3', str(person))
            
            

3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 John Mayer
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugova
3 Rugo

3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Lech Walesa
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Basayev
3 Ismail Haniyeh
3 Ismail Haniyeh
3 Ismail Haniyeh
3 Ismail Haniyeh
3 Ismail Haniyeh
3 Ismai

3 Junichiro Koizumi
3 Junichiro Koizumi
3 Junichiro Koizumi
3 Junichiro Koizumi
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Daniel Pearl
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 Gul
3 

3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Alexander Lukashenko
3 Sayed Agha Saqib
3 Sayed Agha Saqib
3 Sayed Agha Saqib
3 Sayed Agha Saqib
3 Sayed Agha Saqib
3 Sayed Agha Saqib
3 Sayed Agha Saqib
3 Sayed Agha Saqib
3 Sayed Agha Saqib
3

3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hassan Halemi
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hamid Karzai's
3 Hami

3 Joseph Domenech
3 Joseph Domenech
3 Joseph Domenech
3 Joseph Domenech
3 Joseph Domenech
3 Joseph Domenech
3 Joseph Domenech
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Atta Mohammed
3 Nouri al-Maliki
3 Nouri al-Maliki
3 Nouri al-Maliki
3 Nouri al-Maliki
3 Nouri al-Maliki
3 Nouri al-Maliki
3 Nouri al-Maliki
3 Nouri al-M

3 John Lipsky
3 John Lipsky
3 John Lipsky
3 John Lipsky
3 John Lipsky
3 John Lipsky
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Levin
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Mohammad Ali Jalali
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel Sharon
3 Ariel

3 Avigdor Lieberman
1 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Avigdor Lieberman
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 Yushchenko
3 

3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Michael Hayden
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chilumpha
3 Chi

3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Sassou-Nguesso
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup Raj Sharma
3 Anup 

3 Jonathan Evans
3 Jonathan Evans
3 Jonathan Evans
3 Jonathan Evans
3 Jonathan Evans
3 Jonathan Evans
3 Jonathan Evans
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Chung Mong-joon
3 Prince Ali
3 Prince Ali
3 Prince Ali
3 Prince Ali
3

3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Augusto Pinochet
3 Francisco Galan
3 Francisco Galan
3 Francisco Galan
3 Francisco Galan
3 Francisco Galan
3 Francisco Galan
3 Francisco Galan
3 Francisco Galan
3 Francisco Galan
3 Francisco Galan
3 Francisco Galan
3 Francisco Galan
3 Francisco Galan
3 Fran

3 Alexander Downer
3 Alexander Downer
3 Alexander Downer
3 Alexander Downer
3 Alexander Downer
3 Alexander Downer
3 Alexander Downer
3 Alexander Downer
3 Alexander Downer
3 Alexander Downer
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
1 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Gandhi
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Noriega
3 Roger Nori

3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Mohamed ElBaradei
3 Hamid Karzai
3 Hamid Karzai
3 Hamid Karzai
3 Hamid Karzai
3 Hamid Karzai
3 Hamid Karzai
3 Hamid Karzai
3 Hamid Karzai


3 Abdullah Ocalan
3 Abdullah Ocalan
3 Abdullah Ocalan
3 Abdullah Ocalan
3 Abdullah Ocalan
3 Abdullah Ocalan
1 Abdullah Ocalan
3 Abdullah Ocalan
3 Abdullah Ocalan
3 Abdullah Ocalan
3 Abdullah Ocalan
3 Abdullah Ocalan
3 Abdullah Ocalan
3 Abdullah Ocalan
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 Dugard
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaalan
3 al-Shaa

3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Radovan Karadzic
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Bush
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas Schmit
3 Nicolas S

3 Hersh
3 Hersh
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Balkanende
3 Valdas Adamkus
3 Valdas Adamkus
3 Valdas Adamkus
3 Valdas Adamkus
3 Valdas Adamkus
3 Valdas Adamkus
3 Valdas Adamkus
3 Valdas Adamkus
3 Valdas Adamkus
3 Valdas Adamkus
 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Alito
3 Jacques Chirac
3 Jacques Chirac
3 Jacques Chirac
3 Jacques Chirac
3 Jacques Chirac
3 Ja

The next cell shows how `normalise` is intended to be used. Each triple in the set `extracted` is transformed by feeding the two entity mentions into the `normalise` function. The normalised triples are then added to a new set `extracted_normalised`.

In [61]:
            
for i, person, org in sorted(extracted_normalised):
    print("{}\t{}\t{}".format(i, person, org))

144	John Mayer	Save The Music
144	Ma	Save The Music
207	Rugova	European Union
283	Michael Green	the U.S. National Security Council
351	Jendayi Frazer	Sudan Liberation Army
391	Ma	Fatah
391	Mahmoud Abbas	Fatah
512	Aung San Suu Kyi	the National League for Democracy
638	Hassan	CARE
802	Ali Zardari	the Pakistan People's Party
802	Asif Ali Zardari	the Pakistan People's Party
1262	Alasay Valley	Taliban
1349	Karen Hughes	State Department
1591	Fidel Castro	the Communist Party
1790	Koizumi	the United Nations
1966	Lech Walesa	Solidarity
2350	Basayev	Nalchik
2477	Ismail Haniyeh	Fatah
3053	Lecturer John Gai Yoh	the Sudanese Liberation Movement
3160	Jack Straw	Straw
3291	Krasniqi	the Kosovo Protection Corps
3399	MPRP	the Democratic Party
3520	Peres	Amir Peretz
3543	Hassan	Care International
4324	Ma	the Beijing Municipal Public Security Bureau
4324	Ma Zhenchuan	the Beijing Municipal Public Security Bureau
4567	Bush	the U.S. Justice Department
4692	Heliodoro Diaz	House of Representatives
4699	Agim Ce

34889	Prince Ali	the West Asian Football Federation
35067	Ismail Haniyeh	the Palestinian Authority
35288	Sharon	Knesset
35362	Kim	State Hill
35769	Arnold Schwarzenegger	Conservative Party
36053	Osama bin Laden	Taleban
36057	Mullah Omar's	Taleban
36114	Philippe Douste-Blazy	Ingrid Betancourt
36309	Tom Lantos - a	the Presidium of the Supreme People's Assembly
36362	Deby	the Rally for Democracy and Liberty
36418	Chilumpha	the United Political Party
36919	Tutsi RPF	RPF
37037	Ali Akbar Salehi	the Atomic Energy Organization
37076	Bill Clinton	U.N.
37349	Abdullah II	al-Qaida
37409	Olusegun Obasanjo	Group
37521	Romano Prodi	the U.N. Security Council
37865	Raul Gibb Guerrero	the La Opinion
38208	Miller	the Azeri State Oil Company
38415	Yushchenko	the European Union
38609	Hamid Karzai	Taleban
38769	Ma	Hamas
38769	Mahmoud Abbas	Hamas
38773	Abbas	the Palestinian Authority
38773	Mr. Abbas	the Palestinian Authority
38968	Karadzic	Ratko Mladic
39051	Tom Ridge	Homeland Security
39355	Augusto Pinochet	

To pass the assignment, you should add enough normalisation rules to `normalise` to achieve a recall of at least 50%.

In [62]:
evaluate(gold, extracted_normalised)
aa = 'hola comol estas'
import re
appreg = r'(ola)'
appre = re.compile(appreg)
part = re.findall(appre,aa)
if part != []:
    print(part)

Precision : 0.06982543640897755
Recall : 0.6086956521739131
F1 : 0.12527964205816555
['ola']


## Problem 4: Limitations of the gold standard

Each entity pair in the gold standard has been manually checked for correctness. However, there is no guarantee that the gold standard contains all relevant pairs &ndash; there are in fact many pairs that are missing from the gold standard. Your last task in this assignment is to find at least 5&nbsp;entity pairs in the data that are valid instances of the &lsquo;is-leader-of&rsquo; relation but are not contained in the gold standard.

You can solve this task either by writing code or by manual work (inspecting the data file), or mix the two strategies. In any case, you should enter your pairs in the textbox below. Use the triple format shown above where for each pair you also specify the sentence id (line number in the data file) from which the instance was extracted.

Finally we ask you to reflect on the limitations of the evaluation that you carried out in this lab and discuss the question: *How should systems for information extraction really be evaluated?*. Here are some starting points for your discussion.

* How could one create a better gold standard for this task?
* What do precision, recall, and F1 actually measure in this context?
* What measures would be more suitable to evaluate this task?
* What other ways of evaluating systems for information extraction can you think of?

Submit your discussion as a short text (ca. 250&nbsp;words). When presenting your arguments, link back to your own results and experience from this lab, and to concepts you have learned in the lectures or in other parts of the course.

*TODO: Enter your discussion here*

This is the end of the assignment.