<a href="https://colab.research.google.com/github/aolieman/semantic-corpus-exploration/blob/master/notebooks/entity_linking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part I: Entity Linking with DBpedia Spotlight

![Spotlight Logo](https://cleverdon.hum.uva.nl/spotlight/demo/dbpedia_spotlight_logo.jpg)

In [0]:
from IPython.display import IFrame
from urllib.parse import quote

## Confidence parameter

### The horse named Buck

> "She named her horse 'Buck.'"

Before proceeding to click the "annotate" button in the next cell, please decide for yourself which words in this sentence should be linked. Assume no further context is available.

In the Spotlight annotation widget presented below, we present the example as part of a metadata record: a blog post with a title and (plain-text) tags. Now please proceed by clicking the "annotate" button.

In [18]:
IFrame('https://cleverdon.hum.uva.nl/spotlight/demo/?nologo=1&text=Title:%20%22She%20named%20her%20horse%20%27Buck.%27%22%0ATags:%20saddle,%20crupper,%20stirrups', 900, 450)

By default, Spotlight links the tags, but no terms from the title.

Did you notice the confidence slider at the top-left of the widget? This can be used to control how many annotations Spotlight will generate:

- ask it to be more confident about its entity links, and it will generate fewer of them;
- relax this constraint, and Spotlight will dare to guess, even if the input is not at all similar to the training data it has seen.

Please try different values of the confidence parameter on the previous input.

#### Questions

What exactly does the confidence parameter control?

Can you relate this to the spotting and disambiguation probabilities that were previously discussed?

*Hint:* by selecting the "n-best candidates" checkbox, more information will be provided about each linked surface form.

### Darwin's Origin of Species

With the default confidence value of 0.5, Spotlight succeeds in identifying the most salient topics of the following paragraph:

- Domestic pigeons
- Breed (i.e. specific group of domestic animals)
- Animal fancy

Please click "annotate" below  to validate this statement.

There are, however, additional surface forms that are annotated. Whether they should have been linked at all, depends on the goal for which entity linking is employed. When a document collection is to be systematically annotated, the goal of this endeavor should be made explicit by annotation guidelines. Such guidelines—in the ideal case—make it possible to objectively establish whether an annotation can be considered "correct" or not.

#### Questions

Can you give an example of a rule that would be applicable to annotate all published works by Charles Darwin?
- e.g. "The names of cities should be linked, but the names of countries should not."

Is it possible to formulate rules according to which all annotations that Spotlight generates in this example are correct?

Is there a confidence value at which it becomes easier to formulate rules that consider all the annotations that Spotlight does (and does not) generate as correct?

In [19]:
# source: http://darwin-online.org.uk/content/frameset?itemID=F373&viewtype=side&pageseq=35
url_text = quote("""
Believing that it is always best to study some special group, I have, after deliberation, taken up domestic pigeons. I have kept every breed which I could purchase or obtain, and have been most kindly favoured with skins from several quarters of the world, more especially by the Hon. W. Elliot from India, and by the Hon. C. Murray from Persia. Many treatises in different languages have been published on pigeons, and some of them are very important, as being of considerable antiquity. I have associated with several eminent fanciers, and have been permitted to join two of the London Pigeon Clubs.
""")
IFrame(f'https://cleverdon.hum.uva.nl/spotlight/demo/?nologo=1&text={url_text}', 900, 450)

## Toponyms

Location names can be notorioriously ambiguous. In the following examples we'll see how Spotlight fares with toponyms.

### The MP for Thornhill

Please set the confidence to `0.4` before pressing "annotate," to get the most out of this example.

In [20]:
# source: http://resolver.politicalmashup.nl/ca.proc.d.20080130-377?view=html#ca.proc.d.20080130-377.4.2.11
url_text = quote("""
Mr. Speaker, today I am pleased to present two petitions in the House. The first is from residents in my riding of Thornhill who are eager for federal investments in mass public transit. Today I am presenting a petition calling on the Prime Minister to commit to providing federal funding for the Yonge Street subway extension, which is critical to the quality of life of residents.
""")
IFrame(f'https://cleverdon.hum.uva.nl/spotlight/demo/?nologo=1&text={url_text}', 900, 450)

Spotlight gets both Thornhill and Yonge Street right, and is able to link the adjective “federal” and the Prime Minister to their Canadian counterparts.

Its confidence in Thornhill is pretty low though, partly because it is a highly ambigous name, but also because in this case it refers to a specific administrative subdivision: the riding.

If we increase the confidence threshold to `0.5` (please do so), Thornhill will not be linked with the current context. But if “riding” is replaced with “district”, for instance, it will be linked to the suburban community of Thornhill, instead of the riding with the same name. District is now also linked, as a seperate concept. One way to think about this, is that it is easier for Spotlight to deal with the modified example because it more closely resembles the Wikipedia articles it was trained on.

*Hint:* the input text is not editable when the button is labeled "back to text." After clicking on this button, the text becomes editable again.

To find out how many ways there are to incorrectly disambiguate Thornhill, see [this overview](https://www.geonames.org/search.html?q=thornhill&country=).


### Belgian Congo

This example features toponyms that have changed since the time of writing.

In [21]:
url_text = quote("""
In eight days after leaving London one can now be in the Belgian Congo, and the same applies to travellers from Belgium [...] Motor transport takes them to Stanleyville [...] Passengers who fly north from Capetown can change at Broken Hill to a feeder air service to Elisabethsville. Here there is a train link to Port Francqui, at which point connections are established with the Congo airways, which run from Luluabourg to Leopoldville.
""")
IFrame(f'https://cleverdon.hum.uva.nl/spotlight/demo/?nologo=1&text={url_text}', 900, 450)

### Order, please.

In this example, we have redacted the text to obfuscate in which country the following exchange took place.

Can you tell in which parliament these words were spoken?

Please annotate the text with a confidence value of `0.4`.

In [22]:
# source (redacted): http://resolver.politicalmashup.nl/xx.proc.d.20080130-377?view=html#xx.proc.d.20080130-377.4.2.11
url_text = quote("""
A: Mr. Speaker, the [...] obviously think [...] are foolish. They think [...] cannot understand that it is not a proper use of parliamentary funds to run a party office in a [...] where they do not even have a single parliamentarian. That is one of the many reasons why [...] understand one does not have to have ever been in power to be a hypocrite that big.

B: Order, please. The Prime Minister knows that word is an unparliamentary word.

C: Mr. Speaker, can the Prime Minister please provide an update to the House on [...] work with our allies in response to the situation in Ukraine?
""")
IFrame(f'https://cleverdon.hum.uva.nl/spotlight/demo/?nologo=1&text={url_text}', 900, 450)

To see how well Spotlight guessed which country it is, please check the link target of "Prime Minister."

That answer doesn't seem right.

Can you guess which word threw Spotlight off course?

---

-

Let’s see what happens to the link target when we remove “hypocrite” from the input text.

-

---

It may seem strange to us that the use of the word "hypocrite" would reveal much about the nationality of person A (the Prime Minister).

#### Question

What *should* an entity linker pay attention to instead, if it was trained on these parliamentary corpora?

The parlimentary language that is used in Commonwealth parliaments has a lot of overlap. Which words are most informative to make the distinction between Canadian and British parliamentary language, for instance.

*Hint:* compare the number of search results for the following queries:
- Canadian parliament: [“order”](http://search.politicalmashup.nl/?query=%20%7B%22page%22:1,%22debug%22:false,%22useRegexQuery%22:false,%22regexQuery%22:%22%22,%22query%22:%22%5C%22order%5C%22%22,%22downloadAmount%22:1000,%22selectedCollection%22:%22Canada%22,%22selectedDocType%22:%22Speech%22,%22selectedOrder%22:%22Relevance%22,%22excludedPartiesTags%22:%5B%5D,%22excludedSpeakersTags%22:%5B%5D,%22selectedSpeakersTags%22:%5B%5D,%22selectedPartiesTags%22:%5B%5D,%22partyFacets%22:%7B%7D,%22speakerFacets%22:%7B%7D,%22houseFacets%22:%7B%7D,%22categoryFacets%22:%7B%7D,%22dossierFacets%22:%7B%7D,%22roleFacets%22:%7B%7D,%22excludedParties%22:%7B%7D,%22excludedSpeakers%22:%7B%7D,%22sliderYearMin%22:1800,%22sliderYearMax%22:2018,%22dateStart%22:%221800-01-01T00:00:00.000Z%22,%22dateEnd%22:%222018-01-01T00:00:00.000Z%22,%22docType%22:%22speech%22,%22searchTopicTitleOnly%22:false,%22searchClicked%22:false,%22advancedSearchOpened%22:false,%22graphsOpened%22:false%7D) vs [“order please”](http://search.politicalmashup.nl/?query=%20%7B%22page%22:1,%22debug%22:false,%22useRegexQuery%22:false,%22regexQuery%22:%22%22,%22query%22:%22%5C%22order%20please%5C%22%22,%22downloadAmount%22:1000,%22selectedCollection%22:%22Canada%22,%22selectedDocType%22:%22Speech%22,%22selectedOrder%22:%22Relevance%22,%22excludedPartiesTags%22:%5B%5D,%22excludedSpeakersTags%22:%5B%5D,%22selectedSpeakersTags%22:%5B%5D,%22selectedPartiesTags%22:%5B%5D,%22partyFacets%22:%7B%7D,%22speakerFacets%22:%7B%7D,%22houseFacets%22:%7B%7D,%22categoryFacets%22:%7B%7D,%22dossierFacets%22:%7B%7D,%22roleFacets%22:%7B%7D,%22excludedParties%22:%7B%7D,%22excludedSpeakers%22:%7B%7D,%22sliderYearMin%22:1800,%22sliderYearMax%22:2018,%22dateStart%22:%221800-01-01T00:00:00.000Z%22,%22dateEnd%22:%222018-01-01T00:00:00.000Z%22,%22docType%22:%22speech%22,%22searchTopicTitleOnly%22:false,%22searchClicked%22:true,%22advancedSearchOpened%22:false,%22graphsOpened%22:false,%22yearFacets%22:%7B%7D%7D)
- UK parliament: [“order”](http://search.politicalmashup.nl/?query=%20%7B%22page%22:1,%22debug%22:false,%22useRegexQuery%22:false,%22regexQuery%22:%22%22,%22query%22:%22%5C%22order%5C%22%22,%22downloadAmount%22:1000,%22selectedCollection%22:%22United%20Kingdom%22,%22selectedDocType%22:%22Speech%22,%22selectedOrder%22:%22Relevance%22,%22excludedPartiesTags%22:%5B%5D,%22excludedSpeakersTags%22:%5B%5D,%22selectedSpeakersTags%22:%5B%5D,%22selectedPartiesTags%22:%5B%5D,%22partyFacets%22:%7B%7D,%22speakerFacets%22:%7B%7D,%22houseFacets%22:%7B%7D,%22categoryFacets%22:%7B%7D,%22dossierFacets%22:%7B%7D,%22roleFacets%22:%7B%7D,%22excludedParties%22:%7B%7D,%22excludedSpeakers%22:%7B%7D,%22sliderYearMin%22:1800,%22sliderYearMax%22:2018,%22dateStart%22:%221800-01-01T00:00:00.000Z%22,%22dateEnd%22:%222018-01-01T00:00:00.000Z%22,%22docType%22:%22speech%22,%22searchTopicTitleOnly%22:false,%22searchClicked%22:true,%22advancedSearchOpened%22:false,%22graphsOpened%22:false,%22yearFacets%22:%7B%7D%7D) vs [“order please”](http://search.politicalmashup.nl/?query=%20%7B%22page%22:1,%22debug%22:false,%22useRegexQuery%22:false,%22regexQuery%22:%22%22,%22query%22:%22%5C%22order%20please%5C%22%22,%22downloadAmount%22:1000,%22selectedCollection%22:%22United%20Kingdom%22,%22selectedDocType%22:%22Speech%22,%22selectedOrder%22:%22Relevance%22,%22excludedPartiesTags%22:%5B%5D,%22excludedSpeakersTags%22:%5B%5D,%22selectedSpeakersTags%22:%5B%5D,%22selectedPartiesTags%22:%5B%5D,%22partyFacets%22:%7B%7D,%22speakerFacets%22:%7B%7D,%22houseFacets%22:%7B%7D,%22categoryFacets%22:%7B%7D,%22dossierFacets%22:%7B%7D,%22roleFacets%22:%7B%7D,%22excludedParties%22:%7B%7D,%22excludedSpeakers%22:%7B%7D,%22sliderYearMin%22:1800,%22sliderYearMax%22:2018,%22dateStart%22:%221800-01-01T00:00:00.000Z%22,%22dateEnd%22:%222018-01-01T00:00:00.000Z%22,%22docType%22:%22speech%22,%22searchTopicTitleOnly%22:false,%22searchClicked%22:true,%22advancedSearchOpened%22:false,%22graphsOpened%22:false,%22yearFacets%22:%7B%7D%7D)

### Another London

In this example, Spotlight's contextual score cannot overcome the "commonness" prior.

In [23]:
url_text = quote("""
Two CNR trains running between London and Toronto and passing through St. Mary's at 8:05 a.m. and 8.20 p.m., and which did not stop at the depot here, have been advised to do so.
""")
IFrame(f'https://cleverdon.hum.uva.nl/spotlight/demo/?nologo=1&text={url_text}', 900, 450)

## Literature

### Alice in Wonderland



In [24]:
url_text = quote("""
Alice was not a bit hurt, and she jumped up on to her feet in a moment: she looked up, but it was all dark overhead; before her was another long passage, and the White Rabbit was still in sight, hurrying down it.
""")
IFrame(f'https://cleverdon.hum.uva.nl/spotlight/demo/?nologo=1&text={url_text}', 900, 450)

### Adam without Eve

Try a lower confidence value to reunite this famous couple.

In [25]:
# source: https://gist.github.com/phillipj/4944029
url_text = quote("""
And Adam called his wife's name Eve; because she was the mother of all living.
""")
IFrame(f'https://cleverdon.hum.uva.nl/spotlight/demo/?nologo=1&text={url_text}', 900, 450)

### The Gospel According to Saint Matthew

Which confidence value yields the most correct links, without introducing any mistakes?



In [26]:
url_text = quote("""
Now the birth of Jesus Christ was on this wise: When as his mother Mary was espoused to Joseph, before they came together, she was found with child of the Holy Ghost.
""")
IFrame(f'https://cleverdon.hum.uva.nl/spotlight/demo/?nologo=1&text={url_text}', 900, 450)

## Miscellaneous

### Livingstone: famous, but not for his writing

- https://en.wikipedia.org/wiki/Ken_Livingstone

In [27]:
# source: http://resolver.politicalmashup.nl/uk.proc.d.1990-07-03?view=html&q=nietzsche#uk.proc.d.1990-07-03.7.2.7
url_text = quote("""
The entry in the Register of Members' Interests for the hon. Member for Brent, East (Mr. Livingstone), who is a director of a publishing company called Localaction Ltd., says: It is a company formed to cover the publication of my book and any other major writing. What an interesting thought. I see a seamless stream of writers, Nietzsche, Marx, Lenin, Socrates, Livingstone — that has a certain ring about it.
""")
IFrame(f'https://cleverdon.hum.uva.nl/spotlight/demo/?nologo=1&text={url_text}', 900, 450)

### Medical Officers of Health Report

In [28]:
# source: https://wellcomelibrary.org/moh/report/b18248342/1#?c=0&m=0&s=0&cv=4&z=-0.0458%2C1.1963%2C1.1517%2C0.4496
url_text = quote("""
The Board has given much attention to the question of Public Urinals and Conveniences, and has caused to be erected in the Broadway and on Knightsbridge Green, three patent cast-iron Urinals, of a design which it was hoped would be so far unobjectionable as to admit of their being placed in the most public thoroughfares. This hope has not altogether been realized; and the Board has since, in conjunction with the Burial Board of St. Margaret and St. John, caused to be erected in Great Chapel Street, adjoining the Broadway Burying Ground, another form of Urinal, upon the principle of adapting the design to the immediate requirements of the locality,—a principle which the Board believes will, in this instance at all events, be successful.
""")
IFrame(f'https://cleverdon.hum.uva.nl/spotlight/demo/?nologo=1&text={url_text}', 900, 450)

## Your own documents

To interactively annotate any document of your choosing, you may use the following UIs:

- https://cleverdon.hum.uva.nl/spotlight/demo/ (en & nl)
- https://www.dbpedia-spotlight.org/demo/ (en, de, & pt)