<img src="data/images/lecture-notebook-header.png" />

# Entity Linking

Entity linking is a natural language processing (NLP) technique that involves identifying and linking named entities in a given text to a knowledge base or database of entities. In NLP, named entities refer to specific entities that are referred to by name, such as people, organizations, places, and products. Entity linking involves disambiguating these named entities and identifying which specific entity they refer to.

For example, consider the sentence "Steve Jobs was the CEO of Apple." In this sentence, "Steve Jobs" and "Apple" are named entities. Entity linking would involve identifying that "Steve Jobs" refers to the person Steve Jobs and that "Apple" refers to the company Apple Inc. Entity linking is important in many NLP tasks, such as information retrieval, text classification, and question answering. By linking entities to a knowledge base, it enables machines to better understand the meaning of text and answer questions more accurately.

## Setting up the Notebook

In this notebook, we use [DBpedia Spotlight](https://www.dbpedia-spotlight.org/). DBpedia Spotlight is an open-source tool for entity linking, which is a natural language processing (NLP) task that involves identifying and linking named entities in text to a knowledge base or database of entities. DBpedia Spotlight uses the DBpedia knowledge base, which is a structured database that extracts information from Wikipedia. DBpedia Spotlight works by analyzing text and identifying named entities such as people, organizations, and locations. It then links these entities to DBpedia, a knowledge base derived from the information in Wikipedia, allowing the entities to be identified and contextualized.

DBpedia Spotlight can be used in a variety of NLP applications, such as information retrieval, text classification, and question answering. It is available as a RESTful web service and can be accessed through various programming languages, including Java, Python, and Ruby. It is designed to be highly scalable and efficient, making it suitable for large-scale applications. It also allows for customization and configuration, enabling users to adapt the tool to their specific needs.

Conveniently, there exists [DBpedia Spotlight for SpaCy](https://spacy.io/universe/project/spacy-dbpedia-spotlight) to extend the analysis pipeline of spaCy to integrate DBpedia Spotlight.


In [1]:
import spacy
from spacy import displacy
import spacy_dbpedia_spotlight
import requests, json

nlp = spacy.load("en_core_web_trf")
#nlp = spacy.load("en_core_web_lg")
nlp.add_pipe('dbpedia_spotlight')

<spacy_dbpedia_spotlight.entity_linker.EntityLinker at 0x7f95e0221b70>

## Running some Examples

The code cell below contains some of the example sentences that we saw throughout the lecture. Since DBpedia Spotlight is not integrated into the spaCy pipeline, we can just analyze the sentences as usual.

In [2]:
text = "Elon Musk bought Twitter, headquartered in  San Francisco, in October 2022 for the amount of $44 Billion to avoid trial."
text = "Musk bought Twitter, headquartered in  San Francisco, in October 2022 for the amount of $44 Billion to avoid trial."
#text = "Washington was born into slavery on a farm of James Burroughs."
#text = "Bob arrived in Washington for what may well be his last state visit."
#text = "The Washington had proved to be a leaky ship."
#text = "Leonhard Euler was born in Basel."

doc = nlp(text)

Two things you should notice:

* The analysis now takes noticeably longer. This is simply because we need to call the DBpedia spotlight API to access the DBpedia knowledge graph

* The results are not perfect. For example, when using "Elon Musk" the correct entity will be linked; it fails if it only says "Musk"

We can also visualize the results using the [spaCy visualizers](https://spacy.io/usage/visualizers). In fact, we use the same approach as for NER but note how the output is not different and includes the links to the corresponding DBpedia pages (i.e., the unique URLs to the concepts in the knowledge base).


In [3]:
displacy.render(doc, style="ent")

## Doing it "Manually"

While using spaCy with the DBpedia extensions is convenient, it also hides all of the logic. This means we have to rely on how well DBpedia Spotlight works. However, alternatively, we can also directly use knowledge base APIs to search for candidates and rank/filter/select/etc. the best candidate ourselves. To illustrate this, we can use Wikidata. This is a free and open knowledge graph maintained by the Wikimedia Foundation. It provides a structured data model that enables the creation, editing, and linking of data across a wide range of domains and disciplines, such as history, science, culture, and geography. Wikidata contains data about entities such as people, places, organizations, events, and concepts, and it is designed to be machine-readable and interlinked with other knowledge resources on the web. Wikidata's data is contributed and maintained by a community of editors and volunteers from around the world, and it is used by a wide range of applications, such as search engines, recommender systems, and natural language processing tools. The structured data model used by Wikidata is based on Semantic Web technologies such as RDF and OWL, and it is designed to support the integration and interoperability of data from different sources and domains.

The method `call_wikidata_api()` implements a very basic search API call that takes a search term as input and returns all matching results as a JSON document.


In [None]:
def call_wikidata_api(search, topk=5):
    try:
        url = f"https://www.wikidata.org/w/api.php?action=wbsearchentities&search={search}&language=en&format=json"
        data = requests.get(url).json()
        return data['search'][:topk]
    except:
        return []

Let's call the method with the search term "Euler""

In [None]:
for candidate in call_wikidata_api("Euler"):
    print("[{}] {} ({})".format(candidate['id'], candidate['label'], candidate['description']))

Presumably, the API performs some sorting of the results based on some criteria reflecting the entity's popularity in the knowledge base. However, since we generally get a whole list of candidates we can perform any kind of additional candidate selection to find the best-matching entity for our candidate. This is beyond the scope of this notebook though.

---

## Summary

Entity linking is a process in natural language processing that connects textual mentions of entities, such as names of people, places, or things, to their corresponding entries in a knowledge base or database. The goal is to disambiguate ambiguous references in text by identifying the specific entity being referred to. It involves recognizing the entity in the text and then linking it to a unique identifier in a knowledge base, enabling systems to better understand and process information.

Practically, entity linking has numerous applications across various fields. In information retrieval and web search, it improves search accuracy by providing relevant information about entities mentioned in queries or documents. In content recommendation systems, it enhances personalization by understanding user interests through linked entities. Moreover, in data integration and knowledge graph construction, entity linking helps in connecting disparate datasets and building comprehensive knowledge graphs that facilitate better data analysis and knowledge representation.

Furthermore, entity linking plays a crucial role in natural language understanding tasks like question answering, information extraction, and sentiment analysis. By linking entities to rich sources of information in knowledge bases like DBpedia, Wikidata, or Freebase, systems can augment their understanding of text, enabling more sophisticated and context-aware analysis and decision-making.