<img src="data/images/lecture-notebook-header.png" />

# Named Entity Recognition

Named Entity Recognition (NER) is a natural language processing (NLP) technique used to identify and classify named entities within text into predefined categories such as names of persons, organizations, locations, dates, quantities, and more. The primary goal of NER is to extract and categorize specific entities mentioned in text. NER is useful for several reasons:

* **Information Extraction:** It helps extract structured information from unstructured text, enabling systems to understand the meaning and context of text data better.

* **Entity Classification:** NER categorizes entities into predefined types, aiding in information retrieval, analysis, and organization. For example, identifying names of people, organizations, or locations within news articles can facilitate better indexing and search functionalities.

* **Enhancing Search and Retrieval:** By identifying and categorizing named entities, NER improves the accuracy and relevance of search results. It helps search engines retrieve documents or information more precisely based on the types of entities a user is interested in.

* **Relation Extraction:** NER serves as a foundation for more complex tasks like relation extraction, where the connections or relationships between different entities are analyzed. For instance, understanding the relationships between companies and their CEOs from news articles.

* **Information Summarization:** NER assists in summarizing text by highlighting or focusing on important entities. It aids in generating concise summaries by identifying key persons, locations, or events within a document.

* **Domain-specific Applications:** In various domains like healthcare, finance, legal, and more, NER is essential for extracting specific entities crucial for domain-specific analysis or compliance requirements.

Overall, Named Entity Recognition is a fundamental NLP task that plays a crucial role in extracting, categorizing, and understanding named entities within text data. Its applications span across diverse fields, contributing significantly to information retrieval, analysis, and knowledge extraction from textual content.

## Setting up the Notebook

### spaCy packages & Language Models

In [1]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
#nlp = spacy.load("en_core_web_lg")
#nlp = spacy.load("en_core_web_trf")

### NLTK Packages

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

### Example Sentences

In [None]:
text = "Musk bought Twitter, headquartered in  San Francisco, in October 2022 for the amount of $44 Billion to avoid trial."
#text = "Alice is watching The Dark Night"
#text = "Washington was born into slavery on a farm of James Burroughs."
#text = "Washington went up 2 games to 1 in the four-game series."
#text = "Bob arrived in Washington for what may well be his last state visit."
#text = "The Washington had proved to be a leaky ship."
#text = "Musk bought Twitter in October 2022 for $44 Billion."
#text = "Microsoft was founded by Bill Gates and Paul Alan in 1975."
#text = "Bob flew to New York."
#text = "Leonhard Euler was born on 15 April 1707, in Basel, Switzerland."
#text = "Not long ago, I met with Alice in a restaurant to discuss the upcoming business travel to Australia after the semester"

## NER with spaCy

spaCy is a popular Python library used for natural language processing, including named entity recognition (NER). Here are the general steps to perform NER with spaCy: Firstly, we parse and analyze our text. This includes that named entities are identified by default.

**Side note:** If no NER is required, this step can be switched off to improve the performance of spaCy when analyzing a document.


In [None]:
doc = nlp(text)

for t in doc:
    print(t, t.is_stop)

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

spaCy comes with a series of [visualizers](https://spacy.io/usage/visualizers), including one for highlighting named entities. Let's give it a go:

In [None]:
displacy.render(doc, style="ent")

As you can see, spaCy makes Named Entity Recognition very easy to use as it is an in-built feature. Apart from using the available NER model, spaCy also allows you to train your own models on custom data. This can be useful if your named entities are very domain specific and the default model performs poorly on that data. However, training a custom model with spaCy is beyond the scope of this notebook.

## NER with NLTK

NLTK (Natural Language Toolkit) is a popular Python library for natural language processing, which includes tools for named entity recognition (NER). Compared to spaCy, NLTK requires some more individual steps to perform NER out of the box: First we need to tokenize our input text and perform POS tagging.


In [None]:
raw_words= word_tokenize(text)
tags= pos_tag(raw_words)

With the POS tags, we can identify the named entity chunks, which essentially performs NER. Note that setting `binary=True` will only identify the named entity chunks but not their types. If we want to distinguish between different types (e.g., PERSON, GPE, etc.), we need to set `binary=False` (which is also the default value).

In [None]:
ne_chunks = nltk.ne_chunk(tags, binary=False)
print(ne_chunks)

We can now loop through the identify chunks to return the identified named entities:

In [None]:
for chunk in ne_chunks:
    # Only print chunks that are named entities
    if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk))

The results are arguably not that great (compared to spaCy). Firstly, NLTK is missing some of the named entities that spaCy gave us. And secondly, NLTK considers "Twitter" as a PERSON. Clearly, the model that comes with NLTK is less sophisticated than the one of spaCy. Note, however, that the focus of NLTK is to provide a toolbox for users to train their own models, while spaCy does aim to offer state-of-the-art performance out of the box.

---

## Summary

Both Spacy and NLTK are popular Python libraries used for natural language processing tasks, including named entity recognition (NER). However, there are some differences in their approaches to NER.

* Spacy is a modern and efficient library that relies on deep learning models to perform NER. It is designed to be fast and scalable, and provides pre-trained models for multiple languages. Spacy also allows users to train their own custom models for NER using labeled data. Spacy's NER is considered to be highly accurate and is used in various industries and applications.

* NLTK, on the other hand, is an older library that relies on rule-based approaches for NER. It provides pre-trained models for English and other languages, and allows users to customize the rules for NER. NLTK's NER may not be as accurate as Spacy's NER, but it can still be useful for certain tasks and applications.

In summary, Spacy is a powerful and modern library for NER that relies on deep learning models, while NLTK is an older library that uses rule-based approaches. Both libraries have their strengths and weaknesses, and the choice of which library to use may depend on the specific requirements and constraints of a project.
