# 1. Named Entity Recognition (NER)

Named entity recognition (NER) task aims at identifying real-world entities, such as names of people, organizations, and locations within historical documents. The term of *named entity (NE)*, widely used in Information Extraction (IE) or other Natural Language Processing (NLP) applications, was born in the Message Understanding Conferences (MUC) which influenced the IE research between 1988 and 1996. Since 1999, the yearly conference on
Natural Language Learning (CoNLL) covers a large framework of topics about NLP, mostly through machine learning approaches.

### (a) What are named entities?

Named entities are generally proper nouns that refer to specific entities that can be a person, organization, location, date, etc. Consider this example – *Mount Everest is the tallest mountain above sea level*. Here *Mount Everest* is a named entity of type location as it refers to a specific entity.

Some other examples of named entities are listed below in the table.


|  | Named Entity  |  
|-----|---|
|  ORGANIZATION   | United Nations Organization, UNICEF, Microsoft |
|  PERSON   | Novak Djokovic, Beyoncé, Scarlett Johansson |
|  LOCATION   |  Mount Everest, River Nile, Machu Picchu Archaeological Park  |
|  DATE   |  3rd April 1988, 7 June  |
|  TIME   | 8:45 A.M., one-thirty am |
|  GPE   |  France, Liechtenstein, Democratic Republic of Congo |
|  MONEY   |  7 million dollars, 73.01 INR |

What should be considered as a named entity (NE) in a text is quite open for discussion and depends on the kind of information one wants to extract. However, the set of named entity classes that is widely used contains the three fundamental entity types, person (PER), location (LOC), and organization (ORG), collectively referred to as the
enamex since the MUC-6 competition ([Grishman et al 1996](https://aclanthology.org/C96-1079.pdf)).

### (b) Why are named entities important? (case studies)

The detection of entities can be considered as a first step in the exploration of data collections. 



#### Classifying content for news providers

News publishers generate large amounts of content on a daily basis and managing them correctly is very important to get the most use of each article. NER can automatically scan an entire collection of articles and reveal which are the major people, organizations, and places discussed in them. Knowing these relevant information may help in automatically categorizing the articles in defined hierarchies and enable smooth content discovery. This could also save a lot of time and boost the efficiency of teams.

#### Automating customer support

There are a number of ways to make the process of customer feedback handling smooth and NER could be one of them. For example, the customer support department of an electronic store should handle multiple branches worldwide, thus it needs to go through a number mentions in the customer feedback comments. NER could provide entities as locations and products, and these can be then used to categorize the complaint and assign it to the relevant department within the organization that should be handling this.


#### Exploring historical documents

Historical newspapers are considered more and more as an important source of historical knowledge. As the amount of digitized data accumulates, tools for harvesting the data are needed to gather information. Tools like NER can be extremely valuable to researchers, historians, or librarians for adding structure to the volumes of unstructured data and for improving access to the historical digitized collections.  For example, a simple keyword search can already provide a historian with a sense of whether a collection contains material relevant for their research, thus saving many hours of visiting archives and skimming through pages. NER task can be used to detect person names and locations, these entities having an equally significant presence in the news domain, in which people are often at the core of the events reported in articles. 

#### Extracting valuable information from medical documents

Electronic health records are a valuable source of routinely collected health data that can be used for secondary purposes, including clinical and epidemiological research. They typically contain information on consultations, admissions, symptoms, clinical examinations, test results, diagnoses, treatments, and outcomes. NER can clinic letters or discharge summaries can ease the process of information extraction from free-text sources of prescription information, such as clinic letters or discharge summaries. In this case, the NER task can involve extracting different types of entities: drug, strength, duration, route, form, dosage, frequency, reason of administration, etc. NER can also recognize and match demographic factors that could provide analysts/doctors deeper insights.

Similar to MUC, another known competition initiated in 2004 by the Informatics for Integrating Biology and the Bedside ([i2b2](https://www.i2b2.org/)) was designed to encourage the development of NLP techniques for the extraction of medication-related information from narrative patient records, in order to accelerate the translation of clinical findings into novel diagnostics and prognostics.


#### Aiding risk assessment for financial institutions

Risk assessment is a crucial activity for financial institutions because it helps them to determine the amount of capital they should hold to assure their stability. Manual extraction of relevant information from text-based financial documents is expensive and time-consuming. NER can extract credit risk attributes from a large volume of *live* financial documents, numbering in the millions of documents for a large bank financial documents. In the financial domain, example named entity types are: lenger, borrower, amount, date, etc.

#### Easing the research process

An online journal or conference publication site could hold millions of research papers and scholarly articles. There can be hundreds of papers on a single topic with slight modifications. Organizing all this data in a well-structured manner can be complex. Segregating the papers on the basis of the relevant entities it holds can save the trouble of going through the plethora of information on the subject matter. For instance, if the articles have in their metadata different types of entities (for example, NER can detect fields of study as *Named Entity Recognition* and *Information Extraction*), one can quickly find the articles where the use of *named entity recognition in historical documents* is discussed.

#Focusing on named entities
#What is a named entity, and why does it matter ?
#Automatic named entity recognition (NER)
#General principles (ML, training data, etc)
#State of the Art examples (list of SotA papers)
#Entity linking
#Use-case (mapping locations on a map)

### i. NER using [NLTK](https://www.nltk.org/)



In [4]:
def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    sentences = [nltk.ne_chunk(sent) for sent in sentences]
    return sentences

In [None]:
We explore the problem of Named Entity Recognition (NER) tagging of sentences. The task is to tag each token in a given sentence with an appropriate tag such as Person, Location, etc.

John   lives in New   York
B-PER  O     O  B-LOC I-LOC
