# 3 - Named-entity recognition 

This chapter will introduce a slightly more advanced topic: named-entity recognition. You'll learn how to identify the who, what, and where of your texts using pre-trained models on English and non-English text. You'll also learn how to use some new libraries, polyglot and spaCy, to add to your NLP toolbox.



## Named Entity Recognition

Named Entity Recognition or NER for short is a natural language processing task used to identify important named entities in the text -- such as people, places and organizations -- they can even be dates, states, works of art and other categories depending on the libraries and notation you use. NER can be used alongside topic identification, or on its own to determine important items in a text or answer basic natural language understanding questions such as who? what? when and where?

For example, take this piece of text which is from the English Wikipedia article on Albert Einstein. The text has been highlighted for different types of named entities that were found using the Stanford NER library. You can see the dates, locations, persons and organizations found and extract infomation on the text based on these named entities. You can use NER to solve problems like fact extraction as well as which entities are related using computational language models. For example, in this text we can see that Einstein has something to do with the United States, Adolf Hitler and Germany. We can also see by token proximity that Betrand Russel and Einstein created the Russel-Einstein manifesto -- all from simple entity highlighting.

![title](./img/NER.png)

NLTK allows you to interact with named entity recognition via it's own model, but also the aforementioned Stanford library. The Stanford library integration requires you to perform a few steps before you can use it, including installing the required Java files and setting system environment variables. You can also use the standford library on its own without integrating it with NLTK or operate it as an API server. The **stanford CoreNLP** library has great support for named entity recognition as well as some related nlp tasks such as coreference (or linking pronouns and entities together) and dependency trees to help with parsing meaning and relationships amongst words or phrases in a sentence.



### NER with NLTK

You're now going to have some fun with named-entity recognition! A scraped news article has been pre-loaded into your workspace. Your task is to use nltk to find the named entities in this article.

What might the article be about, given the names you found?

Along with nltk, sent_tokenize and word_tokenize from nltk.tokenize have been pre-imported.

In [4]:
article = '\ufeffThe taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character. If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic. Uber wanted to know as much as possible about the people who use its service, and those who don’t. It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies. Even if their email was notionally anonymised, this use of it was not something the users had bargained for. Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple’s phones even thought it is forbidden by the company.\r\n\r\n\r\nUber has also tweaked its software so that regulatory agencies that the company regarded as hostile would, when they tried to hire a driver, be given false reports about the location of its cars. Uber management booked and then cancelled rides with a rival taxi-hailing company which took their vehicles out of circulation. Uber deny this was the intention. The punishment for this behaviour was negligible. Uber promised not to use this “greyball” software against law enforcement – one wonders what would happen to someone carrying a knife who promised never to stab a policeman with it. Travis Kalanick of Uber got a personal dressing down from Tim Cook, who runs Apple, but the company did not prohibit the use of the app. Too much money was at stake for that.\r\n\r\n\r\nMillions of people around the world value the cheapness and convenience of Uber’s rides too much to care about the lack of drivers’ rights or pay. Many of the users themselves are not much richer than the drivers. The “sharing economy” encourages the insecure and exploited to exploit others equally insecure to the profit of a tiny clique of billionaires. Silicon Valley’s culture seems hostile to humane and democratic values. The outgoing CEO of Yahoo, Marissa Mayer, who is widely judged to have been a failure, is likely to get a $186m payout. This may not be a cause for panic, any more than the previous hero worship should have been a cause for euphoria. Yet there’s an urgent political task to tame these companies, to ensure they are punished when they break the law, that they pay their taxes fairly and that they behave responsibly.'

In [3]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# Tokenize the article into sentences: sentences
sentences = sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [____ for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = ____

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and ____ == "____":
            print(chunk)

NameError: name '____' is not defined

## Introduction to SpaCy

## Multilingual NER with polyplot