## Hello everyone,

#### Welcome in this place. 

In this notebook I'll show you a really simple example of Named Entity Recognition (NER).

Using NER it's possible to detect and classify named entities in a sentence.

Wiki: https://en.wikipedia.org/wiki/Named-entity_recognition

I'll make a comparison between NER implemented in NLTK ( Natural Language ToolKit ) and StanfordNERTagger. They are available both inside nltk in Python.

NLTK Documentation: http://www.nltk.org/

NLTK Book: http://www.nltk.org/book/

Stanford NER Documentation: https://nlp.stanford.edu/software/CRF-NER.shtml

###### If you want to run this code you have to download the stanford's component from the webpage I linked

#### Let's start.

In [1]:
import nltk
from nltk.tag import StanfordNERTagger

In [2]:
sentences = [
    'The Facebook website was launched on February 4, 2004, by Mark Zuckerberg, along with fellow Harvard College students and roommates, Eduardo Saverin, Andrew McCollum, Dustin Moskovitz, and Chris Hughes.',
    'Apple Inc. is an American multinational technology company headquartered in Cupertino, California.',
    'Microsoft was founded by Paul Allen and Bill Gates on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800.',
    'Saul Hudson better known by his stage name Slash, is a British-American musician and songwriter. He is best known as the lead guitarist of the American rock band Guns N\' Roses.',
    'Mario Draghi is an Italian economist who has served as President of the European Central Bank since November 2011.',
    'Italy is a unitary parliamentary republic in Europe. Italy shares open land borders with France, Switzerland, Austria, Slovenia, San Marino and Vatican City.',
    'Tesla is an American automaker, energy storage company, and solar panel manufacturer based in Palo Alto.',
    'Elon Reeve Musk is a South African-born Canadian-American business magnate, investor, engineer, and inventor.',
    'Frank Underwood is the main character of House of Cards',
]

In [3]:
import os
# path = 'your_path_here'
java_path = path + "java.exe"
os.environ['JAVAHOME'] = java_path

In [4]:
stanfordtagger = StanfordNERTagger(
    './stanford/classifiers/english.all.3class.distsim.crf.ser.gz',
    './stanford/stanford-ner.jar',
    encoding='latin1')

In [5]:
def split2(x,by=2):
    out = []
    for i in range(0,len(x)):
        if i % 2 == 0:
            out.append(tuple(x[i:i+2]))
    return(out)

def from_sentence_to_ne(x,method='nltk'):
    token = nltk.word_tokenize(x)
    if method == 'nltk':
        tag = nltk.pos_tag(token)
        ne = nltk.ne_chunk(tag)
        out = nltk.chunk.tree2conllstr(ne)
        
        out = out.split()
        length = len(out)
        for i in range(0,length):
            if i % 2 != 0:
                try:
                    del out[i]
                except IndexError:
                    break
        out = split2(out)
    else:
        out = stanfordtagger.tag(token) 
    return(out)

In [6]:
%%time
method = 'nltk'
for s in sentences:
    app = from_sentence_to_ne(s, method=method)
    print(s)
    print('\n')
    for i in app:
        if i[1]!='O':
            print(i)
    print("\n\n")

The Facebook website was launched on February 4, 2004, by Mark Zuckerberg, along with fellow Harvard College students and roommates, Eduardo Saverin, Andrew McCollum, Dustin Moskovitz, and Chris Hughes.


('Facebook', 'B-ORGANIZATION')
('Mark', 'B-PERSON')
('Zuckerberg', 'I-PERSON')
('Harvard', 'B-ORGANIZATION')
('College', 'I-ORGANIZATION')
('Eduardo', 'B-PERSON')
('Saverin', 'I-PERSON')
('Andrew', 'B-PERSON')
('McCollum', 'I-PERSON')
('Dustin', 'B-PERSON')
('Moskovitz', 'I-PERSON')
('Chris', 'B-PERSON')
('Hughes', 'I-PERSON')



Apple Inc. is an American multinational technology company headquartered in Cupertino, California.


('Apple', 'B-PERSON')
('Inc.', 'B-ORGANIZATION')
('American', 'B-GPE')
('Cupertino', 'B-GPE')
('California', 'B-GPE')



Microsoft was founded by Paul Allen and Bill Gates on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800.


('Microsoft', 'B-PERSON')
('Paul', 'B-PERSON')
('Allen', 'I-PERSON')
('Bill', 'B-PERSON')
('Gates', 'I-PERSON'

In [7]:
%%time
method = 'stanford'
for s in sentences:
    app = from_sentence_to_ne(s, method=method)
    print(s)
    print('\n')
    for i in app:
        if i[1]!='O':
            print(i)
    print("\n\n")

The Facebook website was launched on February 4, 2004, by Mark Zuckerberg, along with fellow Harvard College students and roommates, Eduardo Saverin, Andrew McCollum, Dustin Moskovitz, and Chris Hughes.


('Facebook', 'ORGANIZATION')
('Mark', 'PERSON')
('Zuckerberg', 'PERSON')
('Harvard', 'ORGANIZATION')
('College', 'ORGANIZATION')
('Eduardo', 'PERSON')
('Saverin', 'PERSON')
('Andrew', 'PERSON')
('McCollum', 'PERSON')
('Dustin', 'PERSON')
('Moskovitz', 'PERSON')
('Chris', 'PERSON')
('Hughes', 'PERSON')



Apple Inc. is an American multinational technology company headquartered in Cupertino, California.


('Apple', 'ORGANIZATION')
('Inc.', 'ORGANIZATION')
('Cupertino', 'LOCATION')
('California', 'LOCATION')



Microsoft was founded by Paul Allen and Bill Gates on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800.


('Microsoft', 'ORGANIZATION')
('Paul', 'PERSON')
('Allen', 'PERSON')
('Bill', 'PERSON')
('Gates', 'PERSON')



Saul Hudson better known by his stage n

StanfordNERTagger looks quite better than the other. Unfortunally the prediction of this model is expensive in terms of time.

Now i'm going to "perturb" the sentences by switching lowercase and uppercase

In [8]:
lower_case = [s.lower() for s in sentences]

In [9]:
%%time
method = 'nltk'
for s in lower_case:
    app = from_sentence_to_ne(s, method=method)
    print(s)
    print('\n')
    for i in app:
        if i[1]!='O':
            print(i)
    print("\n\n")

the facebook website was launched on february 4, 2004, by mark zuckerberg, along with fellow harvard college students and roommates, eduardo saverin, andrew mccollum, dustin moskovitz, and chris hughes.





apple inc. is an american multinational technology company headquartered in cupertino, california.





microsoft was founded by paul allen and bill gates on april 4, 1975, to develop and sell basic interpreters for the altair 8800.





saul hudson better known by his stage name slash, is a british-american musician and songwriter. he is best known as the lead guitarist of the american rock band guns n' roses.





mario draghi is an italian economist who has served as president of the european central bank since november 2011.





italy is a unitary parliamentary republic in europe. italy shares open land borders with france, switzerland, austria, slovenia, san marino and vatican city.





tesla is an american automaker, energy storage company, and solar panel manufacturer base

In [10]:
%%time
method = 'stanford'
for s in lower_case:
    app = from_sentence_to_ne(s, method=method)
    print(s)
    print('\n')
    for i in app:
        if i[1]!='O':
            print(i)
    print("\n\n")

the facebook website was launched on february 4, 2004, by mark zuckerberg, along with fellow harvard college students and roommates, eduardo saverin, andrew mccollum, dustin moskovitz, and chris hughes.


('mark', 'PERSON')
('zuckerberg', 'PERSON')
('andrew', 'PERSON')
('mccollum', 'PERSON')
('dustin', 'PERSON')
('moskovitz', 'PERSON')



apple inc. is an american multinational technology company headquartered in cupertino, california.


('american', 'LOCATION')
('california', 'LOCATION')



microsoft was founded by paul allen and bill gates on april 4, 1975, to develop and sell basic interpreters for the altair 8800.


('microsoft', 'ORGANIZATION')
('paul', 'PERSON')
('allen', 'PERSON')



saul hudson better known by his stage name slash, is a british-american musician and songwriter. he is best known as the lead guitarist of the american rock band guns n' roses.


('american', 'LOCATION')



mario draghi is an italian economist who has served as president of the european central bank

In [11]:
upper_case = [s.upper() for s in sentences]

In [12]:
%%time
method = 'stanford'
for s in upper_case:
    app = from_sentence_to_ne(s, method=method)
    print(s)
    print('\n')
    for i in app:
        if i[1]!='O':
            print(i)
    print("\n\n")

THE FACEBOOK WEBSITE WAS LAUNCHED ON FEBRUARY 4, 2004, BY MARK ZUCKERBERG, ALONG WITH FELLOW HARVARD COLLEGE STUDENTS AND ROOMMATES, EDUARDO SAVERIN, ANDREW MCCOLLUM, DUSTIN MOSKOVITZ, AND CHRIS HUGHES.


('MARK', 'PERSON')
('ZUCKERBERG', 'PERSON')
('EDUARDO', 'PERSON')
('SAVERIN', 'PERSON')
('ANDREW', 'PERSON')
('MCCOLLUM', 'PERSON')
('DUSTIN', 'PERSON')
('MOSKOVITZ', 'PERSON')
('CHRIS', 'PERSON')
('HUGHES', 'PERSON')



APPLE INC. IS AN AMERICAN MULTINATIONAL TECHNOLOGY COMPANY HEADQUARTERED IN CUPERTINO, CALIFORNIA.


('APPLE', 'ORGANIZATION')
('INC.', 'ORGANIZATION')
('CUPERTINO', 'LOCATION')
('CALIFORNIA', 'LOCATION')



MICROSOFT WAS FOUNDED BY PAUL ALLEN AND BILL GATES ON APRIL 4, 1975, TO DEVELOP AND SELL BASIC INTERPRETERS FOR THE ALTAIR 8800.


('PAUL', 'PERSON')
('ALLEN', 'PERSON')



SAUL HUDSON BETTER KNOWN BY HIS STAGE NAME SLASH, IS A BRITISH-AMERICAN MUSICIAN AND SONGWRITER. HE IS BEST KNOWN AS THE LEAD GUITARIST OF THE AMERICAN ROCK BAND GUNS N' ROSES.


('SAUL', 'PERS

In [13]:
%%time
method = 'nltk'
for s in upper_case:
    app = from_sentence_to_ne(s, method=method)
    print(s)
    print('\n')
    for i in app:
        if i[1]!='O':
            print(i)
    print("\n\n")

THE FACEBOOK WEBSITE WAS LAUNCHED ON FEBRUARY 4, 2004, BY MARK ZUCKERBERG, ALONG WITH FELLOW HARVARD COLLEGE STUDENTS AND ROOMMATES, EDUARDO SAVERIN, ANDREW MCCOLLUM, DUSTIN MOSKOVITZ, AND CHRIS HUGHES.


('FACEBOOK', 'B-ORGANIZATION')
('BY', 'B-ORGANIZATION')
('MARK', 'B-ORGANIZATION')
('ALONG', 'B-ORGANIZATION')
('FELLOW', 'B-ORGANIZATION')
('HARVARD', 'B-ORGANIZATION')
('STUDENTS', 'B-ORGANIZATION')
('EDUARDO', 'B-ORGANIZATION')
('ANDREW', 'B-ORGANIZATION')
('DUSTIN', 'B-ORGANIZATION')
('AND', 'B-ORGANIZATION')



APPLE INC. IS AN AMERICAN MULTINATIONAL TECHNOLOGY COMPANY HEADQUARTERED IN CUPERTINO, CALIFORNIA.


('APPLE', 'B-ORGANIZATION')
('AMERICAN', 'B-ORGANIZATION')
('TECHNOLOGY', 'B-ORGANIZATION')
('CALIFORNIA', 'B-GPE')



MICROSOFT WAS FOUNDED BY PAUL ALLEN AND BILL GATES ON APRIL 4, 1975, TO DEVELOP AND SELL BASIC INTERPRETERS FOR THE ALTAIR 8800.


('MICROSOFT', 'B-ORGANIZATION')
('WAS', 'B-ORGANIZATION')
('PAUL', 'B-ORGANIZATION')
('BILL', 'B-PERSON')
('GATES', 'I-PERSON'

### Conclusion

StanfordNERTagger looks more efficient and also more robust. 

However both of them look inefficient in not standard cases.

My questions are: 

    1) is it possible to train a NER insensitive to lower/upper-case?

    2) if it's possible, how many data it requires?

###### I hope you appreciate these examples. See you around,
###### Federico