# Activity 2: Named Entity Recognition

**Named-entity recognition (NER)** (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER systems has been structured as taking an unannotated block of text, such as this one:

Jim bought 300 shares of Acme Corp. in 2006.
And producing an annotated block of text that highlights the names of entities:

[Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.
In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.

State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.

(https://en.wikipedia.org/wiki/Named-entity_recognition)

This task aims to names of organizations/brands from the corpus.txt. There is two basic approaches:

* Grammar based
* Machine Learning with a corpus of sequential labeled examples

The result should show the candidates for organization/brands retrieved from corpus.txt.

In [2]:
from __future__ import unicode_literals
import codecs

# this could be done in a iterate way for performance in huge corpus
with codecs.open('corpus.txt', encoding='utf8') as fp:
    corpus = fp.read()
    
# corpus has 55 millions characters. I am going to use just 10k to speed up a bit the things
corpus = corpus[:10000]

---
## 1st solution: the easiest, let some toolkit do it for us


In [3]:
import polyglot
polyglot.data_path = '/usr/share/'

In [4]:
from polyglot.text import Text
text = Text("A presidenta do Brasil é Dilma Roussef.")

In [5]:
text.entities

[I-ORG([u'Brasil']), I-PER([u'Dilma', u'Roussef'])]

In [6]:
from polyglot.text import Text
text = Text(corpus[:2000])
# check with more than 2k. Polyglot has a bug with unicode text

In [7]:
text.entities

[I-PER([u'Kit']),
 I-PER([u'Aro']),
 I-ORG([u'Pirelli']),
 I-ORG([u'Pirelli']),
 I-ORG([u'Pirelli']),
 I-PER([u'Safra']),
 I-LOC([u'Pa\xeds']),
 I-LOC([u'Fran\xe7a']),
 I-LOC([u'Vanilia'])]

## 2nd solution: using a grammar

In [8]:
# first we need to tag the corpus.
import nlpnet
nlpnet.set_data_dir(str('/usr/share/nlpnet_data/'))
tagger = nlpnet.POSTagger()
sentences = tagger.tag(corpus)

In [9]:
sentences[0]

[(u'Kit', u'N'),
 (u'com', u'PREP'),
 (u'4', u'NUM'),
 (u'Pneus', u'NPROP'),
 (u'de', u'NPROP'),
 (u'Alta', u'NPROP'),
 (u'Performance', u'NPROP'),
 (u'Pirelli', u'NPROP'),
 (u'Aro', u'NPROP'),
 (u'16', u'NPROP'),
 (u'205/55R16', u'NPROP'),
 (u'Phantom', u'NPROP'),
 (u'Chegou', u'V'),
 (u'o', u'ART'),
 (u'kit', u'N'),
 (u'que', u'PRO-KS'),
 (u'junta', u'V'),
 (u'resist\xeancia', u'N'),
 (u'e', u'KC'),
 (u'conforto', u'N'),
 (u',', u'PU'),
 (u'al\xe9m', u'PREP'),
 (u'de', u'PREP'),
 (u'n\xedveis', u'N'),
 (u'm\xe1ximos', u'ADJ'),
 (u'de', u'PREP'),
 (u'seguran\xe7a', u'N'),
 (u'.', u'PU')]

In [10]:
import nltk
grammar = "NE: {<NPROP>+}"
cp = nltk.RegexpParser(grammar)

In [11]:
sentence = sentences[0]

In [12]:
sentence

[(u'Kit', u'N'),
 (u'com', u'PREP'),
 (u'4', u'NUM'),
 (u'Pneus', u'NPROP'),
 (u'de', u'NPROP'),
 (u'Alta', u'NPROP'),
 (u'Performance', u'NPROP'),
 (u'Pirelli', u'NPROP'),
 (u'Aro', u'NPROP'),
 (u'16', u'NPROP'),
 (u'205/55R16', u'NPROP'),
 (u'Phantom', u'NPROP'),
 (u'Chegou', u'V'),
 (u'o', u'ART'),
 (u'kit', u'N'),
 (u'que', u'PRO-KS'),
 (u'junta', u'V'),
 (u'resist\xeancia', u'N'),
 (u'e', u'KC'),
 (u'conforto', u'N'),
 (u',', u'PU'),
 (u'al\xe9m', u'PREP'),
 (u'de', u'PREP'),
 (u'n\xedveis', u'N'),
 (u'm\xe1ximos', u'ADJ'),
 (u'de', u'PREP'),
 (u'seguran\xe7a', u'N'),
 (u'.', u'PU')]

In [13]:
print cp.parse(sentence)

(S
  Kit/N
  com/PREP
  4/NUM
  (NE
    Pneus/NPROP
    de/NPROP
    Alta/NPROP
    Performance/NPROP
    Pirelli/NPROP
    Aro/NPROP
    16/NPROP
    205/55R16/NPROP
    Phantom/NPROP)
  Chegou/V
  o/ART
  kit/N
  que/PRO-KS
  junta/V
  resistencia/N
  e/KC
  conforto/N
  ,/PU
  alem/PREP
  de/PREP
  niveis/N
  maximos/ADJ
  de/PREP
  seguranca/N
  ./PU)


In [14]:
entities = set()
for tree in cp.parse_sents(sentences):
    for subtree in tree.subtrees():
        if subtree.label() == 'NE': 
            entity = ' '.join([word for word, tag in subtree.leaves()])
            entities.add(entity)
            
from pprint import pprint
pprint(entities)

set([u'-Captura',
     u'-Compatibilidade',
     u'-Entrada',
     u'-Entrada auxiliar Plug',
     u'-Grava',
     u'-Sistema 5.1 Canais Prologic',
     u'-Tipo',
     u'2 Vodkas Sueca Absolut Vanilia',
     u'ANVISA',
     u"Assassin 's Creed",
     u'Assassinos',
     u'BCAA',
     u'BCAA 2400 - 200 C\xe1psulas',
     u'BCAAs',
     u'Baunilha',
     u'Brasil',
     u'CDR-W',
     u'Chandon Brut Ros\xe9',
     u'Colorir',
     u'Divirta-se',
     u'Evie Frye',
     u'Fabricante/Fornecedor',
     u'Floresta Encantada',
     u'Fran\xe7a -Embalagem',
     u'Henry Green',
     u'Home Theater Com DVD Lenoxx',
     u'Home Theater DVD Player Lenoxx HT723 270W 5.1 Canais USB Fun\xe7\xe3o Karaok\xea Obtenha',
     u'Home Theater Lenoxx',
     u'Imagens Meramente Ilustrativas BCAA 2400 - 100 C\xe1psulas',
     u'Inglaterra Vitoriana',
     u'Isoleucina',
     u'Jacob',
     u'Jardim Secreto',
     u'Jardim Secreto + Floresta Encantada + Reino Animal Entretenimento',
     u'Johanna Basford',
  

## 3rd solution: using CRF

There is no corpus available for Portuguese NER training in the python toolkits. However, there is the CONLL corpus available for english.

For more information, please follow the tutorial at (http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb)

For Portuguese, there is available a dataset for download at (http://www.linguateca.pt/aval_conjunta/HAREM/harem_ing.html).