# LAB3 Named-Entity Recognition and Classification

This notebook dives deeper into the task of Named Entity Recongition and Classification (NERC). We will reconsider the modules in NLTK and Spacy but also look at the data that is available to train modules and the formats of this data. Machine learning approaches to NERC typically use many different features and it is important to understand how these are represented in the data and used for training.

We will use again the Lab1-apple-samsung text to reconsider the modules

In [4]:
from pprint import pprint

In [5]:
f=open('/Users/piek/Desktop/TextMiningFEW-2019/text-mining-ba-git/notebooks/Lab1-apple-samsung-example.txt','r')
example_text=f.read()
pprint(example_text)

('https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html\n'
 '\n'
 'Documents filed to the San Jose federal court in California on November 23 '
 'list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" '
 'operating systems, which Apple claims infringe its patents.\n'
 'The six phones and tablets affected are the Galaxy S III, running the new '
 'Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, '
 'Galaxy Rugby Pro and Galaxy S III mini.\n'
 'Apple stated it had “acted quickly and diligently" in order to "determine '
 'that these newly released products do infringe many of the same claims '
 'already asserted by Apple."\n'
 'In August, Samsung lost a US patent case to Apple and was ordered to pay its '
 'rival $1.05bn (£0.66bn) in damages for copying features of the iPad and '
 "iPhone in its Galaxy range of devices. Samsung, which is the world's top "
 'mobile phone maker, is appeal

## NERC by NLTK

NLTK needs part-of-speech tagging as a preprocess for the NERC module

In [23]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

sents = nltk.sent_tokenize(example_text)
print("The first sentence from the document:", sents[0])
tagged_text = preprocess(sents[0])
print('##########')
print('The first sentence preprocessed by NLTK with POS tags')
print('##########')
print(tagged_text)

The first sentence from the document: https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html

Documents filed to the San Jose federal court in California on November 23 list six Samsung products running the "Jelly Bean" and "Ice Cream Sandwich" operating systems, which Apple claims infringe its patents.
##########
The first sentence preprocessed by NLTK with POS tags
##########
[('https', 'NN'), (':', ':'), ('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ'), ('Documents', 'NNS'), ('filed', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('San', 'NNP'), ('Jose', 'NNP'), ('federal', 'JJ'), ('court', 'NN'), ('in', 'IN'), ('California', 'NNP'), ('on', 'IN'), ('November', 'NNP'), ('23', 'CD'), ('list', 'NN'), ('six', 'CD'), ('Samsung', 'NNP'), ('products', 'NNS'), ('running', 'VBG'), ('the', 'DT'), ('``', '``'), ('Jelly', 'RB'), ('Bean', 'NNP'), ("''", "''"), ('and', 'CC'), ('``'

In [24]:
ne_tree = nltk.ne_chunk(tagged_text)
print(ne_tree)

(S
  https/NN
  :/:
  //www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html/JJ
  Documents/NNS
  filed/VBN
  to/TO
  the/DT
  (ORGANIZATION San/NNP Jose/NNP)
  federal/JJ
  court/NN
  in/IN
  (GPE California/NNP)
  on/IN
  November/NNP
  23/CD
  list/NN
  six/CD
  (ORGANIZATION Samsung/NNP)
  products/NNS
  running/VBG
  the/DT
  ``/``
  Jelly/RB
  (GPE Bean/NNP)
  ''/''
  and/CC
  ``/``
  Ice/NNP
  Cream/NNP
  Sandwich/NNP
  ''/''
  operating/VBG
  systems/NNS
  ,/,
  which/WDT
  (PERSON Apple/NNP)
  claims/VBZ
  infringe/VB
  its/PRP$
  patents/NNS
  ./.)


The above representation is a flat tree structure, listing all tagged tokens and providing additional named entity types. NLP modules are often trained and tested on the CoNLL format. NLTK can convert the above tree structure to the IOB representation.

In [27]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged_chunks = tree2conlltags(ne_tree)
pprint(iob_tagged_chunks)

[('https', 'NN', 'O'),
 (':', ':', 'O'),
 ('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html',
  'JJ',
  'O'),
 ('Documents', 'NNS', 'O'),
 ('filed', 'VBN', 'O'),
 ('to', 'TO', 'O'),
 ('the', 'DT', 'O'),
 ('San', 'NNP', 'B-ORGANIZATION'),
 ('Jose', 'NNP', 'I-ORGANIZATION'),
 ('federal', 'JJ', 'O'),
 ('court', 'NN', 'O'),
 ('in', 'IN', 'O'),
 ('California', 'NNP', 'B-GPE'),
 ('on', 'IN', 'O'),
 ('November', 'NNP', 'O'),
 ('23', 'CD', 'O'),
 ('list', 'NN', 'O'),
 ('six', 'CD', 'O'),
 ('Samsung', 'NNP', 'B-ORGANIZATION'),
 ('products', 'NNS', 'O'),
 ('running', 'VBG', 'O'),
 ('the', 'DT', 'O'),
 ('``', '``', 'O'),
 ('Jelly', 'RB', 'O'),
 ('Bean', 'NNP', 'B-GPE'),
 ("''", "''", 'O'),
 ('and', 'CC', 'O'),
 ('``', '``', 'O'),
 ('Ice', 'NNP', 'O'),
 ('Cream', 'NNP', 'O'),
 ('Sandwich', 'NNP', 'O'),
 ("''", "''", 'O'),
 ('operating', 'VBG', 'O'),
 ('systems', 'NNS', 'O'),
 (',', ',', 'O'),
 ('which', 'WDT', 'O'),
 ('Apple', 'NNP', 'B-PE

https://spacy.io 
conda install -c conda-forge spacy
Get the English NLP package:
https://spacy.io/models/en
python -m spacy download en_core_web_sm

In [28]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [29]:
doc = nlp(example_text)
sentences = [x for x in doc.sents]
firstsentence=sentences[0]

In [30]:
pprint([(X.text, X.label_) for X in firstsentence.ents])

[('San Jose', 'GPE'),
 ('California', 'GPE'),
 ('November 23', 'DATE'),
 ('six', 'CARDINAL'),
 ('Samsung', 'ORG'),
 ('Jelly Bean', 'PERSON'),
 ('Ice Cream Sandwich', 'WORK_OF_ART'),
 ('Apple', 'ORG'),
 ('\n', 'GPE')]


 BILUO tagging scheme to describe the entity boundaries: Begin, In, Last (final entity token), Unit (single token entity), Out (non-entity token)

In [31]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in firstsentence])

[(https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html,
  'O',
  ''),
 (

, 'O', ''),
 (Documents, 'O', ''),
 (filed, 'O', ''),
 (to, 'O', ''),
 (the, 'O', ''),
 (San, 'B', 'GPE'),
 (Jose, 'I', 'GPE'),
 (federal, 'O', ''),
 (court, 'O', ''),
 (in, 'O', ''),
 (California, 'B', 'GPE'),
 (on, 'O', ''),
 (November, 'B', 'DATE'),
 (23, 'I', 'DATE'),
 (list, 'O', ''),
 (six, 'B', 'CARDINAL'),
 (Samsung, 'B', 'ORG'),
 (products, 'O', ''),
 (running, 'O', ''),
 (the, 'O', ''),
 (", 'O', ''),
 (Jelly, 'B', 'PERSON'),
 (Bean, 'I', 'PERSON'),
 (", 'O', ''),
 (and, 'O', ''),
 (", 'O', ''),
 (Ice, 'B', 'WORK_OF_ART'),
 (Cream, 'I', 'WORK_OF_ART'),
 (Sandwich, 'I', 'WORK_OF_ART'),
 (", 'O', ''),
 (operating, 'O', ''),
 (systems, 'O', ''),
 (,, 'O', ''),
 (which, 'O', ''),
 (Apple, 'B', 'ORG'),
 (claims, 'O', ''),
 (infringe, 'O', ''),
 (its, 'O', ''),
 (patents, 'O', ''),
 (., 'O', ''),
 (
, 'B', 'GPE')]


In [32]:
labels = [x.label_ for x in firstsentence.ents]
Counter(labels)

Counter({'GPE': 3,
         'DATE': 1,
         'CARDINAL': 1,
         'ORG': 2,
         'PERSON': 1,
         'WORK_OF_ART': 1})

In [33]:
items = [x.text for x in firstsentence.ents]
Counter(items).most_common(10)

[('San Jose', 1),
 ('California', 1),
 ('November 23', 1),
 ('six', 1),
 ('Samsung', 1),
 ('Jelly Bean', 1),
 ('Ice Cream Sandwich', 1),
 ('Apple', 1),
 ('\n', 1)]

In [35]:
print([(x, x.ent_iob_, x.ent_type_) for x in firstsentence])

[(https://www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html, 'O', ''), (

, 'O', ''), (Documents, 'O', ''), (filed, 'O', ''), (to, 'O', ''), (the, 'O', ''), (San, 'B', 'GPE'), (Jose, 'I', 'GPE'), (federal, 'O', ''), (court, 'O', ''), (in, 'O', ''), (California, 'B', 'GPE'), (on, 'O', ''), (November, 'B', 'DATE'), (23, 'I', 'DATE'), (list, 'O', ''), (six, 'B', 'CARDINAL'), (Samsung, 'B', 'ORG'), (products, 'O', ''), (running, 'O', ''), (the, 'O', ''), (", 'O', ''), (Jelly, 'B', 'PERSON'), (Bean, 'I', 'PERSON'), (", 'O', ''), (and, 'O', ''), (", 'O', ''), (Ice, 'B', 'WORK_OF_ART'), (Cream, 'I', 'WORK_OF_ART'), (Sandwich, 'I', 'WORK_OF_ART'), (", 'O', ''), (operating, 'O', ''), (systems, 'O', ''), (,, 'O', ''), (which, 'O', ''), (Apple, 'B', 'ORG'), (claims, 'O', ''), (infringe, 'O', ''), (its, 'O', ''), (patents, 'O', ''), (., 'O', ''), (
, 'B', 'GPE')]


In [36]:
displacy.render(firstsentence, jupyter=True, style='ent')