<a href="https://colab.research.google.com/github/bteinstein/POS-Tagging-NER/blob/master/POS_tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Attempt 1 - Using NLTK POS-tagging

In [0]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

In [2]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

#for ne_chunk for NER
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

**Function Tokenize part of speech tagging**

In [0]:
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [0]:
samp_text = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

#### Text : **European authorities fined Google a record $5.1 billion on** **Wednesday for abusing its power**
**in the mobile phone market and ordered the company to alter its practices**

In [5]:
prs_text = preprocess(samp_text)
prs_text

[('European', 'JJ'),
 ('authorities', 'NNS'),
 ('fined', 'VBD'),
 ('Google', 'NNP'),
 ('a', 'DT'),
 ('record', 'NN'),
 ('$', '$'),
 ('5.1', 'CD'),
 ('billion', 'CD'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('for', 'IN'),
 ('abusing', 'VBG'),
 ('its', 'PRP$'),
 ('power', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mobile', 'JJ'),
 ('phone', 'NN'),
 ('market', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('to', 'TO'),
 ('alter', 'VB'),
 ('its', 'PRP$'),
 ('practices', 'NNS')]

### Forming Noun Phrase Chunks 
  Rule - **(optional) Determiner DT, follow by (any number of) Adjective JJ then a Noun NN**

In [0]:
pattn = 'NP: {<DT>?<JJ>*<NN>}' 

In [7]:
rp = nltk.RegexpParser(pattn)
npprs = rp.parse(prs_text)
print(npprs)

(S
  European/JJ
  authorities/NNS
  fined/VBD
  Google/NNP
  (NP a/DT record/NN)
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  (NP power/NN)
  in/IN
  (NP the/DT mobile/JJ phone/NN)
  (NP market/NN)
  and/CC
  ordered/VBD
  (NP the/DT company/NN)
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


### Using nltk.ne_chunk for NER

In [8]:
ne_tree = nltk.ne_chunk(prs_text)
print(ne_tree)

(S
  (GPE European/JJ)
  authorities/NNS
  fined/VBD
  (PERSON Google/NNP)
  a/DT
  record/NN
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  power/NN
  in/IN
  the/DT
  mobile/JJ
  phone/NN
  market/NN
  and/CC
  ordered/VBD
  the/DT
  company/NN
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


### Note: Not really a good job yet, but definately a good starting point

## Attemp 2 - Spicy for NER

In [0]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp_model = en_core_web_sm.load()


from pprint import pprint

In [10]:
prs_doc = nlp_model(samp_text)
pprint([(p.label_, p.text) for p in prs_doc.ents])

[('NORP', 'European'),
 ('ORG', 'Google'),
 ('MONEY', '$5.1 billion'),
 ('DATE', 'Wednesday')]


In [0]:
samp_text2 = 'Shortly after his inauguration, Buhari on June 3 and 4, 2015 travelled to the Republic of Niger and Chad Republic for consultations on how to tackle terrorism in the country and the region.'

In [14]:
prs_doc2 = nlp_model(samp_text2)
pprint([(p.label_, p.text) for p in prs_doc2.ents])

[('PERSON', 'Buhari'),
 ('DATE', 'June 3 and 4, 2015'),
 ('GPE', 'the Republic of Niger'),
 ('GPE', 'Chad Republic')]
