# Semantic Analysis
* Exploring WordNet
    1. Understanding Synsets
    2. Analyzing Lexical Semantic Relationships
* Word Sense Disambiguity
* Named Entity Recognition
* Building an NER Tagger from Scratch
* Building an End-to-End NER Tagger with Our Trained NER Model
* Analyzing Semantic Representations
    1. Propositional Logic
    2. First Order Logic

## Exploring WordNet 

In [1]:
## Understanding Synsets
from nltk.corpus import wordnet as wn
import pandas as pd

term = 'fruit'
synsets = wn.synsets(term)
# display total synsets
print('Total Synsets:', len(synsets))

Total Synsets: 5


In [2]:
pd.options.display.max_colwidth = 200
fruit_df = pd.DataFrame([{'Synset': synset, 
                          'Part of Speech': synset.lexname(), 
                          'Definition': synset.definition(), 
                          'Lemmas': synset.lemma_names(), 
                          'Examples': synset.examples()}
                                for synset in synsets])
fruit_df = fruit_df[['Synset', 'Part of Speech', 'Definition', 'Lemmas', 'Examples']]
fruit_df

Unnamed: 0,Synset,Part of Speech,Definition,Lemmas,Examples
0,Synset('fruit.n.01'),noun.plant,the ripened reproductive body of a seed plant,[fruit],[]
1,Synset('yield.n.03'),noun.artifact,an amount of a product,"[yield, fruit]",[]
2,Synset('fruit.n.03'),noun.event,the consequence of some effort or action,[fruit],[he lived long enough to see the fruit of his policies]
3,Synset('fruit.v.01'),verb.creation,cause to bear fruit,[fruit],[]
4,Synset('fruit.v.02'),verb.creation,bear fruit,[fruit],[the trees fruited early this year]


In [3]:
## Analyzing Lexical Semantic Relationships
### Entailments
for action in ['walk', 'eat', 'digest']:
    action_syn = wn.synsets(action, pos='v')[0]
    print(action_syn, '-- entails -->', action_syn.entailments())

Synset('walk.v.01') -- entails --> [Synset('step.v.01')]
Synset('eat.v.01') -- entails --> [Synset('chew.v.01'), Synset('swallow.v.01')]
Synset('digest.v.01') -- entails --> [Synset('consume.v.02')]


In [4]:
### Homonyms and Homographs
for synset in wn.synsets('bank'):
    print(synset.name(), '-', synset.definition())

bank.n.01 - sloping land (especially the slope beside a body of water)
depository_financial_institution.n.01 - a financial institution that accepts deposits and channels the money into lending activities
bank.n.03 - a long ridge or pile
bank.n.04 - an arrangement of similar objects in a row or in tiers
bank.n.05 - a supply or stock held in reserve for future use (especially in emergencies)
bank.n.06 - the funds held by a gambling house or the dealer in some gambling games
bank.n.07 - a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
savings_bank.n.02 - a container (usually with a slot in the top) for keeping money at home
bank.n.09 - a building in which the business of banking transacted
bank.n.10 - a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
bank.v.01 - tip laterally
bank.v.02 - enclose with a bank
bank.v.03 - do business with a bank or keep an account at 

In [5]:
### Synonyms and Antonyms
term = 'large'
synsets = wn.synsets(term)
adj_large = synsets[1]
adj_large = adj_large.lemmas()[0]
adj_large_synonym = adj_large.synset()
adj_large_antonym = adj_large.antonyms()[0].synset()

print('Synonym:', adj_large_synonym.name())
print('Definition:', adj_large_synonym.definition())
print('Antonym:', adj_large_antonym.name())
print('Definition:', adj_large_antonym.definition())

Synonym: large.a.01
Definition: above average in size or number or quantity or magnitude or extent
Antonym: small.a.01
Definition: limited or below average in number or quantity or magnitude or extent


In [6]:
term = 'rich'
synsets = wn.synsets(term)[:3]

for synset in synsets:
    rich = synset.lemmas()[0]
    rich_synonym = rich.synset()
    rich_antonym = rich.antonyms()[0].synset()

    print('Synonym:', rich_synonym.name())
    print('Definition:', rich_synonym.definition())
    print('Antonym:', rich_antonym.name())
    print('Definition:', rich_antonym.definition())
    print()

Synonym: rich_people.n.01
Definition: people who have possessions and wealth (considered as a group)
Antonym: poor_people.n.01
Definition: people without possessions or wealth (considered as a group)

Synonym: rich.a.01
Definition: possessing material wealth
Antonym: poor.a.02
Definition: having little money or few possessions

Synonym: rich.a.02
Definition: having an abundant supply of desirable qualities or substances (especially natural resources)
Antonym: poor.a.04
Definition: lacking in specific resources, qualities or substances



In [7]:
### Hyponyms and Hypernyms
term = 'tree'
synsets = wn.synsets(term)
tree = synsets[0]

print('Name:', tree.name())
print('Definition:', tree.definition())

Name: tree.n.01
Definition: a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms


In [8]:
hyponyms = tree.hyponyms()
print('Total Hyponyms:', len(hyponyms))
print('Sample Hyponyms')
for hyponym in hyponyms[:10]:
    print(hyponym.name(), '-', hyponym.definition())
    print()

Total Hyponyms: 180
Sample Hyponyms
aalii.n.01 - a small Hawaiian tree with hard dark wood

acacia.n.01 - any of various spiny trees or shrubs of the genus Acacia

african_walnut.n.01 - tropical African timber tree with wood that resembles mahogany

albizzia.n.01 - any of numerous trees of the genus Albizia

alder.n.02 - north temperate shrubs or trees having toothed leaves and conelike fruit; bark is used in tanning and dyeing and the wood is rot-resistant

angelim.n.01 - any of several tropical American trees of the genus Andira

angiospermous_tree.n.01 - any tree having seeds and ovules contained in the ovary

anise_tree.n.01 - any of several evergreen shrubs and small trees of the genus Illicium

arbor.n.01 - tree (as opposed to shrub)

aroeira_blanca.n.01 - small resinous tree or shrub of Brazil



In [9]:
hypernyms = tree.hypernyms()
print(hypernyms)

[Synset('woody_plant.n.01')]


In [10]:
# get total hierarchy pathways for 'tree'
hypernym_paths = tree.hypernym_paths()
print('Total Hypernym paths:', len(hypernym_paths))

Total Hypernym paths: 1


In [11]:
# print the entire hypernym hierarchy
print('Hypernym Hierarchy')
print(' -> '.join(synset.name() for synset in hypernym_paths[0]))

Hypernym Hierarchy
entity.n.01 -> physical_entity.n.01 -> object.n.01 -> whole.n.02 -> living_thing.n.01 -> organism.n.01 -> plant.n.02 -> vascular_plant.n.01 -> woody_plant.n.01 -> tree.n.01


In [12]:
### Holonyms and Meronyms
member_holonyms = tree.member_holonyms()
print('Total Member Holonyms:', len(member_holonyms))
print('Member Holonyms for [tree]:-')
for holonym in member_holonyms:
    print(holonym.name(), '-', holonym.definition())
    print()

Total Member Holonyms: 1
Member Holonyms for [tree]:-
forest.n.01 - the trees and other plants in a large densely wooded area



In [13]:
part_meronyms = tree.part_meronyms()
print('Total Part Meronyms:', len(part_meronyms))
print('Part Meronyms for [tree]:-')
for meronym in part_meronyms:
    print(meronym.name(), '-', meronym.definition())
    print()

Total Part Meronyms: 5
Part Meronyms for [tree]:-
burl.n.02 - a large rounded outgrowth on the trunk or branch of a tree

crown.n.07 - the upper branches and leaves of a tree or other plant

limb.n.02 - any of the main branches arising from the trunk or a bough of a tree

stump.n.01 - the base part of a tree that remains standing after the tree has been felled

trunk.n.01 - the main stem of a tree; usually covered with bark; the bole is usually the part that is commercially useful for lumber



In [14]:
# substance based meronyms for tree
substance_meronyms = tree.substance_meronyms()
print('Total Substance Meronyms:', len(substance_meronyms))
print('Substance Meronyms for [tree]:-')
for meronym in substance_meronyms:
    print(meronym.name(), '-', meronym.definition())
    print()

Total Substance Meronyms: 2
Substance Meronyms for [tree]:-
heartwood.n.01 - the older inactive central wood of a tree or woody plant; usually darker and denser than the surrounding sapwood

sapwood.n.01 - newly formed outer wood lying between the cambium and the heartwood of a tree or woody plant; usually light colored; active in water conduction



In [15]:
### Semantic Relationships and Similarity
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

# create entities and extract names and definitions
entities = [tree, lion, tiger, cat, dog]
entity_names = [entity.name().split('.')[0] for entity in entities]
entity_definitions = [entity.definition() for entity in entities]

# print entiries and their definitions
for entity, definition in zip(entity_names, entity_definitions):
    print(entity, '-', definition)
    print()

tree - a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms

lion - large gregarious predatory feline of Africa and India having a tawny coat with a shaggy mane in the male

tiger - large feline of forests in most of Asia having a tawny coat with black stripes; endangered

cat - feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats

dog - a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds



In [16]:
common_hypernyms = []
for entity in entities:
    # get pairwise lowest common hypernyms
    common_hypernyms.append([entity.lowest_common_hypernyms(compared_entity)[0]
                                            .name().split('.')[0]
                             for compared_entity in entities])

# build pairwise lower common hypernym matrix
common_hypernym_frame = pd.DataFrame(common_hypernyms,
                                     index=entity_names, 
                                     columns=entity_names)
common_hypernym_frame

Unnamed: 0,tree,lion,tiger,cat,dog
tree,tree,organism,organism,organism,organism
lion,organism,lion,big_cat,feline,carnivore
tiger,organism,big_cat,tiger,feline,carnivore
cat,organism,feline,feline,cat,carnivore
dog,organism,carnivore,carnivore,carnivore,dog


In [17]:
similarities = []
for entity in entities:
    # get pairwise similarities
    similarities.append([round(entity.path_similarity(compared_entity), 2) for compared_entity in entities])

# build pairwise similarity matrix
similarity_frame = pd.DataFrame(similarities, index=entity_names, columns=entity_names)
similarity_frame

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.07,0.07,0.08,0.12
lion,0.07,1.0,0.33,0.25,0.17
tiger,0.07,0.33,1.0,0.25,0.17
cat,0.08,0.25,0.25,1.0,0.2
dog,0.12,0.17,0.17,0.2,1.0


## Word Sense Disambiguation

In [18]:
from nltk.wsd import lesk
from nltk import word_tokenize

# sample text and word to disambiguate
samples = [('The fruits on that plant have ripened', 'n'), 
            ('He finally reaped the fruit of his hard work as he won the race', 'n')]

# perform word sense disambiguity
word = 'fruit'
for sentence, pos_tag in samples:
    word_syn = lesk(word_tokenize(sentence.lower()), word, pos_tag)
    print('Sentence:', sentence)
    print('Word synset:', word_syn)
    print('Corresponding definition:', word_syn.definition())
    print()

Sentence: The fruits on that plant have ripened
Word synset: Synset('fruit.n.01')
Corresponding definition: the ripened reproductive body of a seed plant

Sentence: He finally reaped the fruit of his hard work as he won the race
Word synset: Synset('fruit.n.03')
Corresponding definition: the consequence of some effort or action



In [19]:
# sample text and word to disambiguate
samples = [('Lead is a very soft, malleable metal', 'n'), 
            ('John is the actor who plays the lead in that movie', 'n'), 
            ('This road leads to nowhere', 'v')]
word = 'lead'

# perform word sense disambiguation
for sentence, pos_tag in samples:
    word_syn = lesk(word_tokenize(sentence.lower()), word, pos_tag)
    print('Sentence:', sentence)
    print('Word synset:', word_syn)
    print('Corresponding definition:', word_syn.definition())
    print()

Sentence: Lead is a very soft, malleable metal
Word synset: Synset('lead.n.02')
Corresponding definition: a soft heavy toxic malleable metallic element; bluish white when freshly cut but tarnishes readily to dull grey

Sentence: John is the actor who plays the lead in that movie
Word synset: Synset('star.n.04')
Corresponding definition: an actor who plays a principal role

Sentence: This road leads to nowhere
Word synset: Synset('run.v.23')
Corresponding definition: cause something to pass or lead somewhere



## Named Entity Recognition

In [20]:
text = """Three more countries have joined an “international grand committee” of parliaments, adding to calls for 
Facebook’s boss, Mark Zuckerberg, to give evidence on misinformation to the coalition. Brazil, Latvia and Singapore 
bring the total to eight different parliaments across the world, with plans to send representatives to London on 27 
November with the intention of hearing from Zuckerberg. Since the Cambridge Analytica scandal broke, the Facebook chief 
has only appeared in front of two legislatures: the American Senate and House of Representatives, and the European parliament. 
Facebook has consistently rebuffed attempts from others, including the UK and Canadian parliaments, to hear from Zuckerberg. 
He added that an article in the New York Times on Thursday, in which the paper alleged a pattern of behaviour from Facebook 
to “delay, deny and deflect” negative news stories, “raises further questions about how recent data breaches were allegedly 
dealt with within Facebook.”
"""

In [21]:
import re

text = re.sub(r'\n', '', text) # remove extra newlines

# not working
#import spacy
#nlp = spacy.load('web_en_core_sm')

# github help
import en_core_web_sm
nlp = en_core_web_sm.load()

text_nlp = nlp(text)
# print named entites in article
ner_tagged = [(word.text, word.ent_type_) for word in text_nlp]
print(ner_tagged)

[('Three', 'CARDINAL'), ('more', ''), ('countries', ''), ('have', ''), ('joined', ''), ('an', ''), ('“', ''), ('international', ''), ('grand', ''), ('committee', ''), ('”', ''), ('of', ''), ('parliaments', ''), (',', ''), ('adding', ''), ('to', ''), ('calls', ''), ('for', ''), ('Facebook', 'ORG'), ('’s', 'ORG'), ('boss', ''), (',', ''), ('Mark', 'PERSON'), ('Zuckerberg', 'PERSON'), (',', ''), ('to', ''), ('give', ''), ('evidence', ''), ('on', ''), ('misinformation', ''), ('to', ''), ('the', ''), ('coalition', ''), ('.', ''), ('Brazil', 'GPE'), (',', ''), ('Latvia', 'GPE'), ('and', ''), ('Singapore', 'GPE'), ('bring', ''), ('the', ''), ('total', ''), ('to', ''), ('eight', 'CARDINAL'), ('different', ''), ('parliaments', ''), ('across', ''), ('the', ''), ('world', ''), (',', ''), ('with', ''), ('plans', ''), ('to', ''), ('send', ''), ('representatives', ''), ('to', ''), ('London', 'GPE'), ('on', ''), ('27', 'DATE'), ('November', 'DATE'), ('with', ''), ('the', ''), ('intention', ''), ('of'

In [22]:
from spacy import displacy

# visualize named entities
displacy.render(text_nlp, style='ent', jupyter=True)

In [23]:
# extract named entities
named_entities = []
temp_entity_name = ''
temp_named_entity = None
for term, tag in ner_tagged:
    if tag:
        temp_entity_name = ' '.join([temp_entity_name, term]).strip()
        temp_named_entity = (temp_entity_name, tag)
    else:
        if temp_named_entity:
            named_entities.append(temp_named_entity)
            temp_entity_name = ''
            temp_named_entity = None
print(named_entities)

[('Three', 'CARDINAL'), ('Facebook ’s', 'ORG'), ('Mark Zuckerberg', 'PERSON'), ('Brazil', 'GPE'), ('Latvia', 'GPE'), ('Singapore', 'GPE'), ('eight', 'CARDINAL'), ('London', 'GPE'), ('27 November', 'DATE'), ('Zuckerberg', 'GPE'), ('Cambridge Analytica', 'LOC'), ('Facebook', 'ORG'), ('two', 'CARDINAL'), ('the American Senate', 'ORG'), ('House of Representatives', 'ORG'), ('European', 'NORP'), ('Facebook', 'ORG'), ('UK', 'GPE'), ('Canadian', 'NORP'), ('Zuckerberg', 'GPE'), ('the New York Times', 'ORG'), ('Thursday', 'DATE'), ('Facebook', 'ORG'), ('Facebook', 'ORG')]


In [24]:
# viewing the top entity types
from collections import Counter
c = Counter([item[1] for item in named_entities])
c.most_common()

[('ORG', 8),
 ('GPE', 7),
 ('CARDINAL', 3),
 ('DATE', 2),
 ('NORP', 2),
 ('PERSON', 1),
 ('LOC', 1)]

In [29]:
import os
from nltk.tag import StanfordNERTagger

STANFORD_CLASSIFIER_PATH = r'/Users/beliciarodriguez/Downloads/stanford-ner-2014-08-27/classifiers/english.all.3class.distsim.crf.ser.gz'
STANFORD_NER_JAR_PATH = r'/Users/beliciarodriguez/Downloads/stanford-ner-2014-08-27/stanford-ner-3.4.1.jar'

sn = StanfordNERTagger(STANFORD_CLASSIFIER_PATH, path_to_jar=STANFORD_NER_JAR_PATH)

In [30]:
# perform NER tagging & extract relevant entities
text_enc = text.encode('ascii', errors='ignore').decode('utf-8')
ner_tagged = sn.tag(text_enc.split())

named_entities = []
temp_entity_name = ''
temp_named_entity = None
for term, tag in ner_tagged:
    if tag != 'O':
        temp_entity_name = ' '.join([temp_entity_name, term]).strip()
        temp_named_entity = (temp_entity_name, tag)
    else:
        if temp_named_entity:
            named_entities.append(temp_named_entity)
            temp_entity_name = ''
            temp_named_entity = None
print(named_entities)

[('Facebooks', 'ORGANIZATION'), ('Latvia', 'LOCATION'), ('Singapore', 'LOCATION'), ('London', 'LOCATION'), ('Cambridge Analytica', 'ORGANIZATION'), ('Facebook', 'ORGANIZATION'), ('Senate', 'ORGANIZATION'), ('Facebook', 'ORGANIZATION'), ('UK', 'LOCATION'), ('New York Times', 'ORGANIZATION'), ('Facebook', 'ORGANIZATION')]


In [31]:
# get more frequent entities
c = Counter([item[1] for item in named_entities])
c.most_common()

[('ORGANIZATION', 7), ('LOCATION', 4)]

In [34]:
# using Stanford's Core NLP (connected to server)
from nltk.parse import CoreNLPParser
import nltk

# NER Tagging
ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
tags = list(ner_tagger.raw_tag_sents(nltk.sent_tokenize(text)))
tags = [sublist[0] for sublist in tags]
tags = [word_tag for sublist in tags for word_tag in sublist]

# Extract Named Entities
named_entities = []
temp_entity_name = ''
temp_named_entity = None
for term, tag in tags:
    if tag != 'O':
        temp_entity_name = ' '.join([temp_entity_name, term]).strip()
        temp_named_entity = (temp_entity_name, tag)
    else:
        if temp_named_entity:
            named_entities.append(temp_named_entity)
            temp_entity_name = ''
            temp_named_entity = None
print(named_entities)

[('Three', 'NUMBER'), ('Facebook', 'ORGANIZATION'), ('boss', 'TITLE'), ('Mark Zuckerberg', 'PERSON'), ('Brazil', 'COUNTRY'), ('Latvia', 'COUNTRY'), ('Singapore', 'COUNTRY'), ('eight', 'NUMBER'), ('London', 'CITY'), ('27 November', 'DATE'), ('Zuckerberg', 'PERSON'), ('Cambridge Analytica', 'ORGANIZATION'), ('Facebook', 'ORGANIZATION'), ('two', 'NUMBER'), ('American Senate', 'ORGANIZATION'), ('House of Representatives', 'ORGANIZATION'), ('European', 'NATIONALITY'), ('Facebook', 'ORGANIZATION'), ('UK', 'COUNTRY'), ('Canadian', 'NATIONALITY'), ('Zuckerberg', 'PERSON'), ('New York Times', 'ORGANIZATION'), ('Thursday', 'DATE'), ('Facebook', 'ORGANIZATION'), ('Facebook', 'ORGANIZATION')]


In [35]:
# find out top named entity types
c = Counter([item[1] for item in named_entities])
c.most_common()

[('ORGANIZATION', 9),
 ('COUNTRY', 4),
 ('NUMBER', 3),
 ('PERSON', 3),
 ('DATE', 2),
 ('NATIONALITY', 2),
 ('TITLE', 1),
 ('CITY', 1)]

## Building an NER Tagger from Scratch

In [38]:
dataset_path = '/Users/beliciarodriguez/Downloads/ner_dataset.csv.gz'

import pandas as pd

df = pd.read_csv(dataset_path, compression='gzip', encoding='ISO-8859-1')
df = df.fillna(method='ffill')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Sentence #  1048575 non-null  object
 1   Word        1048575 non-null  object
 2   POS         1048575 non-null  object
 3   Tag         1048575 non-null  object
dtypes: object(4)
memory usage: 32.0+ MB


In [39]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1048565,1048566,1048567,1048568,1048569,1048570,1048571,1048572,1048573,1048574
Sentence #,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,...,Sentence: 47958,Sentence: 47958,Sentence: 47959,Sentence: 47959,Sentence: 47959,Sentence: 47959,Sentence: 47959,Sentence: 47959,Sentence: 47959,Sentence: 47959
Word,Thousands,of,demonstrators,have,marched,through,London,to,protest,the,...,impact,.,Indian,forces,said,they,responded,to,the,attack
POS,NNS,IN,NNS,VBP,VBN,IN,NNP,TO,VB,DT,...,NN,.,JJ,NNS,VBD,PRP,VBD,TO,DT,NN
Tag,O,O,O,O,O,O,B-geo,O,O,O,...,O,O,B-gpe,O,O,O,O,O,O,O


In [40]:
df['Sentence #'].nunique(), df.Word.nunique(), df.POS.nunique(), df.Tag.nunique()

(47959, 35178, 42, 17)

In [41]:
df.Tag.value_counts()

O        887908
B-geo     37644
B-tim     20333
B-org     20143
I-per     17251
B-per     16990
I-org     16784
B-gpe     15870
I-geo      7414
I-tim      6528
B-art       402
B-eve       308
I-art       297
I-eve       253
B-nat       201
I-gpe       198
I-nat        51
Name: Tag, dtype: int64

In [42]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features

# convert input sentence into features
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

# get corresponding outcome NER tag label for input sentence
def sent2labels(sent):
    return [label for token, postag, label in sent]

In [53]:
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                   s['POS'].values.tolist(), 
                                                   s['Tag'].values.tolist())]

In [54]:
grouped_df = df.groupby('Sentence #').apply(agg_func)

sentences = [s for s in grouped_df]

In [55]:
# view sample annotated sentence from our dataset
sentences[0]

[('Thousands', 'NNS', 'O'),
 ('of', 'IN', 'O'),
 ('demonstrators', 'NNS', 'O'),
 ('have', 'VBP', 'O'),
 ('marched', 'VBN', 'O'),
 ('through', 'IN', 'O'),
 ('London', 'NNP', 'B-geo'),
 ('to', 'TO', 'O'),
 ('protest', 'VB', 'O'),
 ('the', 'DT', 'O'),
 ('war', 'NN', 'O'),
 ('in', 'IN', 'O'),
 ('Iraq', 'NNP', 'B-geo'),
 ('and', 'CC', 'O'),
 ('demand', 'VB', 'O'),
 ('the', 'DT', 'O'),
 ('withdrawal', 'NN', 'O'),
 ('of', 'IN', 'O'),
 ('British', 'JJ', 'B-gpe'),
 ('troops', 'NNS', 'O'),
 ('from', 'IN', 'O'),
 ('that', 'DT', 'O'),
 ('country', 'NN', 'O'),
 ('.', '.', 'O')]

In [56]:
# view how each annotated tokenized sentence can be used for feat engineering w/ earlier defined fxn
sent2features(sentences[0][5:7])

[{'bias': 1.0,
  'word.lower()': 'through',
  'word[-3:]': 'ugh',
  'word[-2:]': 'gh',
  'word.isupper()': False,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'IN',
  'postag[:2]': 'IN',
  'BOS': True,
  '+1:word.lower()': 'london',
  '+1:word.istitle()': True,
  '+1:word.isupper()': False,
  '+1:postag': 'NNP',
  '+1:postag[:2]': 'NN'},
 {'bias': 1.0,
  'word.lower()': 'london',
  'word[-3:]': 'don',
  'word[-2:]': 'on',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  'postag': 'NNP',
  'postag[:2]': 'NN',
  '-1:word.lower()': 'through',
  '-1:word.istitle()': False,
  '-1:word.isupper()': False,
  '-1:postag': 'IN',
  '-1:postag[:2]': 'IN',
  'EOS': True}]

In [57]:
sent2labels(sentences[0][5:7])

['O', 'B-geo']

In [58]:
# prepared train and test datasets by feat engineering on input sentences
# getting corresponding NER tag labels
from sklearn.model_selection import train_test_split
import numpy as np

X = np.array([sent2features(s) for s in sentences])
y = np.array([sent2labels(s) for s in sentences])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train.shape, X_test.shape

((35969,), (11990,))

In [59]:
import sklearn_crfsuite

crf = sklearn_crfsuite.CRF(algorithm='lbfgs', 
                           c1=0.1, 
                           c2=0.1, 
                           max_iterations=100, 
                           all_possible_transitions=True, 
                           verbose=True)
# crf.fit(X_train, y_train)

loading training data to CRFsuite: 100%|██████████| 35969/35969 [00:34<00:00, 1034.76it/s]

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 133629
Seconds required: 4.796

L-BFGS optimization
c1: 0.100000
c2: 0.100000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=4.38  loss=1264028.26 active=132637 feature_norm=1.00
Iter 2   time=4.34  loss=994059.01 active=131294 feature_norm=4.42
Iter 3   time=2.21  loss=776413.87 active=125970 feature_norm=3.87
Iter 4   time=11.54 loss=422143.40 active=127018 feature_norm=3.24
Iter 5   time=2.15  loss=355775.44 active=129029 feature_norm=4.04
Iter 6   time=2.11  loss=264125.22 active=124046 feature_norm=6.10
Iter 7   time=2.08  loss=222304.71 active=117183 feature_norm=7.69
Iter 8   time=2.07  loss=197827.17 ac

CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=True,
    averaging=None, c=None, c1=0.1, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=True)

In [60]:
# save model using following code
from sklearn.externals import joblib
joblib.dump(crf, 'ner_model.pkl')

# to load
# crf = joblib.load('ner_model.pkl')

['ner_model.pkl']

In [61]:
# evaluate model performance for NER tagging on test data
# show sample prediction and actual labels
y_pred = crf.predict(X_test)
print(y_pred[0])

['O', 'O', 'O', 'O', 'B-per', 'I-per', 'O', 'B-org', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [62]:
print(y_test[0])

['O', 'O', 'O', 'O', 'B-per', 'I-per', 'O', 'B-org', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [63]:
# evaluate model performance on entire test dataset
# get key classification model performance metrics
from sklearn_crfsuite import metrics as crf_metrics

labels = list(crf.classes_)
labels.remove('O')
print(crf_metrics.flat_classification_report(y_test, y_pred, labels=labels))

precision    recall  f1-score   support

       B-org       0.81      0.73      0.77      5116
       B-per       0.85      0.84      0.84      4239
       I-per       0.85      0.90      0.88      4273
       B-geo       0.86      0.91      0.89      9403
       I-geo       0.81      0.80      0.81      1826
       B-tim       0.93      0.89      0.91      5095
       I-org       0.82      0.79      0.80      4195
       B-gpe       0.97      0.94      0.96      3961
       I-tim       0.84      0.81      0.82      1604
       B-nat       0.50      0.24      0.32        55
       B-eve       0.51      0.33      0.40        80
       B-art       0.36      0.14      0.20       102
       I-art       0.24      0.07      0.10        90
       I-eve       0.45      0.19      0.27        74
       I-gpe       0.86      0.53      0.66        36
       I-nat       0.57      0.22      0.32        18

   micro avg       0.86      0.85      0.86     40167
   macro avg       0.70      0.58      0

## Building an End-to-End NER Tagger with Our Trained NER Model

In [64]:
# tokenize our text and perform POS tagging
import nltk

text_tokens = nltk.word_tokenize(text)
text_pos = nltk.pos_tag(text_tokens)
text_pos[:10]

[('Three', 'CD'),
 ('more', 'JJR'),
 ('countries', 'NNS'),
 ('have', 'VBP'),
 ('joined', 'VBN'),
 ('an', 'DT'),
 ('“', 'NNP'),
 ('international', 'JJ'),
 ('grand', 'JJ'),
 ('committee', 'NN')]

In [65]:
# extract features from POS tagged text document
features = [sent2features(text_pos)]
features[0][0]

{'bias': 1.0,
 'word.lower()': 'three',
 'word[-3:]': 'ree',
 'word[-2:]': 'ee',
 'word.isupper()': False,
 'word.istitle()': True,
 'word.isdigit()': False,
 'postag': 'CD',
 'postag[:2]': 'CD',
 'BOS': True,
 '+1:word.lower()': 'more',
 '+1:word.istitle()': False,
 '+1:word.isupper()': False,
 '+1:postag': 'JJR',
 '+1:postag[:2]': 'JJ'}

In [66]:
# use CRF model just trained to predict features we engineered from sample doc
labels = crf.predict(features)
doc_labels = labels[0]
doc_labels[10:20]

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-art', 'I-art']

In [67]:
# combo actual text tokens with corresponding NER tags
# retrieve relevant named entities from NER tags
text_ner = [(token, tag) for token, tag in zip(text_tokens, doc_labels)]
print(text_ner)

[('Three', 'O'), ('more', 'O'), ('countries', 'O'), ('have', 'O'), ('joined', 'O'), ('an', 'O'), ('“', 'O'), ('international', 'O'), ('grand', 'O'), ('committee', 'O'), ('”', 'O'), ('of', 'O'), ('parliaments', 'O'), (',', 'O'), ('adding', 'O'), ('to', 'O'), ('calls', 'O'), ('for', 'O'), ('Facebook', 'B-art'), ('’', 'I-art'), ('s', 'O'), ('boss', 'O'), (',', 'O'), ('Mark', 'B-per'), ('Zuckerberg', 'I-per'), (',', 'O'), ('to', 'O'), ('give', 'O'), ('evidence', 'O'), ('on', 'O'), ('misinformation', 'O'), ('to', 'O'), ('the', 'O'), ('coalition', 'O'), ('.', 'O'), ('Brazil', 'B-geo'), (',', 'O'), ('Latvia', 'B-org'), ('and', 'I-org'), ('Singapore', 'I-org'), ('bring', 'O'), ('the', 'O'), ('total', 'O'), ('to', 'O'), ('eight', 'O'), ('different', 'O'), ('parliaments', 'O'), ('across', 'O'), ('the', 'O'), ('world', 'O'), (',', 'O'), ('with', 'O'), ('plans', 'O'), ('to', 'O'), ('send', 'O'), ('representatives', 'O'), ('to', 'O'), ('London', 'B-geo'), ('on', 'O'), ('27', 'B-tim'), ('November', 

In [68]:
# extract and display all named entities
named_entities = []
temp_entity_name = ''
temp_named_entity = None
for term, tag in text_ner:
    if tag != 'O':
        temp_entity_name = ' '.join([temp_entity_name, term]).strip()
        temp_named_entity = (temp_entity_name, tag)
    else:
        if temp_named_entity:
            named_entities.append(temp_named_entity)
            temp_entity_name = ''
            temp_named_entity = None

import pandas as pd
pd.DataFrame(named_entities, columns=['Entity', 'Tag'])

Unnamed: 0,Entity,Tag
0,Facebook ’,I-art
1,Mark Zuckerberg,I-per
2,Brazil,B-geo
3,Latvia and Singapore,I-org
4,London,B-geo
5,27 November,I-tim
6,Zuckerberg,B-geo
7,Cambridge Analytica,I-org
8,Facebook,B-org
9,American Senate and House of Representatives,I-org


## Analyzing Semantic Representations

In [69]:
import nltk
import pandas as pd
import os

# assign symbols and propositions
symbol_P = 'P'
symbol_Q = 'Q'

proposition_P = 'He is hungry'
propositon_Q = 'He will eat a sandwich'

In [70]:
# assign various truth values to the propositions
p_statuses = [False, False, True, True]
q_statuses = [False, True, False, True]

# assign the various expressions combining the logical operators
conjunction = '(P & Q)'
disjunction = '(P | Q)'
implication = '(P -> Q)'
equivalence = '(P <-> Q)'
expressions = [conjunction, disjunction, implication, equivalence]
expressions

['(P & Q)', '(P | Q)', '(P -> Q)', '(P <-> Q)']

In [71]:
# evaluate each expression using propositional logic
results = []
for status_p, status_q in zip(p_statuses, q_statuses):
    dom = set([])
    val = nltk.Valuation([(symbol_P, status_p), 
                          (symbol_Q, status_q)])
    assignments = nltk.Assignment(dom)
    model = nltk.Model(dom, val)
    row = [status_p, status_q]
    for expression in expressions:
    # evaluate each expression based on proposition truth values
        result = model.evaluate(expression, assignments) 
        row.append(result)
    results.append(row)

# build the result table
columns = [symbol_P, symbol_Q, conjunction, 
           disjunction, implication, equivalence]           
result_frame = pd.DataFrame(results, columns=columns)

# display results
print('P:', proposition_P)
print('Q:', propositon_Q)
print()
print('Expression Outcomes:-')
print(result_frame)

P: He is hungry
Q: He will eat a sandwich

Expression Outcomes:-
       P      Q  (P & Q)  (P | Q)  (P -> Q)  (P <-> Q)
0  False  False    False    False      True       True
1  False   True    False     True      True      False
2   True  False    False     True     False      False
3   True   True     True     True      True       True


In [95]:
## First Order Logic
import nltk
import os

# for reading FOL expressions
read_expr = nltk.sem.Expression.fromstring

# initialize theorem provers
os.environ['PROVER9'] = r'E:/Users/beliciarodriguez/prover9/LADR-2009-11A/bin'
prover = nltk.Prover9()

In [102]:
# set the rule expressions
rule = read_expr('all x. all y. (jumps_over(x,y) -> -jumps_over(y, x))')

# set the event occured
event = read_expr('jumps_over(fox, dog)')

# set the outcome we want to evaluate -- the goal
test_outcome = read_expr('jumps_over(dog, fox)')

# get the result (took out verbose=True)
prover.prove(goal=test_outcome, assumptions=[event,rule])

False

In [104]:
# set the rule expression
rule = read_expr('all x. (studies(x, exam) -> pass(x, exam))')

# set the events and outcomes we want to determine
event1 = read_expr('-studies(John, exam)')
test_outcome1 = read_expr('pass(John, exam)')

# get results
prover.prove(goal=test_outcome1, assumptions=[event1, rule])

False

In [105]:
# set the events and outcomes we want to determine
event2 = read_expr('studies(Pierre, exam)')
test_outcome2 = read_expr('pass(Pierre, exam)')

# get results
prover.prove(goal=test_outcome2, assumptions=[event2, rule])

True

In [106]:
# define symbols (entities\functions) and their values
rules = """
    rover => r
    felix => f
    garfield => g
    alex => a
    dog => {r, a}
    cat => {g}
    fox => {f}
    runs => {a, f}
    sleeps => {r, g}
    jumps_over => {(f, g), (a, g), (f, r), (a, r)}
    """

val = nltk.Valuation.fromstring(rules)

# view the valuation object of symbols and their assigned values (dictionary)
val

{'rover': 'r',
 'felix': 'f',
 'garfield': 'g',
 'alex': 'a',
 'dog': {('a',), ('r',)},
 'cat': {('g',)},
 'fox': {('f',)},
 'runs': {('a',), ('f',)},
 'sleeps': {('g',), ('r',)},
 'jumps_over': {('a', 'g'), ('a', 'r'), ('f', 'g'), ('f', 'r')}}

In [108]:
# define domain and build FOL based model
dom = {'r', 'f', 'g', 'a'}
m = nltk.Model(dom, val)

# evaluate various expressions
m.evaluate('jumps_over(felix, rover) & dog(rover) & runs(rover)', None)

False

In [109]:
m.evaluate('jumps_over(felix, rover) & dog(rover) & -runs(rover)', None)

True

In [110]:
m.evaluate('jumps_over(alex, garfield) & dog(alex) & cat(garfield) & sleeps(garfield)', None)

True

In [113]:
# assign rover to x and felix to y in the domain
g = nltk.Assignment(dom, [('x', 'r'), ('y', 'f')])

# evaluate more expressions based on above assigned symbols
m.evaluate('runs(y) & jumps_over(y, x) & sleeps(x)', g)

True

In [114]:
m.evaluate('exists y. (fox(y) & runs(y))', g)

True

In [115]:
# who are the animals who run?
formula = read_expr('runs(x)')
m.satisfiers(formula, 'x', g)

{'a', 'f'}

In [116]:
# animals who run and are also a fox?
formula = read_expr('runs(x) & fox(x)')
m.satisfiers(formula, 'x', g)

{'f'}