# **In Class Assignment: Semantic Analysis P1**

## Name: KEY
## *IS 5150*

In this in-class assignment we will apply several methods of semantic analysis that we learned about during the lecture. We will begin with exploring the WordNet corpus and its various semantic synset types, then we will use synsets to determine semantic similarity between entities.

Next, we will utilize the Lesk algorithm to perform word sense disambiguation.

In the P2 of this notebook we will examine named entity recongition.

As, always, we begin by loading our dependencies.

In [None]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
from nltk.corpus import wordnet as wn
import pandas as pd

## Let's begin by examining some synsets in WordNet...

Select a term then print out a dataframe of synsets that includes the synset itself, its POS tag, definition, lemmas and examples.

In [None]:
term = 'pool'
synsets = wn.synsets(term)
print('Total Synsets:', len(synsets))

Total Synsets: 11


In [None]:

pd.options.display.max_colwidth = 200
fruit_df = pd.DataFrame([{'Synset': synset,
                         'Part of Speech': synset.lexname(),
                         'Definition': synset.definition(),
                         'Lemmas': synset.lemma_names(),
                         'Examples': synset.examples()}
                             for synset in synsets])
fruit_df = fruit_df[['Synset', 'Part of Speech', 'Definition', 'Lemmas', 'Examples']]
fruit_df

Unnamed: 0,Synset,Part of Speech,Definition,Lemmas,Examples
0,Synset('pool.n.01'),noun.artifact,an excavation that is (usually) filled with water,[pool],[]
1,Synset('pond.n.01'),noun.object,a small lake,"[pond, pool]",[the pond was too small for sailing]
2,Synset('pool.n.03'),noun.group,an organization of people or resources that can be shared,[pool],"[a car pool, a secretarial pool, when he was first hired he was assigned to the pool]"
3,Synset('consortium.n.01'),noun.group,an association of companies for some definite purpose,"[consortium, pool, syndicate]",[]
4,Synset('pool.n.05'),noun.possession,any communal combination of funds,[pool],[everyone contributed to the pool]
5,Synset('pool.n.06'),noun.object,a small body of standing water (rainwater) or other liquid,"[pool, puddle]","[there were puddles of muddy water in the road after the rain, the body lay in a pool of blood]"
6,Synset('pool.n.07'),noun.possession,the combined stakes of the betters,"[pool, kitty]",[]
7,Synset('pool.n.08'),noun.location,something resembling a pool of liquid,"[pool, puddle]","[he stood in a pool of light, his chair sat in a puddle of books and magazines]"
8,Synset('pool.n.09'),noun.act,any of various games played on a pool table having 6 pockets,"[pool, pocket_billiards]",[]
9,Synset('pool.v.01'),verb.possession,combine into a common fund,[pool],[We pooled resources]


**How many synsets did your word of choice have? Are they spread across different POS types?**

## Next, let's explore some of the different semantic relationships amongst synsets, including:

* Entailments
* Homonyms \& Homographs
* Synonyms \& Antonyms
* Hyponyms \& Hypernyms
* Holonyms \& Meronyms

In [None]:
for action in ['walked', 'eat', 'digest']:                                                        # select some action words
    action_syn = wn.synsets(action, pos='v')[0]                                                   # find their synsets
    print(action_syn, '-- entails -->', action_syn.entailments())                                 # print action synset and its entailment

Synset('walk.v.01') -- entails --> [Synset('step.v.01')]
Synset('eat.v.01') -- entails --> [Synset('chew.v.01'), Synset('swallow.v.01')]
Synset('digest.v.01') -- entails --> [Synset('consume.v.02')]


In [None]:
for synset in wn.synsets('table'):
    print(synset.name(),'-',synset.definition())

table.n.01 - a set of data arranged in rows and columns
table.n.02 - a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs
table.n.03 - a piece of furniture with tableware for a meal laid out on it
mesa.n.01 - flat tableland with steep edges
table.n.05 - a company of people assembled at a table for a meal or game
board.n.04 - food or meals in general
postpone.v.01 - hold back to a later time
table.v.02 - arrange or enter in tabular form


In [None]:
term = 'rich'
synsets = wn.synsets(term)[:3]

for synset in synsets:
    rich = synset.lemmas()[0]
    rich_synonym = rich.synset()
    rich_antonym = rich.antonyms()[0].synset()
    
    print('Synonym:', rich_synonym.name())
    print('Definition:', rich_synonym.definition())
    print('Antonym:', rich_antonym.name())
    print('Definition:', rich_antonym.definition())
    print()

Synonym: rich_people.n.01
Definition: people who have possessions and wealth (considered as a group)
Antonym: poor_people.n.01
Definition: people without possessions or wealth (considered as a group)

Synonym: rich.a.01
Definition: possessing material wealth
Antonym: poor.a.02
Definition: having little money or few possessions

Synonym: rich.a.02
Definition: having an abundant supply of desirable qualities or substances (especially natural resources)
Antonym: poor.a.04
Definition: lacking in specific resources, qualities or substances



In [None]:
term = 'tree'
synsets = wn.synsets(term)
tree = synsets[0]

print('Name:', tree.name())
print('Definition:', tree.definition())

Name: tree.n.01
Definition: a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms


In [None]:
hyponyms = tree.hyponyms()
print('Total Hyponyms:', len(hyponyms))
print('Sample Hyponyms')
for hyponym in hyponyms[:10]:
    print(hyponym.name(), '-', hyponym.definition())
    print()

Total Hyponyms: 180
Sample Hyponyms
aalii.n.01 - a small Hawaiian tree with hard dark wood

acacia.n.01 - any of various spiny trees or shrubs of the genus Acacia

african_walnut.n.01 - tropical African timber tree with wood that resembles mahogany

albizzia.n.01 - any of numerous trees of the genus Albizia

alder.n.02 - north temperate shrubs or trees having toothed leaves and conelike fruit; bark is used in tanning and dyeing and the wood is rot-resistant

angelim.n.01 - any of several tropical American trees of the genus Andira

angiospermous_tree.n.01 - any tree having seeds and ovules contained in the ovary

anise_tree.n.01 - any of several evergreen shrubs and small trees of the genus Illicium

arbor.n.01 - tree (as opposed to shrub)

aroeira_blanca.n.01 - small resinous tree or shrub of Brazil



In [None]:
hypernyms = tree.hypernyms()
print(hypernyms)

[Synset('woody_plant.n.01')]


In [None]:
hypernym_paths = tree.hypernym_paths()
print(' -> '.join(synset.name() for synset in hypernym_paths[0]))

entity.n.01 -> physical_entity.n.01 -> object.n.01 -> whole.n.02 -> living_thing.n.01 -> organism.n.01 -> plant.n.02 -> vascular_plant.n.01 -> woody_plant.n.01 -> tree.n.01


In [None]:
member_holonyms = tree.member_holonyms()    
print('Total Member Holonyms:', len(member_holonyms))
print('Member Holonyms for [tree]:-')
for holonym in member_holonyms:
    print(holonym.name(), '-', holonym.definition())
    print()

Total Member Holonyms: 1
Member Holonyms for [tree]:-
forest.n.01 - the trees and other plants in a large densely wooded area



In [None]:
part_meronyms = tree.part_meronyms()
print('Total Part Meronyms:', len(part_meronyms))
print('Part Meronyms for [tree]:-')
for meronym in part_meronyms:
    print(meronym.name(), '-', meronym.definition())
    print()

Total Part Meronyms: 5
Part Meronyms for [tree]:-
burl.n.02 - a large rounded outgrowth on the trunk or branch of a tree

crown.n.07 - the upper branches and leaves of a tree or other plant

limb.n.02 - any of the main branches arising from the trunk or a bough of a tree

stump.n.01 - the base part of a tree that remains standing after the tree has been felled

trunk.n.01 - the main stem of a tree; usually covered with bark; the bole is usually the part that is commercially useful for lumber



**How do these different forms of semantic relationships help us understand the meaning within text, as well as decrease semantic ambiguity of terms with multiple meanings?**

**Describe an example situation where semantic ambiguity could be problematic in NLP, NLU or text mining in general.**

## Semantic Relationships \& Similarities 

In [None]:
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

# create entities and extract names and definitions
entities = [tree, lion, tiger, cat, dog]
entity_names = [entity.name().split('.')[0] for entity in entities]
entity_definitions = [entity.definition() for entity in entities]

# print entities and their definitions
for entity, definition in zip(entity_names, entity_definitions):
    print(entity, '-', definition)
    print()

tree - a tall perennial woody plant having a main trunk and branches forming a distinct elevated crown; includes both gymnosperms and angiosperms

lion - large gregarious predatory feline of Africa and India having a tawny coat with a shaggy mane in the male

tiger - large feline of forests in most of Asia having a tawny coat with black stripes; endangered

cat - feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats

dog - a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds



In [None]:
common_hypernyms = []
for entity in entities:
    # get pairwise lowest common hypernyms
    common_hypernyms.append([entity.lowest_common_hypernyms(compared_entity)[0]
                                            .name().split('.')[0]
                             for compared_entity in entities])
    

# build pairwise lower common hypernym matrix
common_hypernym_frame = pd.DataFrame(common_hypernyms,
                                     index=entity_names, 
                                     columns=entity_names)
common_hypernym_frame

Unnamed: 0,tree,lion,tiger,cat,dog
tree,tree,organism,organism,organism,organism
lion,organism,lion,big_cat,feline,carnivore
tiger,organism,big_cat,tiger,feline,carnivore
cat,organism,feline,feline,cat,carnivore
dog,organism,carnivore,carnivore,carnivore,dog


In [None]:
similarities = []
for entity in entities:
    # get pairwise similarities
    similarities.append([round(entity.path_similarity(compared_entity), 2)
                         for compared_entity in entities])
    

# build pairwise similarity matrix                             
similarity_frame = pd.DataFrame(similarities,
                                index=entity_names, 
                                columns=entity_names)
similarity_frame

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.07,0.07,0.08,0.12
lion,0.07,1.0,0.33,0.25,0.17
tiger,0.07,0.33,1.0,0.25,0.17
cat,0.08,0.25,0.25,1.0,0.2
dog,0.12,0.17,0.17,0.2,1.0


**Here we determine word similarity based on overlap on hypernyms or classes. Are there other ways we could have grouped together these words based on their meaning or associations? If so, how?**

## Word Sense Disambiguation

In [None]:
from nltk.wsd import lesk
from nltk import word_tokenize

# sample text and word to disambiguate
samples = [('The fruits on that plant have ripened', 'n'),
           ('He finally reaped the fruit of his hard work as he won the race', 'n')]

# perform word sense disambiguation
word = 'fruit'
for sentence, pos_tag in samples:
    word_syn = lesk(word_tokenize(sentence.lower()), word, pos_tag)
    print('Sentence:', sentence)
    print('Word synset:', word_syn)
    print('Corresponding defition:', word_syn.definition())
    print()

Sentence: The fruits on that plant have ripened
Word synset: Synset('fruit.n.01')
Corresponding defition: the ripened reproductive body of a seed plant

Sentence: He finally reaped the fruit of his hard work as he won the race
Word synset: Synset('fruit.n.03')
Corresponding defition: the consequence of some effort or action



In [None]:
# sample text and word to disambiguate
samples = [('Lead is a very soft, malleable metal', 'n'),
           ('John is the actor who plays the lead in that movie', 'n'),
           ('This road leads to nowhere', 'v')]

word = 'lead'

# perform word sense disambiguation
for sentence, pos_tag in samples:
    word_syn = lesk(word_tokenize(sentence.lower()), word, pos_tag)
    print('Sentence:', sentence)
    print('Word synset:', word_syn)
    print('Corresponding defition:', word_syn.definition())
    print()

Sentence: Lead is a very soft, malleable metal
Word synset: Synset('lead.n.02')
Corresponding defition: a soft heavy toxic malleable metallic element; bluish white when freshly cut but tarnishes readily to dull grey

Sentence: John is the actor who plays the lead in that movie
Word synset: Synset('star.n.04')
Corresponding defition: an actor who plays a principal role

Sentence: This road leads to nowhere
Word synset: Synset('run.v.23')
Corresponding defition: cause something to pass or lead somewhere



**How does the Lesk algorithm work and are there any shortcomings of this method? What are some other approaches you could conceive of for approaching word sense disambiguation?**