<a href="https://colab.research.google.com/github/cmb170230/NLP-Portfolio/blob/main/CS_4395_WordNet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [21]:
!pip install nltk
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

##WordNet, Briefly

WordNet is a set of hierarchical relations between various nouns, adjectives, adverbs, and other parts of speech in the English language designed to capture semantic and lexical distance in the hierarchy. The main method of organization is the use of 'synsets', which are connected via higher/lower order relations (hypernym/hyponym), part/whole relations (meronym/holonym), or specificity/subset relations (troponym).

In [22]:
from nltk.corpus import wordnet as wn
bechSyn = wn.synsets('bechamel')
print(bechSyn)

print(bechSyn[0].definition())
print(bechSyn[0].examples())
print(bechSyn[0].lemmas(), "\n")

bechHyp = bechSyn[0].hypernyms()[0]
entity = wn.synset('entity.n.01')
while bechHyp:
  print(bechHyp)
  if bechHyp == entity:
    break
  if bechHyp.hypernyms():
    bechHyp = bechHyp.hypernyms()[0]



[Synset('white_sauce.n.01')]
milk thickened with a butter and flour roux
[]
[Lemma('white_sauce.n.01.white_sauce'), Lemma('white_sauce.n.01.bechamel_sauce'), Lemma('white_sauce.n.01.bechamel')] 

Synset('sauce.n.01')
Synset('condiment.n.01')
Synset('flavorer.n.01')
Synset('ingredient.n.03')
Synset('foodstuff.n.02')
Synset('food.n.01')
Synset('substance.n.07')
Synset('matter.n.03')
Synset('physical_entity.n.01')
Synset('entity.n.01')


Regarding the hypernym structure, everything from 'entity' to 'food' seems like it follows a fairly straightforward logical path, but beyond those the order of some of the categories seems unclear- looking at foodstuff being a hyponym of food, what could be food that does not count as a foodstuff, and if there is none then why are there distinct levels? Some more apparently ambiguous relations that stick out are the condiment-sauce and ingredient-flavorer hypernym/hyponym relationships- it seems like one could come up with arguments for those relationships to go either way, and it is easy to imagine for some words there are even more ambiguous relations, so how are these structures finally decided? Overall though, the increase in specificity throughout the path displayed here seems like a reasonable balance between sufficiently compartmentalized without creating an overly complex network that would be too computationally expensive to parse.

In [23]:
print("Hypernyms:")
print(bechSyn[0].hypernyms())
print("Hyponyms:")
print(bechSyn[0].hyponyms())
print("Meronyms:")
print(bechSyn[0].member_meronyms())
print("Holonyms:")
print(bechSyn[0].member_holonyms())
print("Antonyms")
print(bechSyn[0].lemmas()[0].antonyms())


Hypernyms:
[Synset('sauce.n.01')]
Hyponyms:
[Synset('blanc.n.01'), Synset('cheese_sauce.n.01'), Synset('cream_sauce.n.01')]
Meronyms:
[]
Holonyms:
[]
Antonyms
[]


In [24]:
flaSyn = wn.synsets('flambe', pos=wn.VERB)
print(flaSyn)

print(flaSyn[0].definition())
print(flaSyn[0].examples())
print(flaSyn[0].lemmas(), "\n")

flaHyp = flaSyn[0].hypernyms()[0]
entity = wn.synset('make.v.03')
while flaHyp:
  print(flaHyp)
  if flaHyp == entity:
    break
  if flaHyp.hypernyms():
    flaHyp = flaHyp.hypernyms()[0]

[Synset('flambe.v.01')]
pour liquor over and ignite (a dish)
[]
[Lemma('flambe.v.01.flambe')] 

Synset('cook.v.02')
Synset('create_from_raw_material.v.01')
Synset('make.v.03')


The structure of verbs in WordNet appears to be more diverse, with multiple different options for the highest level hypernym. Before this word, I also tried 'juggle' to see if a more general verb would prompt a larger chain of hypernyms, and while it was larger than 'flambe' it was not nearly as long as the noun form of 'juggle'. It appears based off of these experiments that the structure of verbs in WordNet is less rigorous than with nouns. With the selected example in particular, I would have guessed that there would be at least one more hypernym between flambe and cook, but I assume for model complexity's sake that uncommon words are deprioritized.

In [25]:
print(wn.morphy('flambed', pos=wn.VERB))
print(wn.morphy('flambe', pos=wn.ADJ))
print(wn.morphy('flambe', pos=wn.NOUN))

flambe
None
None


Given such a specific word, morphy can only recover the original form given different tenses.

In [26]:
def traceHypernyms(syn):
  hyp = syn.hypernyms()[0]
  entity = wn.synset('entity.n.01')
  while hyp:
    print(hyp)
    if hyp == entity:
      break
    if hyp.hypernyms():
      hyp = hyp.hypernyms()[0]

In [27]:
from nltk.wsd import lesk
word1 = 'pilot'
word2 = 'captain'

syn1 = wn.synsets(word1, pos=wn.NOUN)
syn2 = wn.synsets(word2, pos=wn.NOUN)
similar1 = syn1[0]
similar2 = syn2[3]

print(similar1.definition())
print(similar1.lemmas())
traceHypernyms(similar1)
print()
print(similar2.definition())
print(similar2.lemmas())
traceHypernyms(similar2)

print("\nWu-Palmer Similarity: ", wn.wup_similarity(similar1, similar2), "\n")
sent = ['The', 'pilot', 'had', 'to', 'get', 'the', 'cargo', 'shipment', 'to', 'the', 'captain', 'on', 'time', '.']
sent2 = ['The', 'airplane', 'pilot', 'had', 'to', 'get', 'the', 'cargo', 'shipment', 'to', 'the', 'captain', 'on', 'time', '.']
sent3 = ['The', 'airplane', 'pilot', 'had', 'to', 'get', 'the', 'cargo', 'shipment', 'to', 'the', 'airport', 'for', 'the', 'captain', 'to', 'leave', 'the', 'dock', 'on', 'time', '.']

lesk1 = lesk(sent3, 'pilot', 'n')
lesk2 = lesk(sent3, 'captain', 'n')
print(lesk1)
print(lesk1.definition())
print(lesk2)
print(lesk2.definition())


someone who is licensed to operate an aircraft in flight
[Lemma('pilot.n.01.pilot'), Lemma('pilot.n.01.airplane_pilot')]
Synset('aviator.n.01')
Synset('skilled_worker.n.01')
Synset('worker.n.01')
Synset('person.n.01')
Synset('causal_agent.n.01')
Synset('physical_entity.n.01')
Synset('entity.n.01')

an officer who is licensed to command a merchant ship
[Lemma('master.n.07.master'), Lemma('master.n.07.captain'), Lemma('master.n.07.sea_captain'), Lemma('master.n.07.skipper')]
Synset('officer.n.04')
Synset('mariner.n.01')
Synset('sailor.n.01')
Synset('skilled_worker.n.01')
Synset('worker.n.01')
Synset('person.n.01')
Synset('causal_agent.n.01')
Synset('physical_entity.n.01')
Synset('entity.n.01')

Wu-Palmer Similarity:  0.5 

Synset('fender.n.02')
an inclined metal frame at the front of a locomotive to clear the track
Synset('captain.n.06')
the pilot in charge of an airship


The Wu-Palmer similarity score makes sense for the selected synsets, as they share a large portion of their upper hypernym path after observation of the manual printing. At first glance I would have thought that the score would be higher, but I suspect the extended length of captain's hypernym path is what is decreasing the score. The results of the lesk algorithm are more interesting however, as I selected the context sentence to be slightly ambiguous without being enigmatic. The first context sentence produced a not entirely nonsense definition for pilot, although it is quite a stretch especially considering the context decided on for captain. Curious if it would respond to additional help, I added a word to the sentence hoping that it would find the correct definition for pilot, but it still turned out the same, which I find baffling. The third sentence tries to give it as much help as possible, and yet the results do not change. The airship definition for the captain still makes sense in all three contexts given the suggestion of aerial transport, but I can't think of anything to indicate railway travel, especially in the last two sentences.

In [28]:
nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn

[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


##SentiWordNet, Briefly

SentiWordNet is a sentiment analysis tool built on top of WordNet, leveraging the synset structure of WordNet to assign sentiment value scores for positiviy, negativity, and objectivity. It seems ideal for cases where one is already leveraging the power of WordNet in their analysis of a corpus to quickly get an idea of the tone of the sentences, where the ease of use would be a valid tradeoff against other more powerful tools.

In [29]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [30]:
chargeWord = 'disdain'
neg = pos = 0
chargeSyn = list(swn.senti_synsets(chargeWord))
for syn in chargeSyn:
  print(syn)
  print("Positive score = ", syn.pos_score())
  print("Negative score = ", syn.neg_score())
  print("Objective score = ", syn.obj_score(), "\n")

<contempt.n.01: PosScore=0.0 NegScore=0.625>
Positive score =  0.0
Negative score =  0.625
Objective score =  0.375 

<condescension.n.02: PosScore=0.375 NegScore=0.0>
Positive score =  0.375
Negative score =  0.0
Objective score =  0.625 

<contemn.v.01: PosScore=0.0 NegScore=0.375>
Positive score =  0.0
Negative score =  0.375
Objective score =  0.625 

<reject.v.04: PosScore=0.0 NegScore=0.25>
Positive score =  0.0
Negative score =  0.25
Objective score =  0.75 



Regarding the analysis of a single word, the apparently high objectivity scores for the sentisynsets related to what to me is a highly negatively charged word, disdain, are surprising- even the most negatively rated related word is given 0.625. This makes me curious as to if these value assignments could be interpreted more as a probability of seeing the word in a given context rather than a scaled value of sentiment intensity. This seems doubtful looking at the score for condescention however- what possible context could that be used positively? Despite the a given individual word's seemingly dubious scoring, I imagine averaged over an entire document it could give at least a good enough idea of the tone of the text. I imagine how some kind of sentiment analysis could help guide a chatbot such as chatGPT to help it modify its output if it seems that the user is becoming frustrated with it or something along those lines.

In [31]:
sentiSent = "The crabby and overworked advisor tore the poor graduate student's thesis to shreds."
sentiTok = word_tokenize(sentiSent)
overPos = overNeg = overObj = count = 0
for tok in sentiTok:
  syn = list(swn.senti_synsets(tok))
  if syn:
    count +=1
    syn = syn[0]
    print(syn)
    print("Positive score = ", syn.pos_score())
    overPos += syn.pos_score()
    print("Negative score = ", syn.neg_score())
    overNeg += syn.neg_score()
    print("Objective score = ", syn.obj_score(), "\n")
    overObj += syn.obj_score()
print("Overall Rating of the Sentence: \nPositive:", overPos/count,"Negative:", overNeg/count, "Objective:", overObj/count)

<crabbed.s.01: PosScore=0.0 NegScore=0.625>
Positive score =  0.0
Negative score =  0.625
Objective score =  0.375 

<overwork.v.01: PosScore=0.0 NegScore=0.0>
Positive score =  0.0
Negative score =  0.0
Objective score =  1.0 

<adviser.n.01: PosScore=0.0 NegScore=0.0>
Positive score =  0.0
Negative score =  0.0
Objective score =  1.0 

<torus.n.02: PosScore=0.0 NegScore=0.0>
Positive score =  0.0
Negative score =  0.0
Objective score =  1.0 

<poor_people.n.01: PosScore=0.0 NegScore=0.0>
Positive score =  0.0
Negative score =  0.0
Objective score =  1.0 

<alumnus.n.01: PosScore=0.0 NegScore=0.0>
Positive score =  0.0
Negative score =  0.0
Objective score =  1.0 

<student.n.01: PosScore=0.0 NegScore=0.0>
Positive score =  0.0
Negative score =  0.0
Objective score =  1.0 

<thesis.n.01: PosScore=0.0 NegScore=0.0>
Positive score =  0.0
Negative score =  0.0
Objective score =  1.0 

<shred.n.01: PosScore=0.125 NegScore=0.0>
Positive score =  0.125
Negative score =  0.0
Objective score 

Regarding the performance on an entire sentence, the most immediately noticeable aspect is the discrepancies between the words and the sentisynset associated with it- because it judges sentiment based on individual tokens, it misses the collocation of 'graduate student', selecting alumnus for graduate. The most surprising mismatch is between the past tense of tear and the gemetrical shape of the torus; I would expect that it would have some way to determine based on some measure of word likelihood in an ambiguous case such as this, especially as 'tore' is an archaic term. As far as the sentiment ratings, it was at least able to detect that the sentence was more negative than positive, but it couldn't pick up on the figurative language in 'tore it to shreds', assigning an overall positive score if you average the score for all of those words. The sentence overall was ruled to be very highly objective, which I doubt any human reading it would agree with, highlighting the limitations of this simplistic approach.

##Collocations, Briefly

Collocations are sets of two or more words that have a specific meaning when used together distinct from using the two words independently or substitutions of synonyms. Collocations can be uncovered by searching for pairs of words, or bigrams, that occur with a frequency higher than randomness would dictate, calculated via point-wise mutual information.

In [43]:
from nltk.collocations import *
nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')
from nltk.book import text4
nltk.download('stopwords')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [38]:
colloc = text4.collocations()

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations


In [39]:
import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def getPotentialColloc(text, scoreFunct = BigramAssocMeasures.chi_sq, n=200):
  finder = BigramCollocationFinder.from_words(text)
  bigrams = finder.nbest(scoreFunct, n)
  return dict([(ngram, True) for ngram in itertools.chain(text, bigrams) if type(ngram) == tuple])

In [41]:
text4colloc = getPotentialColloc(text4)
print(text4colloc)

{('/', '11'): True, ('25', 'straight'): True, ('Amelia', 'Island'): True, ('Apollo', 'astronauts'): True, ('Archibald', 'MacLeish'): True, ('BUSINESS', 'COOPERATION'): True, ('Barbary', 'Powers'): True, ('Belleau', 'Wood'): True, ('Boston', 'lawyer'): True, ('Britannic', 'Majesty'): True, ('COOPERATION', 'BY'): True, ('CRIMINAL', 'JUSTICE'): True, ('Calvin', 'Coolidge'): True, ('Cape', 'Horn'): True, ('Cardinal', 'Bernardin'): True, ('Chop', 'Hill'): True, ('Chosin', 'Reservoir'): True, ('Christmas', 'Eve'): True, ('Colonel', 'Goethals'): True, ('Dark', 'pictures'): True, ('Domestic', 'Product'): True, ('EIGHTEENTH', 'AMENDMENT'): True, ('Emancipation', 'Proclamation'): True, ('English', 'writer'): True, ('Fort', 'Sumter'): True, ('Gatun', 'dam'): True, ('Golden', 'Rule'): True, ('Gross', 'Domestic'): True, ('Growing', 'connections'): True, ('Hague', 'Tribunal'): True, ('Herein', 'flows'): True, ('Holy', 'Writ'): True, ('Hope', 'maketh'): True, ('Information', 'Age'): True, ('Iwo', 'Ji

In [60]:
import math
t4citer = list(text4colloc)
t4colloc = list()

vocab = len(set(text4))

text4t = ' '.join(text4.tokens)

for bigram in t4citer:
  if(bigram[0].isalpha() and bigram[1].isalpha):
    w1 = bigram[0]
    w2 = bigram[1]
    #print(w1, w2)
    bigr = text4t.count(w1 + ' ' + w2)/vocab
    w1c = text4t.count(w1)/vocab
    w2c = text4t.count(w2)/vocab
    #print(bigr, w1c, w2c)
    if(bigr != 0 and w1c != 0 and w2c != 0):
      pmi = math.log2(bigr/(w1c*w2c))
      #print(w1, w2, pmi)
      if(pmi > 0):
        t4colloc.append([w1 + ' ' + w2, pmi])

for colloc in t4colloc:
  print(colloc)

['Amelia Island', 11.706352115508489]
['Apollo astronauts', 13.291314616229645]
['Archibald MacLeish', 13.291314616229645]
['BUSINESS COOPERATION', 13.291314616229645]
['Barbary Powers', 13.291314616229645]
['Belleau Wood', 13.291314616229645]
['Boston lawyer', 13.291314616229645]
['Britannic Majesty', 13.291314616229645]
['COOPERATION BY', 13.291314616229645]
['CRIMINAL JUSTICE', 13.291314616229645]
['Calvin Coolidge', 13.291314616229645]
['Cape Horn', 13.291314616229645]
['Cardinal Bernardin', 13.291314616229645]
['Chop Hill', 13.291314616229645]
['Chosin Reservoir', 13.291314616229645]
['Christmas Eve', 8.384424020621127]
['Colonel Goethals', 13.291314616229645]
['Dark pictures', 13.291314616229645]
['Domestic Product', 13.291314616229645]
['EIGHTEENTH AMENDMENT', 13.291314616229645]
['Emancipation Proclamation', 13.291314616229645]
['English writer', 11.706352115508489]
['Fort Sumter', 10.483959694172041]
['Gatun dam', 9.04338710278606]
['Golden Rule', 11.706352115508489]
['Gross D

From skimming the output of the collocation finder, it appears to cast a wide net, grabbing many phrases that do not fit the category of a collocation- many names and phrases that could easily be substituted with synonyms. I suspect that this could be an issue with the somewhat restricted range of language that comes with the formal nature of an inaugral address violating a prior assumption made when deriving the point-wise mutual information formula, but it could also be simply due to the relatively simplistic nature of the technique.
