Farishah Nahrin

CS 6301.M02 - Special Topics in Computer Science

NLP

# Portfolio Assignment: Wordnet

Code was inspired by Dr. Karen Mazidi's Github directory: https://github.com/kjmazidi/NLP/tree/master/Part_2-Words/Chapter_07_WordNet

## What is wordnet? 

WordNet is a "lexical database of nouns, verbs, adjectives and adverbs that provides short definitions called *glosses*, and use examples." WordNet started as a project at Princeton University, organized by George Miller. The primary premise of the project was "to support theories of human semantic memory, which suggested that people organize concepts mentally in some kind of hierarchy." In NLTK, WordNet is just another NLTK corpus reader, that is created for natural language processing and can be used for translating language automatically, text similarity, to disambiguate words, as a thesaurus. In WordNet, you can use "sysnet" to look up words, and the Sysnet function also has an optional POS argument which lets you constrain the part of speech of a word. Each sysnet contains one or more lemmas, which represent a specific sense of a word. With Sysnet, you can display a set of synonyms of a word as well.

Source: https://www.nltk.org/howto/wordnet.html, and Chapter 7 - Exploring NLP with Python by Dr. Karen Mazidi

#### Import NLTK libraries for this Project

In [None]:
from nltk.corpus import wordnet as wn
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
[nltk_data] Downloading package webtext to /root/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]

True

## Select a Noun

Output all synsets. Select one synset from the list of synsets. Extract its definition, usage examples, and lemmas. From your selected synset, traverse up the WordNet hierarchy as far as you can, outputting the synsets as you go. 

In [None]:
def noun_details(noun:str = None):
    if noun is None or type(noun)!=str:
        print("The noun must be of 'str' type!")
    #Print synsets of noun
    print("\nSynsets are:")
    print(*wn.synsets(noun), sep='\n') 
    #Select first synset of noun
    #I chose Gladden
    noun_synsets = wn.synsets(noun)
    #Selected synset
    noun_synset = noun_synsets[-1]
    print(f"Selected Synset is: {noun_synset.name()}")
    #Print synset defination
    print(f'\n"{noun_synset.name()}" is {noun_synset.definition()}')
    #Print synset defination
    usages = []
    for synset in noun_synsets:
        if synset.examples() != []:
            usages.extend(synset.examples())
    print(f'\nThe usage examples are :',*usages,sep='\n')
    #Print lemmas of synset
    print("\nLemmas are:")
    print(*noun_synset.lemma_names(), sep='\n')
    #Print synset heirarchy in wordNet
    print("\nThe hypernyms are:")
    print(*noun_synset.hypernym_paths()[0][::-1], sep='\n')
    #Print synset hyponyms
    print(f"\nThe hyponyms of {noun} are:")
    print(*noun_synset.hyponyms(), sep='\n')
    ### print meronyms
    print(f"\nThe meronyms of {noun} are:")
    print('\n'.join(noun_synset.part_meronyms()) or '[]')
    ### print holonyms
    print(f"\nThe holonyms of {noun} are:")
    print('\n'.join(noun_synset.part_holonyms()) or '[]')
    ### print antonyms
    print(f"\nThe antonym of {noun} is:")
    ant = noun_synset.lemmas()[0].antonyms()
    print(ant[0] if ant else '[]')

In [None]:
noun_details('joy')


Synsets are:
Synset('joy.n.01')
Synset('joy.n.02')
Synset('rejoice.v.01')
Synset('gladden.v.01')
Selected Synset is: gladden.v.01

"gladden.v.01" is make glad or happy

The usage examples are :
a joy to behold
the pleasure of his company
the new car is a delight

Lemmas are:
gladden
joy

The hypernyms are:
Synset('gladden.v.01')

The hyponyms of joy are:
Synset('overjoy.v.01')

The meronyms of joy are:
[]

The holonyms of joy are:
[]

The antonym of joy is:
Lemma('sadden.v.01.sadden')


WordNet organizes nouns into hierarchies of concepts, with each concept represented by a synset. A synset is a set of words that are interchangeable in some context, and each synset is associated with a unique identifier. Each sysnet contains lemmas, which are sets of words with synonyns. Within each synset, WordNet captures the relationships between words using various semantic relations. For example, a hypernym is a more general word that encompasses the meaning of a specific word (e.g., "emotion" is a hypernym of "joy"), while a hyponym is a more specific word that is encompassed by a more general word (e.g., "emotion" is a hyponym of "feeling"). Through this, we are able to observe the relationships between the words and nouns, to understand the language and context. This is especially helpful for natural language processing, which includes sentiment or text analyses.

# Select a Verb

Output all synsets.Select one synset from the list of synsets. Extract its definition, usage examples, and lemmas. From your selected synset, traverse up the WordNet hierarchy as far as you can, outputting the synsets as you go

In [None]:
#Chosing synsets and passing v for verb
verb_synsets = wn.synsets('work')
#Selected Synset:
verb_synset = verb_synsets[3]
print(f"Selected Synset is: {verb_synset.name()}")
#Printing definations
verb_synset.definition()

Selected Synset is: study.n.02


'applying the mind to learning and understanding a subject (especially by reading)'

In [None]:
#Printing usage examples
verb_synset.examples()

['mastering a second language requires a lot of work',
 'no schools offer graduate study in interior design']

In [None]:
#Printing lemmas
verb_synset.lemma_names()

['study', 'work']

In [None]:
#Traversing
verb_synset.hypernym_paths()[0][::-1]

[Synset('study.n.02'),
 Synset('learning.n.01'),
 Synset('basic_cognitive_process.n.01'),
 Synset('process.n.02'),
 Synset('cognition.n.01'),
 Synset('psychological_feature.n.01'),
 Synset('abstraction.n.06'),
 Synset('entity.n.01')]

WordNet organizes verbs into hierarchies based on their semantic relationships. Each synset represents a concept that encompasses a set of verbs that share similar meanings. The verbs within a synset are related by their senses, which represent different shades of meaning that the verb can take on. This is essentially organized similarly to nouns, where each sysnet, contains lemmas, that have verbs that express the same feeling and meaning. The grouping of sysnets and verbs are done based on their concept.

In [None]:
#Using  morphy to find different words of the noun
for word in verb_synset.lemma_names():
    print(wn.morphy(word))

study
work


## Wu Palmer Similiarty Metric and Lesk Algorithm

In [None]:
#Select two words
elaborate = wn.synsets('elaborate')[0]
explain = wn.synsets('explain')[0]
#Finding wu palmer similarity 
wn.wup_similarity(elaborate,explain)

0.8333333333333334

In [None]:
from nltk.wsd import lesk
#Since Lesk is used with contectual sentances, create a sentance
context_sentence = "I went to the teacher get some details about the topic".split()
print(lesk(context_sentence, 'elaborate'))

Synset('elaborate.v.01')


Both algorithms can be used to disambiguate the sense of a word in a given context, but they use different approaches to calculate the relatedness between synsets. Lesk tends to work better in cases where the context contains more specialized terms, while Wu-Palmer is better suited for cases where the context contains more general terms.

In [None]:
from nltk.corpus import sentiwordnet as swn
import nltk
nltk.download('sentiwordnet')

[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


True

# SentiWordNet

In [None]:
from nltk.corpus import sentiwordnet as swn
#Creating synsets for loathe
senti_synsets = list(swn.senti_synsets('loathe'))
#Printing score of all senti_synsets
for synset in senti_synsets:
    print(synset.synset.name(), synset.pos_score(), synset.neg_score(), synset.obj_score())

abhor.v.01 0.0 0.25 0.75


In [None]:
from nltk.tokenize import word_tokenize

#Creating sentence with the word, loathe
sentence = "I loathe the taste of carrots."
tokens = word_tokenize(sentence)
#Outputting polarity for each word in sentence
for token in tokens:
    senti_synsets = list(swn.senti_synsets(token))
    if len(senti_synsets) > 0:
        print(token, senti_synsets[0].pos_score(), senti_synsets[0].neg_score(), senti_synsets[0].obj_score())
    

I 0.0 0.0 1.0
loathe 0.0 0.25 0.75
taste 0.0 0.0 1.0
carrots 0.0 0.0 1.0


### Observations:

- The word "loathe" a negative polarity, which matches our intuition as it is a highly negative word.
- The other words in the sentence are neutral
- Knowing the polarity scores of words can be very useful in NLP applications such as sentiment analysis, where the overall sentiment of a piece of text is determined by aggregating the polarity scores of individual words. This can be helpful in understanding the sentiment of customer reviews, social media posts, and other text data. However, it is important to note that the accuracy of sentiment analysis depends on the quality of the sentiment lexicon used, and the context in which the words are used. Sometimes with underlying satire, sarcasm, and bias, determining the polarity can be used to label and track the emotion that a body of text evokes. This can be used to analyze customer reviews, for examples, to determine and filter which reviews were positive, netural, or negative.
- Thus, SentiWordNet is the most widely used lexicon to perform tasks related to opinion mining. In SentiWordNet, each synset of WordNet is being assigned the three sentiment numerical scores; positive, negative and objective that are calculated using by a set of classifiers.

# Collocations

In [None]:
from nltk.book import text4
#Print collocations for text4
text4.collocation_list()

[('United', 'States'),
 ('fellow', 'citizens'),
 ('years', 'ago'),
 ('four', 'years'),
 ('Federal', 'Government'),
 ('General', 'Government'),
 ('American', 'people'),
 ('Vice', 'President'),
 ('God', 'bless'),
 ('Chief', 'Justice'),
 ('one', 'another'),
 ('fellow', 'Americans'),
 ('Old', 'World'),
 ('Almighty', 'God'),
 ('Fellow', 'citizens'),
 ('Chief', 'Magistrate'),
 ('every', 'citizen'),
 ('Indian', 'tribes'),
 ('public', 'debt'),
 ('foreign', 'nations')]

### Mutual Information
MI = log( p(x,y) / \[p(x)*p(y) \])

In [None]:
#This code is used from https://github.com/kjmazidi/NLP/blob/master/Part_2-Words/Chapter_07_WordNet/7.5_collocations.ipynb
import math
import random
random.seed(123)

#Selecting one collocation
choice = random.choice(text4.collocation_list())
print('Choice = '+' '.join(choice))
text = ' '.join(text4.tokens)
vocab = len(set(text4))
xy = text.count(f'{choice[0]} {choice[1]}')/vocab
print(f'p({choice[0]} {choice[1]}) =',xy )
x = text.count(choice[0])/vocab
print(f"p({choice[0]}) = ", x)
y = text.count(f'{choice[1]}')/vocab
print(f'p({choice[1]}) = ', y)
pmi = math.log2(xy / (x * y))
print('pmi = ', pmi)

Choice = fellow citizens
p(fellow citizens) = 0.006084788029925187
p(fellow) =  0.013665835411471322
p(citizens) =  0.026932668329177057
pmi =  4.0472042737811735


### Observations:
The mutual Information formula is a measure of the association between two words. The higher the MI score, the stronger the association between the two words. In this case, the MI score for **"fellow citizens"** is **4.04**, which is a relatively high score. This indicates that the words "fellow" and "citizens" are strongly associated with each other in the text, which makes sense given the political context of the Inaugural corpus. Thus, Collocations are two or more words that tend to appear frequently together, for example – United States. Collocation traces the appearance of words that commonly appear next to each other in a text or series of text in order to analyze the words' importance.

This measure of association can be useful in various NLP tasks, such as identifying important collocations, detecting sentiment expressions, and classifying documents based on topic. However, it is important to note that MI can be affected by the frequency of the individual words, which may lead to skewed results in some cases. Therefore, it is important to use MI in conjunction with other measures of association and to carefully interpret the results.