<a href="https://colab.research.google.com/github/airdac/MUD/blob/main/Copy_of_wordnet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# -----------------------------------------------------------------------------------
# 1. Introduction to NLP and WordNet
# -----------------------------------------------------------------------------------
print("Welcome to the NLP and WordNet Introduction Lab!")

In [None]:
# First, let's install the necessary libraries
!pip install nltk gensim

# Import the necessary libraries
import nltk
from nltk.corpus import wordnet as wn
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import numpy as np
from scipy.spatial import distance
from sklearn.metrics import jaccard_score
from sklearn.feature_extraction.text import CountVectorizer

# Download necessary datasets from NLTK
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')  # for tokenization

**Note:** Below shows a table of attribute and their definitions on wordnet.[ref1](https://wordnet.princeton.edu) [/ref2](https://opensource.com/article/20/8/nlp-python-nltk)

| Attribute   | Definition                                                                                             | Example                                                                                                                                                                   | Code Example                                                                                          |
|-------------|--------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| Name        | Name of the synset                                                                                     | Example: The word "code" has five synsets with names code.n.01, code.n.02, code.n.03, code.v.01, code.v.02                                                                | `synset.name()`                                                                                      |
| POS         | Part of speech of the word for this synset                                                             | The word "code" has three synsets in noun form and two in verb form                                                                                                       | `synset.pos()`                                                                                       |
| Definition  | Definition of the word (in POS)                                                                        | One of the definitions of "code" in verb form is: "(computer science) the symbolic..."                             | `synset.definition()`                                                                                |
| Examples    | Examples of word's use                                                                                 | One of the examples of "code": "We should encode the message for security reasons"                                                                                        | `synset.examples()`                                                                                  |
| Lemmas      | Other word synsets this word+POC is related to (not strictly synonyms, but can be considered so);...   | Lemmas of code.v.02 (as in "convert ordinary language into code") are code.v.02.encipher, code.v.02.cipher,... | `synset.lemmas()`                                                                                    |
| Antonyms    | Opposites                                                                                              | Antonym of lemma encode.v.01.encode is decode.v.01.decode                                                                                                                 | `[lemma.antonyms() for lemma in synset.lemmas() if len(lemma.antonyms()) > 0]`                        |
| Hypernym    | A broad category that other words fall under                                                           | A hypernym of code.v.01 (as in "Code the pieces with numbers so that you can identify them later") is tag.v.01                                                           | `synset.hypernyms()`                                                                                 |
| Meronym     | A word that is part of (or subordinate to) a broad category                                            | A meronym of "computer" is "chip"                                                                                                                                         | `synset.part_meronyms()`                                                                             |
| Holonym     | The relationship between a parent word and its subordinate parts                                       | A hyponym of "window" is "computer screen"                                                                                                                                | `synset.part_holonyms()`                                                                             |


**Note:** A "synset" is a single group of synonyms representing one meaning of a word or phrase, while "synsets" refer to the full set of these groups that a word can belong to, representing all the possible meanings of the word as organized in WordNet.

In [None]:
# -----------------------------------------------------------------------------------
# 2. Exploring WordNet with NLTK
# -----------------------------------------------------------------------------------


In [None]:
## Let's find out what synsets the word 'dog' has
## Accessing synsets
dog_synsets = wn.synsets('dog')
print("\nSynsets for 'dog':", dog_synsets)
## a synset comes within a synset id (f.g. 'dog.n.01' ) which refers to the specific sense of that word. A word could have several senses and synsets.

In [None]:
synset = wn.synset('dog.n.01')

In [None]:
## Let's build a function to print the Attribute table information for a synset.
#TODO: fill the "fill in" parts
def synset_info(synset):
    print("Name", synset.name())
    print("POS:", synset.pos())
    print("Definition:", "to fill in .......")
    print("Examples:", "to fill in .......")
    print("Lemmas:", "to fill in .......")
    print("Antonyms:", "to fill in .......")
    print("Hypernyms:", "to fill in .......")
    print("Part Holonyms:", "to fill in .......")
    print("Part Meronyms:", "to fill in .......")

In [None]:
synset_info(synset)

In [None]:
## We found the synsets ids of the word 'dog' . Now we can find the synset of each synset id.

In [None]:
## TODO: Find synsets for 'cat'
# Uncomment the lines below and fill in the blank
# cat_synsets = wn.synsets('____')
# print("\nSynsets for 'cat':", cat_synsets)

## TODO: Find the synset_info() of the a selected synset id from 'cat'


In [None]:
## Exploring word hierarchies (Hypernyms and Hyponyms)
#Build a function to print out the hypernyms and hyponyms of the synsets of the word 'dog' through a loop

def print_hypernyms(input_word):
  word_synsets = wn.synsets(input_word)
  for synset in word_synsets:
    #sysnet = ...
    #print("\nHypernyms of ',input_word, dog.hypernyms())
    #to be filled ...
    pass

def print_hyponyms(input_word):
  word_synsets = wn.synsets(input_word)
  #to be filled...
  pass


## TODO: Explore hypernyms and hyponyms for 'cat'
# Replace 'dog' with 'cat' in the example above and explore
#Todo, redo the functions above for the word 'cat'

In [None]:
##EXTRA PART
##The WordNet corpus reader gives access to the Open Multilingual WordNet,
##using ISO-639 language codes. These languages are not loaded by default, but only lazily, when needed.
wn.synsets(b'\xe7\x8a\xac'.decode('utf-8'), lang='jpn')

In [None]:
## Lets see the lemma names of spy in japanese
wn.synset('spy.n.01').lemma_names('jpn')

In [None]:
## We can print to see which languages the wordnet object has
sorted(wn.langs())

In [None]:
## Let's see what is the lemma names of 'dog' first sense in italian
wn.synset('dog.n.01').lemma_names('ita')

In [None]:
## Now we can search for the lemmas of cane in italian
wn.lemmas('cane', lang='ita')

In [None]:
## we can search for the synonyms of the words as well, in the same or another language!
## synonyms of car in english
en_synonyms = wn.synonyms('car')
print("Synonyms of the word car in en language: ",en_synonyms)

es_synonyms = wn.synonyms('coche', lang='spa')

print("Synonyms of the word coche in es language: ",es_synonyms)

In [None]:
##TODO try it with antonyms (e.g. - > antonyms())
## Write your code here

In [None]:


# -----------------------------------------------------------------------------------
# 3. Computing Semantic Similarity Between Words
# -----------------------------------------------------------------------------------


In [None]:
## Path similarity
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
print("\nPath similarity between 'dog' and 'cat':", dog.path_similarity(cat))

## TODO: Compute Wu-Palmer similarity between 'dog' and 'cat'
# Uncomment the lines below and fill in the blank
# print("Wu-Palmer similarity between 'dog' and 'cat':", dog.wup_similarity(cat))

In [None]:
##Comparing similarities is a fundamental process used to identify commonalities between two text segments.
##This technique is widely applied in various domains, including search engines, chatbots, and beyond,
##serving as a crucial component for enhancing user interactions and information retrieval.

syn1 = wn.synsets('football')
syn2 = wn.synsets('soccer')

# A word may have multiple synsets, so need to compare each synset of word1 with synset of word2
for s1 in syn1:
    for s2 in syn2:
        print("Path similarity of: ")
        print(s1, '(', s1.pos(), ')', '[', s1.definition(), ']')
        print(s2, '(', s2.pos(), ')', '[', s2.definition(), ']')
        print("   is", s1.path_similarity(s2))
        print()

##Todo: write a function to recieve two words and perform the similarities of their synsets. Try it with 'king' 'queen' , 'car' 'engine' , 'good' 'bad' and 'black' 'white'.

##..... write it here below (hint, you can use the above code to produce this function!)

In [None]:
print("\n-- Document Similarity Based on WordNet Synsets --")

## Introduction to computing document similarity using WordNet
## Example and Practice: Define a function to compute the similarity between two documents
def document_similarity(doc1, doc2):
    """
    Compute document similarity using WordNet synset similarities for pairs of words.
    """
    synsets1 = [wn.synsets(word) for word in nltk.word_tokenize(doc1)]
    synsets2 = [wn.synsets(word) for word in nltk.word_tokenize(doc2)]

    # Flatten the list of synsets and filter out None
    synsets1 = [ss for sublist in synsets1 for ss in sublist if ss]
    synsets2 = [ss for sublist in synsets2 for ss in sublist if ss]

    score, count = 0.0, 0

    # For each synset in both documents, find the maximum similarity value
    for synset1 in synsets1:
        max_score = max([synset1.path_similarity(synset2) or 0 for synset2 in synsets2])
        if max_score > 0:
            score += max_score
            count += 1

    # Average the scores
    score /= count
    return score




In [None]:
## Test the function with two simple documents
doc1 = "Dogs are awesome."
doc2 = "Cats are amazing."
print("Document similarity:", document_similarity(doc1, doc2))

# -----------------------------------------------------------------------------------
# Wrap-up and Q&A
# -----------------------------------------------------------------------------------
print("\nThank you for participating in the lab! Feel free to experiment with the code and explore further. If you have any questions, now is a great time to ask.")


In [None]:
##Todo : Try to find the similarities of more documents. does it work for a text with multiple sentences?