# Resnik Similarity

__Anish Sachdeva (DTU/2K16/MC/013)__

__Natural Language Processing (Dr. Seba Susan)__

In the Resnik Similarity metric we compute the lowest Common subsumer __LCS__ of the given words $w_1$ and $w_2$. We then compute the probability of teh subsumer given a corpus and we then compute the similarity score as $-log{LCS(w_1, w_2)}$. weshow below how to compute the closest possible synsets for 2 given worsdusing the Resnik Similarity and we also then use this metric on our resume to see which document matches most closesly with the 6th document.

### Importing Required Packages

In [1]:
import nltk
from nltk.corpus import wordnet, wordnet_ic
# nltk.download('wordnet')
# nltk.download('wordnet_ic')
import numpy as np
import pickle
import pprint
import pandas as pd
from scipy import stats

In [2]:
# Defining Infinity
infinity = float('inf')

We will now import the Brown Corpus which is required for computing the probabilities of the Lowest Common Subsumer __LCS__.

In [3]:
# Importing the Brown Corpus
brown_ic = wordnet_ic.ic('ic-brown.dat')

### Defining the `closest_synsets` Function 
This function will compute the 2 closest synsets for any 2 words such that they are most similar as per the Resnik Similarity Metric.

In [4]:
def closest_synsets(word_1: str, word_2: str):
    word_1 = wordnet.synsets(word_1)
    word_2 = wordnet.synsets(word_2)
    max_similarity = -infinity
    try:
        synset_1_shortest = word_1[0]
        synset_2_shortest = word_2[0]
    except:
        return None, None, -infinity

    for synset_1 in word_1:
        for synset_2 in word_2:
            if synset_1.pos() != synset_2.pos():
                continue
            similarity = synset_1.res_similarity(synset_2, ic=brown_ic)
            if similarity > max_similarity:
                max_similarity = similarity
                synset_1_shortest = synset_1
                synset_2_shortest = synset_2

    return synset_1_shortest, synset_2_shortest, max_similarity

Now, let us test our function with a few sample words. 

In [5]:
word_1 = 'java'
word_2 = 'island'
word_1_synset, word_2_synset, similarity = closest_synsets(word_1, word_2)

print(word_1.capitalize() + ' Definition:', word_1_synset.definition())
print(word_2.capitalize() + ' Definition:', word_2_synset.definition())
print('similarity:', similarity)

Java Definition: an island in Indonesia to the south of Borneo; one of the world's most densely populated regions
Island Definition: a land mass (smaller than a continent) that is surrounded by water
similarity: 6.688645509739946


In [6]:
word_1 = 'java'
word_2 = 'language'
word_1_synset, word_2_synset, similarity = closest_synsets(word_1, word_2)

print(word_1.capitalize() + ' Definition:', word_1_synset.definition())
print(word_2.capitalize() + ' Definition:', word_2_synset.definition())
print('similarity:', similarity)

Java Definition: a platform-independent object-oriented programming language
Language Definition: a systematic means of communicating by the use of sounds or conventional symbols
similarity: 5.792086967391197


In [7]:
word_1 = 'nickel'
word_2 = 'dime'
word_1_synset, word_2_synset, similarity = closest_synsets(word_1, word_2)

print(word_1.capitalize() + ' Definition:', word_1_synset.definition())
print(word_2.capitalize() + ' Definition:', word_2_synset.definition())
print('similarity:', similarity)

Nickel Definition: a United States coin worth one twentieth of a dollar
Dime Definition: a United States coin worth one tenth of a dollar
similarity: 7.455288045755159


In [9]:
word_1 = 'nickel'
word_2 = 'gold'
word_1_synset, word_2_synset, similarity = closest_synsets(word_1, word_2)

print(word_1.capitalize() + ' Definition:', word_1_synset.definition())
print(word_2.capitalize() + ' Definition:', word_2_synset.definition())
print('similarity:', similarity)

Nickel Definition: a hard malleable ductile silvery metallic element that is resistant to corrosion; used in alloys; occurs in pentlandite and smaltite and garnierite and millerite
Gold Definition: a soft yellow malleable ductile (trivalent and univalent) metallic element; occurs mainly as nuggets in rocks and alluvial deposits; does not react with most chemicals but is attacked by chlorine and aqua regia
similarity: 5.442191710437843


We can clearly see from the above examples that our function and the Resnik Similarity Metric are giving good results and we are finding words close to each other based on the context.

We will now test this metric on our resume.

### Loading The Documents from our Resume
Our Resume was divided into 6 documents and each document contains 6 keywords that occurred with the highest frequency in each document.

In [10]:
documents = pickle.load(open('../assets/documents.p', 'rb'))
print('The documents are:')
pprint.pprint(documents)

The documents are:
[['python', 'data', 'structures', 'students', 'com', 'delhi'],
 ['java', 'auckland', 'geometry', 'mathematics', 'theory', 'batch'],
 ['cern', 'applications', 'worked', 'research', 'group', 'core'],
 ['worked', 'also', 'requests', 'participated', 'many', 'teaching'],
 ['structures', 'computer', 'algorithms', 'java', 'university', 'mathematics'],
 ['trinity', 'college', 'london', 'plectrum', 'guitar', 'grade']]


### Viewing the Documents in Tabular Format

In [11]:
documents_table = pd.DataFrame(documents)
print('\nDocuments:')
print(documents_table)


Documents:
            0             1           2             3           4            5
0      python          data  structures      students         com        delhi
1        java      auckland    geometry   mathematics      theory        batch
2        cern  applications      worked      research       group         core
3      worked          also    requests  participated        many     teaching
4  structures      computer  algorithms          java  university  mathematics
5     trinity       college      london      plectrum      guitar        grade


### Finding Similarity Between 6th & Other Documents

In [12]:
similarity_mat = np.zeros((len(documents) - 1, len(documents[0])))

for column, keyword in enumerate(documents[len(documents) - 1]):
    for row in range(len(documents) - 1):
        similarity_mat[row][column] = closest_synsets(keyword, documents[row][column])[2]

print('\nThe similarity coefficients are:\n')
similarity = pd.DataFrame(similarity_mat, columns=documents[5])
print(similarity.to_string())


The similarity coefficients are:

    trinity   college    london  plectrum    guitar     grade
0  5.738632  2.855294  1.531834  1.531834      -inf  1.290026
1  0.596229  1.290026 -0.000000 -0.000000 -0.000000  7.054047
2      -inf  0.801759      -inf -0.000000  0.801759  3.335576
3      -inf      -inf -0.000000      -inf      -inf  2.644521
4  2.855294  2.305849 -0.000000  1.290026  2.305849  2.644521


### Saving The Similarity Coefficient Matrix in a File
We do so, so that we can view reslts later on

In [13]:
results = open('../assets/resnik_similarity_matrix.txt', 'w')
results.write(similarity.to_string())
results.close()

### Selecting document with Maximum/Minimum Similarity with the 6th Document.
We can clearly see that for the first column (word: __trinity__) maximum similarity is 5.738 with __python__ and minimum is 
$ -\infty$ with __worked__ and __structures__.

For the word __college__, maximum is 2.855 with __data__ and minimum is $- \infty$ with __computer__. 

For the word __london__, maximum is 1.531 with __structures__ and minimum is $- \infty$ with __worked__.

For the word __plectrum__, maximum is 1.531 with __students__ and minimum is $- \infty$ with __participated__.

For the word __guitar__, maximum is 2.305 with __university__ and minimum is $- \infty$ with __com__ and __many__.

For the word __grade__, maximum is 7.054 with __batch__ and minimum is 1.290 with __delhi__.

In [14]:
# From the above data we can create vectors for maximum and minimum indices for each column
max = [0, 0, 0, 0, 4, 1]
min = [3, 3, 2, 3, 4, 0]

In [15]:
# document with least/maximum similarity
document_min_similarity = stats.mode(min).mode[0]
document_max_similarity = stats.mode(max).mode[0]

print('\nDocument with Minimum Similarity to 6th document:', documents[document_min_similarity])
print('Document with Maximum Similarity to 6th document:', documents[document_max_similarity])


Document with Minimum Similarity to 6th document: ['worked', 'also', 'requests', 'participated', 'many', 'teaching']
Document with Maximum Similarity to 6th document: ['python', 'data', 'structures', 'students', 'com', 'delhi']
