## Use Sense Vectors to identify Word sense

#### Library Imports

In [1]:
import os
import sys
import time
from gensim.models import KeyedVectors

Append the github repo to system path, and import modules from it

In [2]:
sys.path.append("sensegram_package/")
import sensegram
from wsd import WSD

---------

#### Load data

In [3]:
data_directory = os.path.join(os.getcwd(), "data")
corpus_fpath = os.path.join(data_directory, "corpus.txt")
sense_vectors_fpath = os.path.join(data_directory, "model", "wiki.txt.clusters.minsize5-1000-sum-score-20.sense_vectors")
word_vectors_fpath = os.path.join(data_directory, "model", "wiki.txt.word_vectors")

Load the sense and word vector files. This may take some time, owing to the large file size of the vector files

In [4]:
s = time.time()
if os.path.exists(sense_vectors_fpath) and os.path.exists(word_vectors_fpath):
    sense_vectors = sensegram.SenseGram.load_word2vec_format(sense_vectors_fpath, binary=False)
    word_vectors = KeyedVectors.load_word2vec_format(word_vectors_fpath, binary=False, unicode_errors="ignore")
    print(f"Took {time.time()-s}seconds to load vector files")
else:
    print("Could not find vector files. Check file paths and ensure the right files exists")
del s

print("Reading the corpus now!")
with open(corpus_fpath, "r") as f:
    corpus_data = f.read()

NameError: name 'sense_vectors' is not defined

#### Get all senses of a word

Using the sense vectors, load all possible senses for the given word.  
The output prints the sense *Word#&lt;sense-number&gt;* followed by the probabilities of that word matching other words with similar sense. This table can help us provide logical names of different sense groups. For example, running the code for the word "**table**" gives the following output -  
```
Probabilities of the senses:
[('Table#1', 1.0), ('Table#2', 1.0), ('Table#3', 1.0), ('Table#4', 1.0), ('table#1', 1.0), ('table#2', 1.0), ('table#3', 1.0), ('table#4', 1.0)]


Table#1
====================
table#1 0.996316
TABLE#1 0.993647
PAGE#2 0.989991
page#2 0.989991
WINDOW#2 0.989900
Window#3 0.989900
window#2 0.989900
Scale#2 0.989745
scale#2 0.989745
SCALE#2 0.989745


Table#2
====================
TABLE#2 1.000000
Row#3 0.869726
row#3 0.869726
ROW#3 0.856643
Stack#3 0.829349
Box#3 0.826571
BOX#2 0.826571
stack#3 0.825068
STACK#3 0.824239
BOWL#3 0.813412


Table#3
====================
TABLE#3 0.939938
table#3 0.934190
Boundary_Markers#5 0.845184
Catchment_Basins#2 0.826906
contents#2 0.825448
CONTENTS#2 0.825448
Contents#2 0.824324
tables#1 0.806271
NUMBERS#3 0.804637
Tables#1 0.796628
....

```
Few things we can see from the output - 
* Since we have set *ignore_case=True*, the output shows 4 senses for *Table*, and 4 for *table*.
* Looking at the related words for each sense, we can attribute the following logical groups to few of the senses - 
    - Table#2 - Data table.  
    - Table#3 - Table of contents.
    - table#4 - Hotel/Furniture.
    


In [1]:
test_word = "table"

print("Probabilities of the senses:\n{}\n\n".format(sense_vectors.get_senses(test_word, ignore_case=True)))

for sense_id, prob in sense_vectors.get_senses(test_word, ignore_case=True):
    print(sense_id)
    print("="*20)
    for rsense_id, sim in sense_vectors.wv.most_similar(sense_id):
        print("{} {:f}".format(rsense_id, sim))
    print("\n")

NameError: name 'sence_vectors' is not defined

#### Get disambiguated sense of the word, using corpus as context

##### Input
To understand the word's sense in a given context, we use the *WSD* class from the sensegram library.  
The WSD model takes the following key parameters to decide word sense based on corpus context - 
* vectors - Both sense and word vector models loaded earlier.  
  
  
* method - To calculate the sense of the word, the library averages the sense scores of all the surrounding context words and compares it with different senses of the target word. For this comparison, there are two available metrics - 
 - sim: Uses cosine distance
 - prob: Use log probability score  
  
  
* window - This is the window(±) that the model looks into, to decide the word context.   
For example, if our target word is *table*,   
with the context of *"we load the our data into a data-frame table object and count the number of rows/columns using the .shape method"*  
 1. a window of 3 would consider the following 6(3 on the left, and 3 on the right) words around our context word to find the sense of the word - *into, a, data-frame, object, and, count*  
 2. a window of 5 would use the following context - *our, data, into, a, data-frame, table, object, and, count, the, number*  
  
  
* verbose - Allows to print intermediate outputs while running the disambiguation code

<hr>  
     
Some food-for-thought regarding the usage of WSD module - 
 - Do note that while stopwords like *and* and *the* are considered in the context of the the target word, they are dropped while disambiguating the sense of our target.
 - While it may seem ideal to choose a high value of window for getting the sense of the target word, it may happen that the wider window results in an less accurate output, as it averages across all possible senses.
 - The library considers, and disambiguates, only the first occurance of the target word in the context. For a large corpus, it would be ideal to first split the corpus and generate contexts using an external helper function, and then iteratively get the sense for the target word across all occurances in the corpus.

In [20]:
wsd_model = WSD(sense_vectors, word_vectors, window=15, method='prob', verbose=True)

In [21]:
print(wsd_model.disambiguate(corpus_data, test_word))

Extracted context words:
['leftover', 'today', 'pizza', 'two', 'listening', 'english', 'saturday', 'just', 'ot', 'good', 'hangover', 'really', 'little', 'making', 'stop', 'slices']
Senses of a target word:
[('table#1', 1.0), ('table#2', 1.0), ('table#3', 1.0), ('table#4', 1.0)]
Significance scores of context words:
[0.37694023295858664, 0.5826593097827415, 0.7275173661671048, 0.3462219279577813, 0.32113965964726915, 0.6235405044059925, 0.314476217064492, 0.6094166324483048, 0.19088531161143885, 0.47508827794669944, 0.3897201415017436, 0.6215423778084821, 0.6857576958729588, 0.575916585269904, 0.7006624524726446, 0.7680371969757536]
Context words:
slices	0.768
pizza	0.728
stop	0.701
('table#2', [0.2706353009709064, 0.9591583572384959, 0.40617065436041355, 0.6940131864117054])


<h5> Output </h5>  

Running the Sense disambiguation code generates following lines of output -  
1. Prints the context words extracted from the corpus.
- Prints possible senses of the word, with their respective probabilities(without considering the context)
- Prints the significance score of each context word.
- Prints the most significant context words.
- **Returns** a tuple of the sense of the word as derived from the context, and match scores(log-probability or cosine-similarity depending on the *method* chosen) of various senses of the target word.  
For instance, the output *('table#2', [0.2706353009709064, 0.9591583572384959, 0.40617065436041355, 0.6940131864117054])* indicates the following things regarding our target word -  
    - The closest sense of our target word is with *table#2*, with a match score of 0.959(second in the list)
    - For the other senses, the match score can be read as follows - 
    - table#1 - 0.2706
    - table#2 - 0.959
    - table#3 - 0.406
    - table#4 - 0.694

Since we had not defined the *ignore_case* argument while initializing the WSD model, it resorts to the default of True, and the output return scores for the 4 senses of the word *table*.  
If we chose to ignore case, the output would have match for 8 senses(4-Table; 4-table)
<hr> 