## Use Sense Vectors to identify Word sense

#### Library Imports

In [1]:
import os
import sys
import time
from gensim.models import KeyedVectors
import spacy
from tqdm import tqdm
import pandas as pd
from spacy_wordnet.wordnet_annotator import WordnetAnnotator 
from collections import defaultdict
import ast

Append the github repo to system path, and import modules from it

In [2]:
sys.path.append("sensegram_package/")
import sensegram
from wsd import WSD

---------

#### Load data

In [3]:
data_directory = os.path.join(os.getcwd(), "data")
corpus_fpath = os.path.join(data_directory, "corpus.txt")
sense_vectors_fpath = os.path.join(data_directory, "model", "wiki.txt.clusters.minsize5-1000-sum-score-20.sense_vectors")
word_vectors_fpath = os.path.join(data_directory, "model", "wiki.txt.word_vectors")

Load the sense and word vector files. This may take some time, owing to the large file size of the vector files

In [4]:
s = time.time()
if os.path.exists(sense_vectors_fpath) and os.path.exists(word_vectors_fpath):
    sense_vectors = sensegram.SenseGram.load_word2vec_format(sense_vectors_fpath, binary=False)
    word_vectors = KeyedVectors.load_word2vec_format(word_vectors_fpath, binary=False, unicode_errors="ignore")
    print(f"Took {time.time()-s}seconds to load vector files")
else:
    print("Could not find vector files. Check file paths and ensure the right files exists")
del s

print("Reading the corpus now!")
with open(corpus_fpath, "r") as f:
    corpus_data = f.read()

Took 1278.5607526302338seconds to load vector files
Reading the corpus now!


#### Get all senses of a word

Using the sense vectors, load all possible senses for the given word.  
The output prints the sense *Word#&lt;sense-number&gt;* followed by the probabilities of that word matching other words with similar sense. This table can help us provide logical names of different sense groups. For example, running the code for the word "**table**" gives the following output -  
```
Probabilities of the senses:
[('Table#1', 1.0), ('Table#2', 1.0), ('Table#3', 1.0), ('Table#4', 1.0), ('table#1', 1.0), ('table#2', 1.0), ('table#3', 1.0), ('table#4', 1.0)]


Table#1
====================
table#1 0.996316
TABLE#1 0.993647
PAGE#2 0.989991
page#2 0.989991
WINDOW#2 0.989900
Window#3 0.989900
window#2 0.989900
Scale#2 0.989745
scale#2 0.989745
SCALE#2 0.989745


Table#2
====================
TABLE#2 1.000000
Row#3 0.869726
row#3 0.869726
ROW#3 0.856643
Stack#3 0.829349
Box#3 0.826571
BOX#2 0.826571
stack#3 0.825068
STACK#3 0.824239
BOWL#3 0.813412


Table#3
====================
TABLE#3 0.939938
table#3 0.934190
Boundary_Markers#5 0.845184
Catchment_Basins#2 0.826906
contents#2 0.825448
CONTENTS#2 0.825448
Contents#2 0.824324
tables#1 0.806271
NUMBERS#3 0.804637
Tables#1 0.796628
....

```
Few things we can see from the output - 
* Since we have set *ignore_case=True*, the output shows 4 senses for *Table*, and 4 for *table*.
* Looking at the related words for each sense, we can attribute the following logical groups to few of the senses - 
    - Table#2 - Data table.  
    - Table#3 - Table of contents.
    - table#4 - Hotel/Furniture.
    


In [5]:
test_word = "disk"

print("Probabilities of the senses:\n{}\n\n".format(sense_vectors.get_senses(test_word, ignore_case=False)))

for sense_id, prob in sense_vectors.get_senses(test_word, ignore_case=True):
    print(sense_id)
    print("="*20)
    for rsense_id, sim in sense_vectors.wv.most_similar(sense_id):
        print("{} {:f}".format(rsense_id, sim))
    print("\n")

Probabilities of the senses:
[('disk#1', 1.0), ('disk#2', 1.0), ('disk#3', 1.0), ('disk#4', 1.0), ('disk#5', 1.0)]


Disk#1
Flexible_Disk#2 0.924806
Rigid_Disk#2 0.916348
Basic_Input#1 0.899887
RDOS#1 0.893633
SASI#1 0.891424
Sasi#1 0.891424
corrector#4 0.889009
customizer#2 0.888149
Fargo#3 0.886290
fargo#2 0.886290


Disk#2
DISK#2 0.991812
BOOT#4 0.958993
Bit#4 0.934779
disk#2 0.931051
module#3 0.923883
Module#3 0.921607
MODULE#3 0.916528
Disks#3 0.909954
rom#1 0.903984
hard_disk#1 0.903680


Disk#3
disk#3 1.000000
DISK#3 1.000000
urn#4 0.960472
URN#4 0.960472
Urn#4 0.960472
EXEC#4 0.958112
Coco#4 0.957627
CoCo#2 0.957627
coco#3 0.957627
COCO#2 0.957627


Disk#4
DISK#4 1.000000
disk#4 0.991975
Envelope#1 0.931414
Plate#1 0.922727
Stack#3 0.906151
STACK#3 0.904850
Shelf#4 0.900630
stack#3 0.897358
Bubble#4 0.896194
CHAIN#4 0.895439


disk#1
DISK#1 0.995354
Modem#2 0.961432
install#3 0.960586
SYNC#1 0.960527
sync#1 0.960527
Install#1 0.959036
Handheld#2 0.958145
laptop#1 0.956479
Lapto

#### Get disambiguated sense of the word, using corpus as context

##### Input
To understand the word's sense in a given context, we use the *WSD* class from the sensegram library.  
The WSD model takes the following key parameters to decide word sense based on corpus context - 
* vectors - Both sense and word vector models loaded earlier.  
  
  
* method - To calculate the sense of the word, the library averages the sense scores of all the surrounding context words and compares it with different senses of the target word. For this comparison, there are two available metrics - 
 - sim: Uses cosine distance
 - prob: Use log probability score  
  
  
* window - This is the window(±) that the model looks into, to decide the word context.   
For example, if our target word is *table*,   
with the context of *"we load the our data into a data-frame table object and count the number of rows/columns using the .shape method"*  
 1. a window of 3 would consider the following 6(3 on the left, and 3 on the right) words around our context word to find the sense of the word - *into, a, data-frame, object, and, count*  
 2. a window of 5 would use the following context - *our, data, into, a, data-frame, table, object, and, count, the, number*  
  
  
* verbose - Allows to print intermediate outputs while running the disambiguation code

<hr>  
     
Some food-for-thought regarding the usage of WSD module - 
 - Do note that while stopwords like *and* and *the* are considered in the context of the the target word, they are dropped while disambiguating the sense of our target.
 - While it may seem ideal to choose a high value of window for getting the sense of the target word, it may happen that the wider window results in an less accurate output, as it averages across all possible senses.
 - The library considers, and disambiguates, only the first occurance of the target word in the context. For a large corpus, it would be ideal to first split the corpus and generate contexts using an external helper function, and then iteratively get the sense for the target word across all occurances in the corpus.

In [6]:
wsd_model = WSD(sense_vectors, word_vectors, window=15, method='prob', verbose=True)

In [7]:
print(wsd_model.disambiguate(corpus_data, "pizza"))

Extracted context words:
['know', 'hare', 'street', 'hope', 'overexcited', 'dont', 'er', 'richard', 'still', 'aaw', 'say', 'welcome', 'open', 'poppet', 'closed', 'gif', 'tortoise', 'youre', 'got']
Senses of a target word:
[('pizza#1', 1.0), ('pizza#2', 1.0)]
Significance scores of context words:
[0.29148820525520214, 0.12975690243744742, 0.02722264833952448, 0.11942657176932528, 0.009472310267542472, 0.16188559353799392, 0.1438541919856201, 0.13226385889523357, 0.33827481693934414, 0.01728681152264855, 0.3557976824709924, 0.06579240374490092, 0.11683631808681638, 0.07629630608727012, 0.4041725965133941, 0.025504092740760764, 0.16010984464621203, 0.04979557551486591, 0.34149093466603153]
Context words:
closed	0.404
say	0.356
got	0.341
('pizza#2', [0.7334760438909177, 0.7729272994918818])


<h5> Output </h5>  

Running the Sense disambiguation code generates following lines of output -  
1. Prints the context words extracted from the corpus.
- Prints possible senses of the word, with their respective probabilities(without considering the context)
- Prints the significance score of each context word.
- Prints the most significant context words.
- **Returns** a tuple of the sense of the word as derived from the context, and match scores(log-probability or cosine-similarity depending on the *method* chosen) of various senses of the target word.  
For instance, the output *('table#2', [0.2706353009709064, 0.9591583572384959, 0.40617065436041355, 0.6940131864117054])* indicates the following things regarding our target word -  
    - The closest sense of our target word is with *table#2*, with a match score of 0.959(second in the list)
    - For the other senses, the match score can be read as follows - 
    - table#1 - 0.2706
    - table#2 - 0.959
    - table#3 - 0.406
    - table#4 - 0.694

Since we had not defined the *ignore_case* argument while initializing the WSD model, it resorts to the default of True, and the output return scores for the 4 senses of the word *table*.  
If we chose to ignore case, the output would have match for 8 senses(4-Table; 4-table)
<hr> 

In [8]:
corpus_data.find("pizza")

24929

#### Generate sense embeddings from corpus

In [17]:
def prepare_corpus_data_with_context(corpus_filepath, window=5, force_update=False):
    nlp = spacy.load('en')
    with open(corpus_fpath, "r") as f:
        corpus_data = f.readlines()
    out_file = os.path.join(os.path.dirname(corpus_fpath), f"window_{window}_context_corpus.txt")
    contextual_data = []
    if os.path.exists(out_file) and (not force_update):
        print("Found preexisting file. Loading context data from file")
        with open(out_file, "r") as f:
            contextual_data = f.readlines()
    else:
        print("No pre created corpus found, or force update flag is true. Generating contextual data from corpus")
        with open(out_file, "w") as f:
            for txt_line in corpus_data:
                nlp_data = nlp(txt_line.replace("\n", ""))
                max_tokens = len(nlp_data)
                for i, tok in enumerate(nlp_data):
                    if "NN" in tok.tag_:
                        start = max(0, i-window)
                        end = min(i+window, max_tokens)
                        left_context = [t.text for t in nlp_data[start:i]] + [f"<{tok.text}>"]
                        right_context = [t.text for t in nlp_data[i+1:end]]
                        noun_in_context = f"<{tok.text}> - {' '.join(left_context + right_context)}"
                        contextual_data.append(noun_in_context)
                        f.write(noun_in_context + "\n")
    return contextual_data


In [25]:
def get_sense_group_from_corpus(context_data, wsd_model, sense_vectors, output_file):
    output = {
        "word": [],
        "context": [],
        "sense_id": [],
        "sense_group_name": [],
        "sense_group_num": [],
        "sense_probability": [],
        "related_senses": []
    }
    
    for row in tqdm(context_data, total=len(context_data)):
        word, ctx = row.split(' - ')
        sense_id, sense_probs = wsd_model.disambiguate(ctx, word)
        sense_probability = max(sense_probs)
        sense_group_name, sense_group_num = sense_id.split("#")
        try:
            related_senses = [r_senseid for r_senseid,_ in sense_vectors.wv.most_similar(sense_id)]
            related_senses_l2 = [r_senseid for related_sense in related_senses for r_senseid,_ in sense_vectors.wv.most_similar(related_sense)]
        except KeyError as e:
            print(f"Could not get related senses for {sense_id}")
            related_senses = []
            related_senses_l2 = []
        output["word"].append(word)
        output["context"].append(ctx)
        output["sense_id"].append(sense_id)
        output["sense_group_name"].append(sense_group_name)
        output["sense_group_num"].append(sense_group_num)
        output["sense_probability"].append(sense_probability)
        output["related_senses"].append(related_senses+related_senses_l2)
    
    output_df = pd.DataFrame(output)
    output_df.to_csv(output_file, index=False)
    
    return output_df

In [12]:
window = 15
corpus_data_sense_mappings_file = os.path.join(data_directory, "corpus_sense_mapping.csv")

In [13]:
if not os.path.exists(corpus_data_sense_mappings_file):
    wsd_model2 = WSD(sense_vectors, word_vectors, window=window, method='prob', verbose=False)
    contextual_data = prepare_corpus_data_with_context(corpus_fpath, window=window, force_update=False)
    sense_group_dataframe = get_sense_group_from_corpus(contextual_data, wsd_model2, sense_vectors, corpus_data_sense_mappings_file)
else:
    sense_group_dataframe = pd.read_csv(corpus_data_sense_mappings_file)

In [14]:
sense_group_dataframe.head()

Unnamed: 0,word,context,sense_id,sense_group_name,sense_group_num,sense_probability,related_senses
0,<english>,can not stop listening to what have the <engli...,english#6,english,6,0.962645,"['English#6', 'ENGLISH#6', 'bengali#1', 'Sinha..."
1,<saturday>,can not stop listening to what have the englis...,saturday#1,saturday,1,0.985751,"['Saturday#2', 'SATURDAY#2', 'SUNDAY#3', 'sund..."
2,<bit>,not stop listening to what have the english by...,bit#6,bit,6,0.98774,"['Bit#5', 'BIT#5', 'stuff#3', 'Stuff#2', 'Real..."
3,<one>,the <one> there is amazing had such a laugh wh...,one#3,one,3,0.969355,"['One#3', 'ONE#3', 'List#10', 'Top#8', 'TOP#9'..."
4,<laugh>,the one there is amazing had such a <laugh> wh...,laugh#1,laugh,1,0.914474,"['LAUGH#1', 'Laugh#2', 'giggle#1', 'Giggle#1',..."


-----

#### Hypernymy extraction

In [15]:
with open(corpus_fpath, "r") as f:
    corpus_data = f.read()

In [16]:
weights = {
    "direct_match": 8,
    "direct_nomatch": 5,
    "l1_match": 3,
    "l1_nomatch": 2,
    "l2_match": 1,
    "l2_nomatch": 0.5,
}

nlp = spacy.load("en")
nlp.add_pipe(WordnetAnnotator(nlp.lang), after='tagger')

In [17]:
def generate_hypernymy(word, idx, hypernymy_dict, corpus, level="l1"):
    token = nlp(str(word))[0]
    synsets = token._.wordnet.synsets()
    for syn in synsets:
        if syn.name().split(".n.")[0] == word and int(syn.name().split(".n.")[1]) == idx:
            match_str = "match"
        else:
            match_str = "nomatch"

        hypernym_syn = syn.hypernyms()
        hyponym_syn = syn.hyponyms()
        meronym_syn = syn.part_meronyms()
        if len(hypernym_syn) >0:
            hypernym = hypernym_syn[0].name().split('.')[0]
            rev_map = {
                f"{word}#{idx}":syn,
                "hyponyms": [hyp.name().split(".")[0] for hyp in hyponym_syn if hyp.name().split(".")[0]],
                "meronyms": [mero.name().split(".")[0] for mero in meronym_syn if mero.name().split(".")[0]]
            }
            if level in hypernymy_dict[hypernym]["rev_map"]:
                hypernymy_dict[hypernym]["rev_map"][level].append(rev_map) 
            else:
                hypernymy_dict[hypernym]["rev_map"][level] = [rev_map]

            if "sum_weight" in hypernymy_dict[hypernym]:
                hypernymy_dict[hypernym]["sum_weight"] += weights[f"{level}_{match_str}"]
            else:
                hypernymy_dict[hypernym]["sum_weight"] = weights[f"{level}_{match_str}"]

In [18]:
def extract_hyponyms_meronym(hypernymy_dict, hypernym, kind="hyponyms"):
    nym = []
    reverse_map = hypernymy_dict[hypernym]['rev_map']
    extract_from = []
    if "direct" in reverse_map:
        extract_from.append("direct")
    if "l1" in reverse_map:
        extract_from.append("l1")
    
    for relation in extract_from:
        num_relations = len(reverse_map[relation])
        for i in range(num_relations):
            nym += reverse_map[relation][i][kind]
            
    return list(set(nym))
            

In [24]:
all_hypernyms = []
all_hhm_map = []
all_hyponyms = []
all_meronyms = []

all_hypernyms_from_corpus = []
all_hyponyms_from_corpus = []
all_meronyms_from_corpus = []
num_words = len(sense_group_dataframe)
for _, row in tqdm(sense_group_dataframe[["sense_group_name", "sense_group_num", "related_senses"]].iterrows(), total=num_words):
    word, idx, r_senses = row
    
    hypernymy_dict = defaultdict(lambda: defaultdict(dict))
    generate_hypernymy(word, idx, hypernymy_dict, corpus, level="direct")
    r_senses_list = ast.literal_eval(r_senses)
    for w in r_senses_list:
        related_word, related_idx = str(w).split("#")
        if related_word in corpus:
            generate_hypernymy(related_word, related_idx, corpus, hypernymy_dict)
    if len(hypernymy_dict)>0:
        hypernymy_dict_sorted = {k: v for k, v in sorted(hypernymy_dict.items(), key=lambda item: item[1]['sum_weight'], reverse=True)}
        
        possible_hypernyms = list(hypernymy_dict_sorted.keys())
        hypernym_name = possible_hypernyms[0]
        hyponyms = extract_hyponyms_meronym(hypernymy_dict_sorted, hypernym_name, kind="hyponyms")
        meronyms = extract_hyponyms_meronym(hypernymy_dict_sorted, hypernym_name, kind="meronyms")
        
        all_hypernyms.append(hypernym_name)
        all_hyponyms.append(hyponyms)
        all_meronyms.append(meronyms)
        
        
        hypernym_from_corpus = None
        hyponyms_from_corpus = []
        meronyms_from_corpus = []
        for hypernym in possible_hypernyms:
            if hypernym in corpus_data:
                hypernym_from_corpus = hypernym
        if hypernym_from_corpus:
            hyponyms += extract_hyponyms_meronym(hypernymy_dict_sorted, hypernym_from_corpus, kind="hyponyms")
            meronyms += extract_hyponyms_meronym(hypernymy_dict_sorted, hypernym_from_corpus, kind="meronyms")
            
        for hypo in hyponyms:
            if hypo in corpus_data:
                hyponyms_from_corpus.append(hypo)
                
        for mero in meronyms:
            if mero in corpus_data:
                meronyms_from_corpus.append(mero)
        
        
        all_hypernyms_from_corpus.append(hypernym_from_corpus)
        all_hyponyms_from_corpus.append(hyponyms_from_corpus)
        all_meronyms_from_corpus.append(meronyms_from_corpus)
        all_hhm_map.append(hypernymy_dict_sorted)
    else:
        all_hypernyms.append(None)
        all_hyponyms.append([])
        all_meronyms.append([])
        all_hypernyms_from_corpus.append([])
        all_hyponyms_from_corpus.append([])
        all_meronyms_from_corpus.append([])
        all_hhm_map.append(hypernymy_dict)

100%|██████████| 1351/1351 [21:46<00:00,  1.03it/s]


In [25]:
sense_group_dataframe["hhm_map"] = all_hhm_map
sense_group_dataframe["hypernym"] = all_hypernyms
sense_group_dataframe["hyponym"] = all_hyponyms
sense_group_dataframe["meronym"] = all_meronyms
sense_group_dataframe["hypernym_from_corpus"] = all_hypernyms_from_corpus
sense_group_dataframe["hyponym_from_corpus"] = all_hyponyms_from_corpus
sense_group_dataframe["meronym_from_corpus"] = all_meronyms_from_corpus

In [26]:
hhm_mappings_file = os.path.join(data_directory, "hhm_mappings.csv")
sense_group_dataframe.to_csv(hhm_mappings_file)

In [27]:
sense_group_dataframe

Unnamed: 0,word,context,sense_id,sense_group_name,sense_group_num,sense_probability,related_senses,hhm_map,hypernym,hyponym,meronym,hypernym_from_corpus,hyponym_from_corpus,meronym_from_corpus
0,<english>,can not stop listening to what have the <engli...,english#6,english,6,0.962645,"['English#6', 'ENGLISH#6', 'bengali#1', 'Sinha...",{'sanskrit': {'rev_map': {'l1': [{'Sinhala#3':...,sanskrit,[hindustani],[],indic,[],[]
1,<saturday>,can not stop listening to what have the englis...,saturday#1,saturday,1,0.985751,"['Saturday#2', 'SATURDAY#2', 'SUNDAY#3', 'sund...",{'weekday': {'rev_map': {'direct': [{'saturday...,weekday,"[whit-tuesday, whitmonday]",[],,[],[]
2,<bit>,not stop listening to what have the english by...,bit#6,bit,6,0.987740,"['Bit#5', 'BIT#5', 'stuff#3', 'Stuff#2', 'Real...",{'fragment': {'rev_map': {'direct': [{'bit#6':...,fragment,"[matchwood, scale, scurf, splinter, admit]",[],spend,[admit],[]
3,<one>,the <one> there is amazing had such a laugh wh...,one#3,one,3,0.969355,"['One#3', 'ONE#3', 'List#10', 'Top#8', 'TOP#9'...",{'be': {'rev_map': {'l1': [{'remaining#2': Syn...,be,"[keep_out, sit_tight, stick_together, bide, ou...",[],name,"[be, stand, make]",[]
4,<laugh>,the one there is amazing had such a <laugh> wh...,laugh#1,laugh,1,0.914474,"['LAUGH#1', 'Laugh#2', 'giggle#1', 'Giggle#1',...",{'laugh': {'rev_map': {'l1': [{'giggle#1': Syn...,laugh,[],[],moment,[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1346,<duck>,you lucky <duck>\n,duck#2,duck,2,0.869252,"['toucan#2', 'Toucan#2', 'frog#2', 'FROG#2', '...",{'have': {'rev_map': {'l1': [{'bear#1': Synset...,have,[carry],[],persuade,[],[]
1347,<slices>,not really I just had two <slices> of leftover...,slices#1,slices,1,0.918132,"['Slices#1', 'bite_sized#1', 'fillets#1', 'Jul...",{'cook': {'rev_map': {'l1': [{'fry#5': Synset(...,cook,"[deep-fat-fry, pan-fry, frizzle, saute, stir_f...",[],walk,[],[]
1348,<leftover>,not really I just had two slices of <leftover>...,leftover#1,leftover,1,0.936025,"['Leftover#1', 'Scraps#1', 'scraps#1', 'scaven...",{'waste': {'rev_map': {'l1': [{'Scraps#1': Syn...,waste,"[litter, debris, scrap_metal]",[],mass,[],[]
1349,<pizza>,not really I just had two slices of leftover h...,pizza#2,pizza,2,0.885774,"['hamburger#1', 'Chicken_Sandwich#1', 'chicken...",{'sandwich': {'rev_map': {'l1': [{'hamburger#1...,sandwich,"[chili_dog, cheeseburger]","[ground_beef, frankfurter_bun, frank]",potato,[],[]
