## Use Sense Vectors to identify Word sense

#### Library Imports

In [27]:
import os
import sys
import time
from gensim.models import KeyedVectors
import spacy
from tqdm import tqdm
import pandas as pd

Append the github repo to system path, and import modules from it

In [2]:
sys.path.append("sensegram_package/")
import sensegram
from wsd import WSD

---------

#### Load data

In [3]:
data_directory = os.path.join(os.getcwd(), "data")
corpus_fpath = os.path.join(data_directory, "corpus.txt")
sense_vectors_fpath = os.path.join(data_directory, "model", "wiki.txt.clusters.minsize5-1000-sum-score-20.sense_vectors")
word_vectors_fpath = os.path.join(data_directory, "model", "wiki.txt.word_vectors")

Load the sense and word vector files. This may take some time, owing to the large file size of the vector files

In [4]:
s = time.time()
if os.path.exists(sense_vectors_fpath) and os.path.exists(word_vectors_fpath):
    sense_vectors = sensegram.SenseGram.load_word2vec_format(sense_vectors_fpath, binary=False)
    word_vectors = KeyedVectors.load_word2vec_format(word_vectors_fpath, binary=False, unicode_errors="ignore")
    print(f"Took {time.time()-s}seconds to load vector files")
else:
    print("Could not find vector files. Check file paths and ensure the right files exists")
del s

print("Reading the corpus now!")
with open(corpus_fpath, "r") as f:
    corpus_data = f.read()

Took 1278.5607526302338seconds to load vector files
Reading the corpus now!


#### Get all senses of a word

Using the sense vectors, load all possible senses for the given word.  
The output prints the sense *Word#&lt;sense-number&gt;* followed by the probabilities of that word matching other words with similar sense. This table can help us provide logical names of different sense groups. For example, running the code for the word "**table**" gives the following output -  
```
Probabilities of the senses:
[('Table#1', 1.0), ('Table#2', 1.0), ('Table#3', 1.0), ('Table#4', 1.0), ('table#1', 1.0), ('table#2', 1.0), ('table#3', 1.0), ('table#4', 1.0)]


Table#1
====================
table#1 0.996316
TABLE#1 0.993647
PAGE#2 0.989991
page#2 0.989991
WINDOW#2 0.989900
Window#3 0.989900
window#2 0.989900
Scale#2 0.989745
scale#2 0.989745
SCALE#2 0.989745


Table#2
====================
TABLE#2 1.000000
Row#3 0.869726
row#3 0.869726
ROW#3 0.856643
Stack#3 0.829349
Box#3 0.826571
BOX#2 0.826571
stack#3 0.825068
STACK#3 0.824239
BOWL#3 0.813412


Table#3
====================
TABLE#3 0.939938
table#3 0.934190
Boundary_Markers#5 0.845184
Catchment_Basins#2 0.826906
contents#2 0.825448
CONTENTS#2 0.825448
Contents#2 0.824324
tables#1 0.806271
NUMBERS#3 0.804637
Tables#1 0.796628
....

```
Few things we can see from the output - 
* Since we have set *ignore_case=True*, the output shows 4 senses for *Table*, and 4 for *table*.
* Looking at the related words for each sense, we can attribute the following logical groups to few of the senses - 
    - Table#2 - Data table.  
    - Table#3 - Table of contents.
    - table#4 - Hotel/Furniture.
    


In [5]:
test_word = "disk"

print("Probabilities of the senses:\n{}\n\n".format(sense_vectors.get_senses(test_word, ignore_case=False)))

for sense_id, prob in sense_vectors.get_senses(test_word, ignore_case=True):
    print(sense_id)
    print("="*20)
    for rsense_id, sim in sense_vectors.wv.most_similar(sense_id):
        print("{} {:f}".format(rsense_id, sim))
    print("\n")

Probabilities of the senses:
[('disk#1', 1.0), ('disk#2', 1.0), ('disk#3', 1.0), ('disk#4', 1.0), ('disk#5', 1.0)]


Disk#1
Flexible_Disk#2 0.924806
Rigid_Disk#2 0.916348
Basic_Input#1 0.899887
RDOS#1 0.893633
SASI#1 0.891424
Sasi#1 0.891424
corrector#4 0.889009
customizer#2 0.888149
Fargo#3 0.886290
fargo#2 0.886290


Disk#2
DISK#2 0.991812
BOOT#4 0.958993
Bit#4 0.934779
disk#2 0.931051
module#3 0.923883
Module#3 0.921607
MODULE#3 0.916528
Disks#3 0.909954
rom#1 0.903984
hard_disk#1 0.903680


Disk#3
disk#3 1.000000
DISK#3 1.000000
urn#4 0.960472
URN#4 0.960472
Urn#4 0.960472
EXEC#4 0.958112
Coco#4 0.957627
CoCo#2 0.957627
coco#3 0.957627
COCO#2 0.957627


Disk#4
DISK#4 1.000000
disk#4 0.991975
Envelope#1 0.931414
Plate#1 0.922727
Stack#3 0.906151
STACK#3 0.904850
Shelf#4 0.900630
stack#3 0.897358
Bubble#4 0.896194
CHAIN#4 0.895439


disk#1
DISK#1 0.995354
Modem#2 0.961432
install#3 0.960586
SYNC#1 0.960527
sync#1 0.960527
Install#1 0.959036
Handheld#2 0.958145
laptop#1 0.956479
Lapto

#### Get disambiguated sense of the word, using corpus as context

##### Input
To understand the word's sense in a given context, we use the *WSD* class from the sensegram library.  
The WSD model takes the following key parameters to decide word sense based on corpus context - 
* vectors - Both sense and word vector models loaded earlier.  
  
  
* method - To calculate the sense of the word, the library averages the sense scores of all the surrounding context words and compares it with different senses of the target word. For this comparison, there are two available metrics - 
 - sim: Uses cosine distance
 - prob: Use log probability score  
  
  
* window - This is the window(±) that the model looks into, to decide the word context.   
For example, if our target word is *table*,   
with the context of *"we load the our data into a data-frame table object and count the number of rows/columns using the .shape method"*  
 1. a window of 3 would consider the following 6(3 on the left, and 3 on the right) words around our context word to find the sense of the word - *into, a, data-frame, object, and, count*  
 2. a window of 5 would use the following context - *our, data, into, a, data-frame, table, object, and, count, the, number*  
  
  
* verbose - Allows to print intermediate outputs while running the disambiguation code

<hr>  
     
Some food-for-thought regarding the usage of WSD module - 
 - Do note that while stopwords like *and* and *the* are considered in the context of the the target word, they are dropped while disambiguating the sense of our target.
 - While it may seem ideal to choose a high value of window for getting the sense of the target word, it may happen that the wider window results in an less accurate output, as it averages across all possible senses.
 - The library considers, and disambiguates, only the first occurance of the target word in the context. For a large corpus, it would be ideal to first split the corpus and generate contexts using an external helper function, and then iteratively get the sense for the target word across all occurances in the corpus.

In [6]:
wsd_model = WSD(sense_vectors, word_vectors, window=15, method='prob', verbose=True)

In [7]:
print(wsd_model.disambiguate(corpus_data, "pizza"))

Extracted context words:
['know', 'hare', 'street', 'hope', 'overexcited', 'dont', 'er', 'richard', 'still', 'aaw', 'say', 'welcome', 'open', 'poppet', 'closed', 'gif', 'tortoise', 'youre', 'got']
Senses of a target word:
[('pizza#1', 1.0), ('pizza#2', 1.0)]
Significance scores of context words:
[0.29148820525520214, 0.12975690243744742, 0.02722264833952448, 0.11942657176932528, 0.009472310267542472, 0.16188559353799392, 0.1438541919856201, 0.13226385889523357, 0.33827481693934414, 0.01728681152264855, 0.3557976824709924, 0.06579240374490092, 0.11683631808681638, 0.07629630608727012, 0.4041725965133941, 0.025504092740760764, 0.16010984464621203, 0.04979557551486591, 0.34149093466603153]
Context words:
closed	0.404
say	0.356
got	0.341
('pizza#2', [0.7334760438909177, 0.7729272994918818])


<h5> Output </h5>  

Running the Sense disambiguation code generates following lines of output -  
1. Prints the context words extracted from the corpus.
- Prints possible senses of the word, with their respective probabilities(without considering the context)
- Prints the significance score of each context word.
- Prints the most significant context words.
- **Returns** a tuple of the sense of the word as derived from the context, and match scores(log-probability or cosine-similarity depending on the *method* chosen) of various senses of the target word.  
For instance, the output *('table#2', [0.2706353009709064, 0.9591583572384959, 0.40617065436041355, 0.6940131864117054])* indicates the following things regarding our target word -  
    - The closest sense of our target word is with *table#2*, with a match score of 0.959(second in the list)
    - For the other senses, the match score can be read as follows - 
    - table#1 - 0.2706
    - table#2 - 0.959
    - table#3 - 0.406
    - table#4 - 0.694

Since we had not defined the *ignore_case* argument while initializing the WSD model, it resorts to the default of True, and the output return scores for the 4 senses of the word *table*.  
If we chose to ignore case, the output would have match for 8 senses(4-Table; 4-table)
<hr> 

In [8]:
corpus_data.find("pizza")

24929

#### Generate sense embeddings from corpus

In [17]:
def prepare_corpus_data_with_context(corpus_filepath, window=5, force_update=False):
    nlp = spacy.load('en')
    with open(corpus_fpath, "r") as f:
        corpus_data = f.readlines()
    out_file = os.path.join(os.path.dirname(corpus_fpath), f"window_{window}_context_corpus.txt")
    contextual_data = []
    if os.path.exists(out_file) and (not force_update):
        print("Found preexisting file. Loading context data from file")
        with open(out_file, "r") as f:
            contextual_data = f.readlines()
    else:
        print("No pre created corpus found, or force update flag is true. Generating contextual data from corpus")
        with open(out_file, "w") as f:
            for txt_line in corpus_data:
                nlp_data = nlp(txt_line.replace("\n", ""))
                max_tokens = len(nlp_data)
                for i, tok in enumerate(nlp_data):
                    if "NN" in tok.tag_:
                        start = max(0, i-window)
                        end = min(i+window, max_tokens)
                        left_context = [t.text for t in nlp_data[start:i]] + [f"<{tok.text}>"]
                        right_context = [t.text for t in nlp_data[i+1:end]]
                        noun_in_context = f"<{tok.text}> - {' '.join(left_context + right_context)}"
                        contextual_data.append(noun_in_context)
                        f.write(noun_in_context + "\n")
    return contextual_data


In [25]:
def get_sense_group_from_corpus(context_data, wsd_model, sense_vectors, output_file):
    output = {
        "word": [],
        "context": [],
        "sense_id": [],
        "sense_group_name": [],
        "sense_group_num": [],
        "sense_probability": [],
        "related_senses": []
    }
    
    for row in tqdm(context_data, total=len(context_data)):
        word, ctx = row.split(' - ')
        sense_id, sense_probs = wsd_model.disambiguate(ctx, word)
        sense_probability = max(sense_probs)
        sense_group_name, sense_group_num = sense_id.split("#")
        try:
            related_senses = [r_senseid for r_senseid,_ in sense_vectors.wv.most_similar(sense_id)]
            related_senses_l2 = [r_senseid for related_sense in related_senses for r_senseid,_ in sense_vectors.wv.most_similar(related_sense)]
        except KeyError as e:
            print(f"Could not get related senses for {sense_id}")
            related_senses = []
            related_senses_l2 = []
        output["word"].append(word)
        output["context"].append(ctx)
        output["sense_id"].append(sense_id)
        output["sense_group_name"].append(sense_group_name)
        output["sense_group_num"].append(sense_group_num)
        output["sense_probability"].append(sense_probability)
        output["related_senses"].append(related_senses+related_senses_l2)
    
    output_df = pd.DataFrame(output)
    output_df.to_csv(output_file, index=False)
    
    return output_df

In [24]:
window = 15
corpus_data_sense_mappings_file = os.path.join(data_directory, "corpus_sense_mapping.csv")
wsd_model2 = WSD(sense_vectors, word_vectors, window=window, method='prob', verbose=False)

In [28]:
contextual_data = prepare_corpus_data_with_context(corpus_fpath, window=window, force_update=False)
sense_group_dataframe = get_sense_group_from_corpus(contextual_data, wsd_model2, sense_vectors, corpus_data_sense_mappings_file)

  0%|          | 0/1351 [00:00<?, ?it/s]

Found preexisting file. Loading context data from file


  0%|          | 6/1351 [00:04<15:32,  1.44it/s]

Could not get related senses for ohgod#0


  1%|          | 12/1351 [00:07<14:55,  1.50it/s]

Could not get related senses for myselfhehe#0


  1%|▏         | 18/1351 [00:10<14:39,  1.52it/s]

Could not get related senses for winnerrr#0


  2%|▏         | 22/1351 [00:13<13:56,  1.59it/s]

Could not get related senses for clements#0


  2%|▏         | 28/1351 [00:16<14:40,  1.50it/s]

Could not get related senses for derren#0


  3%|▎         | 42/1351 [00:25<15:56,  1.37it/s]

Could not get related senses for aaaah#0


  4%|▍         | 58/1351 [00:36<14:54,  1.45it/s]

Could not get related senses for ladsgirlie#0


  5%|▍         | 63/1351 [00:38<14:16,  1.50it/s]

Could not get related senses for daaaaaaaay#0


  5%|▍         | 66/1351 [00:40<12:46,  1.68it/s]

Could not get related senses for hallo#0


  6%|▌         | 78/1351 [00:47<14:22,  1.48it/s]

Could not get related senses for itches#0


  6%|▌         | 83/1351 [00:50<13:42,  1.54it/s]

Could not get related senses for boffin#0
Could not get related senses for cisco#0


  7%|▋         | 98/1351 [00:59<14:29,  1.44it/s]

Could not get related senses for derren#0


  8%|▊         | 110/1351 [01:06<14:18,  1.45it/s]

Could not get related senses for kew#0


  8%|▊         | 113/1351 [01:08<12:42,  1.62it/s]

Could not get related senses for brentford#0


  9%|▉         | 120/1351 [01:12<13:20,  1.54it/s]

Could not get related senses for babenothing#0


  9%|▉         | 128/1351 [01:16<13:46,  1.48it/s]

Could not get related senses for croque#0


 10%|█         | 139/1351 [01:23<13:49,  1.46it/s]

Could not get related senses for dauphinoise#0


 11%|█         | 147/1351 [01:28<13:36,  1.47it/s]

Could not get related senses for monmouth#0
Could not get related senses for teeny#0


 12%|█▏        | 157/1351 [01:34<13:10,  1.51it/s]

Could not get related senses for selleck#0


 12%|█▏        | 167/1351 [01:40<13:24,  1.47it/s]

Could not get related senses for naaaturaaalwomaaaaan#0
Could not get related senses for fairmy#0


 13%|█▎        | 173/1351 [01:43<12:30,  1.57it/s]

Could not get related senses for nowim#0


 13%|█▎        | 179/1351 [01:46<12:49,  1.52it/s]

Could not get related senses for miserables#0


 14%|█▎        | 183/1351 [01:48<12:16,  1.59it/s]

Could not get related senses for poon#0


 14%|█▎        | 185/1351 [01:49<10:26,  1.86it/s]

Could not get related senses for errrfluffy#0


 14%|█▍        | 190/1351 [01:51<11:40,  1.66it/s]

Could not get related senses for woop#0
Could not get related senses for woop#0


 14%|█▍        | 194/1351 [01:53<10:19,  1.87it/s]

Could not get related senses for byeeee#0


 16%|█▌        | 215/1351 [02:06<13:00,  1.46it/s]

Could not get related senses for robb#0


 16%|█▌        | 217/1351 [02:07<10:53,  1.73it/s]

Could not get related senses for mackie#0


 16%|█▋        | 221/1351 [02:09<10:31,  1.79it/s]

Could not get related senses for twusband#0
Could not get related senses for failwhale#0


 17%|█▋        | 227/1351 [02:12<11:23,  1.64it/s]

Could not get related senses for parmo#0


 17%|█▋        | 235/1351 [02:16<12:32,  1.48it/s]

Could not get related senses for roomie#0


 18%|█▊        | 249/1351 [02:25<12:18,  1.49it/s]

Could not get related senses for mackie#0


 19%|█▉        | 255/1351 [02:29<11:46,  1.55it/s]

Could not get related senses for alicia#0


 20%|█▉        | 268/1351 [02:37<12:23,  1.46it/s]

Could not get related senses for bellybutton#0


 21%|██        | 278/1351 [02:43<12:16,  1.46it/s]

Could not get related senses for somethingin#0


 21%|██        | 281/1351 [02:44<10:45,  1.66it/s]

Could not get related senses for erggh#0


 21%|██        | 284/1351 [02:46<09:50,  1.81it/s]

Could not get related senses for jackety#0


 22%|██▏       | 292/1351 [02:51<12:00,  1.47it/s]

Could not get related senses for cosette#0
Could not get related senses for marius#0


 22%|██▏       | 295/1351 [02:51<09:35,  1.84it/s]

Could not get related senses for heymy#0


 22%|██▏       | 300/1351 [02:54<10:48,  1.62it/s]

Could not get related senses for twitfolk#0
Could not get related senses for hmmm#0


 23%|██▎       | 312/1351 [03:01<11:55,  1.45it/s]

Could not get related senses for sugartits#0
Could not get related senses for ealing#0


 24%|██▍       | 324/1351 [03:08<11:37,  1.47it/s]

Could not get related senses for chalkley#0


 24%|██▍       | 328/1351 [03:10<10:50,  1.57it/s]

Could not get related senses for mondayergh#0


 25%|██▍       | 333/1351 [03:13<10:51,  1.56it/s]

Could not get related senses for aghost#0


 28%|██▊       | 383/1351 [03:46<10:44,  1.50it/s]

Could not get related senses for chalkley#0


 29%|██▉       | 393/1351 [03:52<10:56,  1.46it/s]

Could not get related senses for gooood#0


 31%|███       | 422/1351 [04:11<10:19,  1.50it/s]

Could not get related senses for robb#0


 33%|███▎      | 445/1351 [04:26<10:04,  1.50it/s]

Could not get related senses for tshirt#0


 33%|███▎      | 448/1351 [04:27<09:05,  1.66it/s]

Could not get related senses for wooah#0


 34%|███▍      | 460/1351 [04:35<10:08,  1.46it/s]

Could not get related senses for robb#0


 35%|███▍      | 468/1351 [04:40<10:00,  1.47it/s]

Could not get related senses for oooh#0


 35%|███▌      | 473/1351 [04:42<09:22,  1.56it/s]

Could not get related senses for myselfany#0


 35%|███▌      | 475/1351 [04:43<08:03,  1.81it/s]

Could not get related senses for teadrinker#0


 37%|███▋      | 502/1351 [05:01<09:32,  1.48it/s]

Could not get related senses for indicatives#0


 38%|███▊      | 507/1351 [05:03<09:05,  1.55it/s]

Could not get related senses for alist#0


 38%|███▊      | 511/1351 [05:05<08:31,  1.64it/s]

Could not get related senses for mannington#0
Could not get related senses for bowes#0
Could not get related senses for beckhams#0


 41%|████      | 550/1351 [05:30<09:04,  1.47it/s]

Could not get related senses for colmans#0


 41%|████      | 554/1351 [05:32<08:21,  1.59it/s]

Could not get related senses for dane#0


 41%|████      | 557/1351 [05:33<07:57,  1.66it/s]

Could not get related senses for darl#0


 42%|████▏     | 561/1351 [05:35<08:04,  1.63it/s]

Could not get related senses for twusband#0
Could not get related senses for techdress#0


 42%|████▏     | 572/1351 [05:41<08:41,  1.49it/s]

Could not get related senses for headshimmyingim#0


 43%|████▎     | 580/1351 [05:46<08:47,  1.46it/s]

Could not get related senses for everoh#0


 43%|████▎     | 582/1351 [05:47<07:27,  1.72it/s]

Could not get related senses for heartcould#0


 43%|████▎     | 584/1351 [05:48<06:29,  1.97it/s]

Could not get related senses for meand#0


 44%|████▎     | 588/1351 [05:50<07:20,  1.73it/s]

Could not get related senses for mackie#0


 44%|████▍     | 594/1351 [05:53<08:07,  1.55it/s]

Could not get related senses for jermaine#0
Could not get related senses for deffo#0


 44%|████▍     | 597/1351 [05:54<06:31,  1.93it/s]

Could not get related senses for scooch#0


 44%|████▍     | 601/1351 [05:56<07:18,  1.71it/s]

Could not get related senses for hehehehehehehe#0


 45%|████▍     | 603/1351 [05:57<06:21,  1.96it/s]

Could not get related senses for critter#0


 46%|████▌     | 616/1351 [06:05<08:34,  1.43it/s]

Could not get related senses for postshower#0


 46%|████▌     | 621/1351 [06:08<07:51,  1.55it/s]

Could not get related senses for annieboob#0
Could not get related senses for swhat#0


 46%|████▋     | 626/1351 [06:10<07:18,  1.65it/s]

Could not get related senses for aparo#0


 46%|████▋     | 628/1351 [06:11<06:23,  1.89it/s]

Could not get related senses for yumyums#0


 47%|████▋     | 640/1351 [06:18<07:58,  1.49it/s]

Could not get related senses for dickheads#0


 48%|████▊     | 642/1351 [06:19<06:49,  1.73it/s]

Could not get related senses for nonnnnje#0
Could not get related senses for regrette#0
Could not get related senses for regrette#0


 48%|████▊     | 648/1351 [06:21<06:36,  1.77it/s]

Could not get related senses for shhhhhhhhhhhh#0


 49%|████▊     | 657/1351 [06:26<07:21,  1.57it/s]

Could not get related senses for queeeeeeeeeeeeen#0


 49%|████▉     | 660/1351 [06:27<06:38,  1.73it/s]

Could not get related senses for hehe#0
Could not get related senses for lovee#0
Could not get related senses for youu#0


 49%|████▉     | 668/1351 [06:31<07:00,  1.63it/s]

Could not get related senses for seb#0


 50%|████▉     | 675/1351 [06:35<07:15,  1.55it/s]

Could not get related senses for valentino#0


 51%|█████     | 689/1351 [06:44<07:34,  1.46it/s]

Could not get related senses for whilex#0


 52%|█████▏    | 706/1351 [06:55<07:19,  1.47it/s]

Could not get related senses for poon#0


 53%|█████▎    | 710/1351 [06:57<06:44,  1.59it/s]

Could not get related senses for woohoo#0


 53%|█████▎    | 719/1351 [07:02<06:59,  1.51it/s]

Could not get related senses for chortleclassic#0


 53%|█████▎    | 722/1351 [07:03<06:16,  1.67it/s]

Could not get related senses for londonvery#0


 54%|█████▍    | 734/1351 [07:11<07:05,  1.45it/s]

Could not get related senses for doris#0


 56%|█████▌    | 754/1351 [07:24<06:57,  1.43it/s]

Could not get related senses for twitfolk#0


 56%|█████▋    | 760/1351 [07:28<06:40,  1.48it/s]

Could not get related senses for shurrup#0


 56%|█████▋    | 762/1351 [07:28<05:44,  1.71it/s]

Could not get related senses for wheeeeeeeeeee#0


 57%|█████▋    | 766/1351 [07:31<05:48,  1.68it/s]

Could not get related senses for hommosexual#0


 57%|█████▋    | 770/1351 [07:33<05:54,  1.64it/s]

Could not get related senses for lifeorganisation#0


 57%|█████▋    | 773/1351 [07:34<05:34,  1.73it/s]

Could not get related senses for hoover#0
Could not get related senses for hoover#0


 57%|█████▋    | 776/1351 [07:35<04:33,  2.10it/s]

Could not get related senses for halfasleep#0
Could not get related senses for tesco#0


 58%|█████▊    | 782/1351 [07:38<05:32,  1.71it/s]

Could not get related senses for welshman#0


 60%|██████    | 812/1351 [07:58<06:15,  1.44it/s]

Could not get related senses for joni#0


 60%|██████    | 814/1351 [07:58<05:19,  1.68it/s]

Could not get related senses for shhhhhhhhhhhhhhhhhhhhhhhhh#0


 61%|██████    | 822/1351 [08:03<05:41,  1.55it/s]

Could not get related senses for hahaha#0


 62%|██████▏   | 831/1351 [08:08<05:48,  1.49it/s]

Could not get related senses for debbie#0


 62%|██████▏   | 842/1351 [08:15<05:44,  1.48it/s]

Could not get related senses for hmmm#0


 63%|██████▎   | 853/1351 [08:22<05:34,  1.49it/s]

Could not get related senses for questors#0


 65%|██████▍   | 872/1351 [08:35<05:28,  1.46it/s]

Could not get related senses for minchin#0


 65%|██████▍   | 875/1351 [08:36<04:48,  1.65it/s]

Could not get related senses for megashift#0


 66%|██████▌   | 887/1351 [08:43<05:18,  1.46it/s]

Could not get related senses for ohgod#0


 66%|██████▋   | 897/1351 [08:50<05:08,  1.47it/s]

Could not get related senses for wless#0


 67%|██████▋   | 906/1351 [08:55<04:57,  1.50it/s]

Could not get related senses for mcdonalds#0


 68%|██████▊   | 918/1351 [09:02<04:53,  1.47it/s]

Could not get related senses for carluccios#0
Could not get related senses for heathrow#0


 68%|██████▊   | 923/1351 [09:05<04:27,  1.60it/s]

Could not get related senses for theeeres#0


 68%|██████▊   | 925/1351 [09:05<03:50,  1.85it/s]

Could not get related senses for theeeres#0
Could not get related senses for ooonnllyy#0


 70%|██████▉   | 940/1351 [09:14<04:36,  1.49it/s]

Could not get related senses for yayayayayay#0


 70%|██████▉   | 945/1351 [09:17<04:28,  1.51it/s]

Could not get related senses for onlygirlinthehouse#0


 70%|███████   | 950/1351 [09:20<04:24,  1.52it/s]

Could not get related senses for tweeniesi#0
Could not get related senses for erdr#0


 71%|███████   | 957/1351 [09:23<04:15,  1.54it/s]

Could not get related senses for lawnhow#0


 72%|███████▏  | 974/1351 [09:35<04:23,  1.43it/s]

Could not get related senses for remarkableand#0
Could not get related senses for bluuuueeee#0


 73%|███████▎  | 986/1351 [09:41<04:13,  1.44it/s]

Could not get related senses for tweeps#0


 74%|███████▎  | 996/1351 [09:48<04:07,  1.44it/s]

Could not get related senses for stumbleupon#0


 74%|███████▍  | 999/1351 [09:49<03:38,  1.61it/s]

Could not get related senses for covent#0


 76%|███████▋  | 1032/1351 [10:11<03:40,  1.44it/s]

Could not get related senses for markfrancis#0


 79%|███████▊  | 1063/1351 [10:31<03:06,  1.54it/s]

Could not get related senses for monmouth#0


 79%|███████▉  | 1072/1351 [10:37<03:11,  1.46it/s]

Could not get related senses for owch#0
Could not get related senses for verr#0


 80%|████████  | 1082/1351 [10:43<03:03,  1.47it/s]

Could not get related senses for thth#0


 80%|████████  | 1085/1351 [10:44<02:38,  1.68it/s]

Could not get related senses for mackie#0


 82%|████████▏ | 1105/1351 [10:57<02:47,  1.47it/s]

Could not get related senses for annie#0


 82%|████████▏ | 1107/1351 [10:57<02:19,  1.75it/s]

Could not get related senses for mcfly#0


 82%|████████▏ | 1112/1351 [11:00<02:31,  1.57it/s]

Could not get related senses for dline#0


 83%|████████▎ | 1126/1351 [11:09<02:34,  1.46it/s]

Could not get related senses for momentboo#0


 84%|████████▎ | 1130/1351 [11:11<02:19,  1.58it/s]

Could not get related senses for shwmae#0


 84%|████████▍ | 1132/1351 [11:12<01:58,  1.84it/s]

Could not get related senses for shhhit#0


 85%|████████▍ | 1142/1351 [11:18<02:19,  1.50it/s]

Could not get related senses for twitfolk#0


 85%|████████▍ | 1147/1351 [11:21<02:12,  1.54it/s]

Could not get related senses for mcfly#0


 85%|████████▌ | 1149/1351 [11:21<01:53,  1.78it/s]

Could not get related senses for yeaaah#0


 86%|████████▌ | 1158/1351 [11:27<02:10,  1.48it/s]

Could not get related senses for juno#0


 88%|████████▊ | 1186/1351 [11:45<01:47,  1.53it/s]

Could not get related senses for jacky#0


 88%|████████▊ | 1190/1351 [11:47<01:39,  1.62it/s]

Could not get related senses for covent#0


 88%|████████▊ | 1195/1351 [11:50<01:38,  1.58it/s]

Could not get related senses for thth#0


 89%|████████▉ | 1206/1351 [11:57<01:39,  1.45it/s]

Could not get related senses for aww#0


 89%|████████▉ | 1208/1351 [11:57<01:23,  1.72it/s]

Could not get related senses for amazeballs#0


 90%|████████▉ | 1211/1351 [11:59<01:17,  1.82it/s]

Could not get related senses for annie#0
Could not get related senses for tesco#0
Could not get related senses for deliverys#0


 90%|█████████ | 1220/1351 [12:03<01:21,  1.60it/s]

Could not get related senses for expresscarluccios#0


 92%|█████████▏| 1241/1351 [12:17<01:15,  1.45it/s]

Could not get related senses for carluccios#0


 92%|█████████▏| 1245/1351 [12:19<01:06,  1.59it/s]

Could not get related senses for yetwhats#0


 93%|█████████▎| 1252/1351 [12:23<01:05,  1.51it/s]

Could not get related senses for awww#0
Could not get related senses for showermuscles#0


 93%|█████████▎| 1257/1351 [12:25<00:58,  1.62it/s]

Could not get related senses for twifey#0


 93%|█████████▎| 1259/1351 [12:25<00:47,  1.92it/s]

Could not get related senses for meee#0


 94%|█████████▍| 1269/1351 [12:31<00:54,  1.50it/s]

Could not get related senses for spektor#0


 94%|█████████▍| 1274/1351 [12:34<00:48,  1.59it/s]

Could not get related senses for pina#0
Could not get related senses for coladas#0


 95%|█████████▍| 1278/1351 [12:36<00:41,  1.75it/s]

Could not get related senses for kittycat#0


 97%|█████████▋| 1308/1351 [12:55<00:28,  1.48it/s]

Could not get related senses for rebecca#0


 97%|█████████▋| 1311/1351 [12:57<00:24,  1.65it/s]

Could not get related senses for hahaha#0


 98%|█████████▊| 1320/1351 [13:02<00:20,  1.52it/s]

Could not get related senses for workpretty#0


 99%|█████████▊| 1333/1351 [13:10<00:12,  1.49it/s]

Could not get related senses for milgram#0


 99%|█████████▉| 1343/1351 [13:16<00:05,  1.47it/s]

Could not get related senses for jedward#0


100%|██████████| 1351/1351 [13:21<00:00,  1.69it/s]


In [29]:
sense_group_dataframe.head()

Unnamed: 0,word,context,sense_id,sense_group_name,sense_group_num,sense_probability,related_senses
0,<english>,can not stop listening to what have the <engli...,english#6,english,6,0.962645,"[English#6, ENGLISH#6, bengali#1, Sinhala#3, k..."
1,<saturday>,can not stop listening to what have the englis...,saturday#1,saturday,1,0.985751,"[Saturday#2, SATURDAY#2, SUNDAY#3, sunday#2, S..."
2,<bit>,not stop listening to what have the english by...,bit#6,bit,6,0.98774,"[Bit#5, BIT#5, stuff#3, Stuff#2, Really#2, REA..."
3,<one>,the <one> there is amazing had such a laugh wh...,one#3,one,3,0.969355,"[One#3, ONE#3, List#10, Top#8, TOP#9, three#2,..."
4,<laugh>,the one there is amazing had such a <laugh> wh...,laugh#1,laugh,1,0.914474,"[LAUGH#1, Laugh#2, giggle#1, Giggle#1, Flinch#..."
