Lorenzo Tosi.
This disambiguation method is based on the Adaptive Lesk Algorithm as
explained in this paper: http://www.d.umn.edu/~tpederse/Pubs/cicling2002-b.pdf.
Every WordNet definition for a word is augmented with all the meanings of the hyponyms, hypernyms, meronyms, homonyms and domains, in order to improve the lack of context and domain bias inherent to Lesk Algorithm.

In [91]:
import random
import re
from nltk.corpus import wordnet as wn
from string import punctuation
from nltk.corpus import stopwords
from pymorphy2 import MorphAnalyzer
import pandas as pd
import nltk
pd.set_option("display.max_colwidth", 1000) 

In [37]:
with open(r"compling_nlp_hse_course/data/corpus_eng.txt", "r", encoding="utf-8") as r:
    corpus = []
    for line in r.readlines():
        corpus.append(line)

In these two blocks we create our corpus and select the 10 sentences that are going to be checked for the task.

In [38]:
breakcorpus = []
for i, sentence in enumerate(corpus):
    if re.search(" break ", sentence) is not None:
        breakcorpus.append(corpus[i])

In [39]:
breakchoice = []
random.shuffle(breakcorpus)
for i in range(0,10):
    breakchoice.append(breakcorpus[i])

Here we define our tokenization and Lesk functions as we did in the seminar.

In [107]:
morph = MorphAnalyzer()
punct = punctuation+'«»—…“”*№–'
stops = set(stopwords.words('english'))

def normalize(text):
    
    words = [word.strip(punct) for word in text.lower().split() if word and word not in stops]
    words = [morph.parse(word)[0].normal_form for word in words if word]

    return words

def tokenize(text):
    
    words = [word.strip(punct) for word in text.lower().split() if word and word not in stops]
    words = [word for word in words if word]

    return words


def get_words_in_context(words, window=3):
    words_in_context = []
    for i in range(len(words)):
        left = words[max(0,i-window):i] 
        right = words[i+1:i+window+1]
        target = words[i]
        words_in_context.append((target, left+right))
    return words_in_context
    
    
def lesk(word, sentence):
    bestsense = 0
    maxoverlap = 0
    
    for i, synset in enumerate(wn.synsets(word)):
        definition = normalize(synset.definition())
        definition = set(definition)
        sentence = set(sentence)
        overlap = len(definition & sentence)
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i
            
    return bestsense

Here we define the fuction that will create our augmented network. We take all the various hyponyms, hypernyms, etc. for a word and we tokenize the definition of everyone of them. We update incrementally every index with all those meanings for a single meaning of the word. Meanings are automatically tokenized and stopwords deleted.


In [None]:
def wordnetworkextractor(word):
    sensesnetwork = []
    for i, synset in enumerate(wn.synsets(word)):
        senseslist = []
        for hyponym in wn.synsets(word)[i].hyponyms():
            senseslist.append(normalize(hyponym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork.append(" ".join(templist))
        senseslist = []
        for hypernym in wn.synsets(word)[i].hypernyms():
            senseslist.append(normalize(hypernym.definition()))
        senseslist.append(normalize(wn.synsets(word)[i].definition())) #adds definition of break(n)
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for member_holonym in wn.synsets(word)[i].member_holonyms():
            senseslist.append(normalize(member_holonym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for substance_holonym in wn.synsets(word)[i].substance_holonyms():
            senseslist.append(normalize(substance_holonym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for part_holonym in wn.synsets(word)[i].part_holonyms():
            senseslist.append(normalize(part_holonym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for part_meronym in wn.synsets(word)[i].part_meronyms():
            senseslist.append(normalize(part_meronym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for member_meronym in wn.synsets(word)[i].member_meronyms():
            senseslist.append(normalize(member_meronym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for substance_meronym in wn.synsets(word)[i].substance_meronyms():
            senseslist.append(normalize(substance_meronym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for topic_domain in wn.synsets(word)[i].topic_domains():
            senseslist.append(normalize(topic_domain.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for region_domain in wn.synsets(word)[i].region_domains():
            senseslist.append(normalize(region_domain.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        for usage_domain in wn.synsets(word)[i].usage_domains():
            senseslist.append(normalize(usage_domain.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        sensesnetwork[i].lstrip()
    return sensesnetwork

We tokenize with the same method our 10 randomly selected sentences and try how well the basic Lesk Algorithm can disambiguate. We create a DataFrame to better interpret our data and debug issues (Word index refers to the sentence with stop words removed, not the original one).

In [42]:
breakchoicetoken = [normalize(line) for line in breakchoice]

In [53]:
breaklist = []
for i, text in enumerate(breakchoicetoken):
    dis_text = []
    words_in_context = get_words_in_context(text)
    for j, (word, context) in enumerate(words_in_context):
        nsense = lesk(word, context)
        if re.search("^break$", word) is not None:
            findind = breakchoice[i].find(" break ")
            breaklist.append((word, j, i, nsense, wn.synsets(word)[nsense].definition(), breakchoice[i][max(0,findind-100):findind+100]))
labels = ["words", "wordind", "sentind", "defid", "definition", "context"]
FinalData = pd.DataFrame.from_records(breaklist, columns=labels)
print(FinalData)

   words  wordind  sentind  defid  \
0  break       15        0      0   
1  break       58        1      0   
2  break       42        2      0   
3  break        1        3     43   
4  break       49        4      0   
5  break       15        5      0   
6  break        0        6      0   
7  break        2        7     26   
8  break        8        8      0   
9  break       14        9      0   

                                                                                                                       definition  \
0                                                                      some abrupt occurrence that interrupts an ongoing activity   
1                                                                      some abrupt occurrence that interrupts an ongoing activity   
2                                                                      some abrupt occurrence that interrupts an ongoing activity   
3                                                           

Here 8/10 meanings are the same, probably the result of a very limited context window. Only 2/10 meanings could be reputed as acceptable. 

In [109]:
breaklist = []
for i, text in enumerate(breakchoicetoken):
    dis_text = []
    words_in_context = get_words_in_context(text, window= 10)
    for j, (word, context) in enumerate(words_in_context):
        nsense = lesk(word, context)
        if re.search("^break$", word) is not None:
            findind = breakchoice[i].find(" break ")
            breaklist.append((word, j, i, nsense, wn.synsets(word)[nsense].definition(), breakchoice[i][max(0,findind-100):findind+100]))
labels = ["words", "wordind", "sentind", "defid", "definition", "context"]
FinalData2 = pd.DataFrame.from_records(breaklist, columns=labels)
print(FinalData2)

   words  wordind  sentind  defid  \
0  break       32        0     28   
1  break      105        1     20   
2  break       68        2     20   
3  break        2        3     43   
4  break       79        4      0   
5  break       26        5      3   
6  break        2        6     51   
7  break        3        7     26   
8  break       11        8      0   
9  break       21        9     20   

                                                                                                                       definition  \
0                                                                 fail to agree with; be in violation of; as of rules or patterns   
1                                          destroy the integrity of; usually by force; cause to separate into pieces or fragments   
2                                          destroy the integrity of; usually by force; cause to separate into pieces or fragments   
3                                                           

In [108]:
breaklist = []
for i, text in enumerate(breakchoicetoken):
    dis_text = []
    words_in_context = get_words_in_context(text, window= 50)
    for j, (word, context) in enumerate(words_in_context):
        nsense = lesk(word, context)
        if re.search("^break$", word) is not None:
            findind = breakchoice[i].find(" break ")
            breaklist.append((word, j, i, nsense, wn.synsets(word)[nsense].definition(), breakchoice[i][max(0,findind-100):findind+100]))
labels = ["words", "wordind", "sentind", "defid", "definition", "context"]
FinalData22 = pd.DataFrame.from_records(breaklist, columns=labels)
print(FinalData22)

   words  wordind  sentind  defid  \
0  break       32        0     28   
1  break      105        1      3   
2  break       68        2     28   
3  break        2        3     43   
4  break       79        4      3   
5  break       26        5     28   
6  break        2        6     28   
7  break        3        7     26   
8  break       11        8      0   
9  break       21        9      2   

                                                                                                                       definition  \
0                                                                 fail to agree with; be in violation of; as of rules or patterns   
1                                                                  a personal or social separation (as between opposing factions)   
2                                                                 fail to agree with; be in violation of; as of rules or patterns   
3                                                           

Here we tried with a window of 10 and 50 and see the results. We have a bigger variety of options. With 10 definitions are way more varied and radically different than window = 3, but seems inaccurate, only sentence 6 and 9 (we refer as per index from 0 to 9) could be considered accurate. With window 50 the ratio of words in the set(sentence & definition) in the algorithm seems to make the definition 28 prevail on the other generically, we still retain very low accuracy. Let's try the Adaptive Lesk Algorithm now with the same windows. 

In [114]:
def adaptivelesk(word, sentence):
    bestsense = 0
    maxoverlap = 0
    wordnetwork = wordnetworkextractor(word)
    
    for i, synset in enumerate(wn.synsets(word)):
        definition = set(wordnetwork[i].split())
        sentence = set(sentence)
        overlap = len(definition & sentence)
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i
            
    return bestsense


In [115]:
breaklist = []
for i, text in enumerate(breakchoicetoken):
    dis_text = []
    words_in_context = get_words_in_context(text, window= 3)
    for j, (word, context) in enumerate(words_in_context):
        nsense = adaptivelesk(word, context)
        if re.search("^break$", word) is not None:
            findind = breakchoice[i].find(" break ")
            breaklist.append((word, j, i, nsense, wn.synsets(word)[nsense].definition(), breakchoice[i][max(0,findind-100):findind+100]))
labels = ["words", "wordind", "sentind", "defid", "definition", "context"]
FinalData3 = pd.DataFrame.from_records(breaklist, columns=labels)
print(FinalData3)

   words  wordind  sentind  defid  \
0  break       32        0     13   
1  break      105        1      0   
2  break       68        2      0   
3  break        2        3     43   
4  break       79        4     47   
5  break       26        5      3   
6  break        2        6      4   
7  break        3        7     26   
8  break       11        8     60   
9  break       21        9      4   

                                                                                                                       definition  \
0                                                                                                                   a sudden dash   
1                                                                      some abrupt occurrence that interrupts an ongoing activity   
2                                                                      some abrupt occurrence that interrupts an ongoing activity   
3                                                           

In [116]:
breaklist = []
for i, text in enumerate(breakchoicetoken):
    dis_text = []
    words_in_context = get_words_in_context(text, window= 10)
    for j, (word, context) in enumerate(words_in_context):
        nsense = adaptivelesk(word, context)
        if re.search("^break$", word) is not None:
            findind = breakchoice[i].find(" break ")
            breaklist.append((word, j, i, nsense, wn.synsets(word)[nsense].definition(), breakchoice[i][max(0,findind-100):findind+100]))
labels = ["words", "wordind", "sentind", "defid", "definition", "context"]
FinalData4 = pd.DataFrame.from_records(breaklist, columns=labels)
print(FinalData4)

   words  wordind  sentind  defid  \
0  break       32        0     28   
1  break      105        1     20   
2  break       68        2      7   
3  break        2        3      6   
4  break       79        4     47   
5  break       26        5      3   
6  break        2        6      4   
7  break        3        7     26   
8  break       11        8     60   
9  break       21        9      4   

                                                                                                                       definition  \
0                                                                 fail to agree with; be in violation of; as of rules or patterns   
1                                          destroy the integrity of; usually by force; cause to separate into pieces or fragments   
2                                                                                            breaking of hard tissue such as bone   
3                                                        a t

In [117]:
breaklist = []
for i, text in enumerate(breakchoicetoken):
    dis_text = []
    words_in_context = get_words_in_context(text, window= 50)
    for j, (word, context) in enumerate(words_in_context):
        nsense = adaptivelesk(word, context)
        if re.search("^break$", word) is not None:
            findind = breakchoice[i].find(" break ")
            breaklist.append((word, j, i, nsense, wn.synsets(word)[nsense].definition(), breakchoice[i][max(0,findind-100):findind+100]))
labels = ["words", "wordind", "sentind", "defid", "definition", "context"]
FinalData5 = pd.DataFrame.from_records(breaklist, columns=labels)
print(FinalData5)

   words  wordind  sentind  defid  \
0  break       32        0      6   
1  break      105        1     47   
2  break       68        2     10   
3  break        2        3      6   
4  break       79        4      3   
5  break       26        5      9   
6  break        2        6     28   
7  break        3        7     21   
8  break       11        8     60   
9  break       21        9      0   

                                                                                definition  \
0                 a time interval during which there is a temporary cessation of something   
1                                               assign to a lower position; reduce in rank   
2                            the opening shot that scatters the balls in billiards or pool   
3                 a time interval during which there is a temporary cessation of something   
4                           a personal or social separation (as between opposing factions)   
5  an abrupt change in the t

Here we can see that definitions on windows = 3 are way more varied (7 as opposed to 3 in the basic algorithm). The accuracy still does not raise and different sentences are assigned right meanings (1/10). With widnows = 10 accuracy is 2 and we have some small changes in meaning. With windows = 50 we have a 3/10, with some meanings (as in sentence 9) not shifting compared to the basic Lesk. This suggest this algorithm is less sensible to the bias induced with the change of window (like probability of catching random high frequency words) and extending the context could be a good decision. Let's try a new strategy for scoring.

In [120]:
def adaptivelesk(word, sentence):
    bestsense = 0
    maxoverlap = 0
    wordnetwork = wordnetworkextractor(word)
    
    for i, synset in enumerate(wn.synsets(word)):
        definition = set(wordnetwork[i].split())
        sentence = set(sentence)
        if len(definition) > 0:
            overlap = len(definition & sentence)/len(definition)
        else:
            overlap = 0
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i
            
    return bestsense


breaklist = []
for i, text in enumerate(breakchoicetoken):
    dis_text = []
    words_in_context = get_words_in_context(text, window= 50)
    for j, (word, context) in enumerate(words_in_context):
        nsense = adaptivelesk(word, context)
        if re.search("^break$", word) is not None:
            findind = breakchoice[i].find(" break ")
            breaklist.append((word, j, i, nsense, wn.synsets(word)[nsense].definition(), breakchoice[i][max(0,findind-100):findind+100]))
labels = ["words", "wordind", "sentind", "defid", "definition", "context"]
FinalData6 = pd.DataFrame.from_records(breaklist, columns=labels)
print(FinalData6)

   words  wordind  sentind  defid  \
0  break       32        0     28   
1  break      105        1     23   
2  break       68        2     28   
3  break        2        3     43   
4  break       79        4     28   
5  break       26        5     28   
6  break        2        6     28   
7  break        3        7     66   
8  break       11        8     60   
9  break       21        9     28   

                                                        definition  \
0  fail to agree with; be in violation of; as of rules or patterns   
1                                                  scatter or part   
2  fail to agree with; be in violation of; as of rules or patterns   
3                                             happen or take place   
4  fail to agree with; be in violation of; as of rules or patterns   
5  fail to agree with; be in violation of; as of rules or patterns   
6  fail to agree with; be in violation of; as of rules or patterns   
7                               

Here we change the way the lesk score is calculated (len(definition & sentence)/len(definition)), making it a ratio dependent on the length of every definition, to try to see if it is a good scoring strategy. The results show us that, instead of reducing the bias between long and short definitions, it tends to prefer too heavily short ones (short related to the extended network).

This algorithm is still scoring poorly on sentences where we have, for example, phrasal verbs ("break down", "break through"). Let's try to add the possibility of catching bigrams by modifying our algorithm. We will create a function that gets the index of the word in the sentence and one that gets the left and right neighbours. We will then try if the network for the bigrams "n-1, n" and "n, n+1" exist (the order is not casual, bigrams like "tax break" are less frequent and more specific in meaning than "break down", so we check first if they are valid). If the network exists, the bigram will became the new word and we will build the extended network related to the bigram.

As words like "down" are considered stop words by the set we use, we will not remove stopwords from the sentence (as stopwords were removed from the extended network inside the function, they still will not count in the scoring, but they will make us catch bigrams).

We will also make the program faster by checking only word context for ambiguity and make the code usable for other words.

In [122]:
def adaptivelesk(word, sentence, collocation):
    bestsense = 0
    maxoverlap = 0
    wordclean = word
    testcollocationaft = collocation[0] + "_" + collocation[1]
    testcollocationbef = collocation[2] + "_" + collocation[0]
    if len(wn.synsets(testcollocationaft)) is not 0:
        word = testcollocationaft
        wordclean = collocation[0] + " " + collocation[1]
    elif len(wn.synsets(testcollocationbef)) is not 0:
        word = testcollocationbef
        wordclean = collocation[2] + " " + collocation[0]
    wordnetwork = wordnetworkextractor(word)
    for i, synset in enumerate(wn.synsets(word)):
        definition = set(wordnetwork[i].split())
        sentence = set(sentence)
        overlap = len(definition & sentence)
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i
            
    return word, bestsense, wordclean


def getindex(words, word):
    for i, singleword in enumerate(words):
        if singleword == word:
            return i


def getcollocation(words, word):
    for i, singleword in enumerate(words):
        if singleword == word:
            return [word, words[i+1], words[max(0,i-1)]]


def tokenizewithstops(text):
    
    words = [word.strip(punct) for word in text.lower().split() if word]
    words = [word for word in words if word]

    return words
    
def normalize(text):
    
    words = [word.strip(punct) for word in text.lower().split() if word and word not in stops]
    words = [morph.parse(word)[0].normal_form for word in words if word]

    return words

def normalizewithstops(text):
    
    words = [word.strip(punct) for word in text.lower().split() if word]
    words = [morph.parse(word)[0].normal_form for word in words if word]

    return words
    
def wordnetworkextractor(word):
    sensesnetwork = []
    for i, synset in enumerate(wn.synsets(word)):
        senseslist = []
        for hyponym in wn.synsets(word)[i].hyponyms():
            senseslist.append(normalize(hyponym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork.append(" ".join(templist))
        senseslist = []
        for hypernym in wn.synsets(word)[i].hypernyms():
            senseslist.append(normalize(hypernym.definition()))
        senseslist.append(normalize(wn.synsets(word)[i].definition())) #adds definition of break(n)
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for member_holonym in wn.synsets(word)[i].member_holonyms():
            senseslist.append(normalize(member_holonym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for substance_holonym in wn.synsets(word)[i].substance_holonyms():
            senseslist.append(normalize(substance_holonym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for part_holonym in wn.synsets(word)[i].part_holonyms():
            senseslist.append(normalize(part_holonym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for part_meronym in wn.synsets(word)[i].part_meronyms():
            senseslist.append(normalize(part_meronym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for member_meronym in wn.synsets(word)[i].member_meronyms():
            senseslist.append(normalize(member_meronym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for substance_meronym in wn.synsets(word)[i].substance_meronyms():
            senseslist.append(normalize(substance_meronym.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for topic_domain in wn.synsets(word)[i].topic_domains():
            senseslist.append(normalize(topic_domain.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        senseslist = []
        for region_domain in wn.synsets(word)[i].region_domains():
            senseslist.append(normalize(region_domain.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        for usage_domain in wn.synsets(word)[i].usage_domains():
            senseslist.append(normalize(usage_domain.definition()))
        templist = []
        for listing in senseslist:
            templist.append(" ".join(listing))
        templist = set(templist)
        sensesnetwork[i] += " " + (" ".join(templist))
        sensesnetwork[i].lstrip()
    return sensesnetwork
            
breakchoicetoken = [normalizewithstops(line) for line in breakchoice]

In [124]:
breaklist = []
for i, text in enumerate(breakchoicetoken):
    dis_text = []
    words_in_context = get_words_in_context(text, window= 50)
    index = getindex(text, "break")
    collocation = getcollocation(text, "break")
    word, nsense, wordclean = adaptivelesk(words_in_context[index][0], words_in_context[index][1], collocation)
    findind = breakchoice[i].find(" {} ".format(wordclean))
    breaklist.append((wordclean, index, i, nsense, wn.synsets(word)[nsense].definition(), breakchoice[i][max(0,findind-100):findind+100]))
labels = ["words", "wordind", "sentind", "defid", "definition", "context"]
FinalData7 = pd.DataFrame.from_records(breaklist, columns=labels)
print(FinalData7)

           words  wordind  sentind  defid  \
0          break       32        0      6   
1          break      105        1     11   
2          break       68        2     10   
3          break        2        3      6   
4          break       79        4      3   
5     break down       26        5      1   
6     break down        2        6      1   
7          break        3        7     21   
8  break through       11        8      0   
9          break       21        9      0   

                                                                                                    definition  \
0                                     a time interval during which there is a temporary cessation of something   
1                                 (tennis) a score consisting of winning a game when your opponent was serving   
2                                                the opening shot that scatters the balls in billiards or pool   
3                                     a time int

In [125]:
breaklist = []
for i, text in enumerate(breakchoicetoken):
    dis_text = []
    words_in_context = get_words_in_context(text, window= 100)
    index = getindex(text, "break")
    collocation = getcollocation(text, "break")
    word, nsense, wordclean = adaptivelesk(words_in_context[index][0], words_in_context[index][1], collocation)
    findind = breakchoice[i].find(" {} ".format(wordclean))
    breaklist.append((wordclean, index, i, nsense, wn.synsets(word)[nsense].definition(), breakchoice[i][max(0,findind-100):findind+100]))
labels = ["words", "wordind", "sentind", "defid", "definition", "context"]
FinalData8 = pd.DataFrame.from_records(breaklist, columns=labels)
print(FinalData8)

           words  wordind  sentind  defid  \
0          break       32        0      6   
1          break      105        1     11   
2          break       68        2     10   
3          break        2        3      6   
4          break       79        4      3   
5     break down       26        5      1   
6     break down        2        6      1   
7          break        3        7     21   
8  break through       11        8      0   
9          break       21        9      0   

                                                                                                    definition  \
0                                     a time interval during which there is a temporary cessation of something   
1                                 (tennis) a score consisting of winning a game when your opponent was serving   
2                                                the opening shot that scatters the balls in billiards or pool   
3                                     a time int

We tried with window = 50 and window = 100, and results are the same for both windows. The quality of the results improves! We see that the algorithm is able to catch our phrasal verbs (3) and to attribute the right meaning to 2 of them (sentence 6 and 8). We have also the ideal definition (not only acceptable) in sentence 0 and 3.
Let's try to add the last feature by adding POS tagging and network disambiguation by POS.

In [126]:
def morphanalyze(words):
    tagtranslate = {"NN" : "n", "NNS" : "n", "NNP" : "n", "NNPS" : "n", "JJ" : "a", "JJR" : "a", "JJS" : "a", "VB" : "v", "VBD" : "v", "VBG" : "v", "VBN" : "v", "VBP" : "v", "VBZ" : "v", "RB" : "r", "RBR" : "r", "RBS" : "r"}
    tagwords = nltk.pos_tag(words)
    for token in tagwords:
        if token[0] == "break":
            pos = token[1]
            pos = tagtranslate[pos]
            return pos
            
            
def adaptivelesk(word, sentence, collocation, pos):
    bestsense = 0
    maxoverlap = 0
    wordclean = word
    testcollocationaft = collocation[0] + "_" + collocation[1]
    testcollocationbef = collocation[2] + "_" + collocation[0]
    if len(wn.synsets(testcollocationaft)) is not 0:
        word = testcollocationaft
        wordclean = collocation[0] + " " + collocation[1]
    elif len(wn.synsets(testcollocationbef)) is not 0:
        word = testcollocationbef
        wordclean = collocation[2] + " " + collocation[0]
    wordnetwork = wordnetworkextractor(word)
    for i, synset in enumerate(wn.synsets(word, pos=pos)):
        definition = set(wordnetwork[i].split())
        sentence = set(sentence)
        overlap = len(definition & sentence)
        if overlap > maxoverlap:
            maxoverlap = overlap
            bestsense = i
            
    return word, bestsense, wordclean

In [127]:
breaklist = []
for i, text in enumerate(breakchoicetoken):
    breakchoicetokenpos = [tokenizewithstops(line) for line in breakchoice[i]]
    pos = morphanalyze(breakchoicetokenpos)
    dis_text = []
    words_in_context = get_words_in_context(text, window= 50)
    index = getindex(text, "break")
    collocation = getcollocation(text, "break")
    word, nsense, wordclean = adaptivelesk(words_in_context[index][0], words_in_context[index][1], collocation, pos)
    findind = breakchoice[i].find(" {} ".format(wordclean))
    breaklist.append((wordclean, index, i, nsense, pos, wn.synsets(word)[nsense].definition(), breakchoice[i][max(0,findind-100):findind+100]))
labels = ["words", "wordind", "sentind", "defid", "POS", "definition", "context"]
FinalData8 = pd.DataFrame.from_records(breaklist, columns=labels)
print(FinalData8)

           words  wordind  sentind  defid POS  \
0          break       32        0      6   n   
1          break      105        1     11   v   
2          break       68        2     10   v   
3          break        2        3      6   n   
4          break       79        4      3   n   
5     break down       26        5      1   v   
6     break down        2        6      1   v   
7          break        3        7      0   n   
8  break through       11        8      0   v   
9          break       21        9      0   v   

                                                                                                    definition  \
0                                     a time interval during which there is a temporary cessation of something   
1                                 (tennis) a score consisting of winning a game when your opponent was serving   
2                                                the opening shot that scatters the balls in billiards or pool   
3   

with POS tagging we have a 5/10 accuracy with sentence 7 getting a relatable meaning. With sentence 2 the sentence context (many sports terms) and the distance between "break" and "down" makes hard to catch the phrasal verb and give the right meaning, in sentence 1 the meaning is intended as sport too (world, record, accomplish, feat, try). The words close and personal makes the definition of break incorrect but overally catch properly the context meaning (it is about a personal separation). Break down is assigned the same meaning in both cases, probably related to the small number of meanings of the extended network of the phrasal verb. Sentence 9 is literally correct (breaking a siege of a city = abrupt occurrence that interrupts an ongoing activity) but there are more appropriate meaning. Here the one choes is one of the most general ones in the network. if we consider this to be correct we have a score of 6/10, more than the 40% predicted in the paper for this kind on network.