# Word Sense Disambiguation using Supervised Learning
## The Naive Bayes Model



## The Senseval dataset

The Senseval 2 corpus is a word sense disambiguation corpus. Each item in the corpus corresponds to a single ambiguous word. For each of these words, the corpus contains a list of instances, corresponding to occurrences of that word. Each instance provides the word; a list of word senses that apply to the word occurrence; and the word’s context.
https://www.nltk.org/howto/corpus.html#senseval

Detailed description of dataset creation in publication here: https://aclanthology.org/S01-1001/

In [1]:
import nltk
nltk.download('senseval')
from nltk.corpus import senseval



[nltk_data] Downloading package senseval to /root/nltk_data...
[nltk_data]   Package senseval is already up-to-date!


In [2]:
inst = senseval.instances('interest.pos')
inst[0]

SensevalInstance(word='interest-n', position=18, context=[('yields', 'NNS'), ('on', 'IN'), ('money-market', 'JJ'), ('mutual', 'JJ'), ('funds', 'NNS'), ('continued', 'VBD'), ('to', 'TO'), ('slide', 'VB'), (',', ','), ('amid', 'IN'), ('signs', 'VBZ'), ('that', 'IN'), ('portfolio', 'NN'), ('managers', 'NNS'), ('expect', 'VBP'), ('further', 'JJ'), ('declines', 'NNS'), ('in', 'IN'), ('interest', 'NN'), ('rates', 'NNS'), ('.', '.')], senses=('interest_6',))

In [3]:
len(inst)

2368

In [4]:
for inst in senseval.instances('interest.pos')[:100]:
  p = inst.position
  left = ' '.join(w for (w,t) in inst.context[p-3:p])
  word = ' '.join(w for (w,t) in inst.context[p:p+1])
  right = ' '.join(w for (w,t) in inst.context[p+1:p+4])
  senses = ' '.join(inst.senses)
  print('%24s |%10s | %-15s -> %s' % (left, word, right, senses))

     further declines in |  interest | rates .         -> interest_6
   to indicate declining |  interest | rates because they -> interest_6
     rises in short-term |  interest | rates .         -> interest_6
                   . 4 % |  interest | in this energy-services -> interest_5
    holding company with | interests | in the mechanical -> interest_5
         refunded , plus |  interest | .               -> interest_6
           curry set the |  interest | rate on the     -> interest_6
          country 's own |  interest | , prompted the  -> interest_4
        of principal and |  interest | is the only     -> interest_6
         to increase its |  interest | to 70 %         -> interest_5
         show the strong |  interest | of japanese investors -> interest_1
        retired early if |  interest | rates decline , -> interest_6
             the drop in |  interest | rates since the -> interest_6
             the drop in |  interest | rates eventually will -> interest_6
        p

In [5]:
senseval.fileids()

['hard.pos', 'interest.pos', 'line.pos', 'serve.pos']

In [6]:
def senses(word):
    """
    This takes a target word from senseval-2 (find out what the possible
    are by running senseval.fileides()), and it returns the list of possible 
    senses for the word
    """
    return list(set(i.senses[0] for i in senseval.instances(word)))

senses('interest.pos')

['interest_1',
 'interest_5',
 'interest_3',
 'interest_2',
 'interest_6',
 'interest_4']

In [7]:
[i for i in senseval.instances('interest.pos') if i.senses[0]=='interest_1']

[SensevalInstance(word='interest-n', position=5, context=[('the', 'DT'), ('purchases', 'NNS'), ('show', 'VBP'), ('the', 'DT'), ('strong', 'JJ'), ('interest', 'NN'), ('of', 'IN'), ('japanese', 'NN'), ('investors', 'NNS'), ('in', 'IN'), ('u', 'PRP'), ('.', '.'), ('s', 'PRP'), ('.', '.'), ('mortgage-based', 'JJ'), ('instruments', 'NNS'), (',', ','), ('fannie', 'NN'), ('mae', 'NN'), ("'s", 'POS'), ('chairman', 'NN'), (',', ','), ('david', 'JJ'), ('o', 'IN'), ('.', '.'), ('maxwell', 'NN'), (',', ','), ('said', 'VBD'), ('at', 'IN'), ('a', 'DT'), ('news', 'NN'), ('conference', 'NN'), ('.', '.')], senses=('interest_1',)),
 SensevalInstance(word='interest-n', position=8, context=[('the', 'DT'), ('fire', 'NN'), ('is', 'VBZ'), ('also', 'RB'), ('fueled', 'VBN'), ('by', 'IN'), ('growing', 'VBG'), ('international', 'JJ'), ('interest', 'NN'), ('in', 'IN'), ('japanese', 'NN'), ('behavior', 'NN'), ('.', '.')], senses=('interest_1',)),
 SensevalInstance(word='interest-n', position=5, context=[('the', 'D

In [8]:
senses('line.pos')

['formation', 'cord', 'text', 'division', 'product', 'phone']

In [9]:
senses('serve.pos')

['SERVE6', 'SERVE2', 'SERVE10', 'SERVE12']

In [10]:
for inst in senseval.instances('hard.pos')[0:30]:
  p = inst.position
  left = ' '.join(w for (w,t) in inst.context[p-5:p])
  word = ' '.join(w for (w,t) in inst.context[p:p+1])
  right = ' '.join(w for (w,t) in inst.context[p+1:p+6])
  senses = ' '.join(inst.senses)
  print('%20s |%10s | %-15s -> %s' % (left, word, right, senses))

for inst in senseval.instances('hard.pos')[-30:]:
  p = inst.position
  left = ' '.join(w for (w,t) in inst.context[p-5:p])
  word = ' '.join(w for (w,t) in inst.context[p:p+1])
  right = ' '.join(w for (w,t) in inst.context[p+1:p+6])
  senses = ' '.join(inst.senses)
  print('%20s |%10s | %-15s -> %s' % (left, word, right, senses))

defeat him and that 's |      hard | to do . ''      -> HARD1
doctors '' are having a |      hard | time helping president bush explain -> HARD1
                     |      hard | to believe that the sacramento -> HARD1
with another person , the |      hard | part in correcting the mistake -> HARD1
bookkeeper ; 'our life is |    harder | now , yes , but -> HARD1
shows , which have become |      hard | to sell in the rerun -> HARD1
 we have to face the |      hard | facts of life . -> HARD1
                     |      hard | to make the columns even -> HARD1
                     |      hard | just to finish these remarks -> HARD1
    has died , is it |      hard | portraying matt dillon without miss -> HARD1
predecessors , and it 's |    harder | to drum up big crowds -> HARD1
                     |      hard | to put into words , -> HARD1
                     |      hard | to believe i 'm seeing -> HARD1
a vehicle to attract the |      hard | to-reach 12-34-year-old demographic . '' ->

In [11]:
senseval.instances('hard.pos')

[SensevalInstance(word='hard-a', position=20, context=[('``', '``'), ('he', 'PRP'), ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'), ('popular', 'JJ'), ('support', 'NN'), (',', ','), ('but', 'CC'), ('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'), ('kill', 'VB'), ('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'), ('and', 'CC'), ('that', 'DT'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('do', 'VB'), ('.', '.'), ("''", "''")], senses=('HARD1',)), SensevalInstance(word='hard-a', position=10, context=[('clever', 'NNP'), ('white', 'NNP'), ('house', 'NNP'), ('``', '``'), ('spin', 'VB'), ('doctors', 'NNS'), ("''", "''"), ('are', 'VBP'), ('having', 'VBG'), ('a', 'DT'), ('hard', 'JJ'), ('time', 'NN'), ('helping', 'VBG'), ('president', 'NNP'), ('bush', 'NNP'), ('explain', 'VB'), ('away', 'RB'), ('the', 'DT'), ('economic', 'JJ'), ('bashing', 'NN'), ('that', 'IN'), ('low-and', 'JJ'), ('middle-income', 'JJ'), ('workers', 'NNS'), ('are', 'VBP'), ('taking', 'VBG'), ('these', 'DT'), ('days

## The Naive Bayes model

We use Bayes's classifier in order to label (classify) the words with a certain WordNet sense. For this we need a context window surrounding the target word (the word for which we search the sense). The context window should contain only "content words" (words with important meaning, that bring information, like nouns, verbs etc)

We note P(s|c) the probability for sense s in the context c. For each such sense of the target word the probability is computed and we take the sense with the highest probability compared to the others.

In order to compute the probability `P(s|c)`, we use the formula: 

`P(s|c)=P(c|s)*P(s)/P(c)`. 

`P(s)` is the probability of a sense without any context. For computing `P(c|s)` we need a training set (with texts that contain the target word, already labeled with its correct sense).

NLTK already has the classifier implemented. In this laboratory we will use the NLTK NaiveBayesClassifier:https://www.nltk.org/_modules/nltk/classify/naivebayes.html

The Naive Bayes classifier will first compute the prior probability for the senses (or, generally speaking, for the class labels) - this is determined by the label's frequency in the training set. The features are used to see the likelyhood of having that label in a given context.

In [12]:
import nltk
import random
from nltk.classify import accuracy, NaiveBayesClassifier, MaxentClassifier
from collections import defaultdict

In [13]:
# NaiveBayesClassifier.train(train_set)

where `train_set` must contain a list with the classes and features for each class. The train_set list will contain tuples of two elements. First element is a dictionary with the features (name and value of each feature). The second element is the class label.


In [14]:
def sense_instances(instances, sense):
    """
    This returns the list of instances in instances that have the sense `sense`
    """
    return [instance for instance in instances if instance.senses[0]==sense]

In [15]:
sense2 = sense_instances(senseval.instances('hard.pos'), 'HARD2')

In [16]:
sense2[:5]

[SensevalInstance(word='hard-a', position=15, context=[('keep', 'VB'), ('this', 'DT'), ('one', 'CD'), ('in', 'IN'), ('your', 'PRP$'), ('drawer', 'NN'), ('for', 'IN'), ('the', 'DT'), ('next', 'JJ'), ('time', 'NN'), ('the', 'DT'), ('boss', 'NN'), ('gives', 'VBZ'), ('you', 'PRP'), ('a', 'DT'), ('hard', 'JJ'), ('time', 'NN'), ('.', '.')], senses=('HARD2',)),
 SensevalInstance(word='hard-a', position=11, context=[('she', 'PRP'), ('recommends', 'VBZ'), ('continuing', 'VBG'), ('education', 'NN'), ('courses', 'NNS'), (',', ','), ('developing', 'VBG'), ('effective', 'JJ'), ('people', 'NNS'), ('skills', 'NNS'), ('and', 'CC'), ('hard', 'JJ'), ('work', 'NN'), ('.', '.')], senses=('HARD2',)),
 SensevalInstance(word='hard-a', position=10, context=[('the', 'DT'), ('phrase', 'NN'), ('``', '``'), ('consent', 'NN'), ('of', 'IN'), ('the', 'DT'), ('governed', 'VBN'), ("''", "''"), ('needs', 'VBZ'), ('a', 'DT'), ('hard', 'JJ'), ('look', 'NN'), ('.', '.')], senses=('HARD2',)),
 SensevalInstance(word='hard-a

In [17]:
nltk.download('stopwords')
STOPWORDS_SET = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
# Some helper functions we'll need to train our model

def extract_vocab_frequency(instances, stopwords=STOPWORDS_SET, n=300):
    """
      Given a list of senseval instances, return a list of the n most frequent words that
      appears in its context (i.e., the sentence with the target word in), output is in order
      of frequency and includes also the number of instances in which that key appears in the
      context of instances.
    """
    fd = nltk.FreqDist()
    for i in instances:
        (target, suffix) = i.word.split('-')
        words = (c[0] for c in i.context if not c[0] == target)
        for word in set(words) - set(stopwords):
            fd[word] += 1
    return fd.most_common()[:n+1]

In [19]:
def extract_vocab(instances, stopwords=STOPWORDS_SET, n=300):
    return [w for w,f in extract_vocab_frequency(instances,stopwords,n)]

In [20]:
extract_vocab(senseval.instances('interest.pos'), stopwords=STOPWORDS_SET, n=1000)[:20]

['.',
 ',',
 'rates',
 "'s",
 'said',
 '%',
 'interests',
 '``',
 "''",
 '$',
 'million',
 'n',
 "'t",
 'mr',
 'company',
 'u',
 'rate',
 'would',
 'market',
 'bonds']

In [21]:
# Feature extraction

def wsd_context_features(instance, vocab, dist=3):
    features = {}
    ind = instance.position
    con = instance.context
    for i in range(max(0, ind-dist), ind):
        j = ind-i
        features['left-context-word-%s(%s)' % (j, con[i][0])] = True

    for i in range(ind+1, min(ind+dist+1, len(con))):
        j = i-ind
        features['right-context-word-%s(%s)' % (j, con[i][0])] = True

 
    features['word'] = instance.word
    features['pos'] = con[1][1]
    return features


This feature set represents the context of a word w as the sequence of m pairs (word,tag) that occur before w and the sequence of m pairs (word, tag) that occur after w. As we'll see shortly, you can specify the value of m (e.g., m=1 means the context consists of just the immediately prior and immediately subsequent word-tag pairs); otherwise, m defaults to 3.

In [22]:
senseval.instances('interest.pos')[4]

SensevalInstance(word='interest-n', position=8, context=[('finmeccanica', 'NN'), ('is', 'VBZ'), ('an', 'DT'), ('italian', 'NN'), ('state-owned', 'JJ'), ('holding', 'NN'), ('company', 'NN'), ('with', 'IN'), ('interests', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('mechanical', 'JJ'), ('engineering', 'NN'), ('industry', 'NN'), ('.', '.')], senses=('interest_5',))

In [23]:
vocab_interest = extract_vocab(senseval.instances('interest.pos'), stopwords=[], n=300)
wsd_context_features(senseval.instances('interest.pos')[4], vocab=vocab_interest)

{'left-context-word-1(with)': True,
 'left-context-word-2(company)': True,
 'left-context-word-3(holding)': True,
 'pos': 'VBZ',
 'right-context-word-1(in)': True,
 'right-context-word-2(the)': True,
 'right-context-word-3(mechanical)': True,
 'word': 'interest-n'}

In [24]:
def wsd_word_features(instance, vocab, dist=3):
    """
    Create a feature set where every key returns False unless it occurs in the
    instance's context
    """
    features = defaultdict(lambda:False)
    features['alwayson'] = True
    #cur_words = [w for (w, pos) in i.context]
    try:
      # 
        for(w, pos) in instance.context:
            if w in vocab:
                features[w] = True
    except ValueError:
        pass
    return features

This feature set is based on the set S of the n most frequent words that occur in the same sentence as the target word w across the entire training corpus (as you'll see later, you can specify the value of n, but if you don't specify it then it defaults to 300). For each occurrence of w, wsd_word_features represents its context as the subset of those words from S that occur in the w's sentence.

In [25]:
senseval.instances('interest.pos')[4]

SensevalInstance(word='interest-n', position=8, context=[('finmeccanica', 'NN'), ('is', 'VBZ'), ('an', 'DT'), ('italian', 'NN'), ('state-owned', 'JJ'), ('holding', 'NN'), ('company', 'NN'), ('with', 'IN'), ('interests', 'NNS'), ('in', 'IN'), ('the', 'DT'), ('mechanical', 'JJ'), ('engineering', 'NN'), ('industry', 'NN'), ('.', '.')], senses=('interest_5',))

In [26]:
wsd_word_features(senseval.instances('interest.pos')[4], vocab=vocab_interest)

defaultdict(<function __main__.wsd_word_features.<locals>.<lambda>>,
            {'.': True,
             'alwayson': True,
             'an': True,
             'company': True,
             'holding': True,
             'in': True,
             'industry': True,
             'interests': True,
             'is': True,
             'the': True,
             'with': True})

In [27]:
_inst_cache = {}


In [28]:
def wsd_classifier(trainer, word, features, stopwords_list = STOPWORDS_SET, number=300, distance=3, confusion_matrix=False):
    """
    This function takes as arguments:
        a trainer (e.g., NaiveBayesClassifier.train);
        a target word from senseval2
        a feature set (this can be wsd_context_features or wsd_word_features);
        a number (defaults to 300), which determines for wsd_word_features the number of
            most frequent words within the context of a given sense that you use to classify examples;
        a distance (defaults to 3) which determines the size of the window for wsd_context_features (if distance=3, then
            wsd_context_features gives 3 words and tags to the left and 3 words and tags to
            the right of the target word);
        confusion_matrix (defaults to False), which if set to True prints a confusion matrix.

    Calling this function splits the senseval data for the word into a training set and a test set (the way it does
    this is the same for each call of this function, because the argument to random.seed is specified,
    but removing this argument would make the training and testing sets different each time you build a classifier).

    It then trains the trainer on the training set to create a classifier that performs WSD on the word,
    using features (with number or distance where relevant).

    It then tests the classifier on the test set, and prints its accuracy on that set.


    If confusion_matrix==True, then calling this function prints out a confusion matrix, where each cell [i,j]
    indicates how often label j was predicted when the correct label was i (so the diagonal entries indicate labels
    that were correctly predicted).
    """
    print("Reading data...")
    global _inst_cache
    if word not in _inst_cache:
        _inst_cache[word] = [(i, i.senses[0]) for i in senseval.instances(word)]
    events = _inst_cache[word][:]
    senses = list(set(l for (i, l) in events))
    instances = [i for (i, l) in events]
    vocab = extract_vocab(instances, stopwords=stopwords_list, n=number)
    print(' Senses: ' + ' '.join(senses))

    # Split the instances into a training and test set,
    #if n > len(events): n = len(events)
    n = len(events)
    random.seed(334)
    random.shuffle(events)
    training_data = events[:int(0.8 * n)]
    test_data = events[int(0.8 * n):n]

    # Train classifier
    print('Training classifier...')
    classifier = trainer([(features(i, vocab, distance), label) for (i, label) in training_data])
    # Test classifier
    print('Testing classifier...')
    acc = accuracy(classifier, [(features(i, vocab, distance), label) for (i, label) in test_data] )
    print('Accuracy: %6.4f' % acc)
    
    if confusion_matrix==True:
        gold = [label for (i, label) in test_data]
        derived = [classifier.classify(features(i, vocab)) for (i, label) in test_data]
        cm = nltk.ConfusionMatrix(gold, derived)
        print(cm)

    return classifier
        

In [29]:
# Training the classifier:
# NB, with features based on 300 most frequent context words
wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features)

# Pseudocode, general training steps:
# featureset = [extract_features[i] for i in instances]
# classifier = NaiveBayesClassifier.train((feature, label) for feature in featureset)


Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8547


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7c96bcc10>

In [30]:
# NB, with features based word + pos in 6 word window
wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features)

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8927


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7c8b063d0>

In [31]:
wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features, confusion_matrix=True) # 0.33

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8927
      |   H   H   H |
      |   A   A   A |
      |   R   R   R |
      |   D   D   D |
      |   1   2   3 |
------+-------------+
HARD1 |<650> 35  16 |
HARD2 |  12 <76>  5 |
HARD3 |   7  18 <48>|
------+-------------+
(row = reference; col = test)



<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7c81a2cd0>

In [32]:
wsd_classifier(NaiveBayesClassifier.train, 'interest.pos', wsd_context_features, confusion_matrix=True) # 1/6

Reading data...
 Senses: interest_1 interest_5 interest_3 interest_2 interest_6 interest_4
Training classifier...
Testing classifier...
Accuracy: 0.4219
           |   i   i   i   i   i   i |
           |   n   n   n   n   n   n |
           |   t   t   t   t   t   t |
           |   e   e   e   e   e   e |
           |   r   r   r   r   r   r |
           |   e   e   e   e   e   e |
           |   s   s   s   s   s   s |
           |   t   t   t   t   t   t |
           |   _   _   _   _   _   _ |
           |   1   2   3   4   5   6 |
-----------+-------------------------+
interest_1 | <24> 33   4   4   5   1 |
interest_2 |   .  <3>  .   1   .   . |
interest_3 |   2   2  <4>  .   .   . |
interest_4 |   1  22   2 <14>  .   2 |
interest_5 |   5  45   1   6 <61>  2 |
interest_6 |   . 133   2   1   . <94>|
-----------+-------------------------+
(row = reference; col = test)



<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7c65c2dd0>

In [33]:
wsd_classifier(trainer=NaiveBayesClassifier.train, 
               word='interest.pos', 
               features=wsd_context_features) # 1/6 ~ 0.16

Reading data...
 Senses: interest_1 interest_5 interest_3 interest_2 interest_6 interest_4
Training classifier...
Testing classifier...
Accuracy: 0.4219


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7c57aca10>

Why is the accuracy lower for "interest"...?

**Baseline**: How could we guess the sense of a word without any additional information?

In [34]:
# Frequency Baseline
hard_sense_fd = nltk.FreqDist([i.senses[0] for i in senseval.instances('hard.pos')])
most_frequent_hard_sense = list(hard_sense_fd.keys())[0]
frequency_hard_sense_baseline = hard_sense_fd.freq(list(hard_sense_fd.keys())[0])


In [35]:
frequency_hard_sense_baseline

0.797369028386799

In [36]:
interest_sense_fd = nltk.FreqDist([i.senses[0] for i in senseval.instances('interest.pos')])
most_frequent_interest_sense= list(interest_sense_fd.keys())[0]
frequency_interest_sense_baseline = interest_sense_fd.freq(list(interest_sense_fd.keys())[0])

In [37]:
frequency_interest_sense_baseline

0.5287162162162162

You can also use Naive Bayes classifier from sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [38]:
from sklearn.metrics import classification_report

In [39]:
classification_report??

# Exercitii (1p)

Download a sample of the iWeb corpus, available here: https://www.corpusdata.org/iweb/samples/text0.zip . Unzip the archive and choose one of the text files in the archive at random. You will use it in the next exercises.

In [40]:
!wget https://www.corpusdata.org/iweb/samples/text0.zip

--2022-05-16 13:20:10--  https://www.corpusdata.org/iweb/samples/text0.zip
Resolving www.corpusdata.org (www.corpusdata.org)... 209.90.108.238
Connecting to www.corpusdata.org (www.corpusdata.org)|209.90.108.238|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 37647973 (36M) [application/x-zip-compressed]
Saving to: ‘text0.zip.1’


2022-05-16 13:20:49 (932 KB/s) - ‘text0.zip.1’ saved [37647973/37647973]



In [41]:
!unzip -x 'text0.zip'

Archive:  text0.zip
replace 103053.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: 103053.txt              
  inflating: 116053.txt              
  inflating: 12053.txt               
  inflating: 128053.txt              
  inflating: 138053.txt              
  inflating: 140053.txt              
  inflating: 152053.txt              
  inflating: 161053.txt              
  inflating: 167053.txt              
  inflating: 17053.txt               
  inflating: 179053.txt              
  inflating: 181053.txt              
  inflating: 183053.txt              
  inflating: 185053.txt              
  inflating: 19053.txt               
  inflating: 206053.txt              
  inflating: 228053.txt              
  inflating: 241053.txt              
  inflating: 246053.txt              
  inflating: 247053.txt              
  inflating: 253053.txt              
  inflating: 259053.txt              
  inflating: 27053.txt               
  inflating: 273053.txt              
  inflat

In [42]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [94]:
import os

dir = '/content/'
input_text = ''
  
os.chdir(dir)
   
def read_text_file(file_path):
    global input_text
    with open(file_path, 'r') as f:
        input_text = f.read()
  
for file in os.listdir()[10:]:
    if file.endswith(".txt"):
        file_path = os.path.join(dir, file)
        read_text_file(file_path)
        break

In [95]:
input_text

'\n@@90398177 @5898177/ <h> Types of Squash Grips <p> Your racket will come with a stock factory grip . You can put an overgrip on top of that , to help with sweat absorption , or to increase the size of the grip and make the racket feel more comfortable in your freakishly large hands . <p> You can also buy a replacement grip . In this case you remove the factory grip and put on the replacement grip instead . <p> I generally only use overgrips . My favorite has always been Tourna Grip . here \'s a video review I did of Tourna Grip , along with a brief tutorial on how to grip a squash racket . <p> I do n\'t  use replacement grips too often but I \'d have to guess the #1 brand of replacement squash grips is Karakal PU Super Grip . Tons of pros use Karakal . <p> Picking a grip is a very personal matter . Obviously , you need your squash racket to feel comfortable in your hands . So experiment with different squash grips and see what works best for you and your freakishly @ @ @ @ @ @ @ @ @

1. Load the text in the selected file and disambiguate every instance of the word `hard` and `line`. Try different approaches, using both knowledge-based and corpus-based methods:

- use the trained Naive Bayes classifier above
- use Lesk algorithm's implementation in NLTK (see previous lab)
- use Banerjee & Pedersen's extended Lesk algoritm (see previous lab, you can use the implementation there)


**Naive Bayes classifier**

In [96]:
import re
import nltk
nltk.download("punkt")

from nltk import tokenize

sentences = tokenize.sent_tokenize(input_text)
text = [re.sub(r"[a-z]*\d+\/*|\\n|<h>|<p>|@|&amp;| -- ", " ", sentence) for sentence in sentences]
text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['\n         Types of Squash Grips   Your racket will come with a stock factory grip .',
 'You can put an overgrip on top of that , to help with sweat absorption , or to increase the size of the grip and make the racket feel more comfortable in your freakishly large hands .',
 '  You can also buy a replacement grip .',
 'In this case you remove the factory grip and put on the replacement grip instead .',
 '  I generally only use overgrips .',
 'My favorite has always been Tourna Grip .',
 "here 's a video review I did of Tourna Grip , along with a brief tutorial on how to grip a squash racket .",
 "  I do n't  use replacement grips too often but I 'd have to guess the #  brand of replacement squash grips is Karakal PU Super Grip .",
 'Tons of pros use Karakal .',
 '  Picking a grip is a very personal matter .',
 'Obviously , you need your squash racket to feel comfortable in your hands .',
 'So experiment with different squash grips and see what works best for you and your freakishly  

In [97]:
sentences = [tokenize.word_tokenize(sentence) for sentence in text]
sentences = [[word.lower() for word in sentence] for sentence in sentences]

In [98]:
# trained Naive Bayes classifier for the target word 'hard'
classifier = wsd_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features)

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8927


In [99]:
def extract_vocabulary_frequency(sentences, target_word, stopwords=STOPWORDS_SET, n=300):
  fd = nltk.FreqDist()
  for sentence in sentences:
    for word in set(sentence) - set(stopwords) - set(target_word):
      fd[word] += 1

  return fd.most_common()[:n+1]


def extract_vocabulary(sentences, target_word="", stopwords=STOPWORDS_SET, n=300):
  return [w for w, f in extract_vocabulary_frequency(sentences, target_word, stopwords, n)]

In [101]:
vocabulary_hard = extract_vocabulary(sentences, "hard")
vocabulary_hard

['.',
 ',',
 'racket',
 'squash',
 ')',
 '(',
 ':',
 'like',
 "'s",
 "n't",
 'power',
 'one',
 '!',
 'shoes',
 'head',
 'good',
 'rackets',
 'string',
 'read',
 '?',
 'would',
 '...',
 'control',
 'new',
 'also',
 'used',
 "'m",
 'feel',
 'strings',
 'dunlop',
 'asics',
 'shoe',
 'great',
 'prince',
 'get',
 'version',
 'think',
 'using',
 "'ve",
 'pro',
 'much',
 'ball',
 'really',
 '``',
 'play',
 'court',
 'gel',
 'comments',
 'racquet',
 'grip',
 'bit',
 'use',
 'even',
 'players',
 'time',
 'tecnifibre',
 'playing',
 'see',
 'balance',
 'well',
 'adidas',
 'light',
 'weight',
 'better',
 'black',
 'years',
 'played',
 'first',
 'back',
 'try',
 'frame',
 'stabil',
 '-',
 'around',
 'model',
 'tried',
 'game',
 'find',
 'best',
 'grams',
 'know',
 'little',
 'two',
 'different',
 "'d",
 'hi',
 'say',
 'got',
 'felt',
 'quite',
 'sure',
 'could',
 'bought',
 'eye',
 'tension',
 'though',
 'hit',
 'still',
 'last',
 'thanks',
 'pretty',
 'links',
 'carboflex',
 'strung',
 'since',
 '

In [106]:
def wsd_condext_features_set(sentence, target_word, vocab, dist=3):
  features = {}
  try:
    ind = sentence.index(target_word)
  except ValueError:
    return {}

  for i in range(max(0, ind - dist), ind):
    j = ind - i
    features['left-context-word-%s(%s)' % (j, sentence[i])] = True

  for i in range(ind + 1, min(ind + dist + 1, len(sentence))):
    j = ind - i
    features['right-context-word-%s(%s)' % (j, sentence[i])] = True

  features['word'] = target_word   

  return features

In [107]:
res_hard = [classifier.classify(wsd_condext_features_set(sentence, 'hard', vocabulary_hard)) for sentence in sentences if 'hard' in sentence]
res_hard

['HARD1',
 'HARD2',
 'HARD2',
 'HARD3',
 'HARD3',
 'HARD3',
 'HARD3',
 'HARD1',
 'HARD1',
 'HARD2',
 'HARD3',
 'HARD2',
 'HARD1',
 'HARD2',
 'HARD2',
 'HARD1',
 'HARD2',
 'HARD2',
 'HARD2',
 'HARD1',
 'HARD1',
 'HARD3',
 'HARD3',
 'HARD3',
 'HARD1',
 'HARD1',
 'HARD2',
 'HARD3',
 'HARD3',
 'HARD1',
 'HARD3',
 'HARD2',
 'HARD2',
 'HARD3',
 'HARD1',
 'HARD3',
 'HARD3',
 'HARD1',
 'HARD3',
 'HARD2',
 'HARD3',
 'HARD1',
 'HARD3',
 'HARD1',
 'HARD1',
 'HARD1',
 'HARD1',
 'HARD3',
 'HARD1',
 'HARD2',
 'HARD3',
 'HARD3',
 'HARD2',
 'HARD3',
 'HARD3',
 'HARD1',
 'HARD2',
 'HARD1',
 'HARD1',
 'HARD1',
 'HARD2',
 'HARD2',
 'HARD2',
 'HARD2',
 'HARD2',
 'HARD2',
 'HARD3',
 'HARD1',
 'HARD1',
 'HARD1',
 'HARD1',
 'HARD3']

In [53]:
# trained Naive Bayes classifier for the target word 'line'
classifier2 = wsd_classifier(NaiveBayesClassifier.train, 'line.pos', wsd_context_features)

Reading data...
 Senses: formation cord text division product phone
Training classifier...
Testing classifier...
Accuracy: 0.7470


In [108]:
vocabulary_line = extract_vocabulary(sentences, "line")
res_line = [classifier2.classify(wsd_condext_features_set(sentence, 'line', vocabulary_line)) for sentence in sentences if 'line' in sentence]
set(res_line)

{'cord', 'division', 'formation', 'phone', 'product', 'text'}

In [109]:
nltk.download('wordnet')

from nltk.corpus import wordnet as wn
from nltk.wsd import lesk

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [110]:
for ss in wn.synsets('hard'):
  print(ss, ss.definition())

Synset('difficult.a.01') not easy; requiring great physical or mental effort to accomplish or comprehend or endure
Synset('hard.a.02') dispassionate; 
Synset('hard.a.03') resisting weight or pressure
Synset('hard.s.04') very strong or vigorous
Synset('arduous.s.01') characterized by effort to the point of exhaustion; especially physical effort
Synset('unvoiced.a.01') produced without vibration of the vocal cords
Synset('hard.a.07') (of light) transmitted directly from a pointed light source
Synset('hard.a.08') (of speech sounds); produced with the back of the tongue raised toward or touching the velum
Synset('intemperate.s.03') given to excessive indulgence of bodily appetites especially for intoxicating liquors
Synset('hard.s.10') being distilled rather than fermented; having a high alcoholic content
Synset('hard.s.11') unfortunate or hard to bear
Synset('hard.s.12') dried out
Synset('hard.r.01') with effort or force or vigor
Synset('hard.r.02') with firmness
Synset('hard.r.03') earne

In [111]:
for ss in wn.synsets('line'):
  print(ss, ss.definition())

Synset('line.n.01') a formation of people or things one beside another
Synset('line.n.02') a mark that is long relative to its width
Synset('line.n.03') a formation of people or things one behind another
Synset('line.n.04') a length (straight or curved) without breadth or thickness; the trace of a moving point
Synset('line.n.05') text consisting of a row of words written across a page or computer screen
Synset('line.n.06') a single frequency (or very narrow band) of radiation in a spectrum
Synset('line.n.07') a fortified position (especially one marking the most forward position of troops)
Synset('argumentation.n.02') a course of reasoning aimed at demonstrating a truth or falsehood; the methodical process of logical reasoning
Synset('cable.n.02') a conductor for transmitting electrical or optical signals or electric power
Synset('course.n.02') a connected series of events or actions or developments
Synset('line.n.11') a spatial location defined by a real or imaginary unidimensional ex

**Lesk algorithm**

In [130]:
lesk_res_hard = [(i, lesk(sent, 'hard')) for i, sent in enumerate(sentences) if 'hard' in sent]
lesk_res_line = [(i, lesk(sent, 'line')) for i, sent in enumerate(sentences) if 'line' in sent]

In [131]:
lesk_res_hard

[(62, Synset('hard.s.11')),
 (277, Synset('unvoiced.a.01')),
 (577, Synset('intemperate.s.03')),
 (598, Synset('hard.r.10')),
 (652, Synset('hard.s.11')),
 (731, Synset('hard.s.11')),
 (841, Synset('hard.a.08')),
 (921, Synset('hard.s.11')),
 (1119, Synset('hard.s.11')),
 (1355, Synset('hard.s.11')),
 (1381, Synset('hard.a.08')),
 (1393, Synset('hard.s.11')),
 (1480, Synset('hard.r.10')),
 (1487, Synset('hard.s.11')),
 (1531, Synset('hard.a.08')),
 (1573, Synset('hard.r.10')),
 (1580, Synset('hard.s.11')),
 (1710, Synset('hard.s.11')),
 (1714, Synset('hard.a.08')),
 (1828, Synset('hard.s.11')),
 (1913, Synset('arduous.s.01')),
 (2103, Synset('unvoiced.a.01')),
 (2286, Synset('unvoiced.a.01')),
 (2336, Synset('hard.s.11')),
 (2389, Synset('hard.s.11')),
 (2442, Synset('hard.s.11')),
 (2467, Synset('unvoiced.a.01')),
 (2551, Synset('hard.s.11')),
 (2625, Synset('hard.s.11')),
 (2643, Synset('hard.s.11')),
 (2689, Synset('hard.a.08')),
 (2690, Synset('difficult.a.01')),
 (2703, Synset('in

In [132]:
lesk_res_line

[(147, Synset('agate_line.n.01')),
 (254, Synset('line.n.20')),
 (409, Synset('agate_line.n.01')),
 (501, Synset('production_line.n.01')),
 (679, Synset('agate_line.n.01')),
 (801, Synset('agate_line.n.01')),
 (832, Synset('wrinkle.n.01')),
 (969, Synset('agate_line.n.01')),
 (1076, Synset('agate_line.n.01')),
 (1272, Synset('agate_line.n.01')),
 (1295, Synset('agate_line.n.01')),
 (1382, Synset('agate_line.n.01')),
 (1395, Synset('production_line.n.01')),
 (1418, Synset('wrinkle.n.01')),
 (1512, Synset('production_line.n.01')),
 (1605, Synset('production_line.n.01')),
 (1646, Synset('production_line.n.01')),
 (1669, Synset('wrinkle.n.01')),
 (2761, Synset('production_line.n.01')),
 (2762, Synset('wrinkle.n.01')),
 (2943, Synset('production_line.n.01')),
 (2944, Synset('wrinkle.n.01')),
 (3557, Synset('wrinkle.n.01')),
 (3616, Synset('wrinkle.n.01')),
 (3972, Synset('wrinkle.n.01')),
 (3977, Synset('wrinkle.n.01')),
 (4067, Synset('wrinkle.n.01')),
 (4072, Synset('wrinkle.n.01')),
 (42

**Banerjee & Pedersen's extended Lesk algorithm**

In [115]:
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
import string
nltk.download('stopwords')
import re

gloss_rel = lambda x: x.definition()
example_rel = lambda x: " ".join(x.examples())
hyponym_rel = lambda x: " ".join(w.definition() for w in x.hyponyms())
meronym_rel = lambda x: " ".join(w.definition() for w in x.member_meronyms() + \
                                 x.part_meronyms() + x.substance_meronyms())
also_rel = lambda x: " ".join(w.definition() for w in x.also_sees())
attr_rel = lambda x: " ".join(w.definition() for w in x.attributes())
hypernym_rel = lambda x: " ".join(w.definition() for w in x.hypernyms())

relpairs = {wn.NOUN: [(hyponym_rel, meronym_rel), (meronym_rel, hyponym_rel),
                      (hyponym_rel, hyponym_rel),
                      (gloss_rel, meronym_rel), (meronym_rel, gloss_rel),
                      (example_rel, meronym_rel), (meronym_rel, example_rel),
                      (gloss_rel, gloss_rel)],
            wn.ADJ: [(also_rel, gloss_rel), (gloss_rel, also_rel),
                     (attr_rel, gloss_rel), (gloss_rel, attr_rel),
                     (gloss_rel, gloss_rel),
                     (example_rel, gloss_rel), (gloss_rel, example_rel),
                     (gloss_rel, hypernym_rel), (hypernym_rel, gloss_rel)],
            wn.VERB:[(example_rel, example_rel),
                     (example_rel, hypernym_rel), (hypernym_rel, example_rel),
                     (hyponym_rel, hyponym_rel),
                     (gloss_rel, hyponym_rel), (hyponym_rel, gloss_rel),
                     (example_rel, gloss_rel), (gloss_rel, example_rel)]}

def preprocess(text):
    """
    Helper function to preprocess text (lowercase, remove punctuation etc.)
    """
    words = nltk.word_tokenize(text)
    punctuation = string.punctuation
    words = [word.lower() for word in words if word not in punctuation]
    words = [word for word in words if not word in stopwords.words('english')] # ? not part of the original algorithm to remove all stopwords! (only ones at the edges of the subsequence)
    return words

def lcs(S1, S2):
    """
    Helper function to compute length and offsets of longest common substring of
    S1 and S2. Uses the classical dynamic programming algorithm.
    """
    M = [[0]*(1+len(S2)) for i in range(1+len(S1))]
    longest, x_longest, y_longest = 0, 0, 0
    for x in range(1,1+len(S1)):
        for y in range(1,1+len(S2)):
            if S1[x-1] == S2[y-1]:
                M[x][y] = M[x-1][y-1] + 1
                if M[x][y]>longest:
                    longest = M[x][y]
                    x_longest = x
                    y_longest = y
            else:
                M[x][y] = 0
    return longest, x_longest - longest, y_longest - longest

def score(gloss1, gloss2, normalized=False):
    """
    Compute score between two glosses based on length of common substrings.
    """
    gloss1 = preprocess(gloss1)
    gloss2 = preprocess(gloss2)
    curr_score = 0
    longest, start1, start2, = lcs(gloss1, gloss2)
    while longest > 0:
        gloss1[start1 : start1 + longest] = []
        gloss2[start2 : start2 + longest] = []
        curr_score += longest ** 2
        longest, start1, start2 = lcs(gloss1, gloss2)
    if normalized and curr_score:
      return curr_score / (len(gloss1) + len(gloss2))
    return curr_score

def relatedness(sense1, sense2, relpairs, normalized=False):
    """
    Compute the relatedness of two senses (synsets) using the list of pairs of
    relations in relpairs.
    """
    return sum(score(pair[0](sense1), pair[1](sense2), normalized=normalized) # Note: normalization not explicitly part of original algorithm!
    for pair in relpairs)

def wsd(context, target, winsize, pos_tag, verbose=False, normalized=False):
    """
    Find the best sense for a word in a given context.
    Arguments:
    context - sentence(s) we are analyzing; expected as list of strings
    target  - string representing the word whose senses we're trying to
              disambiguate. Target is assumed to occur once in sentence. In case
              of multiple occurences, the first one is considered. Will throw
              ValueError if target is not in sentence
    winsize - size of window used for disambiguating. The algorithm will only
              look at winsize words of the appropriate part-of-speech around the
              target word
    pos_tag - part of speech of target word
    """
    context = list(filter(None, [wn.synsets(word, pos=pos_tag) for word in context]))
    target_synsets = wn.synsets(target, pos=pos_tag)
    try:
      pos = context.index(target_synsets)
    except ValueError:
      return None, 0.

    window = context[max(pos - winsize, 0) : pos] + \
             context[pos + 1 : min(pos + winsize + 1, len(context))]
    sense_scores = [sum(sum(relatedness(sense, other_sense, relpairs[pos_tag], normalized=normalized)
                              for other_sense in senses)
                   for senses in window) for sense in target_synsets]
    if verbose:
      print("All scores:")
      for i, s in enumerate(target_synsets):
        print(sense_scores[i], s, s.definition())
    best_score = max(sense_scores)
    best_index = sense_scores.index(best_score)
    return target_synsets[best_index], best_score


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [116]:
sentences_preprocessed = [preprocess(' '.join(sentence)) for sentence in sentences]
sentences_preprocessed

[['types', 'squash', 'grips', 'racket', 'come', 'stock', 'factory', 'grip'],
 ['put',
  'overgrip',
  'top',
  'help',
  'sweat',
  'absorption',
  'increase',
  'size',
  'grip',
  'make',
  'racket',
  'feel',
  'comfortable',
  'freakishly',
  'large',
  'hands'],
 ['also', 'buy', 'replacement', 'grip'],
 ['case',
  'remove',
  'factory',
  'grip',
  'put',
  'replacement',
  'grip',
  'instead'],
 ['generally', 'use', 'overgrips'],
 ['favorite', 'always', 'tourna', 'grip'],
 ["'s",
  'video',
  'review',
  'tourna',
  'grip',
  'along',
  'brief',
  'tutorial',
  'grip',
  'squash',
  'racket'],
 ["n't",
  'use',
  'replacement',
  'grips',
  'often',
  "'d",
  'guess',
  'brand',
  'replacement',
  'squash',
  'grips',
  'karakal',
  'pu',
  'super',
  'grip'],
 ['tons', 'pros', 'use', 'karakal'],
 ['picking', 'grip', 'personal', 'matter'],
 ['obviously', 'need', 'squash', 'racket', 'feel', 'comfortable', 'hands'],
 ['experiment',
  'different',
  'squash',
  'grips',
  'see',
  '

In [117]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [118]:
wordnet_pos_mapping = {
    "N": wn.NOUN,
    "V": wn.VERB,
    "J": wn.ADJ,
    "R": wn.ADV
}

def pos_tag_wordnet(tokens):
  """
    Tag text with wordnet format
  """
  pos_tagged_text = nltk.pos_tag(tokens)

  # map the pos tagging output with wordnet 
  pos_tagged_text = [(word, wordnet_pos_mapping.get(pos_tag[0])) if (pos_tag[0] in wordnet_pos_mapping.keys()) and (pos_tag[0] != "R")
                                                                 else (word, wn.NOUN)
                      for (word, pos_tag) in pos_tagged_text]
  
  return pos_tagged_text

In [119]:
bp_results_hard = [wsd(context=sent, target="hard", winsize=5, pos_tag=pos_tag_wordnet(sent)[sent.index("hard")][1])
                    for sent in sentences_preprocessed if 'hard' in sent]
bp_results_hard

[(Synset('arduous.s.01'), 2),
 (Synset('difficult.a.01'), 2),
 (Synset('difficult.a.01'), 24),
 (None, 0.0),
 (Synset('arduous.s.01'), 3),
 (Synset('arduous.s.01'), 3),
 (Synset('difficult.a.01'), 0),
 (Synset('difficult.a.01'), 0),
 (Synset('difficult.a.01'), 0),
 (Synset('arduous.s.01'), 4),
 (Synset('difficult.a.01'), 4),
 (Synset('difficult.a.01'), 0),
 (Synset('difficult.a.01'), 194),
 (Synset('difficult.a.01'), 12),
 (Synset('difficult.a.01'), 3),
 (Synset('difficult.a.01'), 194),
 (Synset('difficult.a.01'), 12),
 (Synset('difficult.a.01'), 0),
 (Synset('difficult.a.01'), 3),
 (None, 0.0),
 (Synset('hard.a.02'), 5),
 (Synset('hard.a.02'), 4),
 (Synset('hard.a.02'), 4),
 (None, 0.0),
 (None, 0.0),
 (Synset('difficult.a.01'), 22),
 (Synset('hard.a.08'), 7),
 (Synset('difficult.a.01'), 5),
 (Synset('difficult.a.01'), 5),
 (Synset('difficult.a.01'), 22),
 (Synset('difficult.a.01'), 0),
 (Synset('difficult.a.01'), 12),
 (Synset('difficult.a.01'), 16),
 (Synset('difficult.a.01'), 3),
 

In [120]:
bp_results_line = [wsd(context=sent, target="line", winsize=5, pos_tag=pos_tag_wordnet(sent)[sent.index("line")][1])
                    for sent in sentences_preprocessed if 'line' in sent] 
bp_results_line

[(Synset('line.n.18'), 23),
 (Synset('line.n.18'), 78),
 (Synset('line.n.18'), 44),
 (Synset('line.n.11'), 42670),
 (Synset('line.n.18'), 43),
 (Synset('line.n.18'), 43),
 (Synset('line.n.18'), 21),
 (Synset('line.n.18'), 19),
 (Synset('line.n.18'), 67),
 (Synset('line.n.18'), 19),
 (Synset('line.n.18'), 67),
 (Synset('line.n.11'), 32),
 (Synset('line.n.11'), 56),
 (Synset('line.n.18'), 42),
 (Synset('line.n.11'), 11),
 (Synset('line.n.11'), 11),
 (Synset('line.n.11'), 56),
 (Synset('line.n.18'), 42),
 (Synset('line.n.18'), 71),
 (Synset('line.n.11'), 36),
 (Synset('line.n.18'), 71),
 (Synset('line.n.11'), 36),
 (Synset('line.n.18'), 70),
 (Synset('line.n.18'), 38),
 (Synset('line.n.11'), 25),
 (Synset('line.n.11'), 72),
 (Synset('line.n.11'), 25),
 (Synset('line.n.11'), 72),
 (Synset('line.n.18'), 57),
 (Synset('line.n.11'), 72),
 (Synset('line.n.18'), 49),
 (Synset('line.n.11'), 113),
 (Synset('line.n.11'), 10),
 (Synset('line.n.20'), 57),
 (Synset('line.n.11'), 42663),
 (Synset('lin

2. Compute the proportion of word occurrences which were disambiguated identically by the three algorithms, separately for `line` and `hard` (#identical outputs / #total occurences of word). For which of the two words is there higher agreement between the methods?  (..mapping?)

In [146]:
# manually mapping between SENSEVAL 2 corpus and WordNet   

senseval_wn_mapping = {
    "HARD1": ["difficult.a.01",
              "arduous.s.01",
              "hard.s.11",
              "hard.r.01",
              "hard.r.09"],    # needing much effort or skill to accomplish, deal with, or understand
    "HARD2": ["hard.a.02",          # dispassionate - not influenced by strong emotion, and so able to be rational and impartial
              "difficult.a.01",
              "hard.s.04",
              "intemperate.s.03",
              "unvoiced.a.01"],
    "HARD3": ["hard.a.03", 
              "hard.r.07", 
              "heavily.r.07"],         # resisting weight or pressure / solid, firm, and rigid; not easily broken, bent, or pierced
    "cord": ["line.n.18", "cable.n.02"],
    "division": ["line.n.29"],
    "phone": ["telephone_line.n.02"],
    "product": ["line.n.22"],
    "text": ["line.n.05", "note.n.02", "agate_line.n.01"]
}

In [150]:
count = 0
for it in zip(res_hard, lesk_res_hard, bp_results_hard):
  if it[2][0] is None:
    continue
  if it[1][1] != it[2][0]: # check if the corpus-based approaches disagree:
    continue
  else:
    for syn_map in senseval_wn_mapping[it[0]]:
      if syn_map == it[1][1].name():
        count += 1
        continue

print('Coverage of all three algorithms for word `hard` :', count / len(bp_results_hard))

Coverage of all three algorithms for word `hard` : 0.013888888888888888


In [151]:
count = 0
for it in zip(res_line, lesk_res_line, bp_results_line):
  if it[2][0] is None:
    continue
  if it[1][1] != it[2][0]: # check if the corpus-based approaches disagree:
    continue
  else:
    for syn_map in senseval_wn_mapping[it[0]]:
      if syn_map == it[1][1].name():
        count += 1
        continue

print('Coverage of all three algorithms for word `line` :', count / len(bp_results_line))

Coverage of all three algorithms for word `line` : 0.0


3. Pick one of the knowledge-based algorithms above and print the instances where it disagreed with the Naive Bayes method. (they returned a different prediction): show the context where the word occured, and the outputs for each of the methods.

In [152]:
for it in zip(lesk_res_hard, res_hard):
  disagreement = True

  for syn_map in senseval_wn_mapping[it[1]]:
    if syn_map == it[0][1].name():
      disagreement = False

  if disagreement:
    print('Sentence:', ' '.join(sentences[it[0][0]]))
    print('Output:')
    print('\t Naive Bayes:', senseval_wn_mapping[it[1]])
    print('\t Lesk Algorithm:', it[0][1].name(), ' - ', it[0][1].definition())
    print()

Sentence: the bad small sweet spot , and the string type it came with is not to my liking at all ( feels hard , and was not helped by manufacturer string tension ) .
Output:
	 Naive Bayes: ['hard.a.03', 'hard.r.07', 'heavily.r.07']
	 Lesk Algorithm: hard.r.10  -  to the full extent possible; all the way

Sentence: i have to pull the laces tighter on the so that my foot does n't move within the shoe on hard lunges .
Output:
	 Naive Bayes: ['hard.a.03', 'hard.r.07', 'heavily.r.07']
	 Lesk Algorithm: hard.s.11  -  unfortunate or hard to bear

Sentence: i have to pull the laces tighter on the so that my foot does n't move within the shoe on hard lunges .
Output:
	 Naive Bayes: ['hard.a.03', 'hard.r.07', 'heavily.r.07']
	 Lesk Algorithm: hard.s.11  -  unfortunate or hard to bear

Sentence: i also got hit hard with a racquet on a follow through and once again i came out of it unscathed and so did the i-mask .
Output:
	 Naive Bayes: ['hard.a.03', 'hard.r.07', 'heavily.r.07']
	 Lesk Algorithm:

Q: For which of the two words is there higher agreement between the methods?
A: For the word 'hard' is a higher agreement between the three methods. This may be due to the fact that the word *'line'* has more meanings compared to the word *'hard'*.

4. Train several NaiveBayes models for 'hard.pos'/'interest.pos', including at least the following: 
- for the wsd_word_features version, vary number between 100, 200 and 300, 
- and vary the stopwords_list between [] (i.e., the null list) and STOPWORDS; 



In [153]:
# NB, with features based on 100 most frequent context words
wsd_classifier(trainer=NaiveBayesClassifier.train, word='hard.pos', features=wsd_word_features, number=100)

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8431


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7c4594550>

In [154]:
# NB, with features based on 200 most frequent context words
wsd_classifier(trainer=NaiveBayesClassifier.train, word='hard.pos', features=wsd_word_features, number=200)

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8547


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7acb85bd0>

In [155]:
# NB, with features based on 100 most frequent context words
wsd_classifier(trainer=NaiveBayesClassifier.train, word='hard.pos', features=wsd_word_features, number=300)

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8547


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7acab8450>

In [156]:
wsd_classifier(trainer=NaiveBayesClassifier.train, word='hard.pos', features=wsd_word_features, number=300, stopwords_list=[])

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8604


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7aca5fc10>

In [171]:
# NB, with features based on 100 most frequent context words
wsd_classifier(trainer=NaiveBayesClassifier.train, word='line.pos', features=wsd_word_features, number=100)

Reading data...
 Senses: formation cord text division product phone
Training classifier...
Testing classifier...
Accuracy: 0.6169


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7a5a04f50>

In [172]:
# NB, with features based on 200 most frequent context words
wsd_classifier(trainer=NaiveBayesClassifier.train, word='line.pos', features=wsd_word_features, number=200)

Reading data...
 Senses: formation cord text division product phone
Training classifier...
Testing classifier...
Accuracy: 0.6663


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7a591c390>

In [174]:
# NB, with features based on 300 most frequent context words
wsd_classifier(trainer=NaiveBayesClassifier.train, word='line.pos', features=wsd_word_features, number=300)

Reading data...
 Senses: formation cord text division product phone
Training classifier...
Testing classifier...
Accuracy: 0.6928


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7a5805990>

In [175]:
# NB, with features based on 300 most frequent context words
wsd_classifier(trainer=NaiveBayesClassifier.train, word='line.pos', features=wsd_word_features, number=300, stopwords_list=[])

Reading data...
 Senses: formation cord text division product phone
Training classifier...
Testing classifier...
Accuracy: 0.6843


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7a56e3790>

- for the wsd_context_features version, vary the distance between 1, 2 and 3, 
- and vary the stopwords_list between [] and STOPWORDS.
- try to only keep the POS of the words in the context (remove the word itself from the features, and use the POSs instead)

In [157]:
wsd_classifier(trainer=NaiveBayesClassifier.train, 
               word='hard.pos', 
               features=wsd_context_features,
               distance=1) 

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.9135


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7ac86e3d0>

In [160]:
wsd_classifier(trainer=NaiveBayesClassifier.train, 
               word='hard.pos', 
               features=wsd_context_features,
               stopwords_list=[],
               distance=1)

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.9135


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7ac143310>

In [158]:
wsd_classifier(trainer=NaiveBayesClassifier.train, 
               word='hard.pos', 
               features=wsd_context_features,
               distance=2) 

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.9066


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7ac207e10>

In [162]:
wsd_classifier(trainer=NaiveBayesClassifier.train, 
               word='hard.pos', 
               features=wsd_context_features,
               stopwords_list=[],
               distance=2) 

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.9066


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7ab249450>

In [159]:
wsd_classifier(trainer=NaiveBayesClassifier.train, 
               word='hard.pos', 
               features=wsd_context_features,
               distance=3)

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8927


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7ab757410>

In [164]:
wsd_classifier(trainer=NaiveBayesClassifier.train, 
               word='hard.pos', 
               features=wsd_context_features,
               stopwords_list=[],
               distance=3)

Reading data...
 Senses: HARD2 HARD1 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8927


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7aa79c150>

In [170]:
wsd_classifier(trainer=NaiveBayesClassifier.train, 
               word='line.pos', 
               features=wsd_context_features,
               distance=1)

Reading data...
 Senses: formation cord text division product phone
Training classifier...
Testing classifier...
Accuracy: 0.7470


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7a5abdd10>

In [169]:
wsd_classifier(trainer=NaiveBayesClassifier.train, 
               word='line.pos', 
               features=wsd_context_features,
               distance=2)

Reading data...
 Senses: formation cord text division product phone
Training classifier...
Testing classifier...
Accuracy: 0.7470


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7a5e4b710>

In [166]:
wsd_classifier(trainer=NaiveBayesClassifier.train, 
               word='line.pos', 
               features=wsd_context_features,
               distance=3)

Reading data...
 Senses: formation cord text division product phone
Training classifier...
Testing classifier...
Accuracy: 0.7470


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7a71d9110>

In [165]:
wsd_classifier(trainer=NaiveBayesClassifier.train, 
               word='line.pos', 
               features=wsd_context_features,
               stopwords_list=[],
               distance=3)

Reading data...
 Senses: formation cord text division product phone
Training classifier...
Testing classifier...
Accuracy: 0.7470


<nltk.classify.naivebayes.NaiveBayesClassifier at 0x7fa7a8c52410>