## CUNY MSDA DATA 620

### Homework 12: Word sense disambiguation
By Dmitriy Vecheruk  

### Assignment  
*The Senseval 2 Corpus contains data intended to train word-sense disambiguation classifiers. It contains data for four words: hard, interest, line, and serve. Choose one of these four words, and load the corresponding data. Using this dataset, build a classifier that predicts the correct sense tag for a given instance.* 

----
### Solution

From the four words available in the corpus, I have chosen the word **"hard"**. In order to build a word sense classifier, the following steps were taken:  
  
1) Inspect the dataset and clean if necessary, calculate the classifier accuracy baseline   
2) Split the data into training and holdout (test) set  
3) Extract the word context and part of speech features  
4) Build a Naive Bayes classifier and test it on the dev-test set  
5) Test the classifier on the holdout set

### 1. Load, inspect and clean the data 

In [42]:
import nltk
import random
import string
from nltk.corpus import senseval, stopwords
from nltk.classify import accuracy, NaiveBayesClassifier, apply_features
from random import seed,shuffle

# nltk.download() # use NLTK Corpus downloader to get the senseval and stopwords corpora

In [43]:
# Loading the data
instances = senseval.instances('hard.pos')

In [44]:
len(instances)

4333

Overall, there are 4333 instances in the dataset, an individual instance represents a sentence with POS tags and an indicator of the position of the word "hard", as well as a label for the word sense used in the sentence:

In [45]:
instances[5]

SensevalInstance(word=u'hard-a', position=33, context=[('producers', 'NNS'), ('of', 'IN'), ('action', 'NN'), ('shows', 'VBZ'), (',', ','), ('like', 'IN'), ('cannell', 'NNP'), (',', ','), ('are', 'VBP'), ('willing', 'JJ'), ('to', 'TO'), ('make', 'VB'), ('them', 'PRP'), ('at', 'IN'), ('a', 'DT'), ('bargain', 'NN'), ('price', 'NN'), ('to', 'TO'), ('help', 'VB'), ('cbs', 'NNP'), ('open', 'JJ'), ('up', 'IN'), ('a', 'DT'), ('new', 'JJ'), ('market', 'NN'), ('for', 'IN'), ('one-hour', 'JJ'), ('action', 'NN'), ('shows', 'VBZ'), (',', ','), ('which', 'WDT'), ('have', 'VBP'), ('become', 'VBN'), ('hard', 'JJ'), ('to', 'TO'), ('sell', 'VB'), ('in', 'IN'), ('the', 'DT'), ('rerun', 'NN'), ('market', 'NN'), ('.', '.')], senses=('HARD1',))

In [46]:
# Get senses per instance
senses = [item.senses for item in instances]

In [47]:
nltk.FreqDist(senses)

FreqDist({('HARD1',): 3455, ('HARD2',): 502, ('HARD3',): 376})

We see that the sense 'HARD1' is occurring vastly more often than the other two senses. Also, the `FreqDist` function has not counted any instances where the sense was not clear (multiple senses present in a single sentence), which means that all of the input instances can be used.  
The **baseline classification accuracy** of the classifier is the share of the of the most wide-spread class in the data (in this case "HARD1"): 3455/4333 = **79.74%**

### 2. Split the data into training and holdout (test) set  

In [48]:
# Generate training, dev-test, and holdout sets

seed(42) # Inintialize random seed

size = int(len(instances) * 0.1)
train_set, holdout_set = instances[size:], instances[:size]

def split_train_test(x, n_test):
    """Randomly splits a list into two lists with n_test records in one, 
    and the remainder in the other one."""
    
    random.shuffle(x)
    
    return x[:n_test],x[n_test:]

### 3. Extract the word context and part of speech features  
In this part, I set up functions to extract words and part of speech (POS) tags around the target word from each instance. Then, the most frequent items from word and POS context per sense will be used as classifier features. 

In [49]:
# Prepare the set of stopwords
stop_words = list(stopwords.words('english'))

In [50]:
def flatten(l, ltypes=(list, tuple)):
    """Source: http://rightfootin.blogspot.de/2006/09/more-on-python-flatten.html"""
    ltype = type(l)
    l = list(l)
    i = 0
    while i < len(l):
        while isinstance(l[i], ltypes):
            if not l[i]:
                l.pop(i)
                i -= 1
                break
            else:
                l[i:i + 1] = l[i]
        i += 1
    return ltype(l)

def extract_before_after_features(instance,k_before=4,k_after=4,remove_punct=True,stop_words=None, 
                                  min_word_length = None, return_lists = True):
    """Parses a SensevalInstance and returns a dictionary with k words and POS tags 
    before and k words and POS tags after the target word at position i applying the 
    filters: remove punctuation, remove stop words from a provided list, remove all 
    short words with a length under min_word_length"""
    
    i=instance.position
    puncts = [item for item in string.punctuation] + [item*2 for item in string.punctuation]

    # Extract before and after parts

    sent_words = [item[0] for item in instance.context]
    sent_pos = [item[1] for item in instance.context]
    before = zip(sent_words[:i],sent_pos[:i])
    after = zip(sent_words[i+1:],sent_pos[i+1:])

    # Apply cleaning

    if stop_words is not None:
        before = [(word,pos) for (word,pos) in before if word not in stop_words]
        after = [(word,pos) for (word,pos) in after if word not in stop_words]
    if remove_punct is True:
        before = [(word,pos) for (word,pos) in before if word not in puncts]
        after = [(word,pos) for (word,pos) in after if word not in puncts]
    if min_word_length is not None:
        before = [(word,pos) for (word,pos) in before if len(word) > min_word_length]
        after = [(word,pos) for (word,pos) in after if len(word) > min_word_length]
        
    output = dict(
    words_before = [word for (word,pos) in before[-k_before:]],
    pos_before = [pos for (word,pos) in before[-k_before:]],
    words_after = [word for (word,pos) in after[:k_after]],
    pos_after = [pos for (word,pos) in after[:k_after]]
    )
    
#   Make sure that start/end of the sentence are tagged as empty 
    for k, v in output.iteritems():
        if len(v) == 0:
            output[k] = ['EMPTY']

    if return_lists == True:
        return output
    
    else:
        feature_dict = dict()

        for key in output.keys():
            for idx, item in enumerate(output[key]):
                feature_dict[key+'_'+str(idx)] = item
        
        return feature_dict

In [51]:
instance = train_set[0]
print instance

SensevalInstance(word=u'hard-a', position=19, context=[('david', 'NNP'), ('ogden', 'NNP'), ('stiers', 'NNP'), ('makes', 'VBZ'), ('a', 'DT'), ('valiant', 'JJ'), ('effort', 'NN'), ('to', 'TO'), ('bring', 'VB'), ('the', 'DT'), ('town', 'NN'), ("'s", 'POS'), ('mayor', 'NN'), ('to', 'TO'), ('life', 'NN'), (',', ','), ('but', 'CC'), ('often', 'RB'), ('is', 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('decipher', 'VB'), ('because', 'IN'), ('of', 'IN'), ('an', 'DT'), ('aggressive', 'JJ'), ('southern', 'NNP'), ('accent', 'NN'), ('and', 'CC'), ('blustery', 'JJ'), ('tone', 'NN'), ('.', '.')], senses=('HARD1',))


In [52]:
extract_before_after_features(instance,k_before=5,k_after=5,remove_punct=True,
                              stop_words=stop_words,min_word_length=None,return_lists = True)

{'pos_after': ['VB', 'JJ', 'NNP', 'NN', 'JJ'],
 'pos_before': ['NN', 'POS', 'NN', 'NN', 'RB'],
 'words_after': ['decipher', 'aggressive', 'southern', 'accent', 'blustery'],
 'words_before': ['town', "'s", 'mayor', 'life', 'often']}

Now we can inspect the most frequent contexts per sense to understand the differences between them.

In [53]:
senses = ['HARD1','HARD2','HARD3']

for sense in senses:
    sense_instances = [item for item in train_set if item.senses[0] == sense]

    before = []
    after = []
    for instance in sense_instances:
        features = extract_before_after_features(instance,k_before=5,k_after=5,
                                                 remove_punct=True,stop_words=stop_words,min_word_length=2,return_lists = True)
        before.append(features['words_before'])
        after.append(features['words_after'])
    
    before = flatten(before)
    after = flatten(after)
    print 'Sense: '+sense+'\n'
    print 'Most common words before:,\n',nltk.FreqDist(before).most_common(15),'\n'
    print 'Most common words after:,\n',nltk.FreqDist(after).most_common(15),'\n'

Sense: HARD1

Most common words before:,
[('EMPTY', 714), ('would', 119), ('said', 112), ('make', 92), ('even', 65), ('much', 63), ('may', 59), ('going', 59), ('find', 59), ('makes', 47), ('people', 46), ('really', 46), ('one', 44), ('made', 36), ('like', 36)] 

Most common words after:,
[('said', 238), ('time', 232), ('get', 144), ('believe', 130), ('find', 117), ('imagine', 92), ('EMPTY', 84), ('say', 83), ('see', 78), ('people', 68), ('way', 62), ('tell', 59), ('part', 58), ('one', 57), ('come', 57)] 

Sense: HARD2

Most common words before:,
[('take', 48), ('EMPTY', 44), ('said', 24), ('long', 17), ('taking', 15), ('years', 12), ('little', 12), ('get', 11), ('one', 10), ('good', 9), ('people', 9), ("'re", 9), ('took', 8), ('lot', 8), ('also', 8)] 

Most common words after:,
[('work', 156), ('look', 81), ('feelings', 32), ('said', 22), ('line', 17), ('time', 14), ('EMPTY', 11), ('way', 10), ('day', 8), ('people', 8), ('business', 7), ('freedom', 7), ('fast', 7), ('evidence', 7), ('s

We can see from the distributions above that `HARD3` obviously stands for a physical property of withstanding pressure, but the difference between `HARD1` and `HARD2` is more subtle. It seems as if the first sense is more related to "difficulty" (hard time, hard to get, hard to believe), whereas the second is more about "a lack of kindness" (hard look, hard feelings, hard line).

In order to prepare the features, we will extract the 300 most common features appearing before and after the target word and construct a feature extractor function that records their occurrences in a given instance context. 

### 4. Build a Naive Bayes classifier and test it on the dev-test set
In this part, the most frequent word and POS features per sense will be used to train a Naive Bayes classifier

In [54]:
# Split into training and test
size = int(len(train_set) * 0.2)
training, dev_test = train_set[size:], train_set[:size]
print len(training), len(dev_test)

3120 780


In [55]:
# Check label distributions in the training and dev_test
nltk.FreqDist([item.senses for item in training])

FreqDist({('HARD1',): 2242, ('HARD2',): 502, ('HARD3',): 376})

In [56]:
nltk.FreqDist([item.senses for item in dev_test])

FreqDist({('HARD1',): 780})

As we can see, a simple random sampling returns an extremely biased dev-test set with just one class. In order to avoid this, we shall take 1/2 of all senses with lower counts and an equal amount of the instances with the widespread sense for the training and save the rest for dev-test.

In [57]:
sense1_train_set = [item for item in train_set if item.senses[0]=='HARD1']
sense2_train_set = [item for item in train_set if item.senses[0]=='HARD2']
sense3_train_set = [item for item in train_set if item.senses[0]=='HARD3']

samp_size_3 = int(len(sense3_train_set) * 0.5)
training_3, dev_test_3 = sense3_train_set[samp_size_3:], sense3_train_set[:samp_size_3]

samp_size_2 = int(len(sense2_train_set) * 0.5)
training_2, dev_test_2 = sense2_train_set[samp_size_2:], sense2_train_set[:samp_size_2]

# Note the samp_size_2 used below
training_1, dev_test_1 = sense1_train_set[samp_size_2:], sense1_train_set[:samp_size_2]

training = flatten([training_1,training_2,training_3])
dev_test = flatten([dev_test_1,dev_test_2,dev_test_3])

In [58]:
nltk.FreqDist([item.senses for item in dev_test])

FreqDist({('HARD1',): 251, ('HARD2',): 251, ('HARD3',): 188})

Now we have a more balanced dev_test and training set

In [59]:
# Set up feature extraction

def featurizer_0(instance):
    return (extract_before_after_features(instance,k_before=3,k_after=3,
                                                     remove_punct=True,stop_words=None,min_word_length=2,
                                                     return_lists = False) )

def labeler(instance):
    return instance.senses[0]

training_ft_0 = [ (featurizer_0(instance), labeler(instance)) for instance in training]

In [60]:
training_ft_0[1]

({'pos_after_0': 'NN',
  'pos_after_1': 'DT',
  'pos_after_2': 'NN',
  'pos_before_0': 'DT',
  'words_after_0': 'thing',
  'words_after_1': 'the',
  'words_after_2': 'world',
  'words_before_0': 'the'},
 'HARD1')

In [61]:
classifier_0 = nltk.NaiveBayesClassifier.train(training_ft_0)

In [62]:
classifier_0.show_most_informative_features()

Most Informative Features
           words_after_0 = 'look'          HARD2 : HARD1  =     68.9 : 1.0
          words_before_0 = 'EMPTY'         HARD1 : HARD3  =     58.8 : 1.0
           words_after_0 = 'work'          HARD2 : HARD3  =     48.9 : 1.0
          words_before_1 = 'rock'          HARD3 : HARD1  =     41.7 : 1.0
           words_after_0 = 'for'           HARD1 : HARD2  =     31.3 : 1.0
             pos_after_0 = 'VB'            HARD1 : HARD3  =     31.0 : 1.0
           words_after_0 = 'cover'         HARD3 : HARD1  =     30.0 : 1.0
             pos_after_0 = 'VBN'           HARD3 : HARD1  =     29.2 : 1.0
          words_before_1 = 'take'          HARD2 : HARD1  =     24.1 : 1.0
          words_before_1 = None            HARD1 : HARD3  =     19.4 : 1.0


The most informative features show some of the sense distinctions discussed above: combination "[HARD] look" points to the second meaning (lack of compassion), and "rock [HARD]" is predictive of the third meaning (physical property).

In [63]:
def evaluate_model(dev_test, featurizer, classifier):
    """
    Returns the accuracy, a contingency table, and the most informative features
    of an NLTK classifier.
    Based on the code from: Natural Language Processing with Python
    by Steven Bird, Ewan Klein, and Edward Loper. 2009, O'Reilly Media
    """
    # Generate test features
    dev_test_ft = [ (featurizer(instance), labeler(instance)) for instance in dev_test]

    # Score the model on the test features
    model_out = [classifier.classify(item[0]) for item in dev_test_ft]

    # Evaluate the accuracy
    accr = nltk.classify.accuracy(classifier, dev_test_ft)
    

    # Confusion matrix
    true_label = [item[1] for item in dev_test_ft]
    
    
    # Errors: Compare the true label with the classifier output
    errors = []
    for item in zip(dev_test_ft,model_out):
        if item[0][1] != item[1]:
            errors.append( (item[0][1], item[1], item[0][0]))
    
    return dict(model_out = model_out,true_label=true_label,accr=accr,errors=errors)

def print_eval_results(classifier, eval_output,err_cnt):
    "Prints model evaluation output"
    
    
    print ("Accuracy on the test set: {}% \n".format(eval_output["accr"]*100))
    # Main features 
    classifier.show_most_informative_features(10)
    
    print "\n Confusion Matrix: \n"
    print nltk.ConfusionMatrix(eval_output["true_label"],eval_output["model_out"])
    print "\n Errors: \n", 
    for (tag, guess, name) in eval_output["errors"][:err_cnt]: 
        print 'correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name)

In [64]:
# Evaluate the features and classifier
eval_0 = evaluate_model(dev_test=dev_test, featurizer=featurizer_0,classifier=classifier_0)
print_eval_results(classifier_0,eval_0,1)

Accuracy on the test set: 76.8115942029% 

Most Informative Features
           words_after_0 = 'look'          HARD2 : HARD1  =     68.9 : 1.0
          words_before_0 = 'EMPTY'         HARD1 : HARD3  =     58.8 : 1.0
           words_after_0 = 'work'          HARD2 : HARD3  =     48.9 : 1.0
          words_before_1 = 'rock'          HARD3 : HARD1  =     41.7 : 1.0
           words_after_0 = 'for'           HARD1 : HARD2  =     31.3 : 1.0
             pos_after_0 = 'VB'            HARD1 : HARD3  =     31.0 : 1.0
           words_after_0 = 'cover'         HARD3 : HARD1  =     30.0 : 1.0
             pos_after_0 = 'VBN'           HARD3 : HARD1  =     29.2 : 1.0
          words_before_1 = 'take'          HARD2 : HARD1  =     24.1 : 1.0
          words_before_1 = None            HARD1 : HARD3  =     19.4 : 1.0

 Confusion Matrix: 

      |   H   H   H |
      |   A   A   A |
      |   R   R   R |
      |   D   D   D |
      |   1   2   3 |
------+-------------+
HARD1 |<225> 13  13 |
HARD2

The classifier using words and POS tags performs significantly better than a random assignment to one of the three classes (in the training set they are almost balanced, so the baseline accuracy would be below 40%).

### 5. Test the classifier on the holdout set

In [65]:
# Train NB classifier on all training data

train_ft_final = [ (featurizer_0(instance), labeler(instance)) for instance in train_set]
classifier_final = nltk.NaiveBayesClassifier.train(train_ft_final)

# Evaluate the features and the trained classifier
eval_test = evaluate_model(dev_test=holdout_set, featurizer=featurizer_0,classifier=classifier_final)
print_eval_results(classifier_final,eval_test,0)

Accuracy on the test set: 88.4526558891% 

Most Informative Features
             pos_after_0 = 'VB'            HARD1 : HARD2  =    185.7 : 1.0
           words_after_0 = 'work'          HARD2 : HARD3  =     88.7 : 1.0
           words_after_0 = 'look'          HARD2 : HARD1  =     79.9 : 1.0
           words_after_0 = 'for'           HARD1 : HARD2  =     48.1 : 1.0
          words_before_1 = 'rock'          HARD3 : HARD1  =     46.8 : 1.0
          words_before_2 = 'long'          HARD2 : HARD1  =     44.9 : 1.0
          words_before_1 = 'take'          HARD2 : HARD1  =     35.3 : 1.0
          words_before_0 = 'EMPTY'         HARD1 : HARD3  =     34.3 : 1.0
           words_after_0 = 'cover'         HARD3 : HARD1  =     29.5 : 1.0
             pos_after_0 = 'VBN'           HARD3 : HARD1  =     28.6 : 1.0

 Confusion Matrix: 

      |   H   H   H |
      |   A   A   A |
      |   R   R   R |
      |   D   D   D |
      |   1   2   3 |
------+-------------+
HARD1 |<383> 27  23 |
HARD2

The classifier achieves a higher accuracy on the holdout set because it only contains the instances of the most dominant sense in the dataset.

### Reference 
1) [Chapter 6: Learning to Classify Text](http://www.nltk.org/book/ch06.html) from Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2014 the authors   
2) University of Edinburgh [FNLP 2017: Lab Session 5: Word Sense Disambiguation](https://www.inf.ed.ac.uk/teaching/courses/fnlp/Tutorials/7_WSD/tutorial.html)
Henry S. Thompson, based on original by Alex Lascarides, 2017  
3) https://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python  
4) https://docs.python.org/2/library/stdtypes.html#dict.iteritems