## CUNY MSDA DATA 620

### Homework 12: Word sense disambiguation
By Dmitriy Vecheruk  

### Assignment  
*The Senseval 2 Corpus contains data intended to train word-sense disambiguation classifiers. It contains data for four words: hard, interest, line, and serve. Choose one of these four words, and load the corresponding data. Using this dataset, build a classifier that predicts the correct sense tag for a given instance.* 

----
### Solution

From the four words available in the corpus, I have chosen the word **"hard"**. In order to build a word sense classifier, the following steps were taken:  
  
1) Inspect the dataset and clean if necessary, calculate the classifier accuracy baseline   
2) Split the data into training and holdout (test) set  
3) Extract the word context and part of speech features  
4) Build a Naive Bayes classifier and test it using cross-validation  
5) Test the classifier on the holdout set

### 1. Load, inspect and clean the data 

In [426]:
import nltk
import random
import string
from nltk.corpus import senseval, stopwords
from nltk.classify import accuracy, NaiveBayesClassifier, apply_features
from random import seed,shuffle

# nltk.download() # use NLTK Corpus downloader to get the senseval and stopwords corpora

In [29]:
# Loading the data
instances = senseval.instances('hard.pos')

In [33]:
len(instances)

4333

Overall, there are 4333 instances in the dataset, an individual instance represents a sentence with POS tags and an indicator of the position of the word "hard", as well as a label for the word sense used in the sentence:

In [36]:
instances[5]

SensevalInstance(word=u'hard-a', position=33, context=[('producers', 'NNS'), ('of', 'IN'), ('action', 'NN'), ('shows', 'VBZ'), (',', ','), ('like', 'IN'), ('cannell', 'NNP'), (',', ','), ('are', 'VBP'), ('willing', 'JJ'), ('to', 'TO'), ('make', 'VB'), ('them', 'PRP'), ('at', 'IN'), ('a', 'DT'), ('bargain', 'NN'), ('price', 'NN'), ('to', 'TO'), ('help', 'VB'), ('cbs', 'NNP'), ('open', 'JJ'), ('up', 'IN'), ('a', 'DT'), ('new', 'JJ'), ('market', 'NN'), ('for', 'IN'), ('one-hour', 'JJ'), ('action', 'NN'), ('shows', 'VBZ'), (',', ','), ('which', 'WDT'), ('have', 'VBP'), ('become', 'VBN'), ('hard', 'JJ'), ('to', 'TO'), ('sell', 'VB'), ('in', 'IN'), ('the', 'DT'), ('rerun', 'NN'), ('market', 'NN'), ('.', '.')], senses=('HARD1',))

In [39]:
# Get senses per instance
senses = [item.senses for item in instances]

In [40]:
nltk.FreqDist(senses)

FreqDist({('HARD1',): 3455, ('HARD2',): 502, ('HARD3',): 376})

We see that the sense 'HARD1' is occurring vastly more often than the other two senses. Also, the `FreqDist` function has not counted any instances where the sense was not clear (multiple senses present in a single sentence), which means that all of the input instances can be used.  
The **baseline classification accuracy** of the classifier is the share of the of the most wide-spread class in the data (in this case "HARD1"): 3455/4333 = **79.74%**

### 2. Split the data into training and holdout (test) set  

In [298]:
# Generate training, dev-test, and holdout sets

seed(42) # Inintialize random seed

size = int(len(instances) * 0.1)
train_set, holdout_set = instances[size:], instances[:size]

def split_train_test(x, n_test):
    """Randomly splits a list into two lists with n_test records in one, 
    and the remainder in the other one."""
    
    random.shuffle(x)
    
    return x[:n_test],x[n_test:]

### 3. Extract the word context and part of speech features  
In this part, I set up functions to extract words and part of speech (POS) tags around the target word from each instance. Then, the most frequent items from word and POS context per sense will be used as classifier features. 

In [299]:
# Prepare the set of stopwords
stop_words = list(stopwords.words('english'))

In [421]:
def flatten(l, ltypes=(list, tuple)):
    """Source: http://rightfootin.blogspot.de/2006/09/more-on-python-flatten.html"""
    ltype = type(l)
    l = list(l)
    i = 0
    while i < len(l):
        while isinstance(l[i], ltypes):
            if not l[i]:
                l.pop(i)
                i -= 1
                break
            else:
                l[i:i + 1] = l[i]
        i += 1
    return ltype(l)

def extract_before_after_features(instance,i=instance.position,k_before=4,k_after=4,remove_punct=True,stop_words=None, 
                                  min_word_length = None, return_lists = True):
    """Parses a SensevalInstance and returns a dictionary with k words and POS tags 
    before and k words and POS tags after the target word at position i applying the 
    filters: remove punctuation, remove stop words from a provided list, remove all 
    short words with a length under min_word_length"""

    puncts = [item for item in string.punctuation] + [item*2 for item in string.punctuation]

    # Extract before and after parts

    sent_words = [item[0] for item in instance.context]
    sent_pos = [item[1] for item in instance.context]
    before = zip(sent_words[:i],sent_pos[:i])
    after = zip(sent_words[i+1:],sent_pos[i+1:])

    # Apply cleaning

    if stop_words is not None:
        before = [(word,pos) for (word,pos) in before if word not in stop_words]
        after = [(word,pos) for (word,pos) in after if word not in stop_words]
    if remove_punct is True:
        before = [(word,pos) for (word,pos) in before if word not in puncts]
        after = [(word,pos) for (word,pos) in after if word not in puncts]
    if min_word_length is not None:
        before = [(word,pos) for (word,pos) in before if len(word) > min_word_length]
        after = [(word,pos) for (word,pos) in after if len(word) > min_word_length]
        
    output = dict(
    words_before = [word for (word,pos) in before[-k_before:]],
    pos_before = [pos for (word,pos) in before[-k_before:]],
    words_after = [word for (word,pos) in after[:k_after]],
    pos_after = [pos for (word,pos) in after[:k_after]]
    )
    
#   Make sure that start/end of the sentence are tagged as empty 
    for k, v in output.iteritems():
        if len(v) == 0:
            output[k] = ['EMPTY']

    if return_lists == True:
        return output
    
    else:
        feature_dict = dict()

        for key in output.keys():
            for idx, item in enumerate(output[key]):
                feature_dict[key+'_'+str(idx)] = item
        
        return feature_dict

In [418]:
instance = train_set[0]
print instance

SensevalInstance(word=u'hard-a', position=19, context=[('david', 'NNP'), ('ogden', 'NNP'), ('stiers', 'NNP'), ('makes', 'VBZ'), ('a', 'DT'), ('valiant', 'JJ'), ('effort', 'NN'), ('to', 'TO'), ('bring', 'VB'), ('the', 'DT'), ('town', 'NN'), ("'s", 'POS'), ('mayor', 'NN'), ('to', 'TO'), ('life', 'NN'), (',', ','), ('but', 'CC'), ('often', 'RB'), ('is', 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('decipher', 'VB'), ('because', 'IN'), ('of', 'IN'), ('an', 'DT'), ('aggressive', 'JJ'), ('southern', 'NNP'), ('accent', 'NN'), ('and', 'CC'), ('blustery', 'JJ'), ('tone', 'NN'), ('.', '.')], senses=('HARD1',))


In [423]:
extract_before_after_features(instance,i=instance.position,k_before=5,k_after=5,remove_punct=True,
                              stop_words=stop_words,min_word_length=None,return_lists = True)

{'pos_after': ['VB', 'JJ', 'NNP', 'NN', 'JJ'],
 'pos_before': ['NN', 'POS', 'NN', 'NN', 'RB'],
 'words_after': ['decipher', 'aggressive', 'southern', 'accent', 'blustery'],
 'words_before': ['town', "'s", 'mayor', 'life', 'often']}

Now we can inspect the most frequent contexts per sense to understand the differences between them.

In [None]:
def vocab_features(instance)

In [424]:
senses = ['HARD1','HARD2','HARD3']

for sense in senses:
    sense_instances = [item for item in train_set if item.senses[0] == sense]

    before = []
    after = []
    for instance in sense_instances:
        features = extract_before_after_features(instance,i=instance.position,k_before=5,k_after=5,
                                                 remove_punct=True,stop_words=stop_words,min_word_length=2,return_lists = True)
        before.append(features['words_before'])
        after.append(features['words_after'])
    
    before = flatten(before)
    after = flatten(after)
    print 'Sense: '+sense+'\n'
    print 'Most common words before:,\n',nltk.FreqDist(before).most_common(15),'\n'
    print 'Most common words after:,\n',nltk.FreqDist(after).most_common(15),'\n'

Sense: HARD1

Most common words before:,
[('EMPTY', 714), ('would', 119), ('said', 112), ('make', 92), ('even', 65), ('much', 63), ('may', 59), ('going', 59), ('find', 59), ('makes', 47), ('people', 46), ('really', 46), ('one', 44), ('made', 36), ('like', 36)] 

Most common words after:,
[('said', 238), ('time', 232), ('get', 144), ('believe', 130), ('find', 117), ('imagine', 92), ('EMPTY', 84), ('say', 83), ('see', 78), ('people', 68), ('way', 62), ('tell', 59), ('part', 58), ('one', 57), ('come', 57)] 

Sense: HARD2

Most common words before:,
[('take', 48), ('EMPTY', 44), ('said', 24), ('long', 17), ('taking', 15), ('years', 12), ('little', 12), ('get', 11), ('one', 10), ('good', 9), ('people', 9), ("'re", 9), ('took', 8), ('lot', 8), ('also', 8)] 

Most common words after:,
[('work', 156), ('look', 81), ('feelings', 32), ('said', 22), ('line', 17), ('time', 14), ('EMPTY', 11), ('way', 10), ('day', 8), ('people', 8), ('business', 7), ('freedom', 7), ('fast', 7), ('evidence', 7), ('s

We can see from the distributions above that `HARD3` obviously stands for a physical property of withstanding pressure, but the difference between `HARD1` and `HARD2` is more subtle. It seems as if the first sense is more related to "difficulty" (hard time, hard to get, hard to believe), whereas the second is more about "a lack of kindness" (hard look, hard feelings, hard line).

### 4. Build a Naive Bayes classifier and test it using cross-validation  
In this part, the most frequent word and POS features per sense will be used to train a Naive Bayes classifier

In [500]:
# Split into training and test
size = int(len(train_set) * 0.2)
training, dev_test = train_set[size:], train_set[:size]
print len(training), len(dev_test)

3120 780


In [501]:
# Extract features
# training_data = []

# for instance in training:
#     features = extract_before_after_features(instance,i=instance.position,k_before=3,k_after=3,remove_punct=True,
#                                              stop_words=stop_words,min_word_length=2, )
#     label = instance.senses[0]
#     training_data.append( (features,label) )

# training_data[0]

In [502]:
# Extract features
# A list of classified featuresets, i.e., a list of tuples ``(featureset, label)``.
 
# label = instance.senses[0]
    
def featurizer_1(instance):
    return (extract_before_after_features(instance,i=instance.position,k_before=1,k_after=1,
                                                     remove_punct=True,stop_words=None,min_word_length=2,
                                                     return_lists = False) )

def labeler(instance):
    return instance.senses[0]

training_ft = [ (featurizer_1(instance), labeler(instance)) for instance in training]

In [503]:
training_ft[0]

({'pos_after_0': 'VB',
  'pos_before_0': 'NN',
  'words_after_0': 'describe',
  'words_before_0': 'kind'},
 'HARD1')

In [504]:
classifier_0 = nltk.NaiveBayesClassifier.train(training_ft)

In [505]:
classifier_0.show_most_informative_features()

Most Informative Features
             pos_after_0 = 'VB'            HARD1 : HARD2  =    183.3 : 1.0
           words_after_0 = 'look'          HARD2 : HARD1  =     89.5 : 1.0
           words_after_0 = 'work'          HARD2 : HARD3  =     88.1 : 1.0
           words_after_0 = 'for'           HARD1 : HARD2  =     41.6 : 1.0
             pos_after_0 = 'VBN'           HARD3 : HARD1  =     27.4 : 1.0
          words_before_0 = 'EMPTY'         HARD1 : HARD3  =     24.5 : 1.0
           words_after_0 = 'cover'         HARD3 : HARD1  =     23.8 : 1.0
             pos_after_0 = 'IN'            HARD1 : HARD2  =     22.0 : 1.0
          words_before_0 = 'long'          HARD2 : HARD1  =     20.8 : 1.0
           words_after_0 = 'believe'       HARD1 : HARD2  =     19.7 : 1.0


## WRITE INTERPRETATION HERE about the most informative

In [506]:
def evaluate_model(dev_test, featurizer, classifier):
    """
    Returns the accuracy, a contingency table, and the most informative features
    of an NLTK classifier.
    Based on the code from: Natural Language Processing with Python
    by Steven Bird, Ewan Klein, and Edward Loper. 2009, O'Reilly Media
    """
    # Generate test features
    dev_test_ft = [ (featurizer(instance), labeler(instance)) for instance in dev_test]

    # Score the model on the test features
    model_out = [classifier.classify(item[0]) for item in dev_test_ft]

    # Evaluate the accuracy
    accr = nltk.classify.accuracy(classifier, dev_test_ft)
    

    # Confusion matrix
    true_label = [item[1] for item in dev_test_ft]
    
    
    # Errors: Compare the true label with the classifier output
    errors = []
    for item in zip(dev_test_ft,model_out):
        if item[0][1] != item[1]:
            errors.append( (item[0][1], item[1], item[0][0]))
    
    return dict(model_out = model_out,true_label=true_label,accr=accr,errors=errors)

def print_eval_results(classifier, eval_output,err_cnt):
    "Prints model evaluation output"
    
    
    print ("Accuracy on the test set: {}% \n".format(eval_output["accr"]*100))
    # Main features 
    classifier.show_most_informative_features(10)
    
    print "\n Confusion Matrix: \n"
    print nltk.ConfusionMatrix(eval_output["true_label"],eval_output["model_out"])
    print "\n Errors: \n", 
    for (tag, guess, name) in eval_output["errors"][:err_cnt]: 
        print 'correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name)

In [507]:
# Evaluate the features and classifier
eval_0 = evaluate_model(dev_test=dev_test, featurizer=featurizer_1,classifier=classifier_0)
print_eval_results(classifier_0,eval_0,10)

Accuracy on the test set: 92.9487179487% 

Most Informative Features
             pos_after_0 = 'VB'            HARD1 : HARD2  =    183.3 : 1.0
           words_after_0 = 'look'          HARD2 : HARD1  =     89.5 : 1.0
           words_after_0 = 'work'          HARD2 : HARD3  =     88.1 : 1.0
           words_after_0 = 'for'           HARD1 : HARD2  =     41.6 : 1.0
             pos_after_0 = 'VBN'           HARD3 : HARD1  =     27.4 : 1.0
          words_before_0 = 'EMPTY'         HARD1 : HARD3  =     24.5 : 1.0
           words_after_0 = 'cover'         HARD3 : HARD1  =     23.8 : 1.0
             pos_after_0 = 'IN'            HARD1 : HARD2  =     22.0 : 1.0
          words_before_0 = 'long'          HARD2 : HARD1  =     20.8 : 1.0
           words_after_0 = 'believe'       HARD1 : HARD2  =     19.7 : 1.0

 Confusion Matrix: 

      |   H   H   H |
      |   A   A   A |
      |   R   R   R |
      |   D   D   D |
      |   1   2   3 |
------+-------------+
HARD1 |<725> 42  13 |
HARD2

In [414]:
# Extract features
dev_test_data = []

for instance in dev_test:
    
#     label = instance.senses[0]

    features_raw = extract_before_after_features(instance,i=instance.position,k_before=4,k_after=4,
                                                     remove_punct=True,stop_words=stop_words,min_word_length=2)

    feature_dict = dict()

    for key in features_raw.keys():
        for idx, item in enumerate(features_raw[key]):
            feature_dict[key+'_'+str(idx)] = item

    dev_test_data.append( (feature_dict) )

dev_test_data[0]

{'pos_after_0': 'VB',
 'pos_after_1': 'JJ',
 'pos_after_2': 'NNP',
 'pos_after_3': 'NN',
 'pos_before_0': 'NN',
 'pos_before_1': 'NN',
 'pos_before_2': 'NN',
 'pos_before_3': 'RB',
 'words_after_0': 'decipher',
 'words_after_1': 'aggressive',
 'words_after_2': 'southern',
 'words_after_3': 'accent',
 'words_before_0': 'town',
 'words_before_1': 'mayor',
 'words_before_2': 'life',
 'words_before_3': 'often'}

In [415]:
classifier_0.classify(dev_test_data[0])

'HARD1'

In [None]:
def evaluate_model(dev_test, featurizer, classifier):
    """
    Returns the accuracy, a contingency table, and the most informative features
    of an NLTK classifier.
    Based on the code from: Natural Language Processing with Python
    by Steven Bird, Ewan Klein, and Edward Loper. 2009, O'Reilly Media
    """
    # Generate test features
    dev_test_ft = apply_features(featurizer, dev_test)

    # Score the model on the test features
    model_out = [classifier.classify(item[0]) for item in dev_test_ft]

    # Evaluate the accuracy
    accr = nltk.classify.accuracy(classifier, dev_test_ft)
    

    # Confusion matrix
    true_label = [item[1] for item in dev_test_ft]
    
    
    # Errors: Compare the true label with the classifier output
    errors = []
    for item in zip(dev_test,model_out):
        if item[0][1] != item[1]:
            errors.append( (item[0][1], item[1], item[0][0]))
    
    return dict(model_out = model_out,true_label=true_label,accr=accr,errors=errors)
        
def print_eval_results(classifier, eval_output,err_cnt):
    "Prints model evaluation output"
    
    
    print ("Accuracy on the test set: {}% \n".format(eval_output["accr"]*100))
    # Main features 
    classifier.show_most_informative_features(10)
    
    print "\n Confusion Matrix: \n"
    print nltk.ConfusionMatrix(eval_output["true_label"],eval_output["model_out"])
    print "\n Errors: \n", 
    for (tag, guess, name) in eval_output["errors"][:err_cnt]: 
        print 'correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name)

def iterate_nb_validation(data, test_size, featurizer,n_iter):
    
    accuracy_out = []
    
    for item in range(0,n_iter):
    
        # Split dataset
        dev_test, train_set = split_train_test(data, test_size) 

        # Generate training features
        train_ft = apply_features(featurizer, train_set)
    
        # Apply Naive Bayes classifier
        classifier = nltk.NaiveBayesClassifier.train(train_ft)
        eval_result = evaluate_model(dev_test=dev_test, featurizer=featurizer,classifier=classifier)
        
        accuracy_out.append(eval_result["accr"])
        print ".",
    
    return accuracy_out

In [None]:
# Train classifier
    print 'Training classifier...'
    classifier = trainer([(features(i, vocab, distance), label) for (i, label) in training_data])
    # Test classifier
    print 'Testing classifier...'
    acc = accuracy(classifier, [(features(i, vocab, distance), label) for (i, label) in test_data] )
    print 'Accuracy: %6.4f' % acc

In [None]:
# Split dataset
dev_test, train_set = split_train_test(training_data, 500) 

# Generate training features
train_ft = apply_features(gender_features_0, train_set)

# Apply Naive Bayes classifier
classifier_0 = nltk.NaiveBayesClassifier.train(train_ft)

# Evaluate the features and classifier
eval_0 = evaluate_model(dev_test=dev_test, featurizer=gender_features_0,classifier=classifier_0)
print_eval_results(classifier_0,eval_0,10)

### Reference 
1) [Chapter 6: Learning to Classify Text](http://www.nltk.org/book/ch06.html) from Natural Language Processing with Python, by Steven Bird, Ewan Klein and Edward Loper, Copyright © 2014 the authors   
2) University of Edinburgh [FNLP 2017: Lab Session 5: Word Sense Disambiguation](https://www.inf.ed.ac.uk/teaching/courses/fnlp/Tutorials/7_WSD/tutorial.html)
Henry S. Thompson, based on original by Alex Lascarides, 2017  
3) https://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python  
4) https://docs.python.org/2/library/stdtypes.html#dict.iteritems