# Word Sense Disambiguation Project


- Worked on automatic word sense disambiguation (WSD) using the context around an ambigious word
- Applied the semantic knowledge in WordNet

## Getting started
Downloaded the following required NLTK corpora/lexicons:

In [1]:
import ssl

import pandas as pd

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

import nltk

nltk.download("senseval")
nltk.download("wordnet")
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

from collections import defaultdict

from nltk import pos_tag, word_tokenize
from nltk.corpus import senseval, stopwords
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
from nltk.wsd import lesk
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier

STOPWORDS = stopwords.words("english")
OPEN_CLASS_POS = {"n", "v", "j", "r"}

[nltk_data] Downloading package senseval to
[nltk_data]     /Users/varadrajrameshpoojary/nltk_data...
[nltk_data]   Package senseval is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/varadrajrameshpoojary/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/varadrajrameshpoojary/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/varadrajrameshpoojary/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/varadrajrameshpoojary/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


### Setting up the WSD task

We use the Senseval corpus included in NLTK, which has sense-tagged data for a small set of word types. We only look at the ambiguity of the word *line*. Note that this corpus is arranged in a way that is **NOT** typical for NLTK corpora. It is stored in a list of *instances*, where each instance has the sense and the context around it. We iterate over instances of the word *line* using this code: `for instance in senseval.instances('line.pos')`.

We reorganize the information into two Python dictionaries: *train* and *test*.
Each dictionary will contain senses as the keys, while the values are lists of POS-tagged sentences (if an *instance* in the semeval corpus has the given sense, it is included in this list).


In [2]:
train_dict = defaultdict(list)
test_dict = defaultdict(list)
# Your code here
for instance in senseval.instances("line.pos"):
    sense = instance.senses[0]
    context = instance.context
    if len(test_dict[sense]) < 200:
        test_dict[sense].append(context)
    else:
        train_dict[sense].append(context)

In [3]:
assert len(test_dict.keys()) == 6
assert len(train_dict.keys()) == 6
assert "product" in test_dict.keys()
assert "product" in train_dict.keys()
assert len(test_dict["product"]) == 200
assert len(train_dict["product"]) == 2017
assert len(train_dict["product"][0]) == 49
print("Success")

Success


### Creating and testing features for WSD

We will be extracting features from the semeval data stored in Part 1.  We will be extracting several different types of features to eventually present to a classifier to do word-sense disambiguation.

#### Concreteness feature

One typical distinction between senses of a word is that some senses are more concrete (involving the physical world) whereas others are more abstract.  For example, "house" is very concrete - it is a thing that exists in the world, while "happiness" is abstract - there are many different definitions, and you can't point to something and say "that's happiness". A list of words with human-assigned concreteness ratings can be found on the webpage [here](https://raw.githubusercontent.com/ArtsEngine/concreteness/master/Concreteness_ratings_Brysbaert_et_al_BRM.txt); the relevant column is *Conc.M*. Note that they are floating-point numbers (0 means no value was assigned).We extract this information into a Python dict (key is word, value is concreteness) and then write a function which calculates an average concreteness score for all words in a context (that is, given a list of context words, your function calculates the average concreteness of all of them).We lemmatize and lowercase the words in the context before you look them up in the dictionary.  If a word occurs more than once, it should be counted more than once. If a word has no concreteness score (ie, Conc.M == 0) it should be left out of the calculation (both numerator and denominator).

For example, in the sentence "This is a test", we get:

this = 2.14 <br/>
is = 1.59 <br/>
a = 1.46 <br/>
test = 3.93 <br/>

So the concreteness score should be (2.14 + 1.59 + 1.46 + 3.93) / 4 = 2.28


Then use this function to show that the "cord" sense of *line* appears in more concrete contexts, on average, than the "division" sense. We use the function you've built, averaging the result across all the contexts for each of those two senses (using the training data from part 1). 

In [4]:
lemmatizer = WordNetLemmatizer()

In [5]:
df = pd.read_csv(
    "https://raw.githubusercontent.com/ArtsEngine/concreteness/master/Concreteness_ratings_Brysbaert_et_al_BRM.txt",
    delimiter="\t",
    index_col=0,
)
conc_dict = df["Conc.M"].to_dict()


def get_conc_score(context):
    """calculate the average concreteness score for all words in a given context"""
    # Your code here
    total = 0
    count = 0
    for element in context:
        # Skip ill-formatted word-pos pairs
        if type(element) != tuple or len(element) != 2:
            continue
        word, _ = element
        lemma = lemmatizer.lemmatize(word.lower())
        if lemma in conc_dict and conc_dict[lemma] != 0:
            total += conc_dict[lemma]
            count += 1
    score = total / count
    return score
    # Your code here

In [6]:
test_context = pos_tag(word_tokenize("I have a cat"))
assert get_conc_score(test_context) == ((2.18 + 1.46 + 4.86) / 3)

In [7]:
cord_context_conc = 0
div_context_conc = 0

'''
Calculate the average concreteness of all contexts of the sense "cord" and "division".  Show that
"cord" is higher.
'''
#Your code here
def get_sense_avg_conc_score(sense):
    '''
    Calculate the average concreteness of all contexts of the sense in the training data
    '''
    context_list = train_dict[sense]
    total = 0
    for context in context_list:
        total += get_conc_score(context)
    return total/len(context_list)
    
cord_context_conc = get_sense_avg_conc_score('cord')
div_context_conc = get_sense_avg_conc_score('division')

#Your code here

print("The concreteness score for 'cord' is: " + str(round(cord_context_conc, 3)))
print("The concreteness score for 'division' is: " + str(round(div_context_conc, 3)))

The concreteness score for 'cord' is: 2.826
The concreteness score for 'division' is: 2.532


#### Gloss overlap features (Lesk)

In this part you're going to apply the Lesk approach to WSD, looking for word overlap between the gloss of the sense and the context. However, you're not going to be able to use the version included in WordNet, for two reasons:

1. We will be using a restricted set of senses, not all possible senses for *line* included in WordNet
2. Rather than a single feature indicating which sense was chosen, we are going to calculate an overlap score for each possible sense

To apply Lesk, you will first need to associate each sense in the Senseval dataset with a synset in WordNet. I've attached the most-likely synset to each sense in the Senseval dataset.  You are free to look at the definitions of the senses, and see how I arrived at those definitions.


Write a function which takes a sentence context, and calculates the number of tokens that overlap between the context and the gloss of each sense in WordNet (HINT: use set intersection - we are only interested in *type* overlap). Your overlap calculation should exclude English stopwords (see COLX 521 Lecture 2). Your function should return a dictionary where the keys are senses and the values are overlap counts.

Then, show that the average overlap of the "product" gloss from WordNet is higher with "product" contexts than "division" contexts.  Again, use the training data from Part 1.  At this point, you'll have a synset_dictionary (with the glosses from WordNet), and a list of contexts for each sense.  You can use these to calculate the average overlap of each sense in your context dictionary.

For example, if your context dictionary has "This line is busy" and "Hold the line" for the sense 'phone', then you can calculate the overlap of "This line is busy" with the synset gloss of "phone", the same thing for the line "Hold the line", and then average them.

In [8]:
print(train_dict.keys())
line_synsets = wn.synsets("line")
synset_lookup = {}
synset_lookup['cord'] = wn.synset('line.n.18')
synset_lookup['division'] = wn.synset('line.n.29')
synset_lookup['formation'] = wn.synset('line.n.01')
synset_lookup['phone'] = wn.synset('telephone_line.n.02')
synset_lookup['product'] = wn.synset('line.n.22')
synset_lookup['text'] = wn.synset('line.n.05')

dict_keys(['cord', 'division', 'formation', 'phone', 'product', 'text'])


In [9]:
for synset in line_synsets:
    print(synset.name())
    print(synset.definition())
print("SIZE: ", len(line_synsets))

line.n.01
a formation of people or things one beside another
line.n.02
a mark that is long relative to its width
line.n.03
a formation of people or things one behind another
line.n.04
a length (straight or curved) without breadth or thickness; the trace of a moving point
line.n.05
text consisting of a row of words written across a page or computer screen
line.n.06
a single frequency (or very narrow band) of radiation in a spectrum
line.n.07
a fortified position (especially one marking the most forward position of troops)
argumentation.n.02
a course of reasoning aimed at demonstrating a truth or falsehood; the methodical process of logical reasoning
cable.n.02
a conductor for transmitting electrical or optical signals or electric power
course.n.02
a connected series of events or actions or developments
line.n.11
a spatial location defined by a real or imaginary unidimensional extent
wrinkle.n.01
a slight depression in the smoothness of a surface
pipeline.n.02
a pipe used to transport li

In [10]:
def count_overlap(context):
    '''Calculate the number of tokens that overlap between the context 
    and the gloss of each sense in WordNet
    '''
    # Create a dictionary to store overlap
    overlap_dict = defaultdict(int)
    # Create a set to store all possible value in the word list
    token_set = set()
    for element in context:
        # Skip ill-formatted word-pos pairs
        if type(element) != tuple or len(element) != 2:
            continue
        word, _ = element
        token_set.add(word.lower())
    # Delete the English stop word from the set
    token_set_clean = token_set - set(STOPWORDS)

    # Then we need to get the gloss for each sense
    for sense, synset in synset_lookup.items():
        gloss_words_raw = synset.definition().split(" ")
        gloss_words = set()
        for word in gloss_words_raw:
            # Get rid of bracket
            word = word.replace("(", "")
            word = word.replace(")", "")
            gloss_words.add(word.lower())
        overlap_dict[sense] = len(token_set_clean.intersection(set(gloss_words)))
    return overlap_dict


In [11]:
test_sent = "I was holding a flexible line"
test_context = pos_tag(word_tokenize(test_sent))
assert count_overlap(test_context)['cord'] == 1

In [12]:
# 'Product' gloss overlap with product contexts
avg_overlap_dict = {}
avg_overlap_dict['product'] = 0
avg_overlap_dict['division'] = 0

def get_sense_avg_overlap(sense):
    '''
    Calculate the average overlap of all contexts of the sense in the training data
    '''
    context_list = train_dict[sense]
    total = 0
    for context in context_list:
        total += count_overlap(context)[sense]
    return total/len(context_list)

#Your code here
avg_overlap_dict['product'] = get_sense_avg_overlap('product')
#Your code here
avg_overlap_dict['division'] = get_sense_avg_overlap('division')

print(f"The average overlap of the 'product' gloss with the product contexts is: {avg_overlap_dict['product']:0.3f}")
print(f"The average overlap of the 'product' gloss with the division contexts is: {avg_overlap_dict['division']:0.3f}")
print(avg_overlap_dict)

The average overlap of the 'product' gloss with the product contexts is: 0.059
The average overlap of the 'product' gloss with the division contexts is: 0.017
{'product': 0.059494298463063956, 'division': 0.017241379310344827}


#### 2.3 : WordNet distance features
rubric={accuracy:5,efficiency:1,quality:1}

This feature involves calculating the WordNet (Wupalmer) distances from the synsets of relevant senses of *line* to the synsets of mostly non-ambiguous context words. For this, you will need the Senseval -> WordNet sense mapping from 2.2.  If you can't remember how to get the Wupalmer (ie, wup) value, check the lecture slides.

The biggest challenge in this problem is identifying "mostly" non-ambiguous words. We could exclude any word type that has any polysemy (i.e. associated with more than one synset), but that seems too extreme (almost all words have some rare instances of strange sense uses). Instead, we are going to consider a word mostly non-ambiguous if it appears as one particular sense 75% of the time, based on the corpus counts provided in WordNet. You should write a general function, `get_dominant_sense`, which takes a word and a POS (a single letter, same as the input to the WordNet lemmatizer), and returns the dominant (75% of instances) synset if it exists, or `None` if it doesn't. The POS will be useful because, in order to do this properly, you will have to correctly lemmatize the word, so as to match it with the lemmas of each of its synset, so you can get the right count.

So this function should take a word and pos as input, and then: <br/>
1. Lemmatize the word <br/>
2. Get all the senses from the WordNet synsets for the word <br/>
3. Keep track of the counts for each sense that match the lemma <br/>
4. If the highest count is greater than 0.75 * total count, then return that synset.  Otherwise, return None. <br/>

Once you have this function, you should create another function which will, for a particular instance,

1. Use `get_dominant_sense` to get a list of synsets appearing in the context (one for each mostly non-ambiguous word). You will need to do this again in 2.4, so a separate function might be a good idea!  The function will take a context as input, and return a list of synsets.
2. For each sense of *line* in Senseval (ie, the senses in synset_lookup), calculate the average distance between that sense and all the synsets in the context. You should use the built-in function for calculating Wu-Palmer distance between a synset pair, don't implement your own.  
3. Return a dictionary mapping the (Senseval) sense to the average distance to the context synset.  That is, return a dictionary where the keys are the six senses in synset_lookup, and the values are the average distance from the context to that sense.

Then use the output of this function to show that the synsets associated with contexts around "phone" sense of line are on average closer to the "phone" synset than synsets from "division" contexts are.

That is, for each context in your training dictionary with the sense "phone", calculate the average distance between the context and the "phone" sense in synset_lookup.  Then, do the same for each context in your training dictionary with the sense "division".  Show that the "phone" sense is closer for "phone" contexts than "division" contexts ("closer" means the number will be smaller - this is a distance)

In [13]:
dominant_sense_ratio = 0.75

lemmatizer = WordNetLemmatizer()


def get_dominant_sense(word, pos="n"):
    """return the dominant (75% of instances) synset of the word if it exists, or None if it doesn't
    word -- an English word
    pos -- a single letter that represents part of speech of the input word, noun by default"""

    # Your code here
    # The return value
    return_dominant_sense = None
    if pos =='n' or pos =='v' or pos == 'a' or pos == 'r':
        word = lemmatizer.lemmatize(word.lower(), pos)
    else:
        word = lemmatizer.lemmatize(word.lower())
    goal_synsets = wn.synsets(word)

    # Create a dictionary to store counts for each sense that match the lemma
    synset_dict = {}
    for synset in goal_synsets:
        for lemma in synset.lemmas():
            if (lemma.name() == word):
                synset_dict[synset.name()] = lemma.count()

    total_count = sum(synset_dict.values())
    if len(synset_dict) != 0:
        max_item = max(synset_dict.items(), key=lambda x: x[1])
        highest_count = max_item[1]
    if (total_count != 0) and (highest_count / total_count > dominant_sense_ratio):
        return_dominant_sense = wn.synset(max_item[0])

    return return_dominant_sense

In [14]:
assert get_dominant_sense('word', 'n').name() == 'word.n.01'

In [15]:
def convert_into_wordnet_pos_code(tag):
    """return the wordnet pos code, by mapping from the NTLK POS tags"""
    if tag.startswith("NN"):
        return wn.NOUN
    elif tag.startswith("VB"):
        return wn.VERB
    elif tag.startswith("JJ"):
        return wn.ADJ
    elif tag.startswith("RB"):
        return wn.ADV
    else:
        return ""


def get_dominant_sense_context(context):
    """return a list of dominant synsets appearing in the context,
    one for each mostly non-ambiguous word"""

    # Your code here
    valid_context = []
    dominant_synsets = []
    for element in context:
        # Skip ill-formatted word-pos pairs
        if type(element) != tuple or len(element) != 2:
            continue
        word, tag = element
        if word.lower() in set(STOPWORDS):
            continue
        if not tag.isalpha():
            continue
        else:
            valid_context.append(element)

    for valid_word, valid_tag in valid_context:
        wordnet_pos = convert_into_wordnet_pos_code(valid_tag)
        dominant_sense_synset = get_dominant_sense(valid_word, wordnet_pos)
        # print(valid_word, valid_tag, wordnet_pos, dominant_sense_synset)
        if dominant_sense_synset is not None:
            dominant_synsets.append(dominant_sense_synset)

    return dominant_synsets


get_dominant_sense_context(train_dict["product"][0])

[Synset('ideally.r.01'),
 Synset('state.v.01'),
 Synset('buy.v.01'),
 Synset('strong.a.01'),
 Synset('cash.n.01'),
 Synset('compatible.a.01')]

In [16]:
def get_average_distance(context):
    """calculate average distance between senses of the word "line" and a context"""

    # Your code here
    # Create a dictionary to store differences between senses and a context
    distance_diff_dict = defaultdict(int)
    dominant_synsets = get_dominant_sense_context(context)
    if len(dominant_synsets) == 0:
        return {}
    for key, value in synset_lookup.items():
        for synset in dominant_synsets:
            distance_diff_dict[key] += 1 - value.wup_similarity(synset)

    total_iters = len(dominant_synsets)
    return_avg_distance = {k: v / total_iters for k, v in distance_diff_dict.items()}
    return return_avg_distance


get_average_distance(train_dict["product"][0])

{'cord': 0.7984126984126984,
 'division': 0.8261689291101054,
 'formation': 0.7746031746031746,
 'phone': 0.8327020202020202,
 'product': 0.8327020202020202,
 'text': 0.8118686868686869}

In [17]:
print(get_average_distance([('hooks', 'NNP'), ('fly', 'VB'), ('in', 'IN'), ('every', 'DT'), ('direction', 'NN'), ('.', '.'), ('lines', 'NNS'), ('become', 'VBP'), ('tangled', 'JJ'), ('.', '.')]))

{}


In [18]:
avg_phone_distance = 0
count = 0

# Your code here
phone_contexts = train_dict["phone"]
for phone_context in phone_contexts:
    temp_dict = get_average_distance(phone_context)
    if len(temp_dict) != 0:
        avg_phone_distance += temp_dict["phone"]
        count += 1

print(
    "The distance between the synsets associated with contexts around 'phone' sense of line and 'phone' synset: ",
    round(avg_phone_distance / count, 3),
)

The distance between the synsets associated with contexts around 'phone' sense of line and 'phone' synset:  0.761


In [19]:
avg_division_distance = 0
count = 0

# Your code here
phone_contexts = train_dict["division"]
for phone_context in phone_contexts:
    temp_dict = get_average_distance(phone_context)
    if len(temp_dict) != 0:
        avg_division_distance += temp_dict["phone"]
        count += 1

print(
    "The distance between the synsets associated with contexts around 'division' sense of line and 'phone' synset: ",
    round(avg_division_distance / count, 3),
)

The distance between the synsets associated with contexts around 'division' sense of line and 'phone' synset:  0.787


#### WordNet Hypernyms

Now, we will consider the count of WordNet synsets in the context directly as features. However, limiting ourselves to the synsets corresponding directly to words might result in sparsity, and provide little more information than raw words would. Instead, we are going to also include all the hypernyms of words appearing in the context as potential features for doing WSD.

First, we write a recursive function `get_all_hypernyms` which collects the names (e.g. `synset.name()`) of a provided WordNet synset and all of its hypernyms.  The base case can just be when an item no longer has any hypernyms.

Then, applying this function to the synsets found in the context (step 1 of the distance function in 2.3), write a function that counts all the hypernyms of all the (again mostly non-ambiguous) synsets in the context, normalizing by the total count to get a proportion for each synset.

The function will take a context as input.  It will calculate the dominant synsets from this context.  Then, for each of these synsets, it will get all of their hypernyms, and keep track of their counts.
The returned dictionary will have the hypernym names as keys, and the percentage of all the hypernyms found using this method.  For example, if you count all the hypernyms, and have 20, and 5 of them are "animal", then "animal" will have a value of "0.25"

Then we show that the average proportion of the 'object.n.01' synset is higher in contexts involving the "cord" sense of *line* than the "division" sense. (This should be true for the same reason as in 2.1)

In [20]:
def get_all_hypernyms(synset, names=[]):
    '''return a list of the names of a synset and all its hypernyms'''
    #Your code here
    names.append(str(synset.name()))
    synset_hypernyms= synset.hypernyms()
    #Base Case
    if(len(synset_hypernyms)==0):
        return names
    else:
        for hypernym in synset_hypernyms:
            get_all_hypernyms(hypernym, names)
    return names

In [21]:
cat_synset = wn.synset("cat.n.01")
assert get_all_hypernyms(cat_synset, []) == [
    "cat.n.01",
    "feline.n.01",
    "carnivore.n.01",
    "placental.n.01",
    "mammal.n.01",
    "vertebrate.n.01",
    "chordate.n.01",
    "animal.n.01",
    "organism.n.01",
    "living_thing.n.01",
    "whole.n.02",
    "object.n.01",
    "physical_entity.n.01",
    "entity.n.01",
]

In [22]:
def get_all_of_all_hypernyms(context):
    '''get all the hypernyms of all the dominent synsets in a given context,
       return normalized counts'''
    hypernyms = {}
    #Your code here
    total_count =0
    for synset in get_dominant_sense_context(context):
        for hypernym in get_all_hypernyms(synset):
            hypernyms[hypernym]= hypernyms.get(hypernym, 0) + 1
            total_count += 1
    for hypernym, count in hypernyms.items():
        hypernyms[hypernym]= count/total_count
    return hypernyms

In [23]:
get_all_of_all_hypernyms(train_dict['cord'][16])

{'hear.v.01': 0.05952380952380952,
 'perceive.v.01': 0.05952380952380952,
 'rebuff.n.01': 0.047619047619047616,
 'discourtesy.n.03': 0.047619047619047616,
 'behavior.n.01': 0.047619047619047616,
 'activity.n.01': 0.047619047619047616,
 'act.n.02': 0.047619047619047616,
 'event.n.01': 0.047619047619047616,
 'psychological_feature.n.01': 0.047619047619047616,
 'abstraction.n.06': 0.07142857142857142,
 'entity.n.01': 0.11904761904761904,
 'beach.n.01': 0.03571428571428571,
 'geological_formation.n.01': 0.03571428571428571,
 'object.n.01': 0.047619047619047616,
 'physical_entity.n.01': 0.047619047619047616,
 'inch.n.01': 0.023809523809523808,
 'linear_unit.n.01': 0.023809523809523808,
 'unit_of_measurement.n.01': 0.023809523809523808,
 'definite_quantity.n.01': 0.023809523809523808,
 'measure.n.02': 0.023809523809523808,
 'bucket.n.01': 0.011904761904761904,
 'vessel.n.03': 0.011904761904761904,
 'container.n.01': 0.011904761904761904,
 'instrumentality.n.03': 0.011904761904761904,
 'artif

In [24]:
sum(get_all_of_all_hypernyms(train_dict['cord'][16]).values())

1.0

- On average, are the synsets associated with contexts around "phone" sense of `line` closer to the "phone" synset than synsets from "division" contexts?

In [25]:
# average proportion of the "object.n.01" synset in "cord" contexts

total_prop = 0
count = 0
for context in train_dict['cord']:
    hypernyms = get_all_of_all_hypernyms(context)
    if "object.n.01" in get_all_of_all_hypernyms(context).keys():
        total_prop += hypernyms["object.n.01"]
        count += 1

print("avg proportion of the 'object.n.01' synset in 'cord' contexts: ", round(total_prop / count, 3))

avg proportion of the 'object.n.01' synset in 'cord' contexts:  0.055


In [26]:
# average proportion of the "object.n.01" synset in "division" contexts

total_prop = 0
count = 0
for context in train_dict['division']:
    hypernyms = get_all_of_all_hypernyms(context)
    if "object.n.01" in hypernyms.keys():
        total_prop += hypernyms["object.n.01"]
        count += 1

print("avg proportion of the 'object.n.01' synset in 'division' contexts: ", round(total_prop / count, 3))

avg proportion of the 'object.n.01' synset in 'division' contexts:  0.045


### Building a classifier

Now that we have a collection of features which show some promise for the task, we build a classifier for WSD of *line* which uses all the features above. We combine all the individual outputs of the functions into a single feature dictionary for each context sentence. 

In [27]:
def get_feature_dict(context):
    '''Extract a feature dictionary for an input text'''
    feature_dict = {}
    # Your code here
    # Calculate the correctness
    feature_dict['concreteness'] = get_conc_score(context)
    # Calculate the overlap and avg distance
    overlaps = count_overlap(context)
    avg_distances = get_average_distance(context)
    # Add the dict to the feature dict
    for sense in list(synset_lookup.keys()):
        feature_dict['_'.join(['overlap', sense])] = overlaps[sense]
        if len(avg_distances) != 0:
            feature_dict['_'.join(['avg_dist', sense])] = avg_distances[sense]
    # Add Hypernyms
    hypernyms = get_all_of_all_hypernyms(context)
    for hypernym, prop in hypernyms.items():
        feature_dict['_'.join(['avg_prop_hypernym', hypernym])] = prop
    return feature_dict

In [28]:
train_feat_dicts = []
train_classification = []
test_feat_dicts = []
test_classification = []

for key in test_dict.keys():
    for context in range(len(test_dict[key])):
        test_feat_dicts.append(get_feature_dict(test_dict[key][context]))
        test_classification.append(key)

try:
    for key in train_dict.keys():
        for context in range(len(train_dict[key])):
            train_feat_dicts.append(get_feature_dict(train_dict[key][context]))
            train_classification.append(key)
except IndexError:
    print(IndexError)
    print(train_dict[key][context])

In [29]:
def vectorize(train_dict, test_dict):
    '''vectorize given lists of feature dictionaries, return X_train and X_test'''
    vectorizer = DictVectorizer(sparse=False, dtype=float)
    X_train = vectorizer.fit_transform(train_dict)
    X_test = vectorizer.transform(test_dict)
    
    return X_train, X_test

In [30]:
X_train, X_test = vectorize(train_feat_dicts, test_feat_dicts)
y_train, y_test = train_classification, test_classification

In [31]:
# test_feat_dicts
# Comment out this line because the output is too large to push to GitHub

In [32]:
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=5)

In [33]:
print("Final score: ", tree.score(X_test, y_test))

Final score:  0.16833333333333333


#### Experiment: Taking out `average_distance` features (WordNet)

In [34]:
def remove_average_distance_feature(feat_dicts):
    '''remove average distance feature from data set'''
    new_feat_dicts = []

    #Your code here
    for feat_dict in feat_dicts:
        new_feat_dict = {}
        for feature, value in feat_dict.items():
            if feature[:8] != 'avg_dist':
                new_feat_dict[feature] = value
        new_feat_dicts.append(new_feat_dict)
    
    return new_feat_dicts


In [35]:
new_train_feat_dicts, new_test_feat_dicts = remove_average_distance_feature(train_feat_dicts), remove_average_distance_feature(test_feat_dicts)

In [36]:
# new_test_feat_dicts
# Comment out this line because the output is too large to push to GitHub

In [37]:
X_new_train, X_new_test = vectorize(new_train_feat_dicts, new_test_feat_dicts)

In [38]:
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_new_train, y_train)

DecisionTreeClassifier(max_depth=5)

In [39]:
print("Score with new training set: ", tree.score(X_new_test, y_test))

Score with new training set:  0.16833333333333333


We tried to take out the average distance feature from the data sets because it was the feature that showed the least significant distinction between different senses of contexts. As a result, we got a slightly higher score compared to the final score with all features. Given that feature extraction takes a significant amount of time, it would be better to take out the feature from the data sets and try to implement other features.