# Homework 2: Word Similarity

Student Name:

Student ID:

Python version used:

## General info

<b>Due date</b>: 1pm, Sunday April 1st

<b>Submission method</b>: see LMS

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -20% per day

<b>Marks</b>: 5% of mark for class

<b>Overview</b>: In this homework, you'll be quantifying the similarity between pairs of words using the structure of WordNet and word co-occurrence in the Brown corpus, using PMI, LSA, and word2vec. You will quantify how well these methods work by comparing to a carefully filtered human annotated gold-standard.

<b>Materials</b>: See the main class LMS page for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages; if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> It is recommended to use Python 2 but we accept Python 3 solutions</b>. Make sure you state which version you used in the beggining of this notebook.

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.

<b>Extra credit</b>: Each homework has a task which is optional with respect to getting full marks on the assignment, but that can be used to offset any points lost on this or any other homework assignment (but not the final project or the exam). We recommend you skip over this step on your first pass, and come back if you have time: the amount of effort required to receive full marks (1 point) on an extra credit question will be substantially more than earning the same amount of credit on other parts of the homework.

<b>Updates</b>: Any major changes to the assignment will be announced via LMS. Minor changes and clarifications will be announced in the forum on LMS, we recommend you check the forum regularly.

<b>Academic Misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.


<b>Instructions</b>: For this homework we will be comparing our methods against a popular dataset of word similarities called Similarity-353. You need to first obtain this data set, which can be downloaded <a href="http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.zip">here</a>. The file we will be using is called *combined.tab*. Except for the header (which should be stripped out), the file is tab formated with the first two columns corresponding to two words, and the third column representing a human-annotated similarity between the two words.

Assume the file *combined.tab* is located <b>in the same folder as this notebook</b>. You should load this file into a Python dictionary (NOTE: in Python, tuples of strings, i.e. ("tiger","cat") can serve as the keys of dictionaries). This dataset contains many rare words: we need to filter this dataset in order for it to be better suited to the resources we will use in this assignment. So your first goal is to filter this dataset to generate a smaller test set where you will evaluate your word similarity methods.

The first filtering is based on document frequencies in the Brown corpus, in order to remove rare words. In this assignment, we will be treating the <i>paragraphs</i> of the Brown corpus as our "documents", you can iterate over them by using the `paras` method of the corpus reader. You should start by creating a Python list where each element of the list is a set containing the word <b>types</b> from a different paragraph of the Brown corpus: the words should be lower-cased and lemmatized before they are added to the set (keep it around, because you will need this list again later on). Then, using the information in this corpus, calculate document frequencies and remove from your test set any word pairs where at least one of the two words has a document frequency of less than 10 in this corpus. 

The second filtering is based on words with highly ambiguous senses and involves using the NLTK interface to WordNet. Here, you should remove any words which do not have a *single primary sense*. We define single primary sense here as either having only one sense (i.e. only one synset), or where the count (as provided by the WordNet `count()` method for the lemmas associated with a synset) of the most common sense is at least five and at least five times larger than the next most common sense. Also, you should remove any words where the primary sense is not a noun (this information is also in the synset). Store the synset corresponding to this primary sense in a dictionary for use in the next section. Given this definition, remove any word pairs from the test set where at least one of the words does not contain a single primary sense or if the single primary sense is not a noun.

When you have applied these two filtering steps, print out all the pairs in your filtered test set (if you have done this correctly, the total should be more than 10, but less than 50).

(1.5 marks)

In [30]:
import re
import nltk
from nltk.corpus import brown
from nltk.stem import WordNetLemmatizer
import nltk.corpus.reader.plaintext as reader
from nltk.wsd import wordnet as wn

dataset_path = "/Users/alfredgordon/Downloads/wordsim353/combined.tab"
head = str('Word 1\tWord 2\tHuman (mean)\n')



def con_dic(dataset_path):
    """construct dictionary as requirement
    para: dataet_path: combined.tab file path
    return: dicionary format combined.tab
    """
    dataset = open(dataset_path, "r")
    next(dataset)
    testset = {}
    #key = (word1,word2) word pairs in combined.tag
    for l in dataset:
        
            k1 = re.sub(r'^([a-zA-Z]+)	[a-zA-Z]+	[0-9.]+$',r'\1',l)
            k2 = re.sub(r'^[a-zA-Z]+	([a-zA-Z]+)	[0-9.]+$',r'\1',l)
            key = (k1.strip('\n'),k2.strip('\n'))
            #value = '0.0'
            value = re.sub(r'^[a-zA-Z]+	[a-zA-Z]+	([0-9.]+)$',r'\1',l)
            testset[key] = float(value)
    return testset

def doc_setup(dictionary):
    """parse brown corpus by paragraphs
    para: dictionary: testset
    return: parsed brown corpus as documents
    """
    ds = []
    paras = brown.paras()
    wordnet_lemmatizer = WordNetLemmatizer()
    for doc in paras:
        d = []
        for sent in doc:
            for word in sent:
                w = wordnet_lemmatizer.lemmatize(word.lower())
                d.append(w)
        ds.append(set(d))  
    return ds

def doc_frequency(w, docs):
    """count document frequency for individual word
    paras: w: word
    paras: docs: a list of documents
    return: document frequency count
    """
    df = 0
    for p in docs:
        if w in p:
            df += 1
    return df

def first_filter(dictionary, remove_degree):
    """delete word pairs from dictionary if
    one of their document frequency is less than
    remove_degree
    para: dictionary: dictionary format of word-pairs
    para: remove_degree: default to be 10
    return: filtered dictionary, a list of documents
    """
    docs = doc_setup(dictionary)
    for (k1,k2) in list(dictionary):
        if doc_frequency(k1,docs) < remove_degree or \
        doc_frequency(k2,docs) < remove_degree:
            del dictionary[(k1,k2)]            
    print("testset length after 1st filter:")
    print(len(dictionary))
    return dictionary, docs
        

def second_max(a):
    """find second max in a list
    para: a: list of number
    return: second max number
    """
    hi = mid = 0
    for x in a:
        if x > hi:
            mid = hi
            hi = x
        elif x < hi and x > mid:
            lo = mid
            mid = x
    return mid

def primary_filter(word, filter_degree):
    """check primary synset
    paras: word that need to be checked
    paras: filter_degree: lemma times bigger
    return None of suitable primary synset
    """
    synsets = wn.synsets(word)
    # only one synset
    if len(synsets) == 1:
        if synsets[0].pos() != 'n':
            return None
        else:
            return synsets[0].name()
    #more than one synset
    elif len(synsets) > 1:
        return lemma_filter(word, filter_degree)


def lemma_filter(word, filter_degree):
    """check primary synset based on lemma count
    paras: word
    paras: filter_degree: lemma count times
    return None or suitable primary synset
    """
    synsets = wn.synsets(word)
    count = {}
    for index,s in enumerate(synsets):
        #record compare most common n next most common       
        c = 0            
        for lemma in s.lemmas():
            lemma_name = lemma.name()
            if lemma_name == word:
                    #print(word)
                c = lemma.count()
                count[c] = index
    #print(count)
    secmax = second_max(list(count))
    c = max(list(count))

    if synsets[count.get(c)].pos() == 'n':
        #at least five or five times bigger
        if c >= filter_degree and c >= filter_degree * secmax:
            return synsets[count.get(c)].name()  
        else:
            return None
    else:
        return None


        

def second_filter(dictionary, filter_degree, add_synset):
    """filter out not noun words without suitable primary synset
    para: dictionary: dictionary format of word-pairs
    para: filter_degree: control lemma count default 5
    para: add_synset: create (wordpair):synset dictionary or not
    return: filtered dictionary    
    """
    for (k1,k2) in list(dictionary):
        f1 = primary_filter(k1,filter_degree)
        f2 = primary_filter(k2,filter_degree)
        if f1 and f2:
                synsets = [f1,f2]
                if add_synset is True:
                    dictionary[(k1,k2)] = synsets                    
        else:
                del dictionary[(k1,k2)]
    if add_synset is True:
        print("second processed dataset length:")
        print(len(dictionary))
        print(dictionary)
    return dictionary



testset_dic = con_dic(dataset_path)
testset_dic, docs = first_filter(testset_dic, 10)

#primary_filter("eat")
testset_dic = second_filter(testset_dic, 5, True)

testset length after 1st filter:
222
second processed dataset length:
29
{('professor', 'doctor'): ['professor.n.01', 'doctor.n.01'], ('stock', 'egg'): ['stock.n.01', 'egg.n.01'], ('baby', 'mother'): ['baby.n.01', 'mother.n.01'], ('car', 'automobile'): ['car.n.01', 'car.n.01'], ('journey', 'voyage'): ['journey.n.01', 'ocean_trip.n.01'], ('coast', 'shore'): ['seashore.n.01', 'shore.n.01'], ('brother', 'monk'): ['brother.n.01', 'monk.n.01'], ('journey', 'car'): ['journey.n.01', 'car.n.01'], ('coast', 'hill'): ['seashore.n.01', 'hill.n.01'], ('monk', 'slave'): ['monk.n.01', 'slave.n.01'], ('coast', 'forest'): ['seashore.n.01', 'forest.n.01'], ('psychology', 'doctor'): ['psychology.n.01', 'doctor.n.01'], ('psychology', 'mind'): ['psychology.n.01', 'mind.n.01'], ('psychology', 'health'): ['psychology.n.01', 'health.n.01'], ('psychology', 'science'): ['psychology.n.01', 'science.n.01'], ('secretary', 'senate'): ['secretary.n.02', 'senate.n.01'], ('computer', 'laboratory'): ['computer.n.01', 

<b>Instructions</b>: Now you will create several dictionaries with similarity scores for pairs of words in your test set derived using the techniques discussed in class. The first of these is the Wu-Palmer scores derived from the hypernym relationships in WordNet, which you should calculate using the primary sense for each word derived above. You can use the built-in method included in the NLTK interface, you don't have to implement your own. When you're done,  print out the Python dictionary of word pair/similarity mappings. 

(0.5 marks)

In [31]:
def WP_scores(dictionary):
    """compare similarity of word pairs based on Wu-Palmer similarity
    paras: dictionary: dictionary format of word-pairs
    return: a dictionary (wordpair):similarity score
    """
    similarity_dict = {}    
    for (k1,k2) in list(dictionary):
        synsets = dictionary.get((k1,k2))
        
        w1 = wn.synset(synsets[0])
        w2 = wn.synset(synsets[1])
        
        score=w1.wup_similarity(w2)
        #print(score)
        
        similarity_dict[(k1,k2)] = score
    print(similarity_dict)
    return similarity_dict
    
        
    
WP_sim = WP_scores(testset_dic)
print(len(testset_dic))
        
        
    
    

{('professor', 'doctor'): 0.5, ('stock', 'egg'): 0.11764705882352941, ('baby', 'mother'): 0.5, ('car', 'automobile'): 1.0, ('journey', 'voyage'): 0.8571428571428571, ('coast', 'shore'): 0.9090909090909091, ('brother', 'monk'): 0.5714285714285714, ('journey', 'car'): 0.09523809523809523, ('coast', 'hill'): 0.6666666666666666, ('monk', 'slave'): 0.6666666666666666, ('coast', 'forest'): 0.16666666666666666, ('psychology', 'doctor'): 0.1111111111111111, ('psychology', 'mind'): 0.5714285714285714, ('psychology', 'health'): 0.21052631578947367, ('psychology', 'science'): 0.9411764705882353, ('secretary', 'senate'): 0.13333333333333333, ('computer', 'laboratory'): 0.35294117647058826, ('canyon', 'landscape'): 0.3333333333333333, ('century', 'year'): 0.8333333333333334, ('doctor', 'personnel'): 0.13333333333333333, ('school', 'center'): 0.13333333333333333, ('word', 'similarity'): 0.3333333333333333, ('hotel', 'reservation'): 0.375, ('type', 'kind'): 0.9473684210526315, ('equipment', 'maker'):

**Instructions:** Next, you will calculate Positive PMI (PPMI) for your word pairs using statistics derived from the Brown: you should use the same set up as you did to calculate document frequency above: paragraphs as documents, lemmatized, lower-cased, and with term frequency information removed by conversion to Python sets. You need to use the basic method for calculating PPMI introduced in class (and also in the reading) which is appropriate for any possible definition of co-occurrence (here, appearing in the same paragraph), but you should only calculate PPMI for the words in your test set. You must avoid building the entire co-occurrence matrix, instead you should keeping track of the sums you need for the probabilities as you go along. When you have calculated PMI for all the pairs, your code should print out the Python dictionary of word-pair/PPMI-similarity mappings.

(1 mark)

In [32]:
import math
def PPMI(word1, word2, documents):
    """window size will be one paragraph in brown corpus
    compare similarity of a word pair based on PPMI
    paras: word1, word2: word pair that waiting to be compared
    paras: documents: a list of document
    return: PPMI value        
    """
    w1_count = 0
    w2_count = 0
    total_count = 0
    both_count = 0
    #window = window_size //2
    for doc in documents:
        for i in doc:
            total_count += 1
            if i == word1:
                w1_count += 1
                for w in doc:
                    if w == word2:
                        both_count += 1
            elif i == word2:
                w2_count += 1
            
    base = (both_count/total_count)/((w1_count/total_count)*(w2_count/total_count))    
    if base > 0:
        PMI = math.log((both_count/total_count)/((w1_count/total_count)*(w2_count/total_count)), 2)
        #print(PMI)
        return PMI
    else:
        #print(0.0)
        return float(0)
    
def dic_PPMI(dictionary, docs):
    """compare similarity of word pairs based on Wu-Palmer similarity
    paras: dictionary: dictionary format of word-pairs
    paras: docs: a list of documents
    return: a dictionary (wordpair):PPMI 
    """
    pmi_dic = {}
    for (k1,k2) in list(dictionary):
        pmi = PPMI(k1,k2,docs)
        pmi_dic[(k1,k2)] = pmi
    print(pmi_dic)
    return pmi_dic
            
PPMI_sim = dic_PPMI(testset_dic, docs)
    

{('professor', 'doctor'): 0.0, ('stock', 'egg'): 7.432658852069735, ('baby', 'mother'): 8.722036679055751, ('car', 'automobile'): 8.900113284110693, ('journey', 'voyage'): 0.0, ('coast', 'shore'): 10.245932998315858, ('brother', 'monk'): 8.514452943233382, ('journey', 'car'): 0.0, ('coast', 'hill'): 6.828245920623065, ('monk', 'slave'): 0.0, ('coast', 'forest'): 8.665692907837105, ('psychology', 'doctor'): 9.177761495674279, ('psychology', 'mind'): 8.394859617341213, ('psychology', 'health'): 0.0, ('psychology', 'science'): 10.693682351965784, ('secretary', 'senate'): 9.611782098320695, ('computer', 'laboratory'): 0.0, ('canyon', 'landscape'): 0.0, ('century', 'year'): 6.470397157835755, ('doctor', 'personnel'): 7.833807094456917, ('school', 'center'): 6.359230800285396, ('word', 'similarity'): 0.0, ('hotel', 'reservation'): 8.506232436428414, ('type', 'kind'): 6.265260462553219, ('equipment', 'maker'): 9.898498628048598, ('luxury', 'car'): 7.887513247331061, ('soap', 'opera'): 9.83638

**Instructions:** Next, you will derive similarity scores using the LSA method, i.e. apply SVD and truncate to get a dense vector and then use cosine similarity between the two vectors for each word pair. You can use the Distributed Semantics notebook as a starting point, but note that since you are interested here in word semantics, you will be constructing a matrix where the (non-sparse) rows correspond to words in the vocabulary, and the (sparse) columns correspond to the texts where they appear (this is the opposite of the notebook). Again, use the Brown corpus, in the same format as with PMI and document frequency. After you have a matrix in the correct format, use truncatedSVD in Sci-kit learn to produce dense vectors of length 500, and then use cosine similarity to produce similarities for your word pairs. Print out the corresponding Python dictionary.

(1 mark)

In [33]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from scipy.spatial.distance import cosine as cos_distance




#code below is refered from WSTA_N4_distributional_semantics

def get_BOW(text):
    """bag-of-words feature selection
    para: text: one piece of document
    return: dictionary format of bag-of-words
    """
    BOW = {}
    for word in text:
        BOW[word.lower()] = BOW.get(word.lower(),0) + 1
    return BOW

#end of reference



def cons_feature_matrix(docs):
    """create truncatedSVD matrix for word-document feature
    para: docs: a list of documents
    return: created feature_matrix, dictionary_vectorizer
    """
    texts = []
    for doc in docs:
        texts.append(get_BOW(doc))
                            
    vectorizer = DictVectorizer()
    #Tfidf is optional
    transformer = TfidfTransformer(smooth_idf=False,norm=None)
    svd = TruncatedSVD(n_components=500)
    
    # create truncated transposed document-word feature matrix 
    # as word-doc feature matrix
    feature_matrix = svd.fit_transform(vectorizer.fit_transform(texts).T)
    return feature_matrix,vectorizer


def cos_sim(testset,docs,feature_matrix, vectorizer):
    """compare pairs of word similary based on cosine distance
       and TruncatedSVD
    paras: testset: dictionary format of word pairs
    paras: a list of documents
    paras: feature_matrix: truncated word-doc matrix
    paras: vectorizer: a BOW vectorizer for getting word index
    return: dictionary format as (wordpairs):cosine similarity
    """
    look_up = {}
    cos_dic = {}
    for index,w in enumerate(vectorizer.get_feature_names()):
        look_up[w]=index
    
    for (k1,k2) in list(testset):
        v1 = look_up.get(k1)
        v2 = look_up.get(k2)
        cos = cos_distance(feature_matrix[v1],feature_matrix[v2])
        # cosine_similary = 1-cosine_distance
        cos_dic[(k1,k2)]=1-cos        
    print(cos_dic)
    return cos_dic
        
        
    

feature_matrix,vectorizer = cons_feature_matrix(docs)
cos_sim = cos_sim(testset_dic,docs,feature_matrix, vectorizer)


    

{('professor', 'doctor'): 0.07962727332873365, ('stock', 'egg'): 0.1410014992605253, ('baby', 'mother'): 0.32308500443762656, ('car', 'automobile'): 0.36675608829519946, ('journey', 'voyage'): 0.12580095162843796, ('coast', 'shore'): 0.3838696011131211, ('brother', 'monk'): 0.06057832286592235, ('journey', 'car'): 0.013095584487878176, ('coast', 'hill'): 0.19187191501604195, ('monk', 'slave'): -0.04944490856392281, ('coast', 'forest'): 0.1020773221215131, ('psychology', 'doctor'): 0.11453583496975728, ('psychology', 'mind'): 0.11621736949500316, ('psychology', 'health'): 0.023813308889068074, ('psychology', 'science'): 0.28523511157689485, ('secretary', 'senate'): 0.39805789096150535, ('computer', 'laboratory'): 0.13323797796412695, ('canyon', 'landscape'): 0.11497281137165394, ('century', 'year'): 0.06781879547079739, ('doctor', 'personnel'): 0.023514115940761693, ('school', 'center'): 0.046829530650918216, ('word', 'similarity'): -0.003605270742967459, ('hotel', 'reservation'): 0.074

 **Instructions:** Next, you will derive a similarity score from word2vec vectors, using the Gensim interface. Check the Gensim word2vec tutorial for details on the API: https://radimrehurek.com/gensim/models/word2vec.html. Again, you should use the Brown for this, but for word2vec you don't need to worry about paragraphs: feel free to train your model at the sentence level instead. Your vectors should have the same number of dimensions as LSA (500), and you need to run for 50 iterations. This may take a while (several minutes), but that's okay, you won't be marked based on the speed of this. You should extract the similarites you need directly from the Gensim model, put them in a Python dictionary, and print them out.

(0.5 mark)

In [34]:
from gensim.models import Word2Vec

def train_w2v():
    """train word2vec model in sentence level
    return:trained model
    """
    sentences = brown.sents()   
    model = Word2Vec(sentences, size=500, iter=50)
    return model

def w2v_sim_w(word1, word2, model):
    """compare similary of two words
    para: word1,word2
    para: model
    return: similarity score
    """
    similarity = model.wv.similarity(word1, word2)
    #print(similarity)
    return similarity
    
    
def w2v_sim(dataset, model):
    w2v_sim = {}
    for (k1,k2) in list(dataset):
        sim = w2v_sim_w(k1,k2,model)
        w2v_sim[(k1,k2)] = sim
    print(w2v_sim)
    return w2v_sim
        
    
model = train_w2v()    
W2V = w2v_sim(testset_dic, model)


{('professor', 'doctor'): 0.10521473964870218, ('stock', 'egg'): 0.1485685681150803, ('baby', 'mother'): 0.23632185915564013, ('car', 'automobile'): 0.17277914140820544, ('journey', 'voyage'): 0.47617062637383206, ('coast', 'shore'): 0.4051329414318604, ('brother', 'monk'): 0.04042875805562679, ('journey', 'car'): 0.19801697248395853, ('coast', 'hill'): 0.44527485444604537, ('monk', 'slave'): 0.013772762772213785, ('coast', 'forest'): 0.2879094782531423, ('psychology', 'doctor'): -0.03753880586464503, ('psychology', 'mind'): 0.046343553734534054, ('psychology', 'health'): 0.16924622535064146, ('psychology', 'science'): 0.30491753025731005, ('secretary', 'senate'): 0.09820516122337106, ('computer', 'laboratory'): 0.18315515677738242, ('canyon', 'landscape'): 0.16869738104739962, ('century', 'year'): 0.30330282933877273, ('doctor', 'personnel'): -0.05721843137511085, ('school', 'center'): -0.04422633297388947, ('word', 'similarity'): 0.0383974812192422, ('hotel', 'reservation'): 0.053186


**Instructions:** Finally, you should compare all the similarities you've created to the gold standard you loaded and filtered in the first step. For this, you can use the Pearson correlation co-efficient (`pearsonr`), which is included in scipy (`scipy.stats`). Be careful converting your dictionaries to lists for this purpose, the data for the two datasets needs to be in the same order for correct comparison using correlation. Write a general function, then apply it to each of the similarity score dictionaries, and print out the result for each (be sure to label them!). Hint: All of the methods used here should be markedly above 0, but also far from 1 (perfect correlation); if you're not getting reasonable results, go back and check your code for bugs!  

(0.5 mark)


In [35]:
from scipy import stats

ori_testset = con_dic(dataset_path)
ori_test, docs = first_filter(ori_testset, 10)
#primary_filter("eat")
testset_tem = second_filter(ori_testset,5,False)



def dic2list(dic1, dic2):
    l1 = []
    l2 = []
    for (k1,k2) in list(dic1):
        l1.append(dic1.get((k1,k2)))
        l2.append(dic2.get((k1,k2)))
    return l1,l2



olist, WPlist = dic2list(ori_testset, WP_sim)
#print(olist)
#print(WPlist)
olist,PPMIlist = dic2list(ori_testset,PPMI_sim)
olist,COSlist = dic2list(ori_testset,cos_sim)
olist,W2Vlist = dic2list(ori_testset, W2V)
print("wu-Palmer:")
print(stats.pearsonr(olist, WPlist))
print("PPMI:")
print(stats.pearsonr(olist, PPMIlist))
print("Cosine:")
print(stats.pearsonr(olist,COSlist))
print("word2vector:")
print(stats.pearsonr(olist,W2Vlist))

    



testset length after 1st filter:
222
wu-Palmer:
(0.4591586058919407, 0.012225751338546089)
PPMI:
(0.10537972921158244, 0.5864114263344311)
Cosine:
(0.2866665326664514, 0.1316342766703332)
word2vector:
(0.3203661168617215, 0.09020361755065244)


## A final word

Normally, we would not use a corpus as small as the Brown for the purposes of building distributional word vectors. Also, note that filtering our test set to just words we are likely to do well on would typically be considered cheating.