<h1> Ranking </h1>
Task: create dataset based on the unique pairs of 11 movies from the imdb dataset, annotate each row with preference vote, compare the result of Bradley–Terry model output with the original dataset<br><br>

<h2>Simlex999 data</h2>
    This dataset consist of 1000 entries, each having 8 columns  <br>
    <hr>
    <p style="background-color: rgb(255, 127, 0);color:black;font-weight:bold">word1: The first concept in the pair <span style="position:absolute;right:10em">string</span> <br></p>
    <p style="background-color: rgb(0, 255, 127);color:black;font-weight:bold">word2: The second concept in the pair<span style="position:absolute;right:10em">string</span> <br></p>
    <p style="background-color: rgb(127, 0, 255);color:black;font-weight:bold">POS: The majority part-of-speech of the concept words, as determined<br> by occurrence in the POS-tagged British National Corpus. Only pairs of <br>matching POS are included in SimLex-999<span style="position:absolute;right:10em;">string</span> <br></p>
    <p style="background-color: rgb(127, 127, 0);color:black;font-weight:bold">SimLex999: The SimLex999 similarity rating. Note that average annotator <br>scores have been (linearly) mapped from the range [0,6] to the range [0,10]<br> to match other datasets such as WordSim-353 <span style="position:absolute;right:10em">float</span> <br></p>
    <p style="background-color: rgb(0, 127, 127);color:black;font-weight:bold">conc(w1): The concreteness rating of word1 on a scale of 1-7. <br>Taken from the University of South Florida Free Association Norms database<span style="position:absolute;right:10em">float</span> <br></p>
    <p style="background-color: rgb(127, 0, 127);color:black;font-weight:bold">conc(w2): The concreteness rating of word2 on a scale of 1-7. <br>Taken from the University of South Florida Free Association Norms database<span style="position:absolute;right:10em">float</span> <br></p>
    <p style="background-color: rgb(127, 127, 127);color:black;font-weight:bold">concQ: The quartile the pair occupies based on the two concreteness ratings. <br>Used for some analyses in the above paper<span style="position:absolute;right:10em">number</span> <br></p>
    <p style="background-color: rgb(255, 127, 255);color:black;font-weight:bold">Assoc(USF): The strength of free association from word1 to word2. <br>Values are taken from the University of South Florida Free Association Dataset<span style="position:absolute;right:10em">float</span> <br></p>
    <p style="background-color: rgb(255, 255, 127);color:black;font-weight:bold">SimAssoc333: Binary indicator of whether the pair is one of the 333 <br>most associated in the dataset (according to Assoc(USF)). <br>This subset of SimLex999 is often the hardest for computational models<br> to capture because the noise from high association can confound the <br>similarity rating. See the paper for more details <span style="position:absolute;right:10em">number</span> <br></p>
    <p style="background-color: rgb(127, 255, 255);color:black;font-weight:bold">SD(SimLex): The standard deviation of annotator scores when rating this pair. <br>Low values indicate good agreement between the 15+ annotators on the similarity value SimLex999. Higher scores indicate less certainty<span style="position:absolute;right:10em">float</span> <br></p>

<h2>Add libraries</h2>

In [87]:
import pandas as pd

import matplotlib.pyplot as plt

import fasttext.util
from scipy.spatial.distance import cosine
from scipy.stats import kendalltau
from gensim.models import KeyedVectors  

from nltk.corpus import wordnet as wn
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors 
import gensim.downloader as api
from scipy.stats import kendalltau

<h2>Add dataset</h2>

In [88]:
simLex_data = pd.read_csv('SimLex-999.txt', sep='\t', header=None)
simLex_data.columns = ['word1', 'word2', 'pos','simLex999','conc1','conc2','concQ','assoc','simAssoc','sd']
simLex_data = simLex_data.iloc[1:]
simLex_data.reset_index(inplace=True,drop=True)
simLex_data.head()

Unnamed: 0,word1,word2,pos,simLex999,conc1,conc2,concQ,assoc,simAssoc,sd
0,old,new,A,1.58,2.72,2.81,2,7.25,1,0.41
1,smart,intelligent,A,9.2,1.75,2.46,1,7.11,1,0.67
2,hard,difficult,A,8.77,3.76,2.21,2,5.94,1,1.19
3,happy,cheerful,A,9.55,2.56,2.34,1,5.85,1,2.18
4,hard,easy,A,0.95,3.76,2.07,2,5.82,1,0.93


<h2>Calculate Wordnet similarity</h2>

In [89]:
def normalizePOS(df):
    for ix in df.index:
        df.at[ix,'pos'] = df.pos[ix].lower()

In [90]:
def findWordnetSimilarity(df):
    size = df.index.stop
    word1 = df['word1'].to_numpy()
    word2 = df['word2'].to_numpy()
    pos = df['pos'].to_numpy()
    path_sim = []
    for i in range(size):
        synset1 = wn.synsets(word1[i],pos[i])
        synset2 = wn.synsets(word2[i],pos[i])
        combinations = []
        for synset_w1 in synset1:
            for synset_w2 in synset2:
                combinations.append(wn.path_similarity(synset_w1, synset_w2))
        path_sim.append(max(combinations))
    df['wordnetSimilarity'] = path_sim

In [91]:
normalizePOS(simLex_data)
findWordnetSimilarity(simLex_data)
simLex_data.head()

Unnamed: 0,word1,word2,pos,simLex999,conc1,conc2,concQ,assoc,simAssoc,sd,wordnetSimilarity
0,old,new,a,1.58,2.72,2.81,2,7.25,1,0.41,0.333333
1,smart,intelligent,a,9.2,1.75,2.46,1,7.11,1,0.67,0.333333
2,hard,difficult,a,8.77,3.76,2.21,2,5.94,1,1.19,1.0
3,happy,cheerful,a,9.55,2.56,2.34,1,5.85,1,2.18,0.333333
4,hard,easy,a,0.95,3.76,2.07,2,5.82,1,0.93,0.333333


<h2>MissingWords</h2>

In [92]:
print("Does all words exist in wordnet?", "no" if (simLex_data['wordnetSimilarity'].isnull().sum()) else 'yes')

Does all words exist in wordnet? yes


<p>Looks like all the words exist is wordnet, however after my investigation I have found that some of the words actually do not exist in wordnet with their current pos. For example "weird" with pos "a" has another pos in wordnet. Looks like the path_similarity function automatically looks for a and s pos in the wordnet dataset</p>

<h2>Fast Text</h2>

In [93]:
def printMissing(missing, msg):
    print("missing words are for "+ msg+":")
    for key in missing:
        if("word1" in missing[key]):
            print(missing[key]['word1'])
        if("word2" in missing[key]):
            print(missing[key]['word2'])

In [94]:
def findFastTextSimilarity(df):
    cos_sims = []
    missing = {} 
    ft = fasttext.load_model('cc.en.300.bin')

    all_words = ft.get_words()
    print('Total number of words in the vocabulary:', len(all_words))
    appended = False

    for ix in df.index:
        if (df['word1'][ix] not in all_words):
            if(ix not in missing):
                missing[ix] = {}
            missing[ix]["word1"] = df['word1'][ix]
            cos_sims.append(-1)    
            appended = True
        
        if (df['word2'][ix] not in all_words):
            if(ix not in missing):
                missing[ix] = {}
            missing[ix]["word2"] = df['word2'][ix]
            if(not appended):
                cos_sims.append(-1)     
            appended = True
        if appended:
            continue
        word1_vec = ft.get_word_vector(df['word1'][ix])
        word2_vec = ft.get_word_vector(df['word2'][ix])
        cos_sims.append(1 - cosine(word1_vec, word2_vec)) 
    printMissing(missing, 'fastTextSimilarity')

    df['fastTextSimilarity'] = cos_sims

In [95]:
def findWupSimilarity(df):
    wupS = []
    missing = {}

    for ix in df.index:
        wn_pos = None
        if df['pos'][ix] == 'n':
            wn_pos = wn.NOUN
        elif df['pos'][ix] == 'v':
            wn_pos = wn.VERB
        else:
            wn_pos = wn.ADJ
        word1_pos_synsets = [s for s in wn.synsets(df['word1'][ix], pos=wn_pos)]
        word2_pos_synsets = [s for s in wn.synsets(df['word2'][ix], pos=wn_pos)]
        appended = False

        if(len(word1_pos_synsets) == 0):
            if(ix not in missing):
                missing[ix] = {}
            missing[ix]["word1"] = df['word1'][ix]
            wupS.append(-1)
            appended = True
        if (len(word2_pos_synsets) == 0):
            if(ix not in missing):
                missing[ix] = {}
            missing[ix]["word2"] = df['word2'][ix]
            if(not appended):
                wupS.append(-1)
            appended = True
        if appended: 
            continue

        combinations = []
        for synset_w1 in word1_pos_synsets:
            for synset_w2 in word2_pos_synsets:
                combinations.append(wn.wup_similarity(synset_w1, synset_w2))
        wupS.append(max(combinations))
            
            
    printMissing(missing, 'wupSimilarity')

    df['wupSimilarity'] = wupS

In [96]:
def findGensimSimilarity(df):
    corpus = api.load('text8')  
    model = Word2Vec(corpus)    
    gensimS = []
    missing = {}
    for ix in df.index:
        appended = False
        try:
            vec1 = model.wv[df['word1'][ix]]
        except KeyError:
            if(ix not in missing):
                missing[ix] = {}
            missing[ix]["word1"] = df['word1'][ix]
            gensimS.append(-1)
            appended = True
        finally:
            try:
                vec2 = model.wv[df['word2'][ix]]
            except KeyError:
                if(ix not in missing):
                    missing[ix] = {}
                if("word2" not in missing[ix]):
                    missing[ix]["word2"] = df['word2'][ix]
                if (not appended):
                    gensimS.append(-1)
                appended = True
        if (appended):
            continue
        similarity = 1 - cosine(vec1, vec2)
        gensimS.append(similarity)



    printMissing(missing, 'gensimSimilarity')
    df['gensimSimilarity'] = gensimS

In [97]:
findWupSimilarity(simLex_data)
findGensimSimilarity(simLex_data)
findFastTextSimilarity(simLex_data)
simLex_data.head()

missing words are for wupSimilarity:
missing words are for gensimSimilarity:
hallway
suds
orthodontist
orthodontist
hallway
hallway
disorganize




Total number of words in the vocabulary: 2000000
missing words are for fastTextSimilarity:


Unnamed: 0,word1,word2,pos,simLex999,conc1,conc2,concQ,assoc,simAssoc,sd,wordnetSimilarity,wupSimilarity,gensimSimilarity,fastTextSimilarity
0,old,new,a,1.58,2.72,2.81,2,7.25,1,0.41,0.333333,0.5,0.382462,0.441964
1,smart,intelligent,a,9.2,1.75,2.46,1,7.11,1,0.67,0.333333,0.5,0.311402,0.704955
2,hard,difficult,a,8.77,3.76,2.21,2,5.94,1,1.19,1.0,1.0,0.609987,0.63138
3,happy,cheerful,a,9.55,2.56,2.34,1,5.85,1,2.18,0.333333,0.5,0.476049,0.545871
4,hard,easy,a,0.95,3.76,2.07,2,5.82,1,0.93,0.333333,0.5,0.685552,0.486345


In [98]:
def findKendall(df):
    scores_df = df[["wordnetSimilarity", "fastTextSimilarity", "wupSimilarity", "gensimSimilarity","simLex999"]]


    # df = scores_df[(scores_df != -1).all(axis=1)]

    kendall_wordnet, _ = kendalltau(df["wordnetSimilarity"], df["simLex999"])
    kendall_ft, _ = kendalltau(df["fastTextSimilarity"], df["simLex999"])
    kendall_wu_palmer, _ = kendalltau(df["wupSimilarity"], df["simLex999"])
    kendall_gensim, _ = kendalltau(df["gensimSimilarity"], df["simLex999"])

    print("Kendall's Tau for Wordnet:", kendall_wordnet)
    print("Kendall's Tau for FastText:", kendall_ft)
    print("Kendall's Tau for Wu-Palmer:", kendall_wu_palmer)
    print("Kendall's Tau for Gensim:", kendall_gensim)

In [99]:
findKendall(simLex_data)

Kendall's Tau for Wordnet: 0.35344887126870356
Kendall's Tau for FastText: 0.3301400933912036
Kendall's Tau for Wu-Palmer: 0.32114437927102696
Kendall's Tau for Gensim: 0.16782692636713945


<p>Results show that wordnet has the highest kendall's tau coefficient which makes it the best among 4 given similarity models. However I bilieve that fastText could do better, but due to my network I did not download all the files.</p>