## word embedding evaluation in English

### Reference
* evaluation methods for unsupervised word emeddings, Tobias Schnabel et al.

## word similarity test

### list of datasets
* wordsim353
* MEN: MEN1과 MEN2는 동일한데 MEN2가 lemmatization을 한 것 같다. 따라서 MEN2를 사용한다.
* turk
* rare_words
* simlex-999

In [74]:
import pandas as pd
import numpy as np

In [42]:
wordsim_rel = pd.read_csv('./wordsim353/wordsim_relatedness_goldstandard.txt', sep='\t', header=None)
wordsim_sim = pd.read_csv('./wordsim353/wordsim_similarity_goldstandard.txt', sep='\t', header=None)
MEN1 = pd.read_csv('./MEN/elias-men-ratings.csv')
MEN2 = pd.read_csv('./MEN/marcos-men-ratings.csv')
turk = pd.read_csv('./Mtruk.csv', header=None)
rare_words = pd.read_csv('./rare_words/rw.txt', sep='\t', header=None)
rare_words = rare_words.iloc[:,[0,1,2]]
simlex = pd.read_csv('./SimLex-999/SimLex-999.txt', sep='\t')
simlex = simlex[['word1','word2','SimLex999']]

In [82]:
sim_df_list = [wordsim_rel, wordsim_sim, MEN2, turk, rare_words, simlex]
sim_df_list_name = ['wordsim_rel', 'wordsim_sim', 'MEN2', 'turk', 'rare_words', 'simlex']

for df in sim_df_list:
    df.columns = ['word1','word2','score']

## load google's pretrained word2vec model
* 300 차원 벡터

In [63]:
import os
os.getcwd()

'C:\\Users\\sbh0613\\Desktop\\NLP\\embedding part\\embedding evaluation'

In [64]:
import gensim
file_name = 'GoogleNews-vectors-negative300.bin.gz'
model = gensim.models.KeyedVectors.load_word2vec_format(fname = file_name, binary=True)

In [65]:
# get word vectors
word_vectors = model.wv

# get vocabulary
vocabs = word_vectors.vocab.keys()

  


In [94]:
def compute_corr(df, model):
    '''
    input
    1. df: name of dataframe in which there is human-score
    The column of df should be word1, word2, score
    
    2. word_set: tuple of two words
    
    output: correlation between cosine similarity between two word vectors and human-score
    '''
    
    n = df.shape[0]
    
    eval_word_list = []
    for tup in zip(df['word1'], df['word2']):
        eval_word_list.append(set(tup))
    
    word_vectors = model.wv
    vocabs_w2v = list(word_vectors.vocab.keys())
    
    cosine_sim = []
    
    word_set_idx = 0
    word_set_idx_list = []
    for tup in zip(df['word1'], df['word2']):
        if tup[0] in vocabs_w2v and tup[1] in vocabs_w2v:
            cosine_sim.append(word_vectors.similarity(w1 = tup[0], w2 = tup[1]))
            word_set_idx_list.append(word_set_idx)
            word_set_idx += 1
            
    score = [j for idx, j in enumerate(df['score']) if idx in word_set_idx_list]
    
    r = np.corrcoef(score, cosine_sim)
    
    print('전체 {0}개 중에 {1}개가 평가로 사용됨'.format(n, len(word_set_idx_list)))
    
    return r[0,1]

In [95]:
compute_corr(wordsim_rel, model)

전체 252개 중에 252개가 평가로 사용됨




0.5920509855347379

In [98]:
a = 0
for df in sim_df_list:
    r = compute_corr(df, model)
    print('dataset: {0}'.format(sim_df_list_name[0]))
    print('상관계수: {0}'.format(r))
    print('----------------------------------------------------------------')



전체 252개 중에 252개가 평가로 사용됨
dataset: wordsim_rel
상관계수: 0.5920509855347379
----------------------------------------------------------------
전체 203개 중에 203개가 평가로 사용됨
dataset: wordsim_rel
상관계수: 0.7645224545856311
----------------------------------------------------------------
전체 3000개 중에 2946개가 평가로 사용됨
dataset: wordsim_rel
상관계수: 0.04935611822850241
----------------------------------------------------------------
전체 287개 중에 275개가 평가로 사용됨
dataset: wordsim_rel
상관계수: 0.03407367309624036
----------------------------------------------------------------
전체 2034개 중에 1825개가 평가로 사용됨
dataset: wordsim_rel
상관계수: -0.012723297635601458
----------------------------------------------------------------
전체 999개 중에 999개가 평가로 사용됨
dataset: wordsim_rel
상관계수: 0.45392820971322645
----------------------------------------------------------------


## word analogy test
* dataset link: https://aclweb.org/aclwiki/Analogy_(State_of_the_art)

### list of datasets
* MSR dataset (찾치 못함)
* Google analogy dataset

### method
* 3CosAdd
* 3CosMul

In [107]:
a, b, c, d = zip(*((map(str, line.split()) for line in open('google_analogy.txt'))))

In [115]:
analogy_dataset = []
for tup in zip(a,b,c,d):
    analogy_dataset.append(tup)

In [116]:
analogy_dataset[:10]

[('Athens', 'Greece', 'Baghdad', 'Iraq'),
 ('Athens', 'Greece', 'Bangkok', 'Thailand'),
 ('Athens', 'Greece', 'Beijing', 'China'),
 ('Athens', 'Greece', 'Berlin', 'Germany'),
 ('Athens', 'Greece', 'Bern', 'Switzerland'),
 ('Athens', 'Greece', 'Cairo', 'Egypt'),
 ('Athens', 'Greece', 'Canberra', 'Australia'),
 ('Athens', 'Greece', 'Hanoi', 'Vietnam'),
 ('Athens', 'Greece', 'Havana', 'Cuba'),
 ('Athens', 'Greece', 'Helsinki', 'Finland')]

In [142]:
class word_analogy_test:
    def __init__(self, dataset, model):
        self.n = df.shape[0]
        self.data = dataset
        self.word_vectors = model.wv
        self.vocab_w2v = list(self.word_vectors.vocab.keys())
        
    def Cos_Add_Mul(self, epsilon = 0.001):
        pred_add = []
        pred_mul = []
        act = []
        n = len(self.data)
        a = 1
        
        for tup in self.data:
            cosine_add = []
            cosine_mul = []
            if all([True for txt in tup if txt in self.vocab_w2v]):
                candi_b_star = [voc for voc in self.vocab_w2v if voc not in tup[:3]]
                print('start')
                
                for b_star in candi_b_star:
                    first = self.word_vectors.similarity(w1=b_star, w2=tup[1])
                    second = self.word_vectors.similarity(w1=b_star, w2=tup[0])
                    third = self.word_vectors.similarity(w1=b_star, w2=tup[2])
                    cosine_add.append(first - second + third)
                    cosine_mul.append((first*third)/(second+epsilon))
                    
                max_idx_add = cosine_add.index(max(cosine_add))
                max_idx_mul = cosine_mul.index(max(cosine_mul))
                
                pred_add.append( candi_b_star[max_idx_add] )
                pred_mul.append( candi_b_star[max_idx_mul] )
                act.append(tup[3])
            
            else: pass
            
            if a % 100 == 0:
                print('전체 {0}개 중 {1}개 했음'.format(n,a))
            
            a+=1
    
            
        acc_add = [True if i == j else False for i,j in zip(pred_add,act)]
        acc_mul = [True if i == j else False for i,j in zip(pred_mul,act)]
        
        return sum(acc_add)/len(acc_add), sum(acc_mul)/len(acc_mul)