# Distributional semantics

Mehdi Ghanimifard, Adam Ek, Wafia Adouane and Simon Dobnik

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on how to work on group assignments.

Write all your answers and the code in the appropriate boxes below.

---

In this lab we will look how to build distributional semantic models from corpora and use semantic similarity captured by these models to do some simple semantic tasks. We are going to use the code that we discussed in the class last time.

The following command simply imports all the methods from that code.

In [1]:
from dist_erk import *

## 1. Loading a corpus

To train a distributional model, we first need a sufficiently large collection of text which will contain different words used frequently enough in different contexts. Here we will use a section of the Wikipedia corpus which you can download from [here](https://gubox.app.box.com/s/7yt07j0cgpnolvbb07n02s0303ski8wr/folder/75208243314?fbclid=IwAR1muU7cL8ZuxRNvXE7VnEuXl5wh3LdmuttdtkrmiyZ9au-fOtLtvR-Ua_c) (Linux and Mac) or [here](https://gubox.app.box.com/s/7yt07j0cgpnolvbb07n02s0303ski8wr/folder/75208243314?fbclid=IwAR1muU7cL8ZuxRNvXE7VnEuXl5wh3LdmuttdtkrmiyZ9au-fOtLtvR-Ua_c) (Windows). (This file has been borrowed from another lab by [Richard Johansson](http://www.cse.chalmers.se/~richajo/).) When unpacked the file is 151mb hence if you are using the lab computers you should store it in a temporary folder outside your home and adjust `corpus_dir` path below.
<It may already exist in `/opt/mlt/courses/cl2015/a5`.>


In [2]:
corpus_dir = "../wikipedia"

## 2.1 Building a count-based model

Now you are ready to build a count-based model. The functions for building word spaces can be found in `dist_erk.py`. We will build a model that create count-based vectors for 1000 words. Using the methods from the code imported above build three word matrices with 1000 dimensions as follows: (i) with raw counts (saved to a variable `space_1k`); (ii) with PPMI (`ppmispace_1k`); and (iii) with reduced dimensions SVD (`svdspace_1k`). For the latter use `svddim=5`. **[5 marks]**


In [3]:
word_to_keep = 1000
num_dims = 1000
svddim = 5

#word_to_keep = 10000
#num_dims = 1000
#svddim = 2


# which words to use as targets and context words?
ktw = do_word_count(corpus_dir, num_dims)

wi = make_word_index(ktw) # word index
words_in_order = sorted(wi.keys(), key=lambda w:wi[w])  # sorted words # won't be used in the beginning

print('create count matrices')
space_1k = make_space(corpus_dir, wi, num_dims)
print('ppmi transform')
ppmispace_1k = ppmi_transform(space_1k, wi)
print('svd transform')
svdspace_1k = svd_transform(ppmispace_1k, num_dims, svddim)
print('done.')

reading file wikipedia.txt
create count matrices
reading file wikipedia.txt
ppmi transform
svd transform
done.


## 2.2 Bulding a word2vec model

We will also build a continuous-bag-of-words (CBOW) word2vec model using gensim (https://radimrehurek.com/gensim/index.html). Build a CBOW word2vec model, where each word have 300 dimensions and limit the vocabulary size to the most common 1000 words. **[5 marks]**

Documentation for the Word2Vec class can be found here: https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec

In [4]:
from gensim import utils
from gensim.models import Word2Vec


# gensim require a iterable class to process the corpus
class CorpusReader():
    def __init__(self, corpus_path):
        self.corpus_path = corpus_path

    def __iter__(self):
        for line in open(self.corpus_path):
            sentence = utils.simple_preprocess(line)
            if sentence:
                yield sentence 

corpus = CorpusReader(corpus_dir+'/wikipedia.txt')
w2v_model = Word2Vec(sentences=corpus, size=300, max_final_vocab=1000)  
w2v_model_150_2000 = Word2Vec(sentences=corpus, size=150, max_final_vocab=2000)
                     # training options goes here

In [5]:
print('house:', space_1k['house'])
print('house:', w2v_model['house'])

house: [2554 3774 3105  567  962  631  443  185  311  189  131   28   93  169
   81  125  151  408  194   90   79   29  217  184   62   15   31   70
   10    1   41   21    1   31   37    1   30    5   25    7    3   20
   11    1   32   36    2    5   66    4    0   46    8   18   28    0
   20    7    8   16   10   40    0  175   10    2    7   19    1  174
   11    3    1    6    0    0    0   10    9   11    7   24    4    4
   14   23   58    7    0   10    2    3   10    6   18    6   13    3
   22    0    3    5    3    7   14    3   40   20   19   15    6    8
   24    4    5    1   19    0    3    1    0   14    0   14   53    7
    7   11    6    5    5    4   12    6   53    1    1  433    4    0
    5    7    7   12    1    1    3    4   17    8   16    1    2   31
    1   12   14    1   44    6   14    9   38    7    2    6    8    1
   10    6   10    1    9    7    9    4    3   10    0   11    3    2
    0    2   11   37    2    0    2    1    5    9   10   16   88    6

  


Oxford Advanced Dictionary has 185,000 words, hence 1,000 words is not representative. We trained a model with 10,000 words, and 50 dimensions on truncated SVD. It took 40 minutes on a laptop. 

Additionally, we trained a word2vec model on the same data. The vocabulary size is also 10,000 with 300 dimensions for the words, and truncated to 50 dimensions in SVD. It took about 15 minutes on a desktop.

We saved all five matrices [here](https://gubox.app.box.com/folder/75208243314) ([alternative/old link](https://linux.dobnik.net/oc/index.php/s/9NTlpOJfPWGS56t/download?path=%2Flab4-distributional-data&files=pretrained.zip)) which you can load as follows:

In [6]:
import numpy as np

print('Please wait..')
ktw_10k       = np.load('../pretrained/ktw_wikipediaktw.npy', allow_pickle=True)
space_10k     = np.load('../pretrained/raw_wikipediaktw.npy', allow_pickle=True).all()
ppmispace_10k = np.load('../pretrained/ppmi_wikipediaktw.npy', allow_pickle=True).all()
svdspace_10k  = np.load('../pretrained/svd50_wikipedia10k.npy', allow_pickle=True).all()
w2v_space     = np.load('../pretrained/w2v.npy', allow_pickle=True)
w2v_svd_space = np.load('../pretrained/w2v_svd.npy', allow_pickle=True)
print('Done.')


Please wait..
Done.


Each vector space can be queried as a dictionary $\texttt{(word_form: vector, ...)}$:

In [7]:
print('house:', space_10k['house'])
print('house:', w2v_space['house'])

house: [2554 3774 3105 ...    0    0    0]
house: [-0.22926676  0.84213424  0.41197702  0.8781869   1.2811967  -1.4847856
  1.4102424   0.7267851  -0.5590538   0.04928471  1.8261132  -0.4911551
  2.6236389  -0.62284136 -1.4621106   1.1592358   1.0392265  -0.07465155
  1.0108253   1.1842203  -1.5743443  -1.3098637  -0.04264146 -0.1076067
  0.5574365   0.7599903   0.11031609  0.16449381 -0.40311787 -0.68341875
  0.48706874 -0.73431605 -0.2089108  -0.10828558 -0.6296254   1.3785347
 -0.2206072  -1.0867819  -0.2650222  -0.18507054 -1.6295078  -1.0952461
  1.2633797   0.29369423 -0.10325834  1.2930017   0.83000755 -0.14103375
  1.786327    0.49764258 -2.0428705   0.64002794 -0.3000837   0.03268864
 -0.0933575   0.76802623 -0.1682042   1.8946133  -0.10339233  0.78187567
 -0.28241557 -1.0668939   2.4631667   1.0492538   0.10093345  0.5764743
 -0.24940039 -0.27094615 -0.5501715  -0.07181013  0.830345    0.06051366
 -0.75200856  0.03423605  0.12481829  0.35145602 -0.5419142   0.62099475
 -0.059

## 3. Operations on similarities

We can perform mathematical operations on word vectors to derive meaning predictions. For example, we can subtract the normalised vectors for `king` minus `queen` and add the resulting vector to `man` and we hope to get the vector for `woman`. Why? **[3 marks]**

**Your answer should go here:**

In [8]:
# The vector that connects king to man should be similar to the vector that connects queen to woman. This can be 
# illustrated by re-organizing the formula king - queen + man = woman as: king + man = queen + woman.

Here is some helpful code that allow us to calculate such comparisons.

In [9]:
from scipy.spatial import distance

def normalize(vec):
    return vec / veclen(vec)

def make_w2v_dict(wtv):
    wtv_dict = dict()
    for item in wtv.wv.vocab:
        wtv_dict[item]=wtv[item]
    return wtv_dict

w2v_model_dict = make_w2v_dict(w2v_model)
w2v_model_150_2000_dict = make_w2v_dict(w2v_model_150_2000)

def find_similar_to(vec1, space):
    # vector similarity functions
    #sim_fn = lambda a, b: 1-distance.euclidean(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.correlation(a, b)
    #sim_fn = lambda a, b: 1-distance.cityblock(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.chebyshev(normalize(a), normalize(b))
    #sim_fn = lambda a, b: np.dot(normalize(a), normalize(b))
    sim_fn = lambda a, b: 1-distance.cosine(a, b)

    sims = [
        (word2, sim_fn(vec1, space[word2]))
        for word2 in space.keys()
    ]
    return sorted(sims, key = lambda p:p[1], reverse=True)

  if __name__ == '__main__':


Here is how you apply this code. Compare the count-based method with the word2vec method and comment on the results you get. **[4 marks]**

In [31]:
#space_10k:

king = normalize(space_10k['king'])
queen = normalize(space_10k['queen'])
man = normalize(space_10k['man'])
woman = normalize(space_10k['woman'])

#france = normalize(space_10k['france'])
#paris = normalize(space_10k['paris'])
#tokyo = normalize(space_10k['tokyo'])
#japan = normalize(space_10k['japan'])

#coffee_space_10k = normalize(space_10k['coffee'])
#black_space_10k = normalize(space_10k['black'])
#green_space_10k = normalize(space_10k['green'])
#tea_space_10k = normalize(space_10k['tea'])

#print("space_10k:")
#print(find_similar_to(king - man + woman, space_10k)[:10])
#print(find_similar_to(france - paris  + tokyo, space_10k)[:10])
#find_similar_to(japan_space_10k - tokyo_space_10k + paris_space_10k, space_10k)[:10]
#print(find_similar_to(coffee_space_10k - black_space_10k  + green_space_10k, space_10k)[:10])

#ppmispace_10k:

king = normalize(ppmispace_10k['king'])
queen = normalize(ppmispace_10k['queen'])
man = normalize(ppmispace_10k['man'])
woman = normalize(ppmispace_10k['woman'])

coffee = normalize(ppmispace_10k['coffee'])
black = normalize(ppmispace_10k['black'])
green = normalize(ppmispace_10k['green'])
tea = normalize(ppmispace_10k['tea'])

france = normalize(ppmispace_10k['france'])
paris = normalize(ppmispace_10k['paris'])
tokyo = normalize(ppmispace_10k['tokyo'])
japan = normalize(ppmispace_10k['japan'])

#print("ppmispace_10k:")
#print(find_similar_to(king_ppmispace_10k - man_ppmispace_10k + woman_ppmispace_10k, ppmispace_10k)[:10])
#print(find_similar_to(france - paris  + tokyo, ppmispace_10k)[:10])
#find_similar_to(japan_ppmispace_10k - tokyo_ppmispace_10k + paris_ppmispace_10k, ppmispace_10k)[:10]
#print(find_similar_to(coffee_ppmispace_10k - black_ppmispace_10k  + green_ppmispace_10k, ppmispace_10k)[:10])

#svdspace_10k:

king = normalize(svdspace_10k['king'])
queen = normalize(svdspace_10k['queen'])
man = normalize(svdspace_10k['man'])
woman = normalize(svdspace_10k['woman'])

coffee = normalize(svdspace_10k['coffee'])
black = normalize(svdspace_10k['black'])
green = normalize(svdspace_10k['green'])
tea = normalize(svdspace_10k['tea'])

france = normalize(svdspace_10k['france'])
paris = normalize(svdspace_10k['paris'])
tokyok = normalize(svdspace_10k['tokyo'])
japan = normalize(svdspace_10k['japan'])

#print("svdspace_10k:")
#print(find_similar_to(king_svdspace_10k - man_svdspace_10k + woman_svdspace_10k, svdspace_10k)[:10])
#find_similar_to(france - paris  + tokyo, svdspace_10k)[:10]
#find_similar_to(japan_svdspace_10k - tokyo_svdspace_10k + paris_svdspace_10k, svdspace_10k)[:10]
#print(find_similar_to(coffee_svdspace_10k - black_svdspace_10k  + green_svdspace_10k, svdspace_10k)[:10])

#w2v_space:
king = normalize(w2v_space['king'])
queen = normalize(w2v_space['queen'])
man = normalize(w2v_space['man'])
woman = normalize(w2v_space['woman'])

france = normalize(w2v_space['france'])
paris = normalize(w2v_space['paris'])
tokyo = normalize(w2v_space['tokyo'])  
japan = normalize(w2v_space['japan'])

coffee = normalize(w2v_space['coffee'])
black = normalize(w2v_space['black'])
green = normalize(w2v_space['green'])
tea = normalize(w2v_space['tea'])

print("w2v_space:")
#print(find_similar_to(king - man + woman, w2v_space)[:10])
find_similar_to(france - paris  + tokyo, w2v_space)[:10]
#find_similar_to(japan - tokyo + paris, w2v_space)[:10]
find_similar_to(coffee - black  + green, w2v_space)[:10]
#print(find_similar_to(tea - green + black, w2v_space)[:10])


#w2v_svd_space:
king = normalize(w2v_svd_space['king'])
queen = normalize(w2v_svd_space['queen'])
man = normalize(w2v_svd_space['man'])
woman = normalize(w2v_svd_space['woman'])

france = normalize(w2v_svd_space['france'])
paris = normalize(w2v_svd_space['paris'])
tokyo = normalize(w2v_svd_space['tokyo'])  
japan = normalize(w2v_svd_space['japan'])

coffee = normalize(w2v_svd_space['coffee'])
black = normalize(w2v_svd_space['black'])
green = normalize(w2v_svd_space['green'])
tea = normalize(w2v_svd_space['tea'])

print("w2v_svd_space:")
#print(find_similar_to(king_w2v_svd_space - man_w2v_svd_space + woman_w2v_svd_space, w2v_svd_space)[:10])
find_similar_to(france - paris  + tokyo, w2v_svd_space)[:10]
#find_similar_to(japan_w2v_svd_space - tokyo_w2v_svd_space+ paris_w2v_svd_space, w2v_svd_space)[:10]
print(find_similar_to(coffee - black  + green, w2v_svd_space)[:10])
#print(find_similar_to(tea_w2v_svd_space - green_w2v_svd_space + black_w2v_svd_space, w2v_svd_space)[:10])

###HOME-MADE MODELS FROM 2.1 AND 2.2###

#w2v_model (as dictionary):

#king = normalize(w2v_model_dict['king'])
#queen = normalize(w2v_model_dict['queen'])
#man = normalize(w2v_model_dict['man'])
#woman = normalize(w2v_model_dict['woman'])

#coffee = normalize(w2v_model_dict['coffee'])
#black = normalize(w2v_model_dict['black'])
#green = normalize(w2v_model_dict['green'])
#tea = normalize(w2v_model_dict['tea'])


#print("w2v_model (as dictionary):")
#print(find_similar_to(king - man + woman, w2v_model_dict)[:10])
#print(find_similar_to(france - paris  + tokyo, w2v_model_dict)[:10])
#print(find_similar_to(coffee - black + green, w2v_model_dict)[:10])

#w2v_model_150_2000 (as dictionary):

king = normalize(w2v_model_150_2000_dict['king'])
queen = normalize(w2v_model_150_2000_dict['queen'])
man = normalize(w2v_model_150_2000_dict['man'])
woman = normalize(w2v_model_150_2000_dict['woman'])

#france = normalize(w2v_model_150_2000_dict['france'])
#paris = normalize(w2v_model_150_2000_dict['paris'])
#tokyo = normalize(w2v_model_150_2000_dict['tokyo'])  
#japan = normalize(w2v_model_150_2000_dict['japan'])

#coffee = normalize(w2v_model_150_2000_dict['coffee'])
#black = normalize(w2v_model_150_2000_dict['black'])
#green = normalize(w2v_model_150_2000_dict['green'])
#tea = normalize(w2v_model_150_2000_dict['tea'])

#print("w2v_model_150_2000:")
#print(find_similar_to(king - man + woman, w2v_model_150_2000_dict)[:10])
#print(find_similar_to(france - paris  + tokyo, w2v_model_150_2000_dict)[:10])
#print(find_similar_to(coffee - black + green, w2v_model_150_2000_dict)[:10])

#space_1k (doesn't have a big enough vocabulary):

#king = normalize(space_1k['king'])
#queen = normalize(space_1k['queen'])
#man = normalize(space_1k['man'])
#woman = normalize(space_1k['woman'])

#coffee_space_1k = normalize(space_1k['coffee'])
#black_space_1k = normalize(space_1k['black'])
#green_space_1kl = normalize(space_1k['green'])
#tea_space_1k = normalize(space_1k['tea'])


#print("space_1k:")
#print(find_similar_to(king - man + woman, space_1k)[:10])
#print(find_similar_to(coffee_space_1k - black_space_1k  + green_space_1k, space_1k)[:10])


w2v_space:
w2v_svd_space:
[('coffee', 0.824350657374129), ('rice', 0.6517502165361037), ('breakfast', 0.6425617646579574), ('sugar', 0.6424276619034837), ('maple', 0.621559712616331), ('flour', 0.6161205360200773), ('tea', 0.6115164157690608), ('chocolate', 0.6062069039110909), ('butter', 0.5933827772008674), ('wheat', 0.5921006396962648)]


**Your answer should go here:**

Find 2 similar pairs of pairs of words and test them using your own models and the pretrained models. Does the resulting vector similarity confirm your expectations? But remember you can only do this if the words are contained in our vector space. **[2 marks]**

**Answer:**<br>
It turns out that the closest vector to *king - man + woman* using w2v is actually ...king. The pretrained models perform similarly, all of them giving 'king' as the top candidate and ‘queen’ as second or third candidate. For the formula *France - Paris + Tokyo*, the likeliest candidate predicated by the model is also one of the input words themselves: Tokyo (Japan only shows up as third likeliest candidate, just like queen in the previous example). The same is also true for *coffee - black + green*, where you might expect tea as closest vector, but once again find it at third spot, preceded by the input words themselves: coffee and green. 

Try changing the number of dimensions, and the window size in the models you built in (2.1) and (2.2). Test them on the examples you found in the previous cell and comment on the new results you get in comparison to the first results you obtained. [**4 marks**]

**Your answer should go here:**<br>
Having half as many dimensions (150) and twice as many words (2000) didn’t help that much in the w2v model, although it did improve the prediction scores and third number candidate somewhat (from prince to princess). I could not run my secondary examples on the count-based models without getting error messages (ValueError: “operands could not be broadcast together with shapes (50,) (300,)”). Space_1k was too small to even work on the king example, also when increasing the vocabulary to 10 000. 


## 4. Testing semantic similarity

The file `similarity_judgements.txt` (a copy is included with this notebook or you can download it from [here](https://linux.dobnik.net/oc/index.php/s/9NTlpOJfPWGS56t/download?path=%2Flab4-distributional-data&files=similarity_judgements.txt.zip)) contains 7,576 pairs of words and their lexical and visual similarities (based on the pictures) collected in on online crowd-sourcing data collection using Mechanical Turk as described in [1]. The score range from 1 (highly dissimilar) to 5 (highly similar).

The following code will import them into python lists below:

In [11]:
word_pairs = [] # test suit word pairs
semantic_similarity = [] 
visual_similarity = []
test_vocab = set()

for index, line in enumerate(open('similarity_judgements.txt')):
    data = line.strip().split('\t')
    if index > 0 and len(data) == 3:
        w1, w2 = tuple(data[0].split('#'))
        # it will check if both words from each pair exist in the word matrix.
        if w1 in space_1k and w2 in space_1k:
            word_pairs.append((w1, w2))
            test_vocab.update([w1, w2])
            semantic_similarity.append(float(data[1]))
            visual_similarity.append(float(data[2]))
         
print("number of available words to test:", len(test_vocab-(test_vocab-set(ktw))))
print("number of available word pairs to test:", len(word_pairs))
#list(zip(word_pairs, visual_similarity, semantic_similarity))

number of available words to test: 10
number of available word pairs to test: 13


Now we can test how the cosine similarity between vectors of each of the five spaces compares with the human judgements on the words collected in the previous step. Which of the five spaces best approximates human judgements?

For comparison of several scores we can use [Spearman correlation coefficient](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) which is implemented in `scipy.stats.spearmanr` [here](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.spearmanr.html). The values of the Sperman correlation coefficient range from -1, 0 to 1, where 0 indicates no correlation, 1 perfect correaltion and -1 negative correlation. Hence, the greater the number the better. The $p$-values tells us if the coefficient is statistically significant. For this to be the case, it must be less than or equal to $< 0.05$.

Here is how you can calculate Spearman's correlation coefficient betweeen the scores of visual similarity and semantic similarity of the available words in the test suite:

In [12]:
from scipy import stats

rho, pval = stats.spearmanr(semantic_similarity, visual_similarity)
print("""Visual Similarity vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))


Visual Similarity vs. Semantic Similarity:
rho     = 0.8136
p-value = 0.0007


Let's now calculate the cosine similarity scores of all word pairs in an ordered list using all three matrices. **[2 marks]**

In [13]:
raw_similarities  = [cosine(w1, w2, space_1k) for w1, w2 in word_pairs]
print("raw similiarities:", raw_similarities)
ppmi_similarities = [cosine(w1, w2, ppmispace_1k) for w1, w2 in word_pairs]
print("PPMI similiarities:", ppmi_similarities)
svd_similarities  = [cosine(w1, w2, svdspace_1k) for w1, w2 in word_pairs]
print("SVD similiarities:", svd_similarities)
w2v_similarities = [cosine(w1, w2, w2v_model) for w1, w2 in word_pairs]
print("w2v_similiarities:", w2v_similarities)
w2v_svd_similarities = [cosine(w1, w2, w2v_svd_space) for w1, w2 in word_pairs]
print("w2v_svd_similiarities:", w2v_svd_similarities)

raw similiarities: [0.6053463144588419, 0.7554570233618889, 0.9711167809919755, 0.9315826357027838, 0.9754208978693972, 0.868098303658973, 0.8808572066298728, 0.9702039222697523, 0.975056028371703, 0.9647852216476699, 0.8815865237242774, 0.9644596411599679, 0.9360582834159354]
PPMI similiarities: [0.015787171731331283, 0.1523941240586544, 0.21848779165946428, 0.29048023105113163, 0.20372332657803263, 0.0800747896953036, 0.24604982900042865, 0.2896844116975526, 0.3551417142599678, 0.1711784986033382, 0.18422229124535672, 0.2062600304036366, 0.1457266300462318]
SVD similiarities: [0.30258693983005214, 0.6894464221766998, 0.6510528826328721, 0.9158170261550581, 0.9014647733164831, 0.8875770686708406, 0.9730764354280005, 0.9442058336875112, 0.8599389619220836, 0.7538711515095899, 0.7247901720974329, 0.9672977601866538, 0.6744171266958809]
w2v_similiarities: [-0.06559332713395827, -0.1368602392923372, 0.06293609992224168, 0.32245057111325975, 0.25063532635701413, -0.09251503236998528, 0.115

Now, calculate correlation coefficients between lists of similarity scores and the real semantic similarity scores from the experiment. The scores of what model best correlate them? Is this expected? **[4 marks + 2 marks]**

In [14]:
rho, pval = stats.spearmanr(raw_similarities, semantic_similarity)  
print("""Raw Similarities vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(ppmi_similarities, semantic_similarity)  
print("""PPMI similarities vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(svd_similarities, semantic_similarity)  
print("""svd similarities vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(w2v_similarities, semantic_similarity)  
print("""w2v similarities vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(w2v_svd_similarities, semantic_similarity)  
print("""w2v svd similarities vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

Raw Similarities vs. Semantic Similarity:
rho     = 0.3024
p-value = 0.3153
PPMI similarities vs. Semantic Similarity:
rho     = 0.6920
p-value = 0.0088
svd similarities vs. Semantic Similarity:
rho     = 0.5204
p-value = 0.0683
w2v similarities vs. Semantic Similarity:
rho     = 0.8024
p-value = 0.0010
w2v svd similarities vs. Semantic Similarity:
rho     = 0.6716
p-value = 0.0119


**Your answer should go here:**<br>
The similarities of the w2v model best correlate with the real semantic similarity scores (80 %), and the result is statistically significant. I think it's expected given the performance of the count-based models in Q3. 


We can also calculate correlation coefficients between lists of cosine similarity scores and the real visual similarity scores from the experiment. Which similarity model best correlates with them? How do the correlation coefficients compare with those from the previous comparison - and can you speculate why do we get such results? **[2 marks + 6 marks]**

In [15]:
rho, pval = stats.spearmanr(raw_similarities, visual_similarity)  
print("""Raw Similarities vs. Visual Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(ppmi_similarities, visual_similarity)  
print("""PPMI similarities vs. Visual Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(svd_similarities, visual_similarity)  
print("""svd similarities vs. Visual Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(w2v_similarities, visual_similarity)  
print("""w2v similarities vs. Visual Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(w2v_svd_similarities, visual_similarity)  
print("""w2v svd similarities vs. Visual Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

Raw Similarities vs. Visual Similarity:
rho     = 0.5416
p-value = 0.0559
PPMI similarities vs. Visual Similarity:
rho     = 0.7920
p-value = 0.0012
svd similarities vs. Visual Similarity:
rho     = 0.6814
p-value = 0.0103
w2v similarities vs. Visual Similarity:
rho     = 0.7105
p-value = 0.0065
w2v svd similarities vs. Visual Similarity:
rho     = 0.6989
p-value = 0.0079


**Your answer should go here:**<br>
The similarities of the positive pointwise mutual information model give the best correlation score here. I have no idea why. :/

## 5. Discussion

What are the limitations of our approach in this lab? Suggest three ways in which the results could be improved. **[6 marks]**

**Your answer should go here:**<br>

1)Increasing the size of the training corpus might improve scores, but training time would also increase.<br>
2)Maybe introducing some hardcoded logic into the models would help improving the results in Q3.<br>
3)On the practical side of things, it would have been nice with a complete lecture on this subject in order to get a better understanding for the different models and comparisons before starting.<br>  

# Literature

[1] C. Silberer and M. Lapata. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 721–732, Baltimore, Maryland, USA, June 23–25 2014 2014. Association for Computational Linguistics.

[2] Y. Bengio, R. Ducharme, P. Vincent, & C. Jauvin (2003). A neural probabilistic language model. Journal of machine learning research, 3(Feb), 1137-1155.

[3] Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.
