# A2: Vector Semantics

Nikolai Ilinykh, Mehdi Ghanimifard, Wafia Adouane and Simon Dobnik

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read [the following instructions](https://github.com/sdobnik/computational-semantics/blob/master/README.md) on how to work on group assignments.

Write all your answers and the code in the appropriate boxes below.

In this lab we will look at how to build distributional semantic models from corpora and use semantic similarity captured by these models to do semantic tasks. We are also going to examine how different vector composition functions for phrases affect both the model and the learned information about similarities.  

Note that this lab uses a code from `dist_erk.py`, which contains functions that highly resemble those shown during the lecture. In the end, you can use either of the functions (from the lecture / from the file) to solve the tasks.

In [1]:
# the following command simply imports all the methods from that code.
from dist_erk import *

## 1. Loading a corpus

To train a distributional model, we first need a sufficiently large collection of texts which contain different words used frequently enough in different contexts. Here we will use a section of the Wikipedia corpus which you can download from [here](https://linux.dobnik.net/cloud/index.php/s/isMBj49jt5renYt?path=%2Fresources%2Fa2-distributional-representations) (wikipedia.txt.zip). (This file has been borrowed from another lab by [Richard Johansson](http://www.cse.chalmers.se/~richajo/)).  
When unpacked, the file is 151mb, hence if you are using the MLT servers you should store it in a temporary folder outside your home and adjust the `corpus_dir` path below.  
<!-- <It may already exist in `/opt/mlt/courses/cl2015/a5`.> -->


In [2]:
import os
corpus_dir = os.path.abspath("filename/")


## 2. Building a model

Now you are ready to build the model.  
Using the methods from the code imported above build three word matrices with 1000 dimensions as follows:  

(i) with raw counts (saved to a variable `space_1k`);  
(ii) with PPMI (`ppmispace_1k`);  
(iii) with reduced dimensions SVD (`svdspace_1k`).  
For the latter use `svddim=5`. **[5 marks]**

Your task is to replace `...` with function calls. Functions are imported from `dist_erk.py` earlier, and they largely resemble functions shown during the lecture.

In [3]:
numdims = 1000
svddim = 5

# which words to use as targets and context words?
# we need to count the words and keep only the N most frequent ones
# which function would you use here with which variable?
ktw = do_word_count(corpus_dir, numdims)

wi = make_word_index(ktw)

words_in_order = sorted(wi.keys(), key=lambda w:wi[w])

make_space(corpus_dir, wi, numdims)

# create different spaces (the original matrix space, the ppmi space, the svd space)
# which functions with which arguments would you use here?
print('create count matrices')
space_1k = make_space(corpus_dir, wi, numdims)
print('ppmi transform')
ppmispace_1k = ppmi_transform(space_1k, wi)
print('svd transform')
svdspace_1k = svd_transform(space_1k, 1000, svddim)
print('done.')

reading file wikipedia123.txt
reading file wikipedia123.txt
create count matrices
reading file wikipedia123.txt
ppmi transform
svd transform
done.


Oxford Advanced Dictionary has 185,000 words, hence 1,000 words is not representative. We trained a model with 10,000 words, and 50 dimensions on truncated SVD. It took 40 minutes on a laptop. We saved all three matrices [here](https://linux.dobnik.net/cloud/index.php/s/isMBj49jt5renYt?path=%2Fresources%2Fa2-distributional-representations) (pretrained.zip). Download them and unpack them to a `pretrained` folder which should be a subfolder of the folder with this notebook:

In [24]:
import numpy as np

numdims = 10000
svddim = 50

print('Please wait...')
ktw_10k       = np.load('./pretrained/ktw_wikipediaktw.npy', allow_pickle=True)
space_10k     = np.load('./pretrained/raw_wikipediaktw.npy', allow_pickle=True).all()
ppmispace_10k = np.load('./pretrained/ppmi_wikipediaktw.npy', allow_pickle=True).all()
svdspace_10k  = np.load('./pretrained/svd50_wikipedia10k.npy', allow_pickle=True).all()
print('Done.')


Please wait...
Done.


## 3. Testing semantic similarity

The file `similarity_judgements.txt` (a copy is included with this notebook) contains 7,576 pairs of words and their lexical and visual similarities (based on the pictures) collected through crowd-sourcing using Mechanical Turk as described in [1]. The score range from 1 (highly dissimilar) to 5 (highly similar). Note: this is a different dataset from the phrase similarity dataset we discussed during the lecture (the one from [2]). For more information, please read the papers.

The following code will transform similarity scores into a Python-friendly format:

In [25]:
word_pairs = [] # test suit word pairs
semantic_similarity = [] 
visual_similarity = []
test_vocab = set()


for index, line in enumerate(open('similarity_judgements.txt')):
    data = line.strip().split('\t')
    if index > 0 and len(data) == 3:
        w1, w2 = tuple(data[0].split('#'))
        # it will check if both words from each pair exist in the word matrix.
        if w1 in ktw_10k and w2 in ktw_10k:
            word_pairs.append((w1, w2))
            test_vocab.update([w1, w2])
            semantic_similarity.append(float(data[1]))
            visual_similarity.append(float(data[2]))
        
print('number of available words to test:', len(test_vocab-(test_vocab-set(ktw))))
print('number of available word pairs to test:', len(word_pairs))
#list(zip(word_pairs, visual_similarity, semantic_similarity))


number of available words to test: 155
number of available word pairs to test: 774


Now we are going to test how the cosine similarity between vectors of each of the three spaces (normal space, ppmi, svd) compares with the human similarity judgements for the words in the similarity dataset. Which of the three spaces best approximates human judgements?

For comparison of several scores, we can use [Spearman correlation coefficient](https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) which is implemented in `scipy.stats.spearmanr` [here](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.spearmanr.html). The values of the Sperman correlation coefficient range from -1, 0 to 1, where 0 indicates no correlation, 1 perfect correaltion and -1 negative correlation. Hence, the greater the number the better the similarity scores align. The p values tells us if the coefficient is statistically significant. For this to be the case, it must be less than or equal to $< 0.05$.

Here is how you can calculate Pearson's correlation coefficient betweeen the scores of visual similarity and semantic similarity of the available words in the test suite:

In [26]:
from scipy import stats

rho, pval = stats.spearmanr(semantic_similarity, visual_similarity)
print("""Visual Similarity vs. Semantic Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))


Visual Similarity vs. Semantic Similarity:
rho     = 0.7122
p-value = 0.0000


Let's now calculate the cosine similarity scores of all word pairs in an ordered list using all three matrices. **[6 marks]**

In [29]:
def cosine(word1, word2, space):
    vec1 = space[ word1 ]
    vec2 = space[word2]

    veclen1 = veclen(vec1)
    veclen2 = veclen(vec2)

    if veclen1 == 0.0 or veclen2 == 0.0:
        # one of the vectors is empty. make the cosine zero.
        return 0.0

    else:
        # we could also simply do:
        # dotproduct = numpy.dot(vec1, vec2)
        dotproduct = numpy.sum(vec1 * vec2)

        return dotproduct / (veclen1 * veclen2)

raw_similarities  = [cosine(w1, w2, space_10k) for w1, w2 in word_pairs]
ppmi_similarities = [cosine(w1, w2, ppmispace_10k) for w1, w2 in word_pairs]
svd_similarities  = [cosine(w1, w2, svdspace_10k) for w1, w2 in word_pairs]


Now, calculate correlation coefficients between lists of similarity scores and the real semantic similarity scores from the experiment. The scores of what model best correlates them? Is this expected? **[6 marks]**

In [30]:

rho, pval = stats.spearmanr(semantic_similarity, raw_similarities)
print("""Semantic Similarity vs Raw Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(semantic_similarity, ppmi_similarities)
print("""Semantic Similarity vs PPMI Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(semantic_similarity, svd_similarities)
print("""Semantic Similarity vs SVD Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))


Semantic Similarity vs Raw Similarity:
rho     = 0.1522
p-value = 0.0000
Semantic Similarity vs PPMI Similarity:
rho     = 0.4547
p-value = 0.0000
Semantic Similarity vs SVD Similarity:
rho     = 0.4232
p-value = 0.0000


The PPMI model correlated the similarity scores best with the real semantic similarity scores from the internet. However, the difference between the PPMI score and the SVD score appears to be marginal. This outcome is expected because the PPMI score takes into consideration how distinctive a word is in a given context, i.e. it minimises the importance of words that occur frrequently across all contexts. Additionally, SVD reduces noise/randomness in the data to help find a 'true' signal or information. 

We can also calculate correlation coefficients between lists of cosine similarity scores and the real visual similarity scores from the experiment. Which similarity model best correlates with them? How do the correlation coefficients compare with those from the previous comparison - and can you speculate why do we get such results? **[7 marks]**

In [31]:
rho, pval = stats.spearmanr(visual_similarity, raw_similarities)
print("""Visual Similarity vs Raw Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(visual_similarity, ppmi_similarities)
print("""Visual Similarity vs. PPMI Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

rho, pval = stats.spearmanr(visual_similarity, svd_similarities)
print("""Visual Similarity vs. SVD Similarity:
rho     = {:.4f}
p-value = {:.4f}""".format(rho, pval))

Visual Similarity vs Raw Similarity:
rho     = 0.1212
p-value = 0.0007
Visual Similarity vs. PPMI Similarity:
rho     = 0.3838
p-value = 0.0000
Visual Similarity vs. SVD Similarity:
rho     = 0.3097
p-value = 0.0000


The performance of the three matrices compared against the visual similarity ranks in the same order as the semantic similarity above however they perform lower across the board. The PPMI and SVD matrices performance drops off more sharply than the raw similarities, though they still outperform the raw similarity by a considerable margin. It is unsuprising that the performance overall is lower - the matrices were never taking visual similarity into account. It is a bit surprising that he raw similarity's performance does not drop off more than it did. This may indicate that the performance of the raw similarity was never particularly good to begin with.

## 4. Operations on similarities

We can perform mathematical operations on vectors to derive meaning predictions. For example, we can subtract the normalised vectors for `king` minus `queen` and add the resulting vector to `man` and we hope to get the vector for `woman`. Why? **[3 marks]**

The above example is incorrect, but we can still apply this principle. We know in advance that the primary difference between a king and a queen (in terms of real entities) is gender. We can attempt to capture this semantic information in the vector space literally by taking the difference between the vectors of the two entities.

Thus we hope that (king-queen) gives us the mapping from king to queen in the vector space, where our key assumption is that the mapping IS the gender difference.

If we have a vector as the semantic representation of gender, which can map between similar types of entities/nouns which only differ in gender, then ((queen-king)+man) should approximate (woman), if the respective differences between king and queen, and men and women are their gender. 

Here is some helpful code that allows us to calculate such comparisons.

In [32]:
from scipy.spatial import distance

def normalize(vec):
    return vec / veclen(vec)

def find_similar_to(vec1, space):
    # vector similarity funciton
    #sim_fn = lambda a, b: 1-distance.euclidean(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.correlation(a, b)
    #sim_fn = lambda a, b: 1-distance.cityblock(normalize(a), normalize(b))
    #sim_fn = lambda a, b: 1-distance.chebyshev(normalize(a), normalize(b))
    #sim_fn = lambda a, b: np.dot(normalize(a), normalize(b))
    sim_fn = lambda a, b: 1-distance.cosine(a, b)

    sims = [
        (word2, sim_fn(vec1, space[word2]))
        for word2 in space.keys()
    ]
    return sorted(sims, key = lambda p:p[1], reverse=True)

Here is how you apply this code. Comment on the results you get. **[3 marks]**

In [33]:
short = normalize(svdspace_10k['short'])
light = normalize(svdspace_10k['light'])
long = normalize(svdspace_10k['long'])
heavy = normalize(svdspace_10k['heavy'])

find_similar_to(light - (heavy - long), svdspace_10k)[:10]

[('long', 0.8733111261346901),
 ('above', 0.8259671977311955),
 ('around', 0.8030776291120685),
 ('sun', 0.7692439111243973),
 ('just', 0.7678481974778111),
 ('wide', 0.767257431992253),
 ('each', 0.7665960260861158),
 ('circle', 0.7647746702909336),
 ('length', 0.7601066921319761),
 ('almost', 0.7542351860536628)]

The vector of short is not among the 10 most similar vectors, perhaps the relation between light and heavy is not similar enough to the relation between short and long for this to work.

Find 5 similar pairs of pairs of words and test them. Hint: Google for `word analogies examples`. You can also construct analogies that are less lexical but more grammatical, e.g. `see, saw, leave, ?` or analogies that are based on world knowledge as in the [Google analogy dataset](http://download.tensorflow.org/data/questions-words.txt) from [3]. Does the resulting vector similarity confirm your expectations? But remember you can only do this if the words are contained in our vector space with 10,000 dimensions. **[10 marks]**

In [34]:

stockholm = normalize(svdspace_10k['stockholm'])
sweden = normalize(svdspace_10k['sweden'])
copenhagen = normalize(svdspace_10k['copenhagen'])
denmark = normalize(svdspace_10k['denmark'])


south = normalize(svdspace_10k['south'])
north = normalize(svdspace_10k['north'])
east = normalize(svdspace_10k['east'])
west = normalize(svdspace_10k['west'])



week = normalize(svdspace_10k['week'])
work = normalize(svdspace_10k['work'])
fun = normalize(svdspace_10k['fun'])
weedend = normalize(svdspace_10k['weekend'])


up = normalize(svdspace_10k['up'])
high = normalize(svdspace_10k['high'])
down = normalize(svdspace_10k['down'])
low = normalize(svdspace_10k['low'])


man = normalize(svdspace_10k['man'])
woman = normalize(svdspace_10k['woman'])
suit = normalize(svdspace_10k['suit'])
dress = normalize(svdspace_10k['dress'])


In [35]:
find_similar_to(sweden - (stockholm - copenhagen), svdspace_10k)[:10]

[('sweden', 0.9182356045804703),
 ('denmark', 0.8580303489267215),
 ('france', 0.8440013978711486),
 ('spain', 0.8301392747312987),
 ('portugal', 0.8213020089711681),
 ('austria', 0.8207164058301429),
 ('ireland', 0.8185169771255273),
 ('russia', 0.8167576409081696),
 ('belgium', 0.813768120248223),
 ('scotland', 0.8091028834377446)]

In [36]:
find_similar_to(north - (south - east), svdspace_10k)[:10]

[('north', 0.9771493003083315),
 ('west', 0.9261105860453677),
 ('94', 0.8684855057550237),
 ('59', 0.8567001838313205),
 ('81', 0.8508666429149574),
 ('47', 0.8498260606055098),
 ('51', 0.8485466875663064),
 ('83', 0.8476736758957539),
 ('57', 0.8472095348639865),
 ('97', 0.846783792744607)]

In [37]:
find_similar_to(((week - work) + fun), svdspace_10k)[:10]

[('midnight', 0.7593186401972436),
 ('week', 0.7410206061707413),
 ('tonight', 0.73755040038101),
 ('weekend', 0.7337554900094316),
 ('lucky', 0.7174816015710194),
 ('lonely', 0.7124418012899021),
 ('afternoon', 0.7107110404742536),
 ('dirty', 0.7092583029180727),
 ('fun', 0.7088023835686386),
 ('tomorrow', 0.7016167908607219)]

In [38]:
find_similar_to(((down - low) + high), svdspace_10k)[:10]

[('down', 0.9228344781988292),
 ('back', 0.8866341446782393),
 ('off', 0.8755815493034719),
 ('away', 0.8669123137105386),
 ('up', 0.8385302599781487),
 ('just', 0.8143869669353988),
 ('behind', 0.8114415466351721),
 ('front', 0.8048615445972288),
 ('left', 0.7870710896231837),
 ('home', 0.7829769169141294)]

In [39]:
find_similar_to(woman-man + suit, svdspace_10k)[:10]

[('suit', 0.9416501644222036),
 ('suits', 0.8511995979510889),
 ('trademark', 0.7799054556696589),
 ('costume', 0.7650391997980102),
 ('holder', 0.7605134204463706),
 ('owners', 0.7565212650600517),
 ('stamp', 0.749830644173649),
 ('contract', 0.7466969582149255),
 ('ballot', 0.7447614655933226),
 ('dress', 0.7425159418166167)]

In each case our expected word did occur but not as the most similar word to the resulting vector, however it did typically occur in the top half with a fairly high similarity. The one that is different is (woman-man+suit) where dress occurs at the bottom of the list however if you look at the likelihood score for it its atually not a particularly low score at 0.742.. for example the vector manipulation involving work and week + fun, even the most similar vector is only ranked .02 higher than the one for dress. Despite being at the bottom of the list its not like it has decided that it is unlikely, its just that 9 other vectors are more likely in this case. This and the results for the other ones raises the question of how to select the vector that we actually want since its never the top one. 


The two elements that are similar adds a smaller contribution to the resulting vector than the dissimilar element. For example, in (woman - man + suit) man and woman are more similar vectors and so the resulting vector will be more similar to suit than man or woman.


## 5. Semantic composition and phrase similarity **[20 marks]**

In this task, we are going to look at how different semantic composition models, introduced in [2] correlate with human judgements. The file with the dataset is `mitchell_lapata_acl08.txt` included with this notebook. Your task is to do the following:  

(i) process the dataset, extract pairs of `reference - landmark high` and `reference - landmark low`; you can use the code from the lecture as something to start with. Note that there are 2 landmarks for each reference: one landmark exhibits high similarity with the reference, while another one has low similarity with the reference. A single human participant could have evaluated both of these pairs. For more details, we refer you to the paper.  

(ii) build models of semantic phrase composition: in the lecture we introduced simple additive, simple multiplicative and combined models (details are in [2]). Your task is to take a single pair (a reference or a high similarity landmark or a low similarity landmark) and compute the composition of its vectors using each of these functions. Thus, you will have three compositional models that take a `noun - verb` pair and output a single vector, representing the meaning of this pair. As your semantic space, you can use pretrained spaces (standard space, ppmi or svd) introduced above. It is up to you which space you use, but for someone who runs your code, it should be pretty straightforward to switch between them.

(iii) calculate Spearman correlation between each model's predictions and human judgements; you should have something similar to the scores that are shown in the paper [2]:  

![title](./res.png)

The paper states that they calculated correlations between each individual participant's judgeements and each model's predictions.  

Let's say we have 3 models: simple additive (A), simple multiplicative (M), combined (C).  
From our task dataset, we also know that we have 20 participants.  
Now, for each participant in 20 participants we get all `verb - noun` pairs that these participated evaluated.  
For example:

In [7]:
participant_judgemenets_example = [
 'participant50 chatter child gabble 6 high',
 'participant50 chatter tooth click 2 high',
 'participant50 reel head whirl 5 high',
 'participant50 reel mind stagger 4 low',
 'participant50 reel industry stagger 5 high',
 'participant50 reel man whirl 3 low',
 'participant50 glow fire beam 7 low',
 'participant50 glow face burn 3 low',
 'participant50 glow cigar burn 5 high',
 'participant50 glow skin beam 7 high'
    
]

In [8]:
participant_judgemenets_example

['participant50 chatter child gabble 6 high',
 'participant50 chatter tooth click 2 high',
 'participant50 reel head whirl 5 high',
 'participant50 reel mind stagger 4 low',
 'participant50 reel industry stagger 5 high',
 'participant50 reel man whirl 3 low',
 'participant50 glow fire beam 7 low',
 'participant50 glow face burn 3 low',
 'participant50 glow cigar burn 5 high',
 'participant50 glow skin beam 7 high']

Let's look at the first pair that participant50 evaluated: reference `child chatter` and high-level similarity landmark (as the last word in the row indicates) `child gabble`. The human gave the similarity score of 6 (very similar). Thus, human similarity judgment = [6].  

Our A model's output:  
cosine(p1, p2) = 0.88, where p1 is the result of addition of word vectors in the reference phrase `child gabble`, and p2 is the result of addition of word vectors in the high-level similarity phrase `child chatter`.  

Therefore, we have human rating vector [6] and model A output [0.88]. Next is to compute correlation between these two vectors.

To get an overall score, simply average your correlation scores over all participants, since you are calculating correlation scores per participant.

Of course, your human rating vectors will be longer (e.g., [6, 7, 3, 4, 5]) where each element is a participant's judgement of a specific pair. Each of your models (A, B, C) will produce a single vector of cosine similarity between these same pairs (e.g., [0.89, 0.98, 0.23, 0.65, 0.55]). The goal is to compare each model's cosine similarity vectors with human rating vectors and identify the model which outputs the best result in terms of being the closest to the way human rate similarity between the phrases.

The minimum to do in this task: compute correlations for 3 models mentioned above and human rating for AT LEAST one participant. Elaborate on how different the resulting correlation scores are depending on the model's composition function (additive, multiplicative, combined). For examples on how to interpret the results, look at Section 5 Results of the original paper.

In [9]:
from nltk.corpus import stopwords
ktw = do_word_count(corpus_dir, 20000)

def preprocess(s):
    # split up into words, lowercase, remove punctuation at beginning and end of word
    return [x.lower() for x in s if x not in stopwords.words('english') and x not in string.punctuation]

ktw = preprocess(ktw)

reading file wikipedia123.txt


In [10]:
wi2 = make_word_index(ktw)
space_20k = make_space(corpus_dir, wi2, 19865)

reading file wikipedia123.txt


In [11]:
# load the task dataset
with open('./mitchell_lapata_acl08.txt', 'r') as f:
    phrase_dataset = f.read().splitlines()

for line in phrase_dataset[:10]:
    print(line)
    
# get all unique words
words = []
for line in phrase_dataset[1:]:
    _, verb, noun, landmark, _, _ = line.split()
    if verb not in words:
        words.append(verb)
    if noun not in words:
        words.append(noun)
    if landmark not in words:
        words.append(landmark)



#this is our corpus, if words not in corpus remove
our_dataset = ktw

to_remove = []
for w in words:
    if w not in our_dataset:
        to_remove.append(w)
        

cleaned_phrase_dataset = []
for line in phrase_dataset:
    _, verb, noun, landmark, _, _ = line.split()
    if verb in to_remove or noun in to_remove or landmark in to_remove:
        continue
    cleaned_phrase_dataset.append(line)

target_words = []
for line in cleaned_phrase_dataset[1:]:
    _, verb, noun, landmark, _, _ = line.split()
    if verb not in target_words:
        target_words.append(verb)
    if noun not in target_words:
        target_words.append(noun)
    if landmark not in target_words:
        target_words.append(landmark)
        



participant verb noun landmark input hilo
participant20 stray thought roam 7 low
participant20 stray discussion digress 6 high
participant20 stray eye roam 7 high
participant20 stray child digress 1 low
participant20 throb body pulse 5 high
participant20 throb head shudder 2 low
participant20 throb voice shudder 3 low
participant20 throb vein pulse 6 high
participant20 chatter machine click 4 high


In [12]:
print(cleaned_phrase_dataset)

['participant verb noun landmark input hilo', 'participant20 bow butler submit 3 low', 'participant20 bow company submit 5 high', 'participant20 boom noise prosper 3 low', 'participant20 boom export prosper 7 high', 'participant20 boom sale thunder 3 low', 'participant20 boom gun thunder 6 high', 'participant20 glow fire burn 5 high', 'participant20 glow face beam 7 high', 'participant20 glow skin burn 1 low', 'participant21 bow butler submit 2 low', 'participant21 bow company submit 6 high', 'participant21 boom noise prosper 1 low', 'participant21 boom export prosper 6 high', 'participant21 boom gun thunder 6 high', 'participant21 boom sale thunder 5 low', 'participant21 glow fire burn 5 high', 'participant21 glow face beam 7 high', 'participant21 glow skin burn 2 low', 'participant22 bow butler submit 2 low', 'participant22 bow company submit 6 high', 'participant22 boom noise prosper 1 low', 'participant22 boom export prosper 6 high', 'participant22 boom gun thunder 7 high', 'partic

In [13]:

target_words

['bow',
 'butler',
 'submit',
 'company',
 'boom',
 'noise',
 'prosper',
 'export',
 'sale',
 'thunder',
 'gun',
 'glow',
 'fire',
 'burn',
 'face',
 'beam',
 'skin',
 'head',
 'government']

In [14]:

our_space = space_20k

In [15]:
import numpy as np

def build_phrase_space(phrase, x_names):

    # first we get representations for verb and noun
    subject_space = our_space[phrase[0]]
    verb_space = our_space[phrase[1]]
    
    representation = np.zeros(len(x_names))
    
    
    #isnt this the same as doing something pairwise with each vector?
    for i in range(len(x_names)):
        
        #print(count, word)
      #  print(count, word)
        # I get v^Ith element from each of the vectors
        subject_value = subject_space[i]
        verb_value = verb_space[i]
        

        #out = subject_value + verb_value
        #out = subject_value * verb_value
        
        # 6 and 0, if we do summation, we are getting 6
        # if we do mulitplication, we are getting 0
        
        #out = subject_value * 0.2 + verb_value * 0.8
        out = subject_value * 0.0 + verb_value * 0.95 + (0.05 * subject_value * verb_value)

        representation[i] = out
        
    return representation



In [16]:

#should input a noun and a verb, the function should add their dimensions and output a single vector
def veclen(vector):
    return math.sqrt(np.sum(np.square(vector)))
def cosine(vector1, vector2):
    veclen1 = veclen(vector1)
    veclen2 = veclen(vector2)
    if veclen1 == 0.0 or veclen2 == 0.0:
        # one of the vectors is empty, the cosine is 0
        return 0.0
    else:
        # we could also simply do:
        dotproduct = np.dot(vector1, vector2)
        return dotproduct / (veclen1 * veclen2)
    

In [17]:

#get high and low values with words present in semantic space
lows = []
highs = []
for item in cleaned_phrase_dataset[1:]:
    item = item.split()
    participant = item[0]
    verb = item[1]
    noun = item[2]
    landmark = item[3]
    inp = item[4]
    hilo = item[5]
    if hilo == "high":
        highs.append(item)
    elif hilo == "low":
        lows.append(item)
    

def veclen(vector):
    return math.sqrt(np.sum(np.square(vector)))
def average_cosine(list_x):
    values = 0
    for item in list_x:
        verb = item[1]
        noun = item[2]
        landmark = item[3]
        
        reference = [noun, verb]
        landmark = [noun, landmark]
        
        ref = build_phrase_space(reference, ktw)
        X = build_phrase_space(landmark, ktw)
        cosine_value = cosine(ref, X)
        values += cosine_value
        
    return values / len(list_x)
    




In [18]:

reference = ['face', 'glow']
landmark_high = ['face', 'beam']
landmark_low = ['face', 'burn']

ref = build_phrase_space(reference, ktw)

lhigh = build_phrase_space(landmark_high, ktw)

llow = build_phrase_space(landmark_low, ktw)




cosine(ref, llow)

0.7250612423113745

In [21]:

cosine(ref, lhigh)

0.5889442468718049

In [19]:
average_cosine(highs)

0.5431125524977768

In [20]:
average_cosine(lows)

0.4933942062382572

**Any comments/thoughts should go here:**

# Literature

  - [1] C. Silberer and M. Lapata. Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 721–732, Baltimore, Maryland, USA, June 23–25 2014 2014. Association for Computational Linguistics.  

  - [2] Mitchell, J., & Lapata, M. (2008). Vector-based Models of Semantic Composition. In Proceedings of ACL-08: HLT (pp. 236–244). Association for Computational Linguistics.
  
  - [3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

## Marks

This assignment has a total of 60 marks.