# Using the trained model

The main use of the syntactic rand-walk model is to create composite embeddings for syntactically-related pairs of words. Below, we demonstrate how to do this using a model trained specifically for adjective-noun pairs. 

After training the syntactic rand-walk model using the file train_from_triplecounts.py, you should have a numpy archive file containing the learned parameters. You should also have a text file containing the word embeddings used in the training, where line i gives the embedding of word i as a space-separated list of floating point numbers. Finally, you should have a file that gives the vocabulary for the word embeddings set, where line i contains word i. 

In [3]:
import numpy as np
import pandas as pd
import tensor_operations as to
from collections import defaultdict
from scipy import linalg as la

In [6]:
# set paths to important files
vocab_file = "../../datasets/rw_vocab_no_stopwords.txt"
embedding_file = "../../datasets/rw_vectors.txt"
param_file = "/usr/xtmp/abef/learned_params_dep_an_rw.npz"

In [7]:
# load in the vocab, create mapping from word to index
vocab = []
with open(vocab_file,"r") as f:
    for line in f:
        vocab.append(line.strip("\n"))
vocab_dict = defaultdict(lambda : -1) # this will return index -1 if key not found
for i, w in enumerate(vocab):
    vocab_dict[w] = i
    
# load in the word embeddings, compute norm of each embedding
vectors = np.loadtxt(embedding_file)
norms = la.norm(vectors,axis=1)

# load in the learned composition tensor
params = np.load(param_file)
T = params["arr_0"]

In [8]:
# suppose word 0 and word 1 form an adjective-noun pair
av = vectors[0] # adjective vector
nv = vectors[1] # noun vector
composite_embedding = av + nv + to.bilinear_lowrank_batch_np(T,av,nv).flatten()

In [9]:
# find the nearest words to a given set of adjective-noun pairs
# compare the standard additive composition with our tensor composition
# use cosine similarity
N = 10
phrases = ["national park","artificial intelligence","european union"]
res_ind = pd.MultiIndex.from_product([phrases, ["additive","tensor"]])
res = pd.DataFrame(data = np.zeros((N,len(phrases)*2)),columns=res_ind)

for p in phrases:
    a = p.split(" ")[0]
    n = p.split(" ")[1]
    av = vectors[vocab_dict[a]]
    nv = vectors[vocab_dict[n]]
    
    c1 = av + nv
    c2 = av + nv + to.bilinear_lowrank_batch_np(T,av,nv).flatten()
    
    topwords = [vocab[i] for i in np.argsort(np.dot(vectors/norms[:,None],c1))[::-1][:N]]
    res[p,"additive"] = topwords
    topwords = [vocab[i] for i in np.argsort(np.dot(vectors/norms[:,None],c2))[::-1][:N]]
    res[p,"tensor"]=topwords
    
res

Unnamed: 0_level_0,national park,national park,artificial intelligence,artificial intelligence,european union,european union
Unnamed: 0_level_1,additive,tensor,additive,tensor,additive,tensor
0,national,park,intelligence,intelligence,european,annexes
1,park,pisgah,artificial,artificial,union,eec
2,parks,adjoins,researchers,gchq,europe,stabilisation
3,historic,exmoor,human,computational,federation,cooperation
4,forest,geopark,perception,researchers,nations,precondition
5,recreation,cays,computational,robotics,countries,iea
6,monument,gunung,knowledge,intelligent,cooperation,saarc
7,wildlife,wildlife,communication,cybernetics,unions,eurozone
8,preserve,otway,intelligent,cognition,soviet,cpsu
9,reserve,badlands,cognition,minds,socialist,insofar
