## Training the poincare embedding

In [1]:
%load_ext autoreload   
%autoreload 2

import os
import logging
import numpy as np

from gensim.models.poincare import PoincareModel, PoincareKeyedVectors, PoincareRelations

logging.basicConfig(level=logging.INFO)

In [2]:
# poincare_directory = os.path.join(os.getcwd(), 'ML-stuff','poincare-embeddings')
data_directory = os.path.join(os.getcwd(), 'data')
wordnet_mammal_file = os.path.join(data_directory, 'wordnet_mammal_hypernyms.tsv')

The model can be initialized using an iterable of relations, where a relation is simply a pair of nodes

In [3]:
model = PoincareModel(train_data=[('node.1', 'node.2'), ('node.2', 'node.3')])

INFO:gensim.models.poincare:loading relations from train data..
INFO:gensim.models.poincare:loaded 2 relations from train data, 3 nodes


The model can also be initialized from a csv-like file containing one relation per line. The module provides a convenience class `PoincareRelations` to do so.

In [7]:
relations = PoincareRelations(file_path='wordnet_mammal_hypernyms.tsv', delimiter='\t')
model = PoincareModel(train_data=relations)

INFO:gensim.models.poincare:loading relations from train data..
INFO:gensim.models.poincare:loaded 7724 relations from train data, 1182 nodes


Note that the above only initializes the model and does not begin training. To train the model: 

In [9]:
model = PoincareModel(train_data=relations, size=2, burn_in=0)
model.train(epochs=1, print_every=500)

INFO:gensim.models.poincare:loading relations from train data..
INFO:gensim.models.poincare:loaded 7724 relations from train data, 1182 nodes
INFO:gensim.models.poincare:training model of size 2 with 1 workers on 7724 relations for 1 epochs and 0 burn-in epochs, using lr=0.10000 burn-in lr=0.01000 negative=10
INFO:gensim.models.poincare:starting training (1 epochs)----------------------------------------
INFO:gensim.models.poincare:training on epoch 1, examples #4990-#5000, loss: 23.57
INFO:gensim.models.poincare:time taken for 5000 examples: 0.59 s, 8459.29 examples / s
INFO:gensim.models.poincare:training finished


The same model can be trained further on more epochs in case the user decides that the model hasn't converged yet.

In [10]:
model.train(epochs=1, print_every=500)

INFO:gensim.models.poincare:training model of size 2 with 1 workers on 7724 relations for 1 epochs and 0 burn-in epochs, using lr=0.10000 burn-in lr=0.01000 negative=10
INFO:gensim.models.poincare:starting training (1 epochs)----------------------------------------
INFO:gensim.models.poincare:training on epoch 1, examples #4990-#5000, loss: 22.37
INFO:gensim.models.poincare:time taken for 5000 examples: 0.59 s, 8546.03 examples / s
INFO:gensim.models.poincare:training finished


The model can be saved and loaded using two different methods - 

In [13]:
# Saves the entire PoincareModel instance, the loaded model can be trained further
model.save(os.path.join('models','test_model'))
PoincareModel.load(os.path.join('models','test_model'))

INFO:gensim.utils:saving PoincareModel object under models/test_model, separately None
INFO:gensim.utils:not storing attribute _node_counts_cumsum
INFO:gensim.utils:not storing attribute _node_probabilities


FileNotFoundError: ignored

In [14]:
# Saves only the vectors from the PoincareModel instance, in the commonly used word2vec format
model.kv.save_word2vec_format(os.path.join('models','test_vectors'))
PoincareKeyedVectors.load_word2vec_format(os.path.join('models','test_vectors'))

INFO:gensim.models.utils_any2vec:storing 1182x2 projection weights into models/test_vectors


FileNotFoundError: ignored

## Train Poincare Model on WordNet data

**Parameters:**	
**train_data (iterable of (str, str))** – Iterable of relations, e.g. a list of tuples, or a PoincareRelations instance streaming from a file. Note that the relations are treated as ordered pairs, i.e. a relation (a, b) does not imply the opposite relation (b, a). In case the relations are symmetric, the data should contain both relations (a, b) and (b, a).
**size (int, optional)** – Number of dimensions of the trained model.  
**alpha (float, optional)** – Learning rate for training.  
**negative (int, optional)** – Number of negative samples to use.  
**workers (int, optional)** – Number of threads to use for training the model.  
**epsilon (float, optional)** – Constant used for clipping embeddings below a norm of one.  
**regularization_coeff (float, optional)** – Coefficient used for l2-regularization while training (0 effectively disables regularization).  
**burn_in (int, optional)** – Number of epochs to use for burn-in initialization (0 means no burn-in).  
**burn_in_alpha (float, optional)** – Learning rate for burn-in initialization, ignored if burn_in is 0.  
**init_range (2-tuple (float, float)) **– Range within which the vectors are randomly initialized.  
**dtype (numpy.dtype)** – The numpy dtype to use for the vectors in the model (numpy.float64, numpy.float32 etc). Using lower precision floats may be useful in increasing training speed and reducing memory usage.  
**seed (int, optional)** – Seed for random to ensure reproducibility.  
  
gensim.models.poincare.PoincareModel *(train_data, size=50, alpha=0.1, negative=10, workers=1, epsilon=1e-05, regularization_coeff=1.0, burn_in=10, burn_in_alpha=0.01, init_range=(-0.001, 0.001), dtype=<type 'numpy.float64'>, seed=0)*

In [15]:
relations = PoincareRelations(file_path=wordnet_mammal_file, delimiter='\t')
size=50
burn_in=0
workers=1 # multi-threaded version wasn't implemented yet
negative=15
epochs=100
print_every=500
batch_size=10
model = PoincareModel(train_data=relations, size=size, burn_in=burn_in, workers=workers, negative=negative)
model.train(epochs=epochs, print_every=print_every,batch_size=batch_size)

INFO:gensim.models.poincare:loading relations from train data..


FileNotFoundError: ignored

## Save Poincare model

In [None]:
# Saves the entire PoincareModel instance, the loaded model can be trained further
model.save(os.path.join("models",'gensim_model_batch_size_10_burn_in_0_epochs_100_neg_15_dim_50'))

# Saves only the vectors from the PoincareModel instance, in the commonly used word2vec format
model.kv.save_word2vec_format(os.path.join("models",'gensim_model_batch_size_10_burn_in_0_epochs_100_neg_15_dim_50_vectors'))
PoincareKeyedVectors.load_word2vec_format(os.path.join("models",'gensim_model_batch_size_10_burn_in_0_epochs_100_neg_15_dim_50_vectors'))

## What the embedding can be used for

In [None]:
# Load an example model
test_model_path = os.path.join("models", 'gensim_model_batch_size_10_burn_in_0_epochs_100_neg_15_dim_50')
model = PoincareModel.load(test_model_path)

The learnt representations can be used to perform various kinds of useful operations. This section is split into two - some simple operations that are directly mentioned in the paper, as well as some experimental operations that are hinted at, and might require more work to refine.

The models that are used in this section have been trained on the transitive closure of the WordNet hypernym graph. The transitive closure is the list of all the direct and indirect hypernyms in the WordNet graph. An example of a direct hypernym is `(seat.n.03, furniture.n.01)` while an example of an indirect hypernym is `(seat.n.03, physical_entity.n.01)`.



### Simple operations

All the following operations are based simply on the notion of distance between two nodes in hyperbolic space.

In [None]:
# Distance between any two nodes
model.kv.distance('leopard.n.02', 'mammal.n.01')

In [None]:
model.kv.distance('big_cat.n.01', 'carnivore.n.01')

In [None]:
model.kv.distance('leopard.n.02', 'carnivore.n.01')

In [None]:
# Nodes most similar to a given input node - distance
model.kv.most_similar('carnivore.n.01')

In [None]:
model.kv.most_similar('placental.n.01')

In [None]:
model.kv.most_similar('water_buffalo.n.01')

In [None]:
# Rank of distance of node 2 from node 1 in relation to distances of all nodes from node 1
model.kv.rank('dog.n.01', 'carnivore.n.01')

In [None]:
# Rank of distance of node 2 from node 1 in relation to distances of all nodes from node 1
model.kv.rank('big_cat.n.01', 'carnivore.n.01')

In [None]:
# Finding Poincare distance between input vectors
vector_1 = np.random.uniform(size=(100,)) # vector with 100 dim
vector_2 = np.random.uniform(size=(100,))
vectors_multiple = np.random.uniform(size=(5, 100)) # 5 vectors of 100 dim each

# Distance between vector_1 and vector_2
print(PoincareKeyedVectors.vector_distance(vector_1, vector_2))
# Distance between vector_1 and each vector in vectors_multiple
print(PoincareKeyedVectors.vector_distance_batch(vector_1, vectors_multiple))

### Experimental operations

These operations are based on the notion that the norm of a vector represents its hierarchical position. Leaf nodes typically tend to have the highest norms, and as we move up the hierarchy, the norm decreases, with the root node being close to the center (or origin).

In [None]:
# Closest child node
model.kv.closest_child('virginia_deer.n.01')

In [None]:
# Closest child node
model.kv.closest_child('mammal.n.01')

In [None]:
# Closest child node
model.kv.closest_child('carnivore.n.01')

In [None]:
# Closest child node
model.kv.closest_child('dog.n.01')

In [None]:
# Closest child node
model.kv.closest_parent('canine.n.02')

In [None]:
# Closest parent node
model.kv.closest_parent('virginia_deer.n.01')

In [None]:
# Position in hierarchy - lower values represent that the node is higher in the hierarchy
print(model.kv.norm('virginia_deer.n.01'))
print(model.kv.norm('sheep.n.01'))
print(model.kv.norm('dog.n.01'))
print(model.kv.norm('placental.n.01'))
print(model.kv.norm('mammal.n.01'))

In [None]:
# Difference in hierarchy between the first node and the second node
# Positive values indicate the first node is higher in the hierarchy
print(model.kv.difference_in_hierarchy('dog.n.01', 'sheep.n.01'))

In [None]:
# One possible descendant chain
model.kv.descendants('mammal.n.01')

In [None]:
# One possible ancestor chain
model.kv.ancestors('dog.n.01')

In [None]:
model.kv.ancestors('sheep.n.01')

# Visualization

In [None]:
from gensim.viz.poincare import poincare_2d_visualization, poincare_distance_heatmap
import plotly.plotly as py
py.sign_in('Harman','<PLOTLY KEY>')

In [None]:
all_relations = list(set(relations))

In [None]:
# show_node_labels = ['mammal.n.01', 'placental.n.01', 'tiger.n.02', 'homo_sapiens.n.01']
show_node_labels = ['mammal.n.01', 'placental.n.01', 'ungulate.n.01', 'carnivore.n.01', 'rodent.n.01',
'canine.n.02', 'even-toed_ungulate.n.01', 'odd-toed_ungulate.n.01', 'elephant.n.01',
'rhinoceros.n.01', 'german_shepherd.n.01', 'feline.n.01', 'tiger.n.02', 'homo_sapiens.n.01']
filtered_set = set()
for relation in all_relations:
    if relation[0] in show_node_labels and relation[1] in show_node_labels:
        filtered_set.add(relation)

In [None]:
filtered_set

### Since plotly viz just work on 2D vectors, we'll demonstrate viz on model with dimension=2

In [None]:
size=2
burn_in=0
workers=1
negative=10
model_with_dim_2 = PoincareModel(train_data=relations,size=size, burn_in=burn_in,workers=workers,negative=negative)
model_with_dim_2.train(epochs=100, print_every=400,batch_size=10)

In [None]:
fig = poincare_2d_visualization(model_with_dim_2, filtered_set, "Poincare Hierarchy", show_node_labels=show_node_labels,)

In [None]:
py.image.ishow(fig,width=1000,height=1000)

In [None]:
py.image.save_as(fig, filename='poincare_viz.png')

Note that the chains are not symmetric - while descending to the closest child recursively.

This is despite the fact that Poincaré distance is symmetric (like any distance in a metric space). The asymmetry stems from the fact that even if node `Y` is the closest node to node `X` amongst all nodes with a higher norm (lower in the hierarchy) than `X`, node `X` may not be the closest node to node `Y` amongst all the nodes with a lower norm (higher in the hierarchy) than `Y`.