# ComplEx- Graph Embeddings Project for CSCI 548
## Demonstration Jupyter Notebook

In [1]:
import os
import sys

if os.name == 'nt':
    module_path = os.path.abspath(os.path.join('..\..\..'))
else:
    module_path = os.path.abspath(os.path.join('../../..'))
    
if module_path not in sys.path:
    sys.path.append(module_path)
    
from complex import ComplEx

INPUT_FILE_PATH = "D:\\USC\\CS548\\groupdat\\FB15k\\"
MODEL_FILE_PATH = INPUT_FILE_PATH

TRAIN_FILE_NAME = "train.txt"
VALIDATION_FILE_NAME = "valid.txt"
WHOLE_FILE_NAME = "whole.txt"
TEST_FILE_NAME = "test.txt"

RELATIONS_FILE_NAME = "relation2id.txt"
ENTITIES_FILE_NAME = "entity2id.txt"

MODEL_FILE_NAME = "complex.mod"

### Read Data Set - This must be run before executing learn_embeddings
Validation and Whole Text files are optional

In [2]:
algorithm = ComplEx()

train_file_names = {"train": INPUT_FILE_PATH + TRAIN_FILE_NAME,
# Optional          "valid": INPUT_FILE_PATH + VALIDATION_FILE_NAME,
# Optional          "whole": INPUT_FILE_PATH + WHOLE_FILE_NAME,
                    "relations": INPUT_FILE_PATH + RELATIONS_FILE_NAME,
                    "entities": INPUT_FILE_PATH + ENTITIES_FILE_NAME}

algorithm.read_dataset(train_file_names)


INFO:root:Input Files ...
INFO:root:  entities -----> D:\USC\CS548\groupdat\FB15k\entity2id.txt
INFO:root: relations -----> D:\USC\CS548\groupdat\FB15k\relation2id.txt
INFO:root:     train -----> D:\USC\CS548\groupdat\FB15k\train.txt
INFO:root:Preparing data...
INFO:root:   Done loading data...


### Learn Embeddings - Data Set Must be Read First
this should take several minutes per epoch if using the time minimizing hyper-parameters below and no filtered evaluation or validation

In [4]:
parameters = {"mode": 'single',
              "epoch": 5,
              "batch": 128,
              "lr": 0.05,
              "dim": 40,             # reduced these from 200 to save processing time
              "negative": 1,         # reduced from 5 to save processing time
              "opt": 'adagrad',
              "l2_reg": 0.001,
              "gradclip": 5,
              'filtered': False}     # turned filtered output off to save processing time
 
algorithm.learn_embeddings(parameters)


INFO:root:Learning Embeddings...
INFO:root:Arguments...
INFO:root:     batch -----> 128
INFO:root:  cp_ratio -----> 0.5
INFO:root:       dim -----> 40
INFO:root:     epoch -----> 5
INFO:root:  filtered -----> False
INFO:root:  gradclip -----> 5
INFO:root:    l2_reg -----> 0.001
INFO:root:        lr -----> 0.05
INFO:root:    margin -----> 1
INFO:root:      mode -----> single
INFO:root:     nbest -----> 10
INFO:root:  negative -----> 1
INFO:root:       opt -----> adagrad
INFO:root: save_step -----> 30
INFO:root:setup trainer...
INFO:root:start 1 epoch
INFO:root:training loss in 1 epoch: 2441.497601469053
INFO:root:training time in 1 epoch: 43.67363953590393
INFO:root:start 2 epoch
INFO:root:training loss in 2 epoch: 1836.073705813659
INFO:root:training time in 2 epoch: 44.57002019882202
INFO:root:start 3 epoch
INFO:root:training loss in 3 epoch: 1624.363700853205
INFO:root:training time in 3 epoch: 44.05403232574463
INFO:root:start 4 epoch
INFO:root:training loss in 4 epoch: 1468.7192094

### Save Model - Optional

In [5]:
algorithm.save_model(MODEL_FILE_PATH + MODEL_FILE_NAME)


Saving model: D:\USC\CS548\groupdat\FB15k\complex.mod
  Model saved:


### Load Model - Optional
This can load a previously saved model for evaluation or other purposes, not required if current instance of ComplEx has loaded a dataset and learned the embeddings for that dataset

In [6]:
new_alg = ComplEx()
new_alg.load_model(MODEL_FILE_PATH + MODEL_FILE_NAME)


Loading model: D:\USC\CS548\groupdat\FB15k\complex.mod
   Model loaded


### Add Some Test Data

In [7]:
test_subs = ['/m/08mbj32', '/m/08mbj5d', '/m/08mg_b']
test_rels = ['/location/statistical_region/religions./location/religion_percentage/religion',
             '/military/military_conflict/combatants./military/military_combatant_group/combatants',
             '/award/award_category/winners./award/award_honor/ceremony']
test_obs = ['/m/0631_', '/m/0d060g', '/m/01bx35']


### Print Embeddings for sub, rel, object
Each entity & relation has 3 embeddings. Two embeddings are used for ComplEx part of Hybrid score measure (one a vector of real numbers and the other a vector of imaginary numbers). The other embedding is used for DistMult part of Hybrid score measure which only has a real number.  

Output for each entity/relation retrieval function is tuple with each element having an array for the input list and its embedding, which is outputted as real part of complex number (ComplEx), imaginary part of complex number (Complex), real number (DistMult)  

The length of each embedding vector will be equal to the number of dimensions used as a hyper-parameter to the training model divided by 2 (since complex

In [9]:
subs = new_alg.retrieve_entity_embeddings(test_subs)
print("Length of embedding vector: {}, should equal number of dimensions of model assumption/2".format(len(subs[0][0])))
print("Subjects:")
print(subs)

rels = new_alg.retrieve_relations_embeddings(test_rels)
print("Relations:")
print(rels)

objs = new_alg.retrieve_entity_embeddings(test_obs)
print("Objects:")
print(objs)

Length of embedding vector: 40, should equal number of dimensions of model assumption/2
Subjects:
(array([[-5.61960751e-01, -4.56937641e-01,  1.15988693e-01,
        -3.56570515e-02,  3.05946142e-04, -3.66727256e-01,
        -3.17947532e-01,  1.90198339e-01, -4.63260874e-01,
        -4.45461772e-01,  1.60249664e-02,  1.73026954e-01,
        -3.22635070e-01, -8.95369631e-02,  9.25154912e-01,
         5.15317902e-01,  2.54222181e-03, -2.54843103e-01,
        -4.43163382e-01,  7.12120524e-02,  3.99725172e-01,
         1.49784161e-01,  8.42841459e-01, -9.55976185e-02,
         2.23417043e-01, -6.44552194e-01, -1.59867750e-02,
        -2.53060540e-01, -1.39797760e-01,  4.69377152e-01,
         3.17748920e-01,  3.81624172e-01, -1.66862103e-01,
         3.80903106e-01,  1.96301286e-01,  2.39373668e-01,
         9.31574215e-01,  5.31629073e-01,  1.98028362e-01,
         3.38972616e-01],
       [-7.71631637e-01, -3.99835138e-02, -1.24020999e-01,
        -4.99828783e-02,  1.40958311e-01, -1.5374

### Scoring Matrix for Test Data
Length of each scoring vector (there are 3 for the above test data, one for each s,r,o triplet given) is equal to the number of entities in the vocabulary. For example, for FB15k this is 14,951. 


In [10]:
sm = algorithm.retrieve_scoring_matrix(test_subs, test_rels)
print(sm)
print("Number of test triplets in data: {}".format(len(sm)))
print("Length of a scoring vector: {}".format(len(sm[0])))

[[-0.01857784  0.71207095 -0.03385617 ...  0.7641884   0.52066307
  -0.12732459]
 [ 1.45199499  0.0968879   0.21870539 ... -0.16897282  0.07704875
   0.42111322]
 [-0.46635306 -0.02038394  1.06053692 ...  0.58442756 -0.34375057
   0.42061662]]
Number of test triplets in data: 3
Length of a scoring vector: 14951


### Evaluation of Trained Model
Produces evaluation metrics for model, whole text file is optional

In [11]:
evaluate_file_names = {"test": INPUT_FILE_PATH + TEST_FILE_NAME,
                       "whole": INPUT_FILE_PATH + WHOLE_FILE_NAME}  #  Optional
all_res = new_alg.evaluate(evaluate_file_names)
for metric in sorted(all_res.keys()):
    print('{:20s}: {}'.format(metric, all_res[metric]))

Running model evaluation
loading whole graph...
Hits@1              : 0.06701257808400062
Hits@1(filter)      : 0.11696940969341978
Hits@10             : 0.257207428348936
Hits@10(filter)     : 0.34541483976909143
Hits@3              : 0.13523556398232636
Hits@3(filter)      : 0.21364967581385114
MRR                 : 0.12965997678300362
MRR(filter)         : 0.19303451291192836
