# Documentación

## Holographic embeddings data structure

The original dataset uses pickle to store and save on disk. When loaded from python, it comes a dictionary structure wich has 5 different entries:

* `entities`
* `relations`
* And three different subsets
    * `test_subs`
    * `valid_subs`
    * `train_subs`

The **`entities`** and **`relations`** list contains simply identifiers. The three different subsets contains the relations between `entities` and `relations`, divided in `train_subs`, which is the biggest subset, `test_subs` and `valid_subs`. The structure of this subset is the same: [(`<id_entity>`, `<id_entity>`, `<id_relation>`), ...]. The id's are exactly the position they occupy on the entity and relations array, respectively.

In [16]:
import pickle
with open("/home/jovyan/holographic-embeddings/data/wn18.bin", 'rb') as fin:
    data = pickle.load(fin)
print ([(k, len(data[k])) for k in data.keys()])


[('train_subs', 7997), ('relations', 930), ('entities', 8363), ('valid_subs', 1000), ('test_subs', 999)]


As an example, we can see below the internal structure of `valid_subs`. Is a list of tuples.

In [22]:
[tupla for tupla in data['valid_subs'][0:10]]

[(7224, 456, 243),
 (3710, 128, 38),
 (1760, 227, 109),
 (4747, 1097, 265),
 (1500, 245, 332),
 (2846, 147, 37),
 (1710, 245, 359),
 (3579, 226, 261),
 (5937, 802, 261),
 (7911, 806, 382)]

## Dataset python class

The main target is having a Dataset class which can be filled with external data, and this object can be saved on disk with the same structure used on https://github.com/mnick/holographic-embeddings to train a network.

Right now, the class has several methods:
* `load_dataset_from_json`
* `load_dataset_from_query`
* `load_dataset_from_nlevels`
* `load_entire_dataset`
* `save_to_binary`
* `load_from_binary`
* `train_split`
* `show`
* And other private methods.

On the `load_entire_dataset` method. It is necessary to generate internally a count query in order to know how many tuples should be retrieved from server.

To create a dataset:

In [1]:
import importlib
import kgeserver.dataset as dataset
import kgeserver.wikidata_dataset as wikidata_dataset
import pickle
importlib.reload(dataset)
importlib.reload(wikidata_dataset)
from datetime import datetime

dtset = wikidata_dataset.WikidataDataset()

sv = dtset.get_seed_vector(verbose=2)
#dataset.load_entire_dataset(1)
dtset.load_dataset_recurrently(2, sv, verbose=2)
dtset.show()

Found 32871 entities
Scanning level 1 with 33009 elements
Enter S to show status: s
Elapsed time: 5s. Depth 1 of 2. Entities scanned: 1.35% (446 of 33009) Active threads: 38
Enter S to show status: s
Elapsed time: 52s. Depth 1 of 2. Entities scanned: 9.81% (3238 of 33009) Active threads: 29
Waiting all threads to end
Scanning level 2 with 366492 elements
Waiting all threads to end
220656 entities, 516 relations, 873012 tripletas
Enter S to show status: s
Elapsed time: 1538s. Depth 2 of 2. Entities scanned: 100.00% (366482 of 366492) Active threads: 6


### Notes

Taking a dict as key, value storage is much faster than try to save long strings in arrays. 
Also, the search is faster on a dict than in a list. The search is even faster when shorter is the string
used as key in dict.

In [6]:
dict_ = {}
di2 = {}
di = {"http://www.wikidata.org/prop/direct/P/prop/direct/P/prop/direct/P/a": 4,
      "http://www.wikidata.org/prop/direct/P/prop/direct/P/prop/direct/P/b": 3}

st = "P{0}"
strin = "http://www.wikidata.org/prop/direct/P{0}"


for i in range(0, 1000000):
    s = strin.format(i)
    dict_[s] = i
    di2[st.format(i)] = i
    
lis = [strin.format(i) for i in range(0, 1000000)]

In [8]:
%timeit dict_["http://www.wikidata.org/prop/direct/P999994"]
%timeit di["http://www.wikidata.org/prop/direct/P/prop/direct/P/prop/direct/P/b"]
%timeit di2["P999994"]
%timeit lis.index("http://www.wikidata.org/prop/direct/P999994")

The slowest run took 38.21 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 106 ns per loop
10000000 loops, best of 3: 119 ns per loop
The slowest run took 13.65 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 77.2 ns per loop
10 loops, best of 3: 26.1 ms per loop


# Usage for Algorithm class

The 

In [3]:
import pickle
import importlib
import kgeserver.dataset as dataset
import kgeserver.algorithm as algorithm
import kgeserver.experiment as experiment
import skge
importlib.reload(experiment)
importlib.reload(dataset)
importlib.reload(algorithm)
dtset = dataset.Dataset()
# dataset.load_from_binary("holographic-embeddings/data/wn18.bin")
dtset.load_from_binary("wikidata_25k.bin")

#alg = algorithm.Algorithm(dtset, thread_limiter=5)
model = algorithm.ModelTrainer(dtset, model_type=skge.TransE, margin=0.2, ncomp=100, test_all=-1, train_all=True)
#models = alg.find_best(ncomps=[100], model_types=[skge.TransE], test_all=-1, train_all=True, margins = [0.2])
modelo = model.run()

{'no_pairwise': False, 'ne': 1, 'trainer_type': <class 'skge.base.PairwiseStochasticTrainer'>, 'margin': 0.2, 'model_type': <class 'skge.transe.TransE'>, 'dataset': <kgeserver.dataset.Dataset object at 0x7f3d66f84b00>, 'scores': [], 'neval': -1, 'mode': 'rank', 'me': 500, 'fout': None, 'init': 'nunif', 'best_valid_score': -1.0, 'train_all': True, 'afs': 'sigmoid', 'sampler': 'random-mode', 'th_num': 0, 'ncomp': 100, 'best_epoch': None, 'exectimes': [], 'evaluator': <class 'kgeserver.algorithm.TransEEval'>, 'lr': 0.1, 'nb': 100, 'test_all': -1, 'violations': []}
Fitting model TransE with trainer PairwiseStochasticTrainer
[0][  1] time = 0s, violations = 18598
[0][  2] time = 0s, violations = 6013
[0][  3] time = 0s, violations = 3172
[0][  4] time = 0s, violations = 2117
[0][  5] time = 0s, violations = 1566
[0][  6] time = 0s, violations = 1261
[0][  7] time = 0s, violations = 1100
[0][  8] time = 0s, violations = 918
[0][  9] time = 0s, violations = 794
[0][ 10] time = 0s, violations 

In [6]:
import kgeserver.server as server
si = server.SearchIndex()
si.build_from_trained_model(modelo, 1000)


In [7]:
si.save_to_binary("wikidata_25k.annoy.bin")

True

# Usage for Server Class

First, we need to create a SearchIndex. We can choose between create a new one from a trained model, or load from other already built.

The Dataset Class is loaded because is useful to work with entities' strings and id's

In [3]:
import kgeserver.server as server
import kgeserver.dataset as dataset
import pickle

si = server.SearchIndex()

# tm = pickle.load(open("modeloentrenado100k.bin", "rb"))
# si.build_from_trained_model(tm, 1000)

# si.save_to_binary("annoyIndex100k.bin")
si.load_from_file("annoyIndex100k.bin", 120)

dt = dataset.Dataset()
dt.load_from_binary("100k.bin")

True

To use the server class. Is as simple as instantiate a Server object with the searchIndex attribute.

In the example, gets a similar entities vector from a given id, and shows the complete URI through screen.

In [5]:
id1 = dt.get_entity_id("Q5682")

s = server.Server(si)
similares = s.similarity_by_id(556,10)

for ent in similares:
    b_ent = "https://www.wikidata.org/wiki/"
    print("["+str(ent)+"]\t "+b_ent+dt.get_entity(ent))

    

[556]	 https://www.wikidata.org/wiki/Q37853
[1701]	 https://www.wikidata.org/wiki/Q131808
[1267]	 https://www.wikidata.org/wiki/Q288928
[1536]	 https://www.wikidata.org/wiki/Q37068
[1654]	 https://www.wikidata.org/wiki/Q748
[1702]	 https://www.wikidata.org/wiki/Q1474884
[28327]	 https://www.wikidata.org/wiki/Q624895
[43197]	 https://www.wikidata.org/wiki/Q5098
[1462]	 https://www.wikidata.org/wiki/Q19588
[565]	 https://www.wikidata.org/wiki/Q11344
