# Knowledge Graph Embedding using the Wikimedia knowledge graph data.

In this notebook, we will show using the Wikimedia knowledge graph data, how you can compute embeddings for your knowledge graph and perform downstream task completion such as node classification and link prediction.





## Data Preparation
First, you can download the data from kaggle using this link [here](https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data). 

The zip file is 24GB in total. For our example, we only need the property.csv, item.csv and statements.csv.


Once you have downloaded the file, you can unzip the files in the ./data/wikimedia/ folder.

In [None]:
!unzip file.zip -d ./data/wikimedia

In [None]:
!conda install -c dglteam dgl-cuda10.1 -y
!pip install rdflib
!pip install dglke

In [None]:
!python3 -m spacy download en_core_web_lg

In [11]:
%load_ext autoreload
%autoreload 2

%matplotlib notebook
import pandas as pd


import numpy as np
import pylab as plt
import seaborn as sns
sns.set_style('ticks')

import seaborn as sns
import os
import json
import boto3

#### Let's load the relation dictionary file.

In [12]:
dfRelations = pd.read_csv("./data/wikimedia/property.csv")
dfRelations.head()

Unnamed: 0,property_id,en_label,en_description
0,6,head of government,"head of the executive power of this town, city..."
1,10,video,"relevant video. For images, use the property P..."
2,14,traffic sign,"graphic symbol describing the item, used at th..."
3,15,route map,image of route map at Wikimedia Commons
4,16,highway system,system (or specific country specific road type...


In [13]:
#Total Counts
dfRelations.count()

property_id       6985
en_label          6985
en_description    6905
dtype: int64

Let's find out if we have any zeros. Looks like some properties do not have descriptions.

In [14]:
dfRelations.isnull().sum()

property_id        0
en_label           0
en_description    80
dtype: int64

### Let's load the entity file.

In [25]:
dfEntities = pd.read_csv("./data/wikimedia/item.csv")
dfEntities.head()

Unnamed: 0,item_id,en_label,en_description
0,1,Universe,totality of space and all contents
1,2,Earth,third planet from the Sun in the Solar System
2,3,life,matter capable of extracting energy from the e...
3,4,death,permanent cessation of vital functions
4,5,human,"common name of Homo sapiens, unique extant spe..."


In [26]:
#Total Counts
dfEntities.count()

item_id           51450316
en_label          43232907
en_description    38507973
dtype: int64

In [27]:
#Null Counts
dfEntities.isnull().sum()

item_id                  0
en_label           8217409
en_description    12942343
dtype: int64

In [31]:
#Drop description column.
dfEntitiesClean = dfEntities.drop('en_description', 1)

In [32]:
#drop any null values.
dfEntitiesClean = dfEntities.dropna()

In [33]:
dfEntitiesClean.count()

item_id           34373175
en_label          34373175
en_description    34373175
dtype: int64

### Let's load the statements.

In [34]:
dfTriple = pd.read_csv("./data/wikimedia/statements.csv")
dfTriple.head()

Unnamed: 0,source_item_id,edge_property_id,target_item_id
0,1,31,36906466
1,1,279,3695190
2,1,398,497745
3,1,398,1133705
4,1,398,1139177


#### Let's join the properties and the statements file.

In [36]:
dfResults=pd.merge(dfTriple, dfRelations, left_on='edge_property_id', right_on='property_id')



In [37]:
dfResults.head()

Unnamed: 0,source_item_id,edge_property_id,target_item_id,property_id,en_label,en_description
0,1,31,36906466,31,instance of,that class of which this subject is a particul...
1,2,31,3504248,31,instance of,that class of which this subject is a particul...
2,3,31,937228,31,instance of,that class of which this subject is a particul...
3,4,31,2996394,31,instance of,that class of which this subject is a particul...
4,5,31,55983715,31,instance of,that class of which this subject is a particul...


#### Let's join the entity and the head columns.

In [38]:
dfResultsHR=pd.merge(dfResults, dfEntities, left_on='source_item_id', right_on='item_id')

In [39]:
dfResultsHR.head()

Unnamed: 0,source_item_id,edge_property_id,target_item_id,property_id,en_label_x,en_description_x,item_id,en_label_y,en_description_y
0,1,31,36906466,31,instance of,that class of which this subject is a particul...,1,Universe,totality of space and all contents
1,1,279,3695190,279,subclass of,all instances of these items are instances of ...,1,Universe,totality of space and all contents
2,1,398,497745,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents
3,1,398,1133705,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents
4,1,398,1139177,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents


#### Let's join the entity and the tail columns.

In [40]:
dfResultsHRT=pd.merge(dfResultsHR, dfEntities,how='left', left_on='target_item_id', right_on='item_id')

In [41]:
dfResultsHRT.head()

Unnamed: 0,source_item_id,edge_property_id,target_item_id,property_id,en_label_x,en_description_x,item_id_x,en_label_y,en_description_y,item_id_y,en_label,en_description
0,1,31,36906466,31,instance of,that class of which this subject is a particul...,1,Universe,totality of space and all contents,36906466.0,universe,"class of which Our Universe is an instance, an..."
1,1,279,3695190,279,subclass of,all instances of these items are instances of ...,1,Universe,totality of space and all contents,3695190.0,cosmology,discipline directed to the philosophical conte...
2,1,398,497745,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents,497745.0,Sloan Great Wall,one of the largest known structures in the uni...
3,1,398,1133705,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents,1133705.0,galaxy filament,thread-like structures that form the boundarie...
4,1,398,1139177,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents,1139177.0,CfA2 Great Wall,one of the largest known superstructures in th...


#### Let's select and rename the head, relation and tail to [h,r,t] columns.

In [43]:
dfResultsFinal = dfResultsHRT[['en_label_y','en_label_x', 'en_label']].rename(columns={"en_label_y": "head", "en_label_x": "relation", "en_label":"tail"})

In [44]:
dfResultsFinal.head()

Unnamed: 0,head,relation,tail
0,Universe,instance of,universe
1,Universe,subclass of,cosmology
2,Universe,child astronomical body,Sloan Great Wall
3,Universe,child astronomical body,galaxy filament
4,Universe,child astronomical body,CfA2 Great Wall


In [45]:
#Count null values
dfResultsFinal.isnull().sum()

head        20788004
relation           0
tail          446828
dtype: int64

In [46]:
#Drop null records.
dfTrain = dfResultsFinal.dropna()

#### Our Final dataset contains about 120 mio triples or statements.

In [47]:
dfTrain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 119976844 entries, 0 to 141206852
Data columns (total 3 columns):
head        object
relation    object
tail        object
dtypes: object(3)
memory usage: 3.6+ GB


#### Let's split our data into training, test and valid datasets.

In [50]:
from sklearn.model_selection import train_test_split

trainDF, testDF = train_test_split(dfTrain, test_size=0.1)

trainDF, validDF = train_test_split(trainDF, test_size=0.15)

In [51]:
trainDF.shape, validDF.shape, testDF.shape

((91782285, 3), (16196874, 3), (11997685, 3))

#### Let's save the our files as tabular separated files.

In [52]:
trainDF.to_csv("./data/wikimedia/train.txt", header = None, index = None, sep = "\t")

In [53]:
validDF.to_csv("./data/wikimedia/valid.txt", header = None, index = None, sep = "\t")

In [54]:
testDF.to_csv("./data/wikimedia/test.txt", header = None, index = None, sep = "\t")

## Knowledge Graph Embeddings 

Let's run the command line to generage our embeddings. In our case we will mix cpu-gpu as our KG is quite large.

In [None]:
%%time
!DGLBACKEND=pytorch dglke_train --dataset wikimedia --model_name TransE_l2 --batch_size 1000 \
--neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 24000 --log_interval 100 \
--batch_size_eval 16 -adv --regularization_coef 1.00E-09 --test --gpu 0 1 2 3 --mix_cpu_gpu  --save_path ./wikimedia \
--data_path ./data/wikimedia/ --format raw_udd_hrt --data_files train.txt valid.txt test.txt --neg_sample_size_eval 10000 

Let's evaluate our model.

In [1]:
%%time
!DGLBACKEND=pytorch dglke_eval --dataset wikimedia --model_name TransE_l2 \
--neg_sample_size 200 --hidden_dim 400 --gamma 19.9 \
--batch_size_eval 16 --gpu 0 1 2 3  --model_path ./wikimedia/TransE_l2_wikimedia_0/ \
--data_path ./data/wikimedia/ --format raw_udd_hrt --data_files train.txt valid.txt test.txt --neg_sample_size_eval 10000 --no_eval_filter

Using backend: pytorch
Reading train triples....
Finished. Read 91782285 train triples.
Reading valid triples....
Finished. Read 16196874 valid triples.
Reading test triples....
Finished. Read 11997685 test triples.
Logs are being recorded at: ./wikimedia/TransE_l2_wikimedia_0/eval.log
|valid|: 16196874
|test|: 11997685
-------------- Test result --------------
Test average MRR: 0.4159753346227368
Test average MR: 1001.1689418833716
Test average HITS@1: 0.3540242971873324
Test average HITS@3: 0.45541123141672746
Test average HITS@10: 0.5213350742247359
-----------------------------------------
Test takes 6781.685 seconds
CPU times: user 2min 7s, sys: 30 s, total: 2min 37s
Wall time: 2h 9min 59s


Let's run the inference command line. In our case we want to find the Top 5 similar nodes to our "head.list" file. Our file contains the following entries:

Jeff Bezos
Barack Obama

In [2]:
# Using PyTorch Backend
!DGLBACKEND=pytorch dglke_score --data_path ./data/wikimedia/ --model_path ./wikimedia/TransE_l2_wikimedia_0/ \
--format 'h_*_*' --data_files head.list \
--score_func none --topK 5

Traceback (most recent call last):
  File "/usr/local/bin/dglke_score", line 11, in <module>
    load_entry_point('dglke==0.1.0.dev0', 'console_scripts', 'dglke_score')()
  File "/usr/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 490, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 2854, in load_entry_point
    return ep.load()
  File "/usr/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 2445, in load
    return self.resolve()
  File "/usr/lib/python3.6/dist-packages/pkg_resources/__init__.py", line 2451, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/local/lib/python3.6/site-packages/dglke-0.1.0.dev0-py3.6.egg/dglke/infer_score.py", line 25, in <module>
  File "/usr/local/lib/python3.6/site-packages/dglke-0.1.0.dev0-py3.6.egg/dglke/models/__init__.py", line 20, in <module>
  File "/usr/local/lib/pyt