# Knowledge Graph Embedding using the Wikimedia knowledge graph.

In this notebook, we will show using the Wikimedia knowledge graph data, how you can compute embeddings for your knowledge graph and perform downstream task completion such as node classification and link prediction.





## DGL-KE Installation
The dglke_score function is not part of the pipy dglke package yet and we need to follow the instructions from the [documentation](https://aws-dglke.readthedocs.io/en/latest/install.html) to build dgl-ke from source.


In [None]:
!sudo pip3 install dgl-cu101
!git clone  https://github.com/awslabs/dgl-ke.git
!pushd dgl-ke;cd python;sudo python3 setup.py install;
!pip3 uninstall torch -y
!sudo pip3 install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

## Data Preparation
First, you can download the data from kaggle using this link [here](https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data). 

The zip file is 24GB in total. For our example, we only need the property.csv, item.csv and statements.csv.


Once you have downloaded the file, you can unzip the files in the ./data/wikimedia/ folder.

In [5]:
%load_ext autoreload
%autoreload 2

%matplotlib notebook
import pandas as pd


import numpy as np
import pylab as plt
#import seaborn as sns
#sns.set_style('ticks')

#import seaborn as sns
import os
import json
import boto3

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### Let's load the relation dictionary file.

In [7]:
dfRelations = pd.read_csv("./data/wikimedia/property.csv")
dfRelations.head()

Unnamed: 0,property_id,en_label,en_description
0,6,head of government,"head of the executive power of this town, city..."
1,10,video,"relevant video. For images, use the property P..."
2,14,traffic sign,"graphic symbol describing the item, used at th..."
3,15,route map,image of route map at Wikimedia Commons
4,16,highway system,system (or specific country specific road type...


In [8]:
#Total Counts
dfRelations.count()

property_id       6985
en_label          6985
en_description    6905
dtype: int64

Let's find out if we have any zeros. Looks like some properties do not have descriptions.

In [9]:
dfRelations.isnull().sum()

property_id        0
en_label           0
en_description    80
dtype: int64

### Let's load the entity file.

In [10]:
dfEntities = pd.read_csv("./data/wikimedia/item.csv")
dfEntities.head()

Unnamed: 0,item_id,en_label,en_description
0,1,Universe,totality of space and all contents
1,2,Earth,third planet from the Sun in the Solar System
2,3,life,matter capable of extracting energy from the e...
3,4,death,permanent cessation of vital functions
4,5,human,"common name of Homo sapiens, unique extant spe..."


In [11]:
#Total Counts
dfEntities.count()

item_id           51450316
en_label          43232907
en_description    38507973
dtype: int64

In [12]:
#Null Counts
dfEntities.isnull().sum()

item_id                  0
en_label           8217409
en_description    12942343
dtype: int64

In [13]:
#Drop description column.
dfEntitiesClean = dfEntities.drop('en_description', 1)

In [14]:
#drop any null values.
dfEntitiesClean = dfEntities.dropna()

In [15]:
dfEntitiesClean.count()

item_id           34373175
en_label          34373175
en_description    34373175
dtype: int64

### Let's load the statements.

In [16]:
dfTriple = pd.read_csv("./data/wikimedia/statements.csv")
dfTriple.head()

Unnamed: 0,source_item_id,edge_property_id,target_item_id
0,1,31,36906466
1,1,279,3695190
2,1,398,497745
3,1,398,1133705
4,1,398,1139177


#### Let's join the properties and the statements file.

In [17]:
dfResults=pd.merge(dfTriple, dfRelations, left_on='edge_property_id', right_on='property_id')



In [18]:
dfResults.head()

Unnamed: 0,source_item_id,edge_property_id,target_item_id,property_id,en_label,en_description
0,1,31,36906466,31,instance of,that class of which this subject is a particul...
1,2,31,3504248,31,instance of,that class of which this subject is a particul...
2,3,31,937228,31,instance of,that class of which this subject is a particul...
3,4,31,2996394,31,instance of,that class of which this subject is a particul...
4,5,31,55983715,31,instance of,that class of which this subject is a particul...


#### Let's join the entity and the head columns.

In [19]:
dfResultsHR=pd.merge(dfResults, dfEntities, left_on='source_item_id', right_on='item_id')

In [20]:
dfResultsHR.head()

Unnamed: 0,source_item_id,edge_property_id,target_item_id,property_id,en_label_x,en_description_x,item_id,en_label_y,en_description_y
0,1,31,36906466,31,instance of,that class of which this subject is a particul...,1,Universe,totality of space and all contents
1,1,279,3695190,279,subclass of,all instances of these items are instances of ...,1,Universe,totality of space and all contents
2,1,398,497745,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents
3,1,398,1133705,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents
4,1,398,1139177,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents


#### Let's join the entity and the tail columns.

In [21]:
dfResultsHRT=pd.merge(dfResultsHR, dfEntities,how='left', left_on='target_item_id', right_on='item_id')

In [22]:
dfResultsHRT.head()

Unnamed: 0,source_item_id,edge_property_id,target_item_id,property_id,en_label_x,en_description_x,item_id_x,en_label_y,en_description_y,item_id_y,en_label,en_description
0,1,31,36906466,31,instance of,that class of which this subject is a particul...,1,Universe,totality of space and all contents,36906466.0,universe,"class of which Our Universe is an instance, an..."
1,1,279,3695190,279,subclass of,all instances of these items are instances of ...,1,Universe,totality of space and all contents,3695190.0,cosmology,discipline directed to the philosophical conte...
2,1,398,497745,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents,497745.0,Sloan Great Wall,one of the largest known structures in the uni...
3,1,398,1133705,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents,1133705.0,galaxy filament,thread-like structures that form the boundarie...
4,1,398,1139177,398,child astronomical body,minor body that belongs to the item,1,Universe,totality of space and all contents,1139177.0,CfA2 Great Wall,one of the largest known superstructures in th...


#### Let's select and rename the head, relation and tail to [h,r,t] columns.

In [23]:
dfResultsFinal = dfResultsHRT[['en_label_y','en_label_x', 'en_label']].rename(columns={"en_label_y": "head", "en_label_x": "relation", "en_label":"tail"})

In [24]:
dfResultsFinal.head()

Unnamed: 0,head,relation,tail
0,Universe,instance of,universe
1,Universe,subclass of,cosmology
2,Universe,child astronomical body,Sloan Great Wall
3,Universe,child astronomical body,galaxy filament
4,Universe,child astronomical body,CfA2 Great Wall


In [25]:
#Count null values
dfResultsFinal.isnull().sum()

head        20788004
relation           0
tail          446828
dtype: int64

In [26]:
#Drop null records.
dfTrain = dfResultsFinal.dropna()

#### Our Final dataset contains about 120 mio triples or statements.

In [27]:
dfTrain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 119976844 entries, 0 to 141206852
Data columns (total 3 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   head      object
 1   relation  object
 2   tail      object
dtypes: object(3)
memory usage: 3.6+ GB


#### Let's split our data into training, test and valid datasets.

In [28]:
from sklearn.model_selection import train_test_split

trainDF, testDF = train_test_split(dfTrain, test_size=0.1)

trainDF, validDF = train_test_split(trainDF, test_size=0.15)

In [29]:
trainDF.shape, validDF.shape, testDF.shape

((91782285, 3), (16196874, 3), (11997685, 3))

#### Let's save the our files as tabular separated files.

In [30]:
trainDF.to_csv("./data/wikimedia/train.txt", header = None, index = None, sep = "\t")

In [31]:
validDF.to_csv("./data/wikimedia/valid.txt", header = None, index = None, sep = "\t")

In [32]:
testDF.to_csv("./data/wikimedia/test.txt", header = None, index = None, sep = "\t")

## Knowledge Graph Embeddings 

Let's run the command line to generage our embeddings. In our case we will mix cpu-gpu as our KG is quite large.

In [None]:
%%time
!DGLBACKEND=pytorch dglke_train --dataset wikimedia --model_name TransE_l2 --batch_size 1000 \
--neg_sample_size 200 --hidden_dim 400 --gamma 19.9 --lr 0.25 --max_step 24000 --log_interval 100 \
--batch_size_eval 16 -adv --regularization_coef 1.00E-09 --test --gpu 0 1 2 3 --mix_cpu_gpu  --save_path ./wikimedia \
--data_path ./data/wikimedia/ --format raw_udd_hrt --data_files train.txt valid.txt test.txt --neg_sample_size_eval 10000 

Let's evaluate our model.

In [None]:
%%time
!DGLBACKEND=pytorch dglke_eval --dataset wikimedia --model_name TransE_l2 \
--neg_sample_size 200 --hidden_dim 400 --gamma 19.9 \
--batch_size_eval 16 --gpu 0 1 2 3  --model_path ./wikimedia/TransE_l2_wikimedia_0/ \
--data_path ./data/wikimedia/ --format raw_udd_hrt --data_files train.txt valid.txt test.txt --neg_sample_size_eval 10000 --no_eval_filter

Let's run the inference command line. In our case we want to find the Top 5 similar nodes to our "head.list" file. Our file contains the following entries:

Jeff Bezos<br>
Barack Obama

In [None]:
# Using PyTorch Backend
!DGLBACKEND=pytorch dglke_emb_sim --model_path ./wikimedia/TransE_l2_wikimedia_0/ \
--format 'h_*_*' --data_files head.list \
--score_func none --topK 10 --gpu 0

Let's inspect our results.

In [10]:
!cat result.tsv

left	right	score
Jeff Bezos	Jeff Bezos	1.0
Jeff Bezos	Aga Khan IV	0.8602205514907837
Jeff Bezos	Alisher Usmanov	0.8584005236625671
Jeff Bezos	Klaus Tschira	0.8512368202209473
Jeff Bezos	Bill Gates	0.8441287875175476
Jeff Bezos	Matthew McDowell	0.8408565521240234
Jeff Bezos	Al-Waleed bin Talal	0.8397267460823059
Jeff Bezos	Andreas Heldal-Lund	0.8363775014877319
Jeff Bezos	Rinat Akhmetov	0.8358486294746399
Jeff Bezos	Oleg Deripaska	0.831737220287323
Barack Obama	Barack Obama	1.0
Barack Obama	Donald Trump	0.9529082179069519
Barack Obama	George W. Bush	0.9426612854003906
Barack Obama	Harry S. Truman	0.9414601922035217
Barack Obama	Ronald Reagan	0.9393566250801086
Barack Obama	Bill Clinton	0.9360300898551941
Barack Obama	Franklin Delano Roosevelt	0.9323158264160156
Barack Obama	George H. W. Bush	0.9265057444572449
Barack Obama	Dwight D. Eisenhower	0.9219849109649658
Barack Obama	Gerald Ford	0.9207168221473694
