<a href="https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/Graph_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!git clone -b oasis_connector --single-branch https://github.com/arangodb/interactive_tutorials.git
!git clone -b imdb_with_ratings https://github.com/arangodb/interactive_tutorials imdb_data
!rsync -av interactive_tutorials/ ./ --exclude=.git
!chmod -R 755 ./tools
!pip3 install pyarango
!pip3 install "python-arango>=5.0"
!pip3 install networkx
!pip3 install matplotlib
!pip3 install adbnx-adapter==0.0.0.2.5.3.post1
!pip install node2vec

Scope:
1. What are Graph Embeddings
2. Motivate the ideas used to develop a graph embedding
3. What are the approaches to developing a graph embedding.
4. What are the characteristics of each approach. 
5. What are the strengths and limitations of each approach.

## Graph Embeddings
Mathematically, a graph is a structure that captures a set of objects and the relations between them. This representation has had a remarkable versatility in the abstraction of a wide variety of scientific and business problems. Road, networks, water pipeline networks, traffic networks, oil/chemical networks, supply-chains, computer networks, recommender systems in electronic market places are all examples of graphs. Since graph structured data are pervasive, problems on graphs have been researched for a long time now. These solutions are exploited to provide very useful applications in many domains. Finding best delivery routes, identifying influential writers in content networks, detecting fraud in financial transactions are all examples. Many of these features are available as part of the ArangoDB analytics eco-system. With the current explosion of interest machine learning, there is a surge in interest in developing machine learning applications on graphs. In this post we will discuss one approach to performing machine learning on graphs. The material for this post comes from [this paper](https://www-cs.stanford.edu/people/jure/pubs/graphrepresentation-ieee17.pdf) and [these slides/notes](http://snap.stanford.edu/proj/embeddings-www/files/nrltutorial-part2-gnns.pdf).
The key idea is to convert the data which has a representation as a graph to a representation commonly used machine learning. This representation is called the Euclidean representation. The crucial idea with this representation is that the notion of distance between two nodes exist. Nodes in the graph are transformed into points in a Euclidean representation. 
A basic template to learn this Euclidean representation is:
1. Define the set of nodes which we deem to be similar. This is the notion of neighborhood of the node.
2. Posit that nodes that are similar in the graph are similar in the Euclidean representation. In otherwords, the similarity in the set of nodes (ie., neighborhood) is preserved as go from the graph to Euclidean representation.
3. Develop a neural network in which the set of nodes that are similar in the graph are provided as the input to the network and Euclidean representation is what is learned. The neural network represents an _encoder_ that uses a _similarity_ measure to determine a Euclidean representation. The parameters of the network are learned so that similarity in the graph representation is preserved in the Euclidean representation.

As discussed in [the slides/notes](http://snap.stanford.edu/proj/embeddings-www/files/nrltutorial-part2-gnns.pdf), we can consider the following as the set of nodes for which similarity in the graph is evaluated:
1. Nodes that are adjacent to a node
2. Nodes that are reachable by $k$ hops from the node.
3. Nodes that manifest in a random walk of a specified length starting at the node.

    Each of these strategies have strengths and weakness and are particularly suited to certain applications as discussed in [goyal and ferrera, 2017](https://usc-isi-i2.github.io/papers/goyal18-knosys.pdf). The third approach is what we will discuss in this post. This approach of capturing the neighborhood of the node is called a _random-walk_. The Euclidean representation of the nodes obtained from the neural network are called _embeddings_. To obtain these embeddings, random walks of particular lengths are initiated from each node of the graph. The nodes visited by the walk are captured sequentially until a walk of the desired length is completed. The generated walks are fed as input to the neural network and the parameters of the neural network are learned as part of training the network. As a result of training, we obtain the Euclidean representation of the nodes of the graph. These node representations can be used to peform machine learning tasks such as _node_ _classification_ and _link_ _prediction_. A schematic illustrating the basic elements of an approach to obtaining embeddings from a graph is shown below. This illustration depicts using a _random walk_ of length 4 from each node of the graph. A sample of _random walks_ from node A and node B are shown. These _random walks_ are then used as input to a neural network. The embeddings are obtained as an output of the neural network. To determine the parameters (the weights and biases) of the neural network, we utilize the fact that nodes that are similar in the graph, such as those encountered in a _random walk_ are similar in the embeddeding. The output from the neural network is a vector. This is a mathematical object with co-ordinates or dimensions. The number of dimensions the embedding has is a parameter (actually, a _hyper parameter_, since this is not learned by the neural network) that we need to specify when the neural network is developed. Machine learning methods, for example, the [_t Stochastic Neighborhood Embedding(tSNE)_](https://distill.pub/2016/misread-tsne/), can be used to visualize a two-dimensional version of the embedding. Instead of using a neural network, it is also possible to use methods such as _graph factorization_ to obtain embeddings from the graph. ![](https://github.com/arangodb/interactive_tutorials/blob/master/notebooks/img/embedding_schematic.png?raw=1)





The approach using just an _encoder_ function and a _similarity_ metric is an example of a _shallow_ approach to learning an embedding. Some of the drawbacks of using this approach are:
1. Learning an embedding requires determining a large number of parameters - in the order of the number of nodes in a graph ($\mathbf{O}(\left|V\right|)$, where $V$ represents the number of nodes in the graph).
2. The learning is _transductive_. The embeddings that can be determined are limited to the nodes seen during training. We will not be able to determine an embedding for nodes that were not seen as part of the training process.
3. Inability to incorporate node features in determining the embedding. 

Instead of just using a simple _encoder_ and _similarity_ measure to determine a Euclidean representation, it is possible to use more sophisticated approach to determining a representation for the graph. In contrast, we can use a _deeper_ approach using _Graph Neural Networks (GNN)_. In a _GNN_ nodes aggregate information from their neighbors using a _neural network_ . This approach eliminates some of the draw backs observed with the shallow approach. A schematic of the computational approach for _GNN's_ is shown below. ![](https://github.com/arangodb/interactive_tutorials/blob/master/notebooks/img/gnn_schematic.png?raw=1)


This notebook illustrates the shallow approach to embedding using _Node2vec_. An illustration of the deeper approach will follow shortly.

In [None]:
import pprint
import oasis



In [None]:
pp = pprint.PrettyPrinter()

## Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(credentialProvider='https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB', tutorialName="GraphEmbeddings")

## Connect to the temp database
test_db = oasis.connect_python_arango(login)

Requesting new temp credentials.
Temp database ready to use.


In [None]:
pp.pprint(login)

{'dbName': 'TUT11u9u12g91hwqq4ydbqtc',
 'hostname': 'tutorials.arangodb.cloud',
 'password': 'TUT5tnc2m3x7n9arxuzmyvl4h',
 'port': 8529,
 'username': 'TUTy9qypqnm8xgrlrpof2yr0c'}


In [None]:
if not test_db.has_graph("imdb_graph"):
    test_db.create_graph('imdb_graph', smart=True)

if not test_db.has_graph("user_user_graph"):
    test_db.create_graph("user_user_graph", smart=True)
    
imdb_graph = test_db.graph("imdb_graph")
user_user_graph = test_db.graph("user_user_graph")

if not test_db.has_collection("Users"):
    test_db.create_collection("Users", replication_factor=3)
if not imdb_graph.has_vertex_collection("Users"):
    imdb_graph.vertex_collection("Users")

if not test_db.has_collection("Movies"):
    test_db.create_collection("Movies", replication_factor=3)
if not imdb_graph.has_vertex_collection("Movies"):
    imdb_graph.vertex_collection("Movies")
    
if not test_db.has_collection("Ratings"):
    test_db.create_collection("Ratings", edge=True, replication_factor=3)
if not imdb_graph.has_edge_definition("Ratings"):
    ratings = imdb_graph.create_edge_definition(
        edge_collection='Ratings',
        from_vertex_collections=['Users'],
        to_vertex_collections=['Movies']
    )
if not test_db.has_collection("User_User_Sim"):
    test_db.create_collection("User_User_Sim", edge=True, replication_factor=3)
    
if not user_user_graph.has_edge_definition("User_User_Sim"):
    user_user_sim = user_user_graph.create_edge_definition(
        edge_collection='User_User_Sim',
        from_vertex_collections=['Users'],
        to_vertex_collections=['Users']
    )

In [None]:
! ./tools/arangorestore -c none --create-collection true --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3  --input-directory "./imdb_data/data/imdb_with_ratings"

[0m2021-05-17T15:06:00Z [469] INFO [05c30] {restore} Connected to ArangoDB 'http+ssl://tutorials.arangodb.cloud:8529'
[0m[0m2021-05-17T15:06:00Z [469] INFO [abeb4] {restore} Database name in source dump is 'TUTdit9ohpgz1ntnbetsjstwi'
[0m[0m2021-05-17T15:06:00Z [469] INFO [9b414] {restore} # Re-creating document collection 'Movies'...
[0m[0m2021-05-17T15:06:01Z [469] INFO [9b414] {restore} # Re-creating document collection 'Users'...
[0m[0m2021-05-17T15:06:02Z [469] INFO [9b414] {restore} # Re-creating edge collection 'Ratings'...
[0m[0m2021-05-17T15:06:03Z [469] INFO [6d69f] {restore} # Dispatched 3 job(s), using 2 worker(s)
[0m[0m2021-05-17T15:06:03Z [469] INFO [94913] {restore} # Loading data into document collection 'Movies', data size: 68107 byte(s)
[0m[0m2021-05-17T15:06:03Z [469] INFO [94913] {restore} # Loading data into document collection 'Users', data size: 16717 byte(s)
[0m[0m2021-05-17T15:06:03Z [469] INFO [6ae09] {restore} # Successfully restored document 

In [None]:
con = {"username": login["username"], "password": login["password"],\
       "dbName": login["dbName"], "hostname": login["hostname"],\
       "port": login["port"], "protocol": "https"}

In [None]:
imdb_attributes = {'vertexCollections': {'Users': {},
                                         'Movies': {}},
                   'edgeCollections': {'Ratings': {'_from', '_to', 'ratings'}}}

In [None]:
from adbnx_adapter.imdb_arangoDB_networkx_adapter import IMDBArangoDB_Networkx_Adapter
ma = IMDBArangoDB_Networkx_Adapter(conn=con)
g = ma.create_networkx_graph(
    graph_name='IMDBGraph',  graph_attributes=imdb_attributes)

In [None]:
user_nodes = [n for n in g.nodes() if n.startswith("Users")]
movie_nodes = [n for n in g.nodes() if n.startswith("Movies")]

In [None]:
print("Number of Users are %d" % (len(user_nodes)))
print("Number of Movies are %d" % (len(movie_nodes)))
print("Number of Ratings are %d" % (len(list(g.edges()))))

Number of Users are 943
Number of Movies are 1682
Number of Ratings are 65499


In [None]:
import networkx as nx
B = nx.Graph()
B.add_nodes_from(user_nodes, bipartite=0)
B.add_nodes_from(movie_nodes, bipartite=1)
B.add_edges_from(list(g.edges()))

In [None]:
from networkx.algorithms import bipartite
cr = bipartite.clustering(B)
cu = {}
cm = {}
for k, v in sorted(cr.items(),reverse=True, key=lambda item: item[1]):
    if k.startswith("Users"):
        cu[k] = v
    else:
        cm[k] = v

del cr

In [None]:
t10cu = list(cu.keys())[:10]
proj_user = nx.bipartite.projected_graph(B, t10cu)

In [None]:
import json
%time
num_embs = proj_user.number_of_nodes()
index = 0

batch = []
BATCH_SIZE = 500
batch_idx = 1
collection =test_db["User_User_Sim"]
el = proj_user.edges()
for e in el:
    insert_doc1 = {"_from": e[0], "_to": e[1]}
    insert_doc2 = {"_from": e[1], "_to": e[0]}
    batch.append(insert_doc1)
    batch.append(insert_doc2)
    index += 2
    last_record = (index == (len(el) - 2)) 
    if index % BATCH_SIZE == 0:
        print("Inserting batch %d" % (batch_idx))
        batch_idx += 1
        collection.import_bulk(batch)
        batch = []
    if last_record and len(batch) > 0:
        print("Inserting batch the last batch!")
        collection.import_bulk(batch)


CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.48 µs
Inserting batch 1
Inserting batch 2
Inserting batch 3
Inserting batch 4
Inserting batch 5
Inserting batch 6
Inserting batch 7
Inserting batch 8
Inserting batch 9
Inserting batch 10
Inserting batch 11
Inserting batch 12
Inserting batch 13
Inserting batch 14
Inserting batch 15
Inserting batch 16
Inserting batch 17
Inserting batch the last batch!
Inserting batch 18
Inserting batch 19
Inserting batch 20
Inserting batch 21
Inserting batch 22
Inserting batch 23
Inserting batch 24
Inserting batch 25
Inserting batch 26
Inserting batch 27
Inserting batch 28
Inserting batch 29
Inserting batch 30
Inserting batch 31
Inserting batch 32
Inserting batch 33
Inserting batch 34
Inserting batch 35


In [None]:
from node2vec import Node2Vec
node2vec = Node2Vec(proj_user, dimensions=64, walk_length=10, num_walks=10, workers=4)
model = node2vec.fit(window=10, min_count=1, batch_words=4)

HBox(children=(FloatProgress(value=0.0, description='Computing transition probabilities', max=914.0, style=Pro…




In [None]:
t10cu_emb = { n: list(map(float,model.wv.get_vector(n))) for n in proj_user.nodes()}

In [None]:
import json
%time
num_embs = proj_user.number_of_nodes()
index = 0
collection =test_db["Users"]
batch = []
BATCH_SIZE = 500
batch_idx = 1
for u, e in t10cu_emb.items():
    update_doc = {}
    the_key = u.split('/')[1]
    update_doc['_id'] = u
    update_doc['n2v_emb'] = e
    batch.append(update_doc)
    index += 1
    last_record = (index == (num_embs - 1)) 
    if index % BATCH_SIZE == 0:
        print("Inserting batch %d" % (batch_idx))
        batch_idx += 1
        collection.update_many(batch)
        batch = []
    if last_record and len(batch) > 0:
        print("Inserting batch the last batch!")
        collection.update_many(batch)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.87 µs
Inserting batch 1
Inserting batch the last batch!
