# Using Node2Vec to Embed Viewers of a Movie using the ArangoDB IMDB NetworkX Adapter

This notebook provides the details of using the ArangoDB IMDB NetworkX adapter to develop a _node2vec_ embedding of the viewers of a movie from the _IMDB_ database. 

## Install Required Libraries 

<a href="https://colab.research.google.com/github/arangoml/networkx-adapter/blob/dgl_updates/examples/IMDB_Networkx_Adapter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%%capture
!git clone -b dgl_updates https://github.com/arangoml/networkx-adapter.git
!rsync -av networkx-adapter/examples/ ./ --exclude=.git
!pip3 install networkx
!pip3 install matplotlib
!pip3 install --index-url https://test.pypi.org/simple/ adbnx-adapter==0.0.0.2.5
!pip3 install pyarango
!pip3 install python-arango
!pip install node2vec

## Get a Oasis Connection

__Oasis__, the managed database service offering from ArangoDB, will be used for this exercise. This eliminates the need for setting up and configuring an instance of a database.

In [1]:
from adbnx_adapter.imdb_arangoDB_networkx_adapter import IMDBArangoDB_Networkx_Adapter
import oasis
con = oasis.getTempCredentials()

print()
print("https://{}:{}".format(con["hostname"], con["port"]))
print("Username: " + con["username"])
print("Password: " + con["password"])
print("Database: " + con["dbName"])


ma = IMDBArangoDB_Networkx_Adapter(conn=con)

Reusing cached credentials.

https://5904e8d8a65f.arangodb.cloud:8529
Username: TUTcsbu6z4dykg5cfcaulaq5
Password: TUTwutyp2bej9r6l08ey6nsm3
Database: TUTko876refvycdccp1b2ryv6


## Create the Collections for the Database 

In [2]:
import csv
import json
import requests
import sys
import oasis


from pyArango.connection import *
from pyArango.collection import Collection, Edges, Field
from pyArango.graph import Graph, EdgeDefinition
from pyArango.collection import BulkOperation as BulkOperation

In [3]:
# Connect to the temp database
conn = oasis.connect_pyarango(con)
db = conn[con["dbName"]]

In [4]:
from pyArango.collection import Collection, Field
from pyArango.graph import Graph, EdgeDefinition


class Users(Collection):
    _fields = {
        "user_id": Field(),
        #         "age": Field(),
        #         "gender": Field()
    }


class Movies(Collection):
    _fields = {
        "movie_id": Field(),
        #         "movie_title": Field(),
        #         "release_data": Field()
    }


class Ratings(Edges):
    _fields = {
        # user_id and item_id are encoded by _from, _to
        "rating": Field(),
        #         "timestamp": Field()
    }


class IMDBGraph(Graph):
    _edgeDefinitions = [EdgeDefinition("Ratings", fromCollections=[
                                       "Users"], toCollections=["Movies"])]
    _orphanedCollections = []


db.createCollection("Users")
db.createCollection("Movies")
db.createCollection("Ratings")
iMDBGraph = db.createGraph("IMDBGraph")

print("Collection/Graph Setup done.")

Collection/Graph Setup done.


## Load the Data 

In [5]:
collection = db["Users"]
with BulkOperation(collection, batchSize=100) as col:
    with open('data/users.csv', newline='') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='|')
        # Skip header
        next(reader)
        for row in reader:
            user_id, age, gender, occupation, zip = tuple(row)
            doc = col.createDocument()
            doc["_key"] = user_id
#             doc["age"] = age
#             doc["gender"] = gender
            doc.save()

collection = db["Movies"]
with BulkOperation(collection, batchSize=100) as col:
    with open('data/movies.csv', newline='') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='|')
        # Skip header
        next(reader)
        for row in reader:
            movie_id, movie_title, release_date, video_release_date, url, unknown, action, adventure, animation, childrens, comedy, crime, documentary, drama, fantasy, noir, horror, musical, mystery, romance, scifi, thriller, war, western = tuple(
                row)
            doc = col.createDocument()
            doc["_key"] = movie_id
#             doc["movie_title"] = movie_title
#             doc["release_date"] = release_date
            doc.save()

collection = db["Ratings"]
with BulkOperation(collection, batchSize=1000) as col:
    with open('data/ratings.csv', newline='') as csvfile:
        reader = csv.reader(csvfile, delimiter=',', quotechar='|')
        # Skip header
        next(reader)
        for row in reader:
            user_id, movie_id, rating, timestamp = tuple(row)
            doc = col.createDocument()
            doc["_from"] = "Users/"+user_id
            doc["_to"] = "Movies/"+movie_id
            doc["ratings"] = rating
#             doc["timestamp"] = timestamp
            doc.save()

print("Import Done")

Import Done


## Specify the Graph Structure 

To use the IMDB Networkx Adapter, we need to specify the structure of the graph that we want to create. This is done with a simple dictionary. The details of creating the _Networkx_ graph are shown below.

In [6]:
imdb_attributes = {'vertexCollections': {'Users': {},
                                         'Movies': {}},
                   'edgeCollections': {'Ratings': {'_from', '_to', 'ratings'}}}

In [7]:
g = ma.create_networkx_graph(
    graph_name='IMDBGraph',  graph_attributes=imdb_attributes)

## Inspect the 'Users' and 'Movies' Nodes

In [8]:
g.nodes['Users/2']

{'attr_dict': {'_id': 'Users/2'}, 'bipartite': 0}

In [9]:
g.nodes['Movies/4']

{'attr_dict': {'_id': 'Movies/4'}, 'bipartite': 1}

## Who are the viewers of 'Movies/4' ('Get Shorty') 

In [10]:
m4v = [t[0] for t in g.in_edges('Movies/4')]

## How similar are viewers of the movie 'Get Shorty'?
The __Jaccard__ similarity is used for this purpose. We first get all pairs of users who have seen the movie and then compute the __Jaccard__ similarity between them. The details are shown below

In [11]:
from itertools import combinations
m4vucmb = list(combinations(m4v, 2))

In [12]:
import networkx as nx
gp = g.to_undirected()
jcp = nx.jaccard_coefficient(gp, m4vucmb)

## Create a complete sub-graph for the viewers of 'Get Shorty' using the Jaccard Simlarity for the edge weights

In [13]:
gs = nx.DiGraph()
for u, v, p in jcp:
    gs.add_edge(u, v, weight=p)
    #print('(%s, %s) -> %.8f' % (u, v, p))

## How many edges does the complete sub-graph have?

In [14]:
gs.number_of_edges()

9453

## Embed the sub-graph using Node2vec 

In [15]:
from node2vec import Node2Vec
node2vec = Node2Vec(gs, dimensions=32, walk_length=100,
                    num_walks=30, workers=4)

Computing transition probabilities: 100%|██████████| 138/138 [00:01<00:00, 112.89it/s]


In [16]:
model = node2vec.fit(window=10, min_count=1, batch_words=4)

In [17]:
model.wv.most_similar(m4v[5])

[('Users/532', 0.9992837905883789),
 ('Users/339', 0.9991508722305298),
 ('Users/301', 0.9991448521614075),
 ('Users/109', 0.9991329908370972),
 ('Users/327', 0.9991060495376587),
 ('Users/271', 0.9991012215614319),
 ('Users/457', 0.9990988373756409),
 ('Users/406', 0.9990825653076172),
 ('Users/144', 0.9990816116333008),
 ('Users/771', 0.9990763664245605)]