# Node embeddings with the `graphdatascience` driver

In this notebook, we try to recommend movies with respect to the ratings that were given by the users.

In [1]:
import numpy as np
from numpy.linalg import norm
from graphdatascience import GraphDataScience

### Connection

Here we connect to our database

In [2]:
host = "bolt://127.0.0.1:7687"
user = "neo4j"
password= "1234"

In [3]:
gds = GraphDataScience(host, auth=(user, password))

### Projection

Here we define our projection, we only need nodes labeled `Movie` and `User`. For the relations, we are only interessed in the `RATED` relationship with it's property `grade` which gives us the grade that the User gave to that movie.

We check how much memory it will cost us to do this operation (We will always check that in order to avoid starting an algorithm that our machine won't be able to compute)

In [4]:
nodes = ["User", "Movie"]
relationships = {"RATED": {"orientation": "UNDIRECTED", "properties": "grade"}}

result = gds.graph.project.estimate(nodes, relationships)

print("Required memory : " + result['requiredMemory'])


Required memory : [5843 KiB ... 5907 KiB]


Because the memory required is relatively low we compute the projection.

In [5]:
G, result = gds.graph.project("ratings", nodes, relationships)

# We can use convenience methods on `G` to check if the projection looks correct
print(G.node_count(), "nodes created")

Loading:   0%|          | 0/100 [00:00<?, ?%/s]

10352 nodes created


### FastRP

Now we want to compute the embedding of each movie. For that, first, we will use the FastRP algorithm because it encapsulates our idea of closeness for a movie i.e 

In [6]:
result = gds.fastRP.mutate.estimate(
    G,
    mutateProperty="embedding",
    embeddingDimension=10,
    relationshipWeightProperty="grade",
)

print(result['requiredMemory'])

2183 KiB


Here we launch the algorithm with the `write` syntax which means that it will "write" in our database, which we need in order to get the vector in this notebook.

In [7]:
result = gds.fastRP.write(
    G,
    writeProperty="embedding",
    randomSeed=42,
    embeddingDimension=10,
    relationshipWeightProperty="grade"
)

# Let's make sure we got an embedding for each node
print("Number of embedding vectors :", result['nodePropertiesWritten'])

FastRP:   0%|          | 0/100 [00:00<?, ?%/s]

Number of embedding vectors : 10352


Here we get the vectors into a `pd.DataFrame`

In [8]:
vectors = gds.run_cypher("""
MATCH (m:Movie)
RETURN m.title as movie, m.embedding as embedding""")

In [9]:
vectors.head(10)

Unnamed: 0,movie,embedding
0,Toy Story (1995),"[0.39723098278045654, -0.5194790363311768, 0.3..."
1,Jumanji (1995),"[0.3533506989479065, -0.6616520285606384, 0.29..."
2,Grumpier Old Men (1995),"[0.1305207908153534, -0.5875227451324463, 0.28..."
3,Waiting to Exhale (1995),"[0.470040887594223, -0.3561713695526123, 0.127..."
4,Father of the Bride Part II (1995),"[0.17649927735328674, -0.4792723059654236, 0.5..."
5,Heat (1995),"[0.37821003794670105, -0.4955470860004425, 0.3..."
6,Sabrina (1995),"[0.33683887124061584, -0.4300536811351776, 0.3..."
7,Tom and Huck (1995),"[0.1818910837173462, -0.3456726372241974, 0.26..."
8,Sudden Death (1995),"[-0.07396924495697021, -0.41880151629447937, 0..."
9,GoldenEye (1995),"[0.1819116771221161, -0.6005917191505432, 0.38..."


### Recommending movies (Finally ;) 

Here we'll try to find the most similar movies to Toy Story

First we compute an embedding matrix, we just put all the embedding vectors in a big matrix of size $n_{movies} * dim(embedding)$

In [11]:
embedding_matrix = np.array(vectors["embedding"].tolist()) 

In [12]:
embedding_matrix[:1]

array([[ 0.39723098, -0.51947904,  0.34273216, -0.1783711 , -0.01867387,
        -0.01733829,  0.3574723 ,  0.26418304, -0.87213004, -0.80410951]])

In [13]:
toy_story_embedding = embedding_matrix[0]

Now we compute the cos similarity of the Toy Story vector with the other vectors $\cos ({\bf u},{\bf v})= {{\bf u} {\bf v} \over \|{\bf u}\| \|{\bf v}\|}$

In [14]:
cos_similarites = np.array([np.dot(toy_story_embedding, movie_embedding)/((norm(toy_story_embedding)*norm(movie_embedding))+0.0000000000001) 
                   for movie_embedding in embedding_matrix])

We order them in decreasing order, and we only keep the 10 most similar vectors

In [16]:
most_similar = np.argsort(-cos_similarites)[:10]

In [17]:
vectors.iloc[most_similar]

Unnamed: 0,movie,embedding
0,Toy Story (1995),"[0.39723098278045654, -0.5194790363311768, 0.3..."
322,"Lion King, The (1994)","[0.3312600553035736, -0.6245747804641724, 0.33..."
615,Independence Day (a.k.a. ID4) (1996),"[0.3732811510562897, -0.5146379470825195, 0.31..."
512,Beauty and the Beast (1991),"[0.319600373506546, -0.6198705434799194, 0.335..."
4208,"Last Unicorn, The (1982)","[0.26970577239990234, -0.46319907903671265, 0...."
43,Seven (a.k.a. Se7en) (1995),"[0.3310205042362213, -0.5267654061317444, 0.27..."
618,"Hunchback of Notre Dame, The (1996)","[0.3476608991622925, -0.58438640832901, 0.4309..."
18,Ace Ventura: When Nature Calls (1995),"[0.23379650712013245, -0.628512442111969, 0.34..."
506,Aladdin (1992),"[0.30380168557167053, -0.5830289125442505, 0.3..."
546,Mission: Impossible (1996),"[0.34676864743232727, -0.49239689111709595, 0...."


Conclusion : As we can see, it seems to work pretty good given that it is a very simple model. We can see that the closest movie of Toy Story would be The Lion King, it's a good thing. Nonetheless, there are some movies that have no similarity to Toy Story that are there (ID4 for example)

In [16]:
G.drop()

graphName                                                      ratings
database                                                         neo4j
memoryUsage                                                           
sizeInBytes                                                         -1
nodeCount                                                        10352
relationshipCount                                               201672
configuration        {'relationshipProjection': {'RATED': {'orienta...
density                                                       0.001882
creationTime                       2023-01-16T22:01:09.903689000+01:00
modificationTime                   2023-01-16T22:01:10.059144000+01:00
schema               {'graphProperties': {}, 'relationships': {'RAT...
Name: 0, dtype: object