# Introduction to Graph Datbases

The first part of this assignment is designed to give you hands-on experience with graph databases. You will start by setting up an in-memory graph database, for which the support code is already written. Once the database is running, you will execute queries of increasing complexity, exploring how relationships between nodes and edges are stored and retrieved. Through this process, you will gain practical insights into graph database concepts such as connectivity, traversal, and querying using graph-specific languages.

In [1]:
import os
import sys

sys.path.append(os.path.abspath(os.path.join(os.path.dirname('__file__'), '..')))
from utils import setup_database, download_sample_data

In [2]:
# Download sample data for the Kuzudb example
data_dir = '../data'
download_sample_data(data_dir, urls=[
    "https://kuzudb.com/data/movie-lens/movies.csv",
    "https://kuzudb.com/data/movie-lens/users.csv",
    "https://kuzudb.com/data/movie-lens/ratings.csv",
    "https://kuzudb.com/data/movie-lens/tags.csv"
])

# Set up the Kuzudb database connection
connection = setup_database('../tmp', delete_existing=True)

# Create schema
connection.execute('CREATE NODE TABLE Movie (movieId INT64, year INT64, title STRING, genres STRING, PRIMARY KEY (movieId))')
connection.execute('CREATE NODE TABLE User (userId INT64, PRIMARY KEY (userId))')
connection.execute('CREATE REL TABLE Rating (FROM User TO Movie, rating DOUBLE, timestamp INT64)')
connection.execute('CREATE REL TABLE Tags (FROM User TO Movie, tag STRING, timestamp INT64)')

# Insert data
connection.execute(f'COPY Movie FROM "{data_dir}/movies.csv" (HEADER=TRUE)')
connection.execute(f'COPY User FROM "{data_dir}/users.csv" (HEADER=TRUE)')
connection.execute(f'COPY Rating FROM "{data_dir}/ratings.csv" (HEADER=TRUE)')
connection.execute(f'COPY Tags FROM "{data_dir}/tags.csv" (HEADER=TRUE)')


Downloading sample data
Downloading https://kuzudb.com/data/movie-lens/movies.csv...
Saved https://kuzudb.com/data/movie-lens/movies.csv to ../data/movies.csv
Downloading https://kuzudb.com/data/movie-lens/users.csv...
Saved https://kuzudb.com/data/movie-lens/users.csv to ../data/users.csv
Downloading https://kuzudb.com/data/movie-lens/ratings.csv...
Saved https://kuzudb.com/data/movie-lens/ratings.csv to ../data/ratings.csv
Downloading https://kuzudb.com/data/movie-lens/tags.csv...
Saved https://kuzudb.com/data/movie-lens/tags.csv to ../data/tags.csv
Sample data downloaded successfully
Loading graph database
Removing existing database at ../tmp


## Running Queries

Now that your graph database is set up, you can begin querying it. This section includes seven queries, each increasing in complexity.

In [None]:
# Query 1: Query all nodes with the label 'Movie'. Return those movie nodes. Limit your results to 25
result = connection.execute("")

df = result.get_as_df()
df.head()

Unnamed: 0,m
0,"{'_id': {'offset': 0, 'table': 0}, '_label': '..."
1,"{'_id': {'offset': 1, 'table': 0}, '_label': '..."
2,"{'_id': {'offset': 2, 'table': 0}, '_label': '..."
3,"{'_id': {'offset': 3, 'table': 0}, '_label': '..."
4,"{'_id': {'offset': 4, 'table': 0}, '_label': '..."


In [None]:
# Query 2: Query all nodes with the label 'Movie'. Get all connected nodes to the movie nodes. Limit your results to 50
result = connection.execute("")

df = result.get_as_df()
df.head()

Unnamed: 0,p
0,"{'_nodes': [{'_id': {'offset': 4210, 'table': ..."
1,"{'_nodes': [{'_id': {'offset': 4212, 'table': ..."
2,"{'_nodes': [{'_id': {'offset': 4215, 'table': ..."
3,"{'_nodes': [{'_id': {'offset': 4253, 'table': ..."
4,"{'_nodes': [{'_id': {'offset': 4256, 'table': ..."


In [None]:
# Query 3: Count the total number of nodes in the database
# Hint: Use the `COUNT` function to count the number of nodes
result = connection.execute("")

df = result.get_as_df()
df.head()

Unnamed: 0,TotalNodes
0,10352


In [None]:
# Query 4: Query all nodes with the label 'User'. Count the degree for these nodes. Filter the nodes where the user rated more than 3 movies. Return the users and the degree
# Hint: First find all users and their ratings, then count the degree, and finally filter the results to only include users with more than 3 ratings
result = connection.execute("")

df = result.get_as_df()
df.head()

Unnamed: 0,u,degree
0,"{'_id': {'offset': 591, 'table': 1}, '_label':...",94
1,"{'_id': {'offset': 607, 'table': 1}, '_label':...",831
2,"{'_id': {'offset': 181, 'table': 1}, '_label':...",977
3,"{'_id': {'offset': 601, 'table': 1}, '_label':...",135
4,"{'_id': {'offset': 169, 'table': 1}, '_label':...",50


In [None]:
# Query 5: Query all nodes with the label 'Movie'. Each node has a 'genre' attribute. Count the number of nodes per genre
# Hint: Use the `WITH` clause to group by genres and count the number of movies
result = connection.execute("")

df = result.get_as_df()
df.head()

Unnamed: 0,genres,MovieCount
0,Drama,1058
1,Comedy,950
2,Comedy|Drama,435
3,Comedy|Romance,363
4,Drama|Romance,349


In [None]:
# Query 6: Query all nodes with the label 'Movie' and 'User', and the edge 'Rating' between movie and user. Each edge 'Rating' has a rating. Find the top 10 rated movies by average rating score
# Hint: Use the AVG clause to calculate an average. Use the `ORDER BY` clause to sort the movies by rating in descending order
result = connection.execute("")

df = result.get_as_df()
df.head()

Unnamed: 0,title,avg_rating
0,Cherish (2002),5.0
1,Madame Sousatzka (1988),5.0
2,Jane Eyre (1944),5.0
3,Return to Treasure Island (1988),5.0
4,Vampire in Venice (Nosferatu a Venezia) (Nosfe...,5.0


In [None]:
# Query 7: Query all nodes with the label 'Movie' and 'User', and the edge 'Rating' between movie and user. Find pairs of movies often rated by the same users
result = connection.execute("")

df = result.get_as_df()
df.head()

Unnamed: 0,m1.title,m2.title,CommonUsers
0,"Shawshank Redemption, The (1994)",Forrest Gump (1994),231
1,Pulp Fiction (1994),Forrest Gump (1994),230
2,Pulp Fiction (1994),"Shawshank Redemption, The (1994)",222
3,Pulp Fiction (1994),"Silence of the Lambs, The (1991)",207
4,Forrest Gump (1994),"Silence of the Lambs, The (1991)",199
