![arangodb](https://github.com/arangodb/interactive_tutorials/blob/master/notebooks/img/ArangoDB_logo.png?raw=1)

# Analyzing Movie Popularity Through Graph Algorithms on ArangoDB

---



<a href="https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/Pregel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Movie Influence and Popularity Analysis Using Graphs and ArangoDB**

---




This project demonstrates how to integrate the MovieLens dataset with the multimodel database ArangoDB to perform advanced graph data analysis. Using data on movies, users, ratings, and tags, we build a directed graph representing connections between users and movies through weighted ratings.

The pipeline begins with extracting and cleaning the MovieLens CSV files, followed by batch inserting vertex collections (movies and users) and edge collections (ratings and tags) into ArangoDB. After loading the data, a graph is created in ArangoDB defining the relationships between users and movies.

Using the NetworkX library, the graph is reconstructed in Python from ArangoDB data, with edges weighted by rating values. We then apply the PageRank algorithm to measure the relative influence of movies in the graph, considering both connection structure and rating weights.

To enhance the analysis, we combine PageRank with additional metrics computed directly in ArangoDB, such as total number of ratings and average rating per movie. After normalizing these indicators, we generate a weighted final score ranking movies by popularity and relevance within the network.

The final output is a list of the top 10 most influential movies, taking into account both the graph structural influence (PageRank) and the volume and quality of ratings received. This approach showcases the power of distributed graph processing with ArangoDB and the application of analytical algorithms to extract valuable insights from large relational datasets.

# Setup

This project is designed to run in a Python environment, such as Jupyter Notebook or Google Colab, and requires several libraries and dependencies for data processing, graph construction, and database interaction.

The setup process starts by upgrading the Python package manager (pip) and installing key libraries including:

- python-arango and pyarango;

- pandas;

- networkx;

- numpy

In [1]:
%%capture
# Update
!pip3 install --upgrade pip

# Install  pandas, networkx, arango client libs
!pip3 install --upgrade python-arango pyarango pandas networkx


# Clone repo
!git clone https://github.com/arangodb/interactive_tutorials.git -b oasis_connector --single-branch
!rsync -av interactive_tutorials/ ./ --exclude=.git


In [2]:
import oasis
from pyArango.connection import *
import pandas as pd
import zipfile
import requests
import os
from tqdm import tqdm
from arango import ArangoClient
import networkx as nx
import numpy as np


Create the temporary database:

In [3]:
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName="Movielens", credentialProvider='https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB')

# Connect to the temp database
db = oasis.connect_python_arango(login)

Requesting new temp credentials.
Temp database ready to use.


In [4]:
print("https://{}:{}".format(login["hostname"], login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])

https://tutorials.arangodb.cloud:8529
Username: TUTpgayvplrllcuf8n46zhz3i
Password: TUT5mpnyqh8fn880lgt1ihq2v
Database: TUTg7c78306nb9p0eeovgcyu


Feel free to use to above URL to checkout the UI!

##  Import Data

import movielens dataset

In [5]:
# Small dataset URL
url = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"

# Zip file path
zip_path = "/content/ml-latest-small.zip"
if not os.path.exists(zip_path):
    print("Downloading dataset...")
    r = requests.get(url)
    with open(zip_path, "wb") as f:
        f.write(r.content)
else:
    print("Dataset already downloaded.")

# Extract files (ml-latest-small/)
print("Extracting files...")
with zipfile.ZipFile(zip_path, "r") as zip_ref:
    zip_ref.extractall("/content")

# Correct paths inside the subfolder
extract_folder = "/content/ml-latest-small"
movies_path = os.path.join(extract_folder, "movies.csv")
ratings_path = os.path.join(extract_folder, "ratings.csv")
tags_path = os.path.join(extract_folder, "tags.csv")

# Read complete datasets
print("Reading CSV files...")
movies_df = pd.read_csv(movies_path)
ratings_df = pd.read_csv(ratings_path)
tags_df = pd.read_csv(tags_path)

print(f"Total movies: {movies_df.shape[0]}")
print(f"Total ratings: {ratings_df.shape[0]}")


Downloading dataset...
Extracting files...
Reading CSV files...
Total movies: 9742
Total ratings: 100836


Insert the dataset on Arangodb

In [6]:
# Fixed connection data (replace with your own)
HOST = "https://tutorials.arangodb.cloud:8529"
DB_NAME = "TUTg7c78306nb9p0eeovgcyu" #Your DB name
USERNAME = "TUTpgayvplrllcuf8n46zhz3i" #Your Username
PASSWORD = "TUT5mpnyqh8fn880lgt1ihq2v" #Your Password

# Set up connection
client = ArangoClient(hosts=HOST)
db = client.db(DB_NAME, username=USERNAME, password=PASSWORD)

# Create collections if they don't exist
if not db.has_collection('movies'):
    db.create_collection('movies')
if not db.has_collection('users'):
    db.create_collection('users')
if not db.has_collection('ratings'):
    db.create_collection('ratings', edge=True)
if not db.has_collection('tags'):
    db.create_collection('tags', edge=True)

movies_collection = db.collection('movies')
users_collection = db.collection('users')
ratings_edge_collection = db.collection('ratings')
tags_edge_collection = db.collection('tags')

# Function to split lists into chunks
def chunks(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

# Insert movies in batches
print("Inserting movies in batches...")
movies = [{
    "_key": str(row['movieId']),
    "title": row['title'],
    "genres": row['genres']
} for _, row in movies_df.iterrows()]

for batch in tqdm(list(chunks(movies, 1000))):
    db.aql.execute(
        """
        FOR doc IN @batch
            INSERT doc INTO movies OPTIONS { ignoreErrors: true }
        """,
        bind_vars={"batch": batch}
    )

# Insert users in batches
print("Inserting users in batches...")
user_ids = list(set(ratings_df['userId'].astype(str)))
users = [{'_key': uid} for uid in user_ids]

for batch in tqdm(list(chunks(users, 1000))):
    db.aql.execute(
        """
        FOR doc IN @batch
            INSERT doc INTO users OPTIONS { ignoreErrors: true }
        """,
        bind_vars={"batch": batch}
    )

# Insert ratings (edges) in batches
print("Inserting ratings in batches...")
edges_ratings = [{
    "_from": f"users/{row['userId']}",
    "_to": f"movies/{row['movieId']}",
    "rating": float(row['rating']),
    "timestamp": int(row['timestamp'])
} for _, row in ratings_df.iterrows()]

for batch in tqdm(list(chunks(edges_ratings, 1000))):
    db.aql.execute(
        """
        FOR doc IN @batch
            INSERT doc INTO ratings OPTIONS { ignoreErrors: true }
        """,
        bind_vars={"batch": batch}
    )

# Insert tags (edges) in batches
print("Inserting tags in batches...")
edges_tags = [{
    "_from": f"users/{row['userId']}",
    "_to": f"movies/{row['movieId']}",
    "tag": row['tag'],
    "timestamp": int(row['timestamp'])
} for _, row in tags_df.iterrows()]

for batch in tqdm(list(chunks(edges_tags, 1000))):
    db.aql.execute(
        """
        FOR doc IN @batch
            INSERT doc INTO tags OPTIONS { ignoreErrors: true }
        """,
        bind_vars={"batch": batch}
    )

print("Insertion complete!")


Inserting movies in batches...


100%|██████████| 10/10 [00:01<00:00,  9.59it/s]


Inserting users in batches...


100%|██████████| 1/1 [00:00<00:00, 10.91it/s]


Inserting ratings in batches...


100%|██████████| 101/101 [00:10<00:00,  9.45it/s]


Inserting tags in batches...


100%|██████████| 4/4 [00:00<00:00,  9.51it/s]

Insertion complete!





##  Create the graph

In [7]:
graph_name = 'movies_graph'

# If the graph exists, delete it (without dropping the collections)
if db.has_graph(graph_name):
    db.delete_graph(graph_name, drop_collections=False)

# Create the graph
graph = db.create_graph(graph_name)

# Define the ratings edge (user -> movie)
graph.create_edge_definition(
    edge_collection='ratings',
    from_vertex_collections=['users'],
    to_vertex_collections=['movies']
)

# Define the tags edge (user -> movie)
graph.create_edge_definition(
    edge_collection='tags',
    from_vertex_collections=['users'],
    to_vertex_collections=['movies']
)

print(f"Graph '{graph_name}' created with vertex and edge collections.")


Graph 'movies_graph' created with vertex and edge collections.


# Page Rank

This code loads movie rating data into ArangoDB, builds a graph of users and movies, and runs the PageRank algorithm to rank movies based on user ratings. It then combines PageRank scores with rating counts and averages to identify the top movies

In [8]:
import networkx as nx
from pyArango.connection import *
import numpy as np

# ArangoDB settings
DB_NAME = "TUTg7c78306nb9p0eeovgcyu" #Your DB name
USERNAME = "TUTpgayvplrllcuf8n46zhz3i" #Your Username
PASSWORD = "TUT5mpnyqh8fn880lgt1ihq2v" #Your Password
HOST = "https://tutorials.arangodb.cloud:8529"

# Function to clean the trailing ".0" from keys if it exists
def clean_key(s):
    if s.endswith('.0'):
        return s[:-2]
    return s

# Connect to ArangoDB
conn = Connection(username=USERNAME, password=PASSWORD, arangoURL=HOST)
db = conn[DB_NAME]

# Fetch ratings (edges) to build the graph
query = """
FOR r IN ratings
  RETURN {user: r._from, movie: r._to, weight: r.rating}
"""
edges = db.AQLQuery(query, rawResults=True)

# Create directed graph
G = nx.DiGraph()

# Add edges with weights
for e in edges:
    G.add_edge(e['user'], e['movie'], weight=e['weight'])

print(f"Graph created with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.")

# Run PageRank
pr = nx.pagerank(G, alpha=0.85, weight='weight')

# Filter only movies in the PageRank result
movie_ranks = {k: v for k, v in pr.items() if k.startswith("movies/")}

# Extract movie _keys applying cleaning
movie_keys = [clean_key(m.split('/')[1]) for m in movie_ranks.keys()]

# Fetch movie titles
query_movies = """
FOR m IN movies
  FILTER m._key IN @keys
  RETURN {key: m._key, title: m.title}
"""
movies_data = db.AQLQuery(query_movies, bindVars={"keys": movie_keys}, rawResults=True)
id_to_title = {m['key']: m['title'] for m in movies_data}

# Fetch stats: count and average rating per movie
query_stats = """
FOR r IN ratings
  COLLECT movie = r._to INTO group = r
  LET count = LENGTH(group)
  LET avg_rating = AVERAGE(group[*].rating)
  RETURN {movie: movie, count: count, avg_rating: avg_rating}
"""
stats = db.AQLQuery(query_stats, rawResults=True)

# Create dictionaries for quick access
movie_counts = {s['movie']: s['count'] for s in stats}
movie_avg_ratings = {s['movie']: s['avg_rating'] for s in stats}

# Prepare arrays for normalization and final score calculation
pr_values = np.array([movie_ranks.get(k, 0) for k in movie_ranks.keys()])
count_values = np.array([movie_counts.get(k, 0) for k in movie_ranks.keys()])
avg_rating_values = np.array([movie_avg_ratings.get(k, 0) for k in movie_ranks.keys()])

# Simple min-max normalization function
def min_max_normalize(arr):
    if arr.max() == arr.min():
        return np.zeros_like(arr)
    return (arr - arr.min()) / (arr.max() - arr.min())

pr_norm = min_max_normalize(pr_values)
count_norm = min_max_normalize(count_values)
avg_norm = min_max_normalize(avg_rating_values)

# Combine scores with weights: 50% PageRank, 30% count, 20% average rating
final_score = 0.5 * pr_norm + 0.3 * count_norm + 0.2 * avg_norm

# Create a dict with final scores
combined_scores = dict(zip(movie_ranks.keys(), final_score))

# Sort top 10 movies by combined score
top_10_combined = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:10]

print("\nTop 10 movies combining PageRank, count, and average rating:\n")
for movie_id_str, score in top_10_combined:
    movie_key = clean_key(movie_id_str.split('/')[1])
    title = id_to_title.get(movie_key, "Title not found")
    print(title)


Graph created with 10334 nodes and 100836 edges.

Top 10 movies combining PageRank, count, and average rating:

Shawshank Redemption, The (1994)
Forrest Gump (1994)
Pulp Fiction (1994)
Silence of the Lambs, The (1991)
Matrix, The (1999)
Braveheart (1995)
Star Wars: Episode IV - A New Hope (1977)
Schindler's List (1993)
Jurassic Park (1993)
Terminator 2: Judgment Day (1991)


# Next Steps

Be sure to check out the community detection tutorial to explore more graph analytics applications using ArangoDB.

To keep experimenting and working with ArangoDB beyond this temporary setup, you can:

Get a 2-week free trial on ArangoDB Cloud

Take the free Graph Course

Download and install ArangoDB locally

Keep learning at https://www.arangodb.com/arangodb-training-center/

Useful resources:
https://www.arangodb.com/docs/stable/aql/tutorial.html


#Further links


Keep learning at https://www.arangodb.com/arangodb-training-center/

Useful resources: https://www.arangodb.com/docs/stable/aql/tutorial.html