## Movie recommendations with personalized PageRank

Lets load the graph safely and keep all nodes

In [18]:
import networkx as nx
from typing import Iterable, Dict, List, Tuple


G_raw = nx.read_pajek("data/movies_graph.net")

G = nx.Graph()
G.add_nodes_from(G_raw.nodes(data=True))   
G.add_edges_from(G_raw.edges())            

print(f"Loaded graph with {G.number_of_nodes():,} nodes and {G.number_of_edges():,} edges")


Loaded graph with 6,577 nodes and 16,842 edges


Lets add tiny helpers to read labels clearly and to tell movies apart from mode nodes. In this dataset mode nodes have labels that start with m. Everything else is a movie

In [21]:
# Function to return clean string label for any node
def get_label(n) -> str:
    """Return a ."""
    lbl = G.nodes[n].get("label", str(n))
    return str(lbl).strip('"')
# Function to check if node is mode. Mode nodes start with 'm-'
def is_mode(n) -> bool:
    return get_label(n).lower().startswith("m-")

# Fuction to check if node represents movie name.
def is_movie(n) -> bool:
    return not is_mode(n)

# Sanity check
sample_labels = [get_label(n) for n in list(G.nodes())[:30]]
n_movies = sum(1 for n in G.nodes() if is_movie(n))
n_modes  = G.number_of_nodes() - n_movies
print("Sample labels:", sample_labels[:20])
print("Movies:", n_movies, "Modes:", n_modes)


Sample labels: ["'71", 'm-Action', 'm-Crime', 'm-Drama', 'm-Thriller', 'm-War', "m-Jack O'Connell", 'm-Jack Lowden', 'm-Paul Popplewell', 'm-Adam Nagaitis', 'm-Yann Demange', 'm-English', 'm-UK', "m-'71", '12 Rounds 2 - Reloaded', 'm-Adventure', 'm-Randy Orton', 'm-Tom Stevens', 'm-Brian Markinson', 'm-Venus Terzo']
Movies: 1337 Modes: 5240


Lets now add simple search helper to grab nodes by text and personalized PageRank. I will define personalized PageRank with a clean seed distribution. Finally I add two combiners. Geometric mean for "AND" and simple average for "OR". Also defien a function to return top k movie nodes by score. 
I will use these to build the five queries. 

In [None]:
# Step 3. Search, PPR, and score combiners

from typing import Iterable, Dict, List, Tuple
import numpy as np
import networkx as nx

# Function to limmit nodes whose label contains text. 
def find_nodes(text: str, *, include_modes=True, include_movies=True, limit=10) -> List[Tuple[str, str]]:
    q = text.lower()
    out = []
    for n in G.nodes():
        lbl = get_label(n)
        ok_type = (include_modes and is_mode(n)) or (include_movies and is_movie(n))
        if ok_type and q in lbl.lower():
            out.append((n, lbl))
            if len(out) >= limit:
                break
    return out

# Function for personalized pagerank wiltg teleport mass spread uniformly over the given seeds.
def ppr(seeds: Iterable, alpha: float = 0.85) -> Dict: #Alpha is the chance to take a normal step
    seeds = list(seeds)
    if not seeds:
        return {}
    w = 1.0 / len(seeds)
    personalization = {n: 0.0 for n in G.nodes()}
    for s in seeds:
        personalization[s] = w
    return nx.pagerank(G, alpha=alpha, personalization=personalization)

# Function for geomertic mean. It rewars nodes that are strong for all conditions
def combine_and(scores_list: List[Dict], eps: float = 1e-15) -> Dict:
    if not scores_list:
        return {}
    keys = set().union(*[d.keys() for d in scores_list])
    out = {}
    for k in keys:
        vals = [max(d.get(k, 0.0), eps) for d in scores_list]
        out[k] = float(np.prod(vals) ** (1.0 / len(vals)))
    return out

# Function for simple mean. It rewars nodes that are strong for any condition
def combine_or(scores_list: List[Dict]) -> Dict:
    if not scores_list:
        return {}
    keys = set().union(*[d.keys() for d in scores_list])
    out = {}
    for k in keys:
        vals = [d.get(k, 0.0) for d in scores_list]
        out[k] = float(np.mean(vals))
    return out

# Function to return top k movie nodes by score. It skips any in exclude
def top_movies(score_dict: Dict, *, exclude: Iterable = (), k: int = 15) -> List[Tuple[str, float]]:
    ex = set(exclude)
    items = []
    for n, s in score_dict.items():
        if is_movie(n) and n not in ex:
            items.append((get_label(n), float(s)))
    items.sort(key=lambda t: t[1], reverse=True)
    return items[:k]

# Quick check ro show a few candidates for common terms
print("Moana candidates:", [lbl for _, lbl in find_nodes("Moana", include_movies=True, include_modes=False)])
print("Tom Hanks candidates:", [lbl for _, lbl in find_nodes("Tom Hanks", include_modes=True, include_movies=False)])
print("Drama candidates:", [lbl for _, lbl in find_nodes("Drama", include_modes=True, include_movies=False)][:5])
print("Action candidates:", [lbl for _, lbl in find_nodes("Action", include_modes=True, include_movies=False)][:5])
print("Adventure candidates:", [lbl for _, lbl in find_nodes("Adventure", include_modes=True, include_movies=False)][:5])
print("Brad Pitt candidates:", [lbl for _, lbl in find_nodes("Brad Pitt", include_modes=True, include_movies=False)])
print("George Clooney candidates:", [lbl for _, lbl in find_nodes("George Clooney", include_modes=True, include_movies=False)])


Moana candidates: ['Moana']
Tom Hanks candidates: ['m-Tom Hanks']
Drama candidates: ['m-Drama']
Action candidates: ['m-Action', 'm-Extraction']
Adventure candidates: ['m-Adventure']
Brad Pitt candidates: ['m-Brad Pitt']
George Clooney candidates: ['m-George Clooney']


Lets now build queries and print top results. Logic is that I will grab the seed nodes for each query then I will run personalized PageRank then I will combine scores with geometric mean and simple menan. Finally I will only rank movie nodes and I skip the seed movie itself when that makes sense. 

Function to return node whose label matches exactly else None. 

In [24]:
def pick_exact(label_text: str):
    for n in G.nodes():
        if get_label(n) == label_text:
            return n
    return None

Query 1: Finding top 15 movies similar to the animated movie Moana.

Idea: I think of the movie graph as a map. Moana is a place on that map. From Moana you can walk to its actors, directors, genres, and then on to other movies that share those things. We release a tiny random walker on this map. At each step the walker either follows a link to a neighbor or snaps back to Moana. Because the walker keeps snapping back to Moana it spends most of its time in the neighborhood of Moana. Any movie that is close by through many short paths gets visited a lot. The visit rate becomes that movie’s score. Higher score means more similar to Moana. At the end we list only movie nodes, drop Moana itself then sort by score and finally show the top results

In [26]:
moana = pick_exact("Moana")
if moana:
    s = ppr([moana])
    out1 = top_movies(s, exclude=[moana], k=15)
    for i, (title, sc) in enumerate(out1, 1):
        print(f"{i:2d}. {title:45s} score={sc:.6g}")
else:
    print("Moana not found")

 1. Saving Santa                                  score=0.00162228
 2. Frozen                                        score=0.00151347
 3. Smallfoot                                     score=0.0014918
 4. Lion King                                     score=0.00144753
 5. Hoodwinked                                    score=0.00144243
 6. Rio                                           score=0.00142845
 7. Book of Life                                  score=0.00133951
 8. Muppets Most Wanted                           score=0.00128614
 9. Aladdin                                       score=0.00123583
10. Into the Woods                                score=0.0011949
11. Greatest Showman                              score=0.00113187
12. Planet 51                                     score=0.00112896
13. Sweeney Todd - The Demon Barber of Fleet Street score=0.00103229
14. Jumanji 2 - Welcome to the Jungle             score=0.000954246
15. Skyscraper                                    score=0.000

Query 2:  Finding dramas starring Tom Hanks

In [27]:
tom = pick_exact("m-Tom Hanks")
drama = pick_exact("m-Drama")
if tom and drama:
    s_actor = ppr([tom])
    s_genre = ppr([drama])
    s_and = combine_and([s_actor, s_genre])
    out2 = top_movies(s_and, k=15)
    for i, (title, sc) in enumerate(out2, 1):
        print(f"{i:2d}. {title:45s} score={sc:.6g}")
else:
    print("Missing Tom Hanks or Drama seed")

 1. Forrest Gump                                  score=0.0029402
 2. Captain Phillips                              score=0.00293454
 3. Sully                                         score=0.00291289
 4. Saving Private Ryan                           score=0.00288604
 5. Inferno                                       score=0.00284561
 6. Saving Mr. Banks                              score=0.00276516
 7. Da Vinci Code                                 score=0.00200921
 8. Toy Story 3                                   score=0.00173738
 9. Toy Story That Time Forgot                    score=0.00172135
10. Angels & Demons                               score=0.00167005
11. Polar Express                                 score=0.00166542
12. Toy Story of Terror                           score=0.00153506
13. Kraftidioten                                  score=0.000774082
14. Blade Runner 2049                             score=0.000768155
15. Blind Side                                    score=0.000

Query 3: Finding action and adventure movies featuring Johnny Depp

In [28]:
depp = pick_exact("m-Johnny Depp")
g_action = pick_exact("m-Action")
g_adventure = pick_exact("m-Adventure")
if depp and g_action and g_adventure:
    s_depp = ppr([depp])
    s_a = ppr([g_action])
    s_adv = ppr([g_adventure])
    s_and = combine_and([s_depp, s_a, s_adv])
    out3 = top_movies(s_and, k=15)
    for i, (title, sc) in enumerate(out3, 1):
        print(f"{i:2d}. {title:45s} score={sc:.6g}")
else:
    print("Missing Depp or one of the genres")

 1. Pirates of the Caribbean 5 - Dead Men Tell No Tales score=0.00148203
 2. Pirates of the Caribbean 2 - Dead Man's Chest score=0.00147494
 3. Pirates of the Caribbean - The Curse of the Black Pearl score=0.00131665
 4. Pirates of the Caribbean 3 - At World's End   score=0.00131665
 5. Pirates of the Caribbean 4 - On Stranger Tides score=0.00129739
 6. Lone Ranger                                   score=0.00129248
 7. Tourist                                       score=0.0011604
 8. Fantastic Beasts - The Crimes of Grindelwald  score=0.00109679
 9. Transcendence                                 score=0.00102427
10. Alice in Wonderland                           score=0.00102065
11. Alice Through the Looking Glass               score=0.000982117
12. Fear and Loathing in Las Vegas                score=0.00095554
13. Mortdecai                                     score=0.000882605
14. Sweeney Todd - The Demon Barber of Fleet Street score=0.000834799
15. Dark Shadows                         

Query 4: Finding movies co-starring Brad Pitt AND George Clooney

In [29]:
pitt = pick_exact("m-Brad Pitt")
clooney = pick_exact("m-George Clooney")
if pitt and clooney:
    s_pitt = ppr([pitt])
    s_clooney = ppr([clooney])
    s_and = combine_and([s_pitt, s_clooney])
    out4 = top_movies(s_and, k=15)
    for i, (title, sc) in enumerate(out4, 1):
        print(f"{i:2d}. {title:45s} score={sc:.6g}")
else:
    print("Missing one of the actor seeds")

 1. Ocean's 12                                    score=0.0154755
 2. Burn After Reading                            score=0.01493
 3. Ocean's 13                                    score=0.014748
 4. Ocean's 11                                    score=0.00542633
 5. Hail, Caesar!                                 score=0.00464708
 6. Monuments Men                                 score=0.00376517
 7. Leatherheads                                  score=0.00359884
 8. Babel                                         score=0.00344456
 9. Gravity                                       score=0.00343713
10. Tomorrowland                                  score=0.00318805
11. Interview with the Vampire                    score=0.00292219
12. Meet Joe Black                                score=0.00276865
13. Se7en                                         score=0.00276577
14. Fight Club                                    score=0.00271225
15. World War Z                                   score=0.00264899


Query 5: Finding movies featuring either Brad Pitt or George Clooney, or both.

In [30]:
if pitt or clooney:
    parts = []
    if pitt: parts.append(ppr([pitt]))
    if clooney: parts.append(ppr([clooney]))
    s_or = combine_or(parts)
    out5 = top_movies(s_or, k=15)
    for i, (title, sc) in enumerate(out5, 1):
        print(f"{i:2d}. {title:45s} score={sc:.6g}")
else:
    print("No actor seeds found")


 1. Ocean's 12                                    score=0.0158939
 2. Burn After Reading                            score=0.0153116
 3. Ocean's 13                                    score=0.0151484
 4. Ocean's 11                                    score=0.0115839
 5. Leatherheads                                  score=0.0107926
 6. Hail, Caesar!                                 score=0.0105646
 7. Tomorrowland                                  score=0.0104201
 8. Gravity                                       score=0.00990664
 9. Monuments Men                                 score=0.00907123
10. Interview with the Vampire                    score=0.00773485
11. Fight Club                                    score=0.00700363
12. Se7en                                         score=0.00697938
13. Babel                                         score=0.00688665
14. Allied                                        score=0.00673952
15. Meet Joe Black                                score=0.00673069


In summary, I used personalized PageRank over the provided movie knowledge graph. For each query I built a seed set that represents the intent for example a movie node for Moana or actor and genre nodes for Tom Hanks and Drama. Personalized PageRank simulates a random walk that keeps restarting at the seeds so nodes closer and more connected to the seeds get higher steady visit probability. For AND queries I ran one PageRank per condition and combined scores with the geometric mean so a movie must align with all conditions. For the OR query I averaged scores so either branch helps. I ranked only movie nodes and returned the top fifteen for each query