# Creating Networks from JSON Data

This notebook contains an example that reads data from a file of movies `../data/imdb_movies_1985to2022.json` and constructs a graph of actors. This dataset contains a sample of movies released betwen 2000-2022, their titles, genres, release years, ratings, and top-billed actors.

Using this dataset, we build a graph and perform some rudimentary graph analysis, extracting centrality metrics from it.

In [1]:
%matplotlib inline

In [2]:
import json
import random

import numpy as np
import pandas as pd
import networkx as nx


## Exercise 1: Build Graph of Actors, Finding Most Prolific Actor

The dataset contains a list of movies. We want to convert that list into a network of actors, where nodes represent the actor, and edges between them represent the movies in which the two actors have co-starred.

From there, we want to rank the actors by the number of neighboring actors to whom they are connected, and print the top 10.

In [4]:
g = nx.Graph() # Build the graph

In [5]:
with open("../data/imdb_movies_1985to2022.json", "r") as in_file:
    for line in in_file:
        
        # Load the movie from this line
        this_movie = json.loads(line)
            
        # Create a node for every actor
        for actor_id,actor_name in this_movie['actors']:
            g.add_node(actor_id, name=actor_name)
            
        # Iterate through the list of actors, generating all pairs
        #. Starting with the first actor in the list, generate pairs with all subsequent actors
        #. then continue to second actor in the list and repeat
        i = 0 # Counter in the list
        for left_actor_id,left_actor_name in this_movie['actors']:
            for right_actor_id,right_actor_name in this_movie['actors'][i+1:]:
                # Get the current weight, if it exists
                current_weight = g.get_edge_data(left_actor_id, right_actor_id, default={"weight":0})["weight"]
                
                # Add an edge for these actors
                g.add_edge(left_actor_id, right_actor_id, weight=current_weight+1)
                
            i += 1 # increment the counter

In [6]:
print("Nodes:", len(g.nodes))

Nodes: 34360


In [7]:
# If you want to explore this graph in Gephi or some other
#. graph analysis tool, NetworkX makes it easy to export data.
#. Here, we use the GraphML format, which Gephi can read 
#. natively, to keep node attributes like Actor Name
nx.write_graphml(g, "actors.graphml")

In [10]:
top_k = 10 # how many of the most central nodes to print

In [11]:
# Calculate degree centrality for all nodes
centrality_degree = nx.degree_centrality(g)

# sort node-centrality dictionary by metric, and reverse to get top elements first
for u in sorted(centrality_degree, key=centrality_degree.get, reverse=True)[:top_k]:
    print(u, g.nodes[u]['name'], centrality_degree[u])

nm0000616 Eric Roberts 0.002391219855268272
nm0000514 Michael Madsen 0.0012807462382706226
nm0001744 Tom Sizemore 0.0011400862467842536
nm0261724 Joe Estevez 0.001121578353167626
nm0001803 Danny Trejo 0.001025337306361163
nm0000115 Nicolas Cage 0.0009587088893413041
nm0442207 Lloyd Kaufman 0.0009476041531713276
nm0004193 Debbie Rochon 0.0008994836297680961
nm0000246 Bruce Willis 0.0008920804723214451
nm0000448 Lance Henriksen 0.0008772741574281432
