---
title: "Graph Attributes"
format:
  html:
    embed-resources: true
---


### Imports

In [15]:
import networkx as nx
import numpy as np
import pandas as pd
import ast


### Network Summary

In [16]:
# Load GML graph
G = nx.read_graphml("../data/networks/crew_collaboration_network.graphml")

# Inspect basic stats
def network_summary(G):

    def centrality_stats(x):
        x1=dict(x)
        x2=np.array(list(x1.values())); #print(x2)
        print("	min:" ,min(x2))
        print("	mean:" ,np.mean(x2))
        print("	median:" ,np.median(x2))
        # print("	mode:" ,stats.mode(x2)[0][0])
        print("	max:" ,max(x2))
        x=dict(x)
        sort_dict=dict(sorted(x1.items(), key=lambda item: item[1],reverse=True))
        print("	top nodes:",list(sort_dict)[0:6])
        print("	          ",list(sort_dict.values())[0:6])

    try: 
        print("GENERAL")
        print("	number of nodes:",len(list(G.nodes)))
        print("	number of edges:",len(list(G.edges)))

        print("	is_directed:", nx.is_directed(G))
        print("	is_weighted:" ,nx.is_weighted(G))


        if(nx.is_directed(G)):
            print("IN-DEGREE (NORMALIZED)")
            centrality_stats(nx.in_degree_centrality(G))
            print("OUT-DEGREE (NORMALIZED)")
            centrality_stats(nx.out_degree_centrality(G))
        else:
            print("	number_connected_components: ", nx.number_connected_components(G))
            print("	number of triangle: ",len(nx.triangles(G).keys()))
            print("	density:" ,nx.density(G))
            print("	average_clustering coefficient: ", nx.average_clustering(G))
            print("	degree_assortativity_coefficient: ", nx.degree_assortativity_coefficient(G))
            print("	is_tree:" ,nx.is_tree(G))

            if(nx.is_connected(G)):
                print("	diameter:" ,nx.diameter(G))
                print("	radius:" ,nx.radius(G))
                print("	average_shortest_path_length: ", nx.average_shortest_path_length(G))

            #CENTRALITY 
            print("DEGREE (NORMALIZED)")
            centrality_stats(nx.degree_centrality(G))

            print("CLOSENESS CENTRALITY")
            centrality_stats(nx.closeness_centrality(G))

            print("BETWEEN CENTRALITY")
            centrality_stats(nx.betweenness_centrality(G))
    except Exception as e:
        print(f"unable to run: {e}")

network_summary(G)

GENERAL
	number of nodes: 1302
	number of edges: 9682
	is_directed: False
	is_weighted: True
	number_connected_components:  5
	number of triangle:  1302
	density: 0.01143159403554633
	average_clustering coefficient:  0.835207502960012
	degree_assortativity_coefficient:  0.034358075495094315
	is_tree: False
DEGREE (NORMALIZED)
	min: 0.006149116064565719
	mean: 0.01143159403554633
	median: 0.007686395080707148
	max: 0.06917755572636433
	top nodes: ['Samuel L. Jackson', 'Tom Cruise', 'Benedict Cumberbatch', 'Dwayne Johnson', 'Will Smith', 'Andy Serkis']
	           [0.06917755572636433, 0.05995388162951576, 0.04996156802459646, 0.04996156802459646, 0.04996156802459646, 0.04765564950038432]
CLOSENESS CENTRALITY
	min: 0.006917755572636433
	mean: 0.2638778085112068
	median: 0.2718867265199872
	max: 0.37695110457093983
	top nodes: ['Benedict Cumberbatch', 'Idris Elba', 'Chris Hemsworth', 'Samuel L. Jackson', 'Robert Downey Jr.', 'Tom Cruise']
	           [0.37695110457093983, 0.36300245134300

## Collaboration Network Summary

The collaboration graph provides insight into the structure of the film industry through connections between actors, directors, and other crew members.

### General Graph Properties
- **Nodes:** 1302  
  Unique individuals in the network.
- **Edges:** 9682  
  Total collaborations (shared projects).
- **Graph Type:** Undirected, Weighted  
  - *Undirected* → Collaborations are mutual.  
  - *Weighted* → Edge weights represent the number of shared collaborations.
- **Connected Components:** 5  
  Indicates the network is not fully connected — there are distinct clusters of collaborators.  
- **Triangles:** 1302  
  Suggests a large number of tightly knit teams (cliques).  
- **Density:** 0.0114 (1.14%)  
  The network is sparse overall, as expected in real-world social graphs.  
- **Average Clustering Coefficient:** 0.835  
  Very high — if two people work with the same person, they are highly likely to have worked together too.  
- **Degree Assortativity:** 0.034  
  Nearly zero → highly connected individuals collaborate with both well-connected and less connected people.  
- **Is Tree:** False  
  Collaboration networks are cyclical with overlapping groups, not tree-like.

### Centrality Measures

**Degree Centrality** — *Collaboration Volume*  
- **Range:** 0.006 → 0.069  
- **Top Nodes:** Samuel L. Jackson, Tom Cruise, Benedict Cumberbatch, Dwayne Johnson, Will Smith, Andy Serkis  
- **Interpretation:** These individuals are the most prolific collaborators, appearing in many different projects.

**Closeness Centrality** — *Reachability in the Network*  
- **Range:** 0.007 → 0.377  
- **Top Nodes:** Benedict Cumberbatch, Idris Elba, Chris Hemsworth, Samuel L. Jackson, Robert Downey Jr., Tom Cruise  
- **Interpretation:** These actors can reach others in the network through very few steps, giving them high structural influence.

**Betweenness Centrality** — *Brokerage / Gatekeeping*  
- **Range:** 0.0 → 0.066  
- **Top Nodes:** Samuel L. Jackson, Frank Grillo, Tom Cruise, Dwayne Johnson, Wu Jing, Andy Serkis  
- **Interpretation:** These individuals bridge otherwise disconnected parts of the network, acting as important connectors.

### Key Takeaways
- The network is **highly clustered** but **sparse overall**, with multiple connected components.
- A handful of stars dominate the collaboration structure, especially actors linked to major franchises (e.g., Marvel, Avatar, Star Wars).
- **High clustering** suggests strong “cliquish” collaboration groups.
- **Central actors** (high degree, closeness, or betweenness) are likely to be important predictors of box office performance.


### Compute Network Attributes

In [17]:
# compute centraility measures
deg_cent = nx.degree_centrality(G)
close_cent = nx.closeness_centrality(G)
betw_cent = nx.betweenness_centrality(G)

# turn into a DataFrame
centrality_df = pd.DataFrame({
    'Person': list(deg_cent.keys()),
    'DegreeCentrality': list(deg_cent.values()),
    'ClosenessCentrality': list(close_cent.values()),
    'BetweennessCentrality': list(betw_cent.values())
})

centrality_df.head()

Unnamed: 0,Person,DegreeCentrality,ClosenessCentrality,BetweennessCentrality
0,Sam Worthington,0.018447,0.295678,0.004802
1,Zoe Saldaña,0.030746,0.330639,0.013308
2,Sigourney Weaver,0.010761,0.272618,7.6e-05
3,Stephen Lang,0.010761,0.272618,7.6e-05
4,Michelle Rodriguez,0.025365,0.301916,0.005079


### Add Centrality Attributes to Data

In [18]:
import pandas as pd
import ast

# function to add centrality measures to provided df
def add_centrality_features(movies_df, centrality_df):

    # Ensure columns are lists instead of strings
    for col in ['Actors', 'Directors', 'Producers']:
        movies_df[col] = movies_df[col].apply(
            lambda x: ast.literal_eval(x) if isinstance(x, str) else (x if isinstance(x, list) else [])
        )

    # Flatten into a movie-person mapping
    movie_people = []
    for _, row in movies_df.iterrows():
        people = row['Actors'] + row['Directors'] + row['Producers']
        for person in people:
            movie_people.append({'IMDB_ID': row['IMDB_ID'], 'Person': person})

    movie_people_df = pd.DataFrame(movie_people)

    # Merge each person with their centralities
    movie_people_df = movie_people_df.merge(centrality_df, on="Person", how="left")

    # Aggregate centrality measures by movie
    movie_features = (
        movie_people_df
        .groupby("IMDB_ID")
        .agg({
            'DegreeCentrality': ['mean', 'max'],
            'ClosenessCentrality': ['mean', 'max'],
            'BetweennessCentrality': ['mean', 'max']
        })
    )

    # Flatten column names
    movie_features.columns = ["_".join(col) for col in movie_features.columns]
    movie_features.reset_index(inplace=True)

    # Merge back into the main movie DataFrame
    movies_centrality_df = movies_df.merge(movie_features, on="IMDB_ID", how="left")

    # fill rows with NA, this means they do not have scores
    movies_centrality_df = movies_centrality_df.fillna(0)

    return movies_centrality_df


Now we want to aggregate these centrality measures to create features for our regression model. The following attributes are some of the key features derived:

* `DegreeCentrality_mean` → average collaboration volume of cast/crew
* `DegreeCentrality_max` → star power of the most connected person
* `ClosenessCentrality_mean` → overall network reachability
* `BetweennessCentrality_max` → presence of a bridging star

In [19]:
# read in cleaned movies data
movies_reg_df = pd.read_csv("../data/processed/movies_reg.csv")
movies_class_df = pd.read_csv("../data/processed/movies_class.csv")

# apply function
movies_reg_centrality_df = add_centrality_features(movies_reg_df, centrality_df)
movies_class_centrality_df = add_centrality_features(movies_class_df, centrality_df)

# write to csv
movies_reg_centrality_df.to_csv('../data/processed/movies_reg_centrality.csv', index=False)
movies_class_centrality_df.to_csv('../data/processed/movies_class_centrality.csv', index=False)

In [22]:
movies_class_centrality_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4510 entries, 0 to 4509
Data columns (total 34 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   IMDB_ID                     4510 non-null   object 
 1   Title                       4510 non-null   object 
 2   Year                        4510 non-null   float64
 3   Release_Month               4510 non-null   int64  
 4   Age_Rating                  4510 non-null   object 
 5   Genre                       4510 non-null   object 
 6   Directors                   4510 non-null   object 
 7   Actors                      4510 non-null   object 
 8   Producers                   4510 non-null   object 
 9   Writers                     4510 non-null   object 
 10  Composers                   4510 non-null   object 
 11  Runtime                     4510 non-null   float64
 12  Cinematographers            4510 non-null   object 
 13  Production_Companies        4510 