# **Algorithmic Methods of Data Mining - Fall 2022**

## **Homework 5: The Marvel Universe!**

**Packages that are used troughout the notebook:**

In [4]:
# For data analysis and manipulation
import pandas as pd
import numpy as np

# For graph representation
import networkx as nx

# Utils
from tqdm import tqdm
import sys

## 1. Data

### Matching hero names
Here we examine and prepare the data so that the hero names match between the two datasets. 

In [19]:
# Load data and strip forward slashes and whitespace
nodes = pd.read_csv('data/nodes.csv').applymap(lambda x: x.rstrip('/').strip())
edges = pd.read_csv('data/edges.csv').applymap(lambda x: x.rstrip('/').strip())
network = pd.read_csv('data/hero-network.csv').applymap(lambda x: x.rstrip('/').strip())

In [20]:
# Unique heroes in each dataframe
node_heroes = set(nodes[nodes['type'] == 'hero']['node'])
edge_heroes = set(edges['hero'])
network_heroes = set(pd.concat([network['hero1'], network['hero2']]))

# Number of unique heroes in each dataframe
len(node_heroes), len(edge_heroes), len(network_heroes)

(6439, 6439, 6421)

We would like to have each hero in edges and network represented in the nodes dataframe. So let's see which ones do not have a match in the nodes dataframe.

In [21]:
print('Heroes in edges but not in nodes:', edge_heroes - node_heroes)
print('Heroes in nodes but not in edges:', node_heroes - edge_heroes)

Heroes in edges but not in nodes: {'SPIDER-MAN/PETER PARKER'}
Heroes in nodes but not in edges: {'SPIDER-MAN/PETER PARKERKER'}


In [22]:
# Fix spiderman's name in nodes
nodes.loc[nodes['node'] == 'SPIDER-MAN/PETER PARKERKER', 'node'] = 'SPIDER-MAN/PETER PARKER'

# Update the heroe set for nodes
node_heroes = set(nodes[nodes['type'] == 'hero']['node'])

# Now let's check the network dataframe
print('Heroes in network but not in nodes:', network_heroes - node_heroes)

Heroes in network but not in nodes: {'SPIDER-MAN/PETER PAR'}


In [23]:
# Fix spiderman's name in network
network = network.applymap(lambda x: 'SPIDER-MAN/PETER PARKER' if x == 'SPIDER-MAN/PETER PAR' else x)
network_heroes = set(pd.concat([network['hero1'], network['hero2']]))

# Now we have every hero in the nodes dataframe
print('Number of heroes in nodes but not in network:', len(node_heroes - network_heroes))

Number of heroes in nodes but not in network: 18


We have processed the data such that all the heroes exist in the nodes df. Now let's drop the rows in the network df which have the same hero pair.

In [24]:
# drop same hero pairs
network = network[network['hero1'] != network['hero2']].reset_index(drop=True)

# Are the comics in the nodes and edges matching?
nodes_comics = set(nodes[nodes['type'] == 'comic']['node'])
edges_comics = set(edges['comic'])
len(nodes_comics), len(edges_comics), len(nodes_comics ^ edges_comics) # The comics are matching

(12651, 12651, 0)

Also, we have discovered that there are some heroes and comics that share the same name. We will add "_c" to the comic name so that it will have its own unique node for the upcoming analysis.

In [25]:
iffy_comics = nodes.loc[nodes['node'].duplicated(keep=False) & (nodes['type'] == 'comic'), 'node']
iffy_comics

2078     BLADE
13362    REBEL
13704    SABRE
Name: node, dtype: object

In [26]:
nodes.loc[nodes['node'].isin(iffy_comics) & (nodes['type'] == 'comic'), 'node'] += '_c'
edges.loc[edges['comic'].isin(iffy_comics), 'comic'] += '_c'

In [50]:
from itertools import combinations
from tqdm import tqdm

# Here we create are own network dataframe
my_network = []

for comic, heroes in tqdm(edges.groupby('comic')['hero']):
    heroes = sorted(heroes)
    for combo in combinations(heroes, 2):

        my_network.append({'hero1': combo[0], 'hero2': combo[1], 'comic': comic})

my_network = pd.DataFrame(my_network)
my_network.shape, network.shape

100%|██████████| 12651/12651 [00:01<00:00, 9073.14it/s] 


((579171, 3), (572235, 2))

Lastly, we have discovered that there are some heroes and comics that share the same name. We will add "_c" to the comic name so that it will have its own unique node for the upcoming analysis.

In [42]:
# Delete unnecessary variables
del combo, comic, edge_heroes, edges_comics, heroes, iffy_comics, network_heroes, node_heroes, nodes_comics

NameError: name 'edge_heroes' is not defined

## 2. Backend Implementation
We start by creating a list of heroes and their number of appearances so that we can get the top N heroes when needed.

In [53]:
heroes = edges.groupby('hero').size().sort_values(ascending=False)
heroes.head(5)

hero
SPIDER-MAN/PETER PARKER    1577
CAPTAIN AMERICA            1334
IRON MAN/TONY STARK        1150
THING/BENJAMIN J. GR        963
THOR/DR. DONALD BLAK        956
dtype: int64

#### Creating the **First Graph**

We have a bit of an interesting idea to readjust the weights of the edges. You have asked us to have a lower weight/cost for heroes that appear together more frequently. We will add a **small twist** to this idea. These weights will be adjusted based on the number of appearances of the heroes. We will use the following formula to calculate the weight of an edge:
$$
w_{ij} = \frac{|c \in C: h_i \in c| + |c \in C: h_j \in c|}{|c \in C: h_i, h_j \in c|}
$$
where $C$ is the set of all comics, $h_i$ and $h_j$ are two arbitrary heroes in the Marvel Universe. In other words, weight $w_{ij}$ is the inverse of the probability that the two heroes appear together.

In [54]:
edges_weighted = my_network.groupby(['hero1', 'hero2']).size().reset_index().sort_values(by=0, ascending=False)
edges_weighted.columns = ['hero1', 'hero2', 'colab']

edges_weighted['tot'] = edges_weighted['hero1'].map(heroes) + edges_weighted['hero2'].map(heroes)
edges_weighted['weight'] = edges_weighted['tot'] / edges_weighted['colab']
edges_weighted.head(5)

Unnamed: 0,hero1,hero2,colab,tot,weight
106165,HUMAN TORCH/JOHNNY S,THING/BENJAMIN J. GR,724,1849,2.553867
105742,HUMAN TORCH/JOHNNY S,MR. FANTASTIC/REED R,694,1740,2.507205
141154,MR. FANTASTIC/REED R,THING/BENJAMIN J. GR,690,1817,2.633333
109418,INVISIBLE WOMAN/SUE,MR. FANTASTIC/REED R,682,1616,2.369501
105475,HUMAN TORCH/JOHNNY S,INVISIBLE WOMAN/SUE,675,1648,2.441481


In [55]:
G1 = nx.from_pandas_edgelist(edges_weighted, 'hero1', 'hero2', 'weight')
G1.number_of_nodes(), G1.number_of_edges(), pd.concat([edges_weighted['hero1'], edges_weighted['hero2']]).nunique(), edges_weighted.shape[0]

(6421, 171644, 6421, 171644)

#### Creating the **Second Graph**

In [56]:
G2 = nx.from_pandas_edgelist(edges, 'hero', 'comic')
G2.number_of_nodes(), G2.number_of_edges(), nodes.shape[0], edges.shape[0]

(19090, 96104, 19090, 96104)

In [63]:
G2.add_nodes_from(nodes[nodes['type'] == 'hero']['node'], type='hero')
G2.add_nodes_from(nodes[nodes['type'] == 'comic']['node'], type='comic')
G2.number_of_nodes(), G2.number_of_edges(), nodes.shape[0], edges.shape[0]

(19090, 96104, 19090, 96104)

### Functionality 1 - extract the graph's features

In [120]:
def graph_summary(G: nx.Graph, type: int, N: int = 10):
    """Prints some basic features of the graph.
    Args:
        G (nx.Graph): The graph.
        type (int): The type of graph. Can be 1 or 2.
        N (int): Denotes the top N heroes to consider.
    """
    if type not in [1, 2]:
        raise ValueError('type must be 1 or 2')
    
    subnodes = heroes.head(N).index
    subg = G.subgraph(subnodes)

    N = subg.number_of_nodes()
    print('Number of nodes:', N)

    E = subg.number_of_edges()
    if type == 1:
        print('Number of collaborations:', E)
    else:
        print('Number of hero appearances in each comic:', E)


    print(f'Network density: {nx.density(subg):.4f}')

    ave_degree = np.mean(list(dict(subg.degree()).values()))
    print(f'Average degree: {ave_degree:.4f}')

    q95 = np.quantile(list(dict(subg.degree()).values()), 0.99)
    print('Hub nodes:', '\t\t'.join([node for node, degree in subg.degree() if degree >= q95]))
    print('The network is sparse:', nx.density(subg) < 0.1)

    print('Network degree distribution:')
    print(pd.DataFrame(dict(subg.degree()).values()).describe().T)

graph_summary(G1, 1, 3000)
# graph_summary(G2, 2, 3000)

Number of nodes: 3000
Number of collaborations: 124977
Network density: 0.0278
Average degree: 83.3180
Hub nodes: SHE-HULK/JENNIFER WA		FURY, COL. NICHOLAS		PROFESSOR X/CHARLES		WOLVERINE/LOGAN		DR. STRANGE/STEPHEN		CYCLOPS/SCOTT SUMMER		IRON MAN/TONY STARK		MARVEL GIRL/JEAN GRE		MR. FANTASTIC/REED R		SUB-MARINER/NAMOR MA		THOR/DR. DONALD BLAK		ICEMAN/ROBERT BOBBY		WONDER MAN/SIMON WIL		THING/BENJAMIN J. GR		BEAST/HENRY &HANK& P		INVISIBLE WOMAN/SUE		SCARLET WITCH/WANDA		JARVIS, EDWIN		HAWK		COLOSSUS II/PETER RA		HULK/DR. ROBERT BRUC		ANGEL/WARREN KENNETH		SPIDER-MAN/PETER PARKER		WASP/JANET VAN DYNE		HERCULES [GREEK GOD]		HUMAN TORCH/JOHNNY S		STORM/ORORO MUNROE S		ANT-MAN/DR. HENRY J.		CAPTAIN AMERICA		VISION
The network is sparse: True
Network degree distribution:
    count    mean         std  min   25%   50%   75%     max
0  3000.0  83.318  129.603245  1.0  23.0  42.0  86.0  1452.0


### Functionality 2 - Find top superheroes!

In [161]:
def top_heroes(G: nx.Graph, node: str, metric: int, N: int = 10):
    """Prints the top N heroes based on the given metric.
    Args:
        G (nx.Graph): The graph.
        node (str): The node (hero or comic).
        metric (int): Integer denoting the metric. Can be 1, 2, 3 or 4, that corresponds to:
            1: Betweeness
            2: PageRank
            3: ClosenessCentrality
            4: DegreeCentrality
        N (int): Denotes the top N heroes to consider.
    """
    if metric not in [1, 2, 3, 4]:
        raise ValueError('metric must be 1, 2, 3 or 4')
    
    measure = ['Betweeness', 'PageRank', 'ClosenessCentrality', 'DegreeCentrality']
    subnodes = heroes.index[:N]
    subg = G.subgraph(subnodes)

    if metric == 1:
        res = nx.betweenness_centrality(subg, normalized=True, weight='weight')
        print(f'{node}\'s {measure[metric-1]}: {res[node]:.4f}')
    elif metric == 2:
        res = nx.pagerank(G, weight='weight')
        print(f'{node}\'s {measure[metric-1]}: {res[node]:.4f}')
    elif metric == 3:
        res = nx.closeness_centrality(G, u=node)
        print(f'{node}\'s {measure[metric-1]}: {res:.4f}')
    else:
        res = nx.degree_centrality(G)
        print(f'{node}\'s {measure[metric-1]}: {res[node]:.4f}')

    # print(res)
    # print('Top 10 heroes:', *[(node, value) for node, value in sorted(res.items(), key=lambda x: x[1], reverse=True)[:2]])

top_heroes(G1, heroes.index[1], 1, 50)
top_heroes(G1, heroes.index[1], 2, 50)
top_heroes(G1, heroes.index[1], 3, 50)
top_heroes(G1, heroes.index[1], 4, 50)

CAPTAIN AMERICA's Betweeness: 0.1794
CAPTAIN AMERICA's PageRank: 0.0224
CAPTAIN AMERICA's ClosenessCentrality: 0.5845
CAPTAIN AMERICA's DegreeCentrality: 0.2989


### Functionality 3 - Shortest ordered Route

### Functionality 4 - Disconnecting Graphs

### Functionality 5 - Extracting Communities

## 3. Frontend Implementation

### Visualization 1 - Visualize some features of the network

### Visualization 2 - Visualize centrality measure

### Visualization 3 - Visualize the shortest-ordered route

### Visualization 4 - Visualize the disconnected graph

### Visualization 5 - Visualize the communities

## 4. Command Line Questions

## 5. Bonus - PageRank on MapReduce

## 6. Algorithmic Question