# 1. Network Metrics

1. **Basic Graph Stats**

   size (number of nodes, number of edges)

   density (number of edges compared to fully connected graph | ratio of actual to possible edge)
   
2. **Degree Metrics**

   average degree (how many clusters each cluster is connected to | number of neighbours of a node (cluster))

   average weighted degree (how strong/frequent those connections are)

3. **Connectedness** (Local & Global)

   Clustering coefficient (average clustering) (how likely neighboring clusters are also connected)



   Connected Components (cc) (counts how many disconnected groups exist | unified vs. fragmented network)

       * reachability: in cc, all the nodes are always reachable from each other

   Diameter (of largest component) (how far apart are the two most distant clusters (in terms of longest shortest path))

4. **Modularity** (Community Structure)

   Louvain algo to detect communities (cluster groups)

   modularity score to measure how well separated those communities are

5. **Centrality Measures**

   degree centrality (importance by number of connections)

   betweenness centrality (importance as a bridge in network flow)

6. **Top Nodes by Centrality**

    top 5 nodes by degree centrality

    top 5 nodes by betweenness centrality

7. **Top Strongest Edges**

   most frequently co-occurring cluster pairs by sorting edges by weight (in original list of edges)

pairs: e.g "Depression" cluster for female connected to "Reproduction Health" cluster vs.  (what is for male?)

In [32]:
import json
import os
import pandas as pd
import networkx as nx
import numpy as np

import community.community_louvain as community_louvain # import Louvain algo for community detection (calculate modularity)
import matplotlib.pyplot as plt
from collections import Counter

base_dir = os.path.abspath("..")
file_path = os.path.join(base_dir, "data", "bipartite_network", "community-specific_patterns", "cluster_co-occurrence_edges_by_comm")
output_path = os.path.join(base_dir, "data", "bipartite_network", "community-specific_patterns", "network_analysis", "network_stats.json")

# uk_file = os.path.join(file_path, "cluster_co-occurrence_United Kingdom.json")
# us_file = os.path.join(file_path, "cluster_co-occurrence_United States.json")

def load_graph(json_path):
    # Load edge list ( {"source": "Hematology & Blood Disorders", "target": "Procedures, Surgeries & Medical Devices", "weight": 2}
    with open(json_path, "r", encoding = "utf-8") as f:
        edges = json.load(f)
        
    G = nx.Graph()

    # Load the edge list into a weighted (undirected) graph
    for entry in edges:
        u = entry["source"]  # source node (cluster)
        v = entry["target"]  # target node (cluster)
        w = entry["weight"]
        G.add_edge(u, v, weight = w)
        
    return G

def analyze_graph(G, label):
    analysis = {}  # store graph's stats

    # 1. Basic stats
    analysis["label"] = label
    analysis["nodes"] = G.number_of_nodes()  # number of nodes (clusters) in a graph
    analysis["edges"] = G.number_of_edges()
    analysis["density"] = nx.density(G)  # [0-1] number of edges compared to fully connected graph (ratio of actual to all possible connections)
    # 0 -> sparse; 1 -> fully connected

    # 2. Degree
    degrees = dict(G.degree())  # number of neighbours of a node (cluster)
    weighted_degrees = dict(G.degree(weight="weight"))  # sum of egde weights
    analysis["average_degree"] = np.mean(list(degrees.values())) # how many (avg) other clusters is auch cluster connected to
    analysis["average_weighted_degree"] = np.mean(list(weighted_degrees.values()))  # how frequent (strong) are these connections btw clusters

    # 3. Connectedness of a graph (local (Clustering Coefficient) & globally (Connected Components)
    
    # 3.1 Clustering Coefficient (how often neighbours of a node (cluster) are connected)
    # If cluster A is connected to B and C, how likely that B and C are connected too?
    clustering = nx.clustering(G, weight="weight")
    analysis["average_clustering"] = np.mean(list(clustering.values()))  # avg across all nodes (cluster)

    # 3.2 Connected Components (cc) (see how many disconnected groups of clusters a graph has: a graph is unified or fragmented)
    # reachability: in connected components, all the nodes are always reachable from each other
    components = list(nx.connected_components(G))
    largest_cc = max(components, key=len)  # largest cc
    analysis["num_components"] = len(components) # total number of cc ( 1 if everything is connected)
    analysis["largest_component_size"] = len(largest_cc) # how many nodes (clusters) are in the largest group
    
    # 3.3 Diameter (the longest shortest path btw any 2 nodes (clusters) in a cc) -> computer diameter of the largest_cc
    G_largest_cc = G.subgraph(largest_cc)
    if nx.is_connected(G_largest_cc):
        analysis["diameter"] = nx.diameter(G_largest_cc)  # nx.diameter works on cc
    else:
        analysis["diameter"] = "Not connected"
    

    # 4. Modularity ( [0-1] how well a graph can be divided into communities (groups of closely connected nodes))
    # 1 if the detected communities are well separated
    # 0 if mixed up
    partition = community_louvain.best_partition(G)  # Louvain algo to detect communities
    analysis["modularity"] = community_louvain.modularity(partition, G)
    
    # 5. Centrlity Measures (nodes (clusters) of importance)
    degree_centrality = nx.degree_centrality(G)  # nodes with most connections
    betweenness_centrality = nx.betweenness_centrality(G, weight="weight")  # what clusters serve as bridges between others (nodes most important for information flow)
    
    # 6. Top (n=5) nodes for each centrality measure
    analysis["top5_degree_centrality"] = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True)[:5]  # dictionary; descending order
    analysis["top5_betweenness_centrality"] = sorted(betweenness_centrality.items(), key=lambda x: x[1], reverse=True)[:5]  # dictionary; descending order

    # 7. Top 5 strongest edges (cluster pairs that co-occurre most frequently across profiles)
    # insted of looking which cluster has the highest degree, we look into what pairs are most connected
    # get the edge with the biggest weight and get the corresponding clusters of that "heavy" edge
    # essentially: sort the original (json) list of edges by weights and get top (n=5) cluster-PAIRS (main point are pairs)
    strongest_edges = sorted(G.edges (data=True), key=lambda x: x[2]["weight"], reverse=True)[:5]  # list of tuples; descending order
    analysis["top5_cluster_pairs_by_weight"] = [(u, v, w["weight"]) for u, v, w in strongest_edges]  # u - source, v - target cluster

    return analysis
    
# Load the contents of the network_stats.json file (if exists)
if os.path.exists(output_path):
    with open(output_path, "r", encoding="utf-8") as f:
        results = json.load(f)
else:
    results = {}
    
# Extract group label from dir name
group_label = os.path.basename(os.path.normpath(file_path))  # "UK - US"

# if group_label not in results:
results[group_label] = []

for file in os.listdir(file_path):
    if file.endswith(".json") and not file.startswith("pms"):
        label = file.replace(".json", "")
        path_current = os.path.join(file_path, file)
        
        G = load_graph(path_current)  # load JSON files (edges) as graphs
        stats = analyze_graph(G, label)  # analyze graphs
        
        results[group_label].append(stats)

with open(output_path, "w", encoding="utf-8") as f:
    json.dump(results, f, indent=2)

print(f"\nNetwork stats saved to: {output_path}")

summary_df = pd.DataFrame(results[group_label])
summary_df



Network stats saved to: C:\Users\NASTYA\code\tum-thesis\data\bipartite_network\community-specific_patterns\network_analysis\network_stats.json


Unnamed: 0,label,nodes,edges,density,average_degree,average_weighted_degree,average_clustering,num_components,largest_component_size,diameter,modularity,top5_degree_centrality,top5_betweenness_centrality,top5_cluster_pairs_by_weight
0,abda_co-occurrence,19,112,0.654971,11.789474,36.210526,0.194076,1,19,3,0.014991,"[(Pain & Musculoskeletal Conditions, 0.9444444...","[(Allergy & Immunology, 0.22510893246187366), ...","[(Pain & Musculoskeletal Conditions, Autoimmun..."
1,above-beyond_co-occurrence,19,135,0.789474,14.210526,32.842105,0.113857,1,19,2,0.095085,"[(Pain & Musculoskeletal Conditions, 1.0), (Li...","[(Pain & Musculoskeletal Conditions, 0.1237996...","[(Lifestyle, Diet & Supplements, Mental Health..."
2,acoustic-neuroma-support_co-occurrence,17,69,0.507353,8.117647,21.176471,0.106666,1,17,2,0.052670,"[(Neurological Disorders, 1.0), (Reproductive ...","[(Lifestyle, Diet & Supplements, 0.15138888888...","[(Lifestyle, Diet & Supplements, Neurological ..."
3,actiononpain_co-occurrence,22,162,0.701299,14.727273,42.909091,0.126613,1,22,2,0.038351,"[(Neurological Disorders, 1.0), (Pain & Muscul...","[(Procedures, Surgeries & Medical Devices, 0.1...","[(Mental Health & Emotional Wellbeing, Pain & ..."
4,actionradiotherapy_co-occurrence,16,61,0.508333,7.625000,14.500000,0.204973,1,16,3,0.059119,"[(Lifestyle, Diet & Supplements, 0.93333333333...",[(Reproductive & Sexual Health (Women's Health...,"[(Lifestyle, Diet & Supplements, Cancer & Onco..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
320,womenshealth_co-occurrence,19,113,0.660819,11.894737,32.421053,0.131900,1,19,2,0.024071,"[(Lifestyle, Diet & Supplements, 1.0), (Reprod...","[(Dermatology & Skin-Related, 0.12305477746654...","[(Lifestyle, Diet & Supplements, Reproductive ..."
321,worldaccordingtolupus_co-occurrence,18,145,0.947712,16.111111,37.555556,0.228395,1,18,2,0.034645,"[(Mental Health & Emotional Wellbeing, 1.0), (...","[(Respiratory Conditions & Treatments, 0.13876...","[(Metabolic & Endocrine Disorders, Autoimmune ..."
322,youngadult-stress_co-occurrence,12,32,0.484848,5.333333,14.333333,0.131724,1,12,3,0.028326,"[(Pain & Musculoskeletal Conditions, 0.8181818...","[(Respiratory Conditions & Treatments, 0.33636...","[(Mental Health & Emotional Wellbeing, Lifesty..."
323,youngadultswithmelanoma_co-occurrence,3,3,1.000000,2.000000,3.333333,0.793701,1,3,1,0.000000,"[(Procedures, Surgeries & Medical Devices, 1.0...","[(Procedures, Surgeries & Medical Devices, 0.0...","[(Procedures, Surgeries & Medical Devices, Can..."


In [10]:
summary_df["top5_degree_centrality"].to_list()

[[('Mental Health & Emotional Wellbeing', 1.0),
  ("Reproductive & Sexual Health (Women's Health & Men's Health)", 1.0),
  ('Digestive Health (Gastrointestinal Conditions)', 1.0),
  ('Respiratory Conditions & Treatments', 1.0),
  ('Neurological Disorders', 1.0)],
 [('Mental Health & Emotional Wellbeing', 1.0),
  ("Reproductive & Sexual Health (Women's Health & Men's Health)", 1.0),
  ('Digestive Health (Gastrointestinal Conditions)', 1.0),
  ('Infectious & Communicable Diseases', 1.0),
  ('Neurological Disorders', 1.0)]]

In [11]:
summary_df["top5_betweenness_centrality"].to_list()

[[('Allergy & Immunology', 0.4310966810966811),
  ('Rare Diseases & Genetic Disorders', 0.21392496392496393),
  ('Preventive Health & Screening', 0.17200577200577197),
  ('Infectious & Communicable Diseases', 0.08145743145743145),
  ('Procedures, Surgeries & Medical Devices', 0.07056277056277055)],
 [('Allergy & Immunology', 0.33604989286807463),
  ('Eye & Vision Health', 0.2725370589006953),
  ('Hematology & Blood Disorders', 0.15837159473523113),
  ('Patient Support & Education', 0.14994206130569768),
  ('Procedures, Surgeries & Medical Devices', 0.09969937469937469)]]

In [12]:
summary_df["top5_cluster_pairs_by_weight"].to_list()

[[("Reproductive & Sexual Health (Women's Health & Men's Health)",
   'Lifestyle, Diet & Supplements',
   359),
  ('Mental Health & Emotional Wellbeing',
   'Lifestyle, Diet & Supplements',
   219),
  ('Mental Health & Emotional Wellbeing',
   "Reproductive & Sexual Health (Women's Health & Men's Health)",
   177),
  ('Metabolic & Endocrine Disorders', 'Lifestyle, Diet & Supplements', 159),
  ('Cardiovascular Diseases', 'Lifestyle, Diet & Supplements', 146)],
 [("Reproductive & Sexual Health (Women's Health & Men's Health)",
   'Lifestyle, Diet & Supplements',
   394),
  ('Mental Health & Emotional Wellbeing',
   'Lifestyle, Diet & Supplements',
   322),
  ('Mental Health & Emotional Wellbeing',
   "Reproductive & Sexual Health (Women's Health & Men's Health)",
   261),
  ('Cancer & Oncology', 'Lifestyle, Diet & Supplements', 207),
  ("Reproductive & Sexual Health (Women's Health & Men's Health)",
   'Cancer & Oncology',
   165)]]