# Assigment 2
### 02467 Computational Social Science Group 6

## Part 1: Properties of the real-world network of Computational Social Scientists

1. Random Network

In [None]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import ast
import numpy as np

# 1. Load data
papers_df = pd.read_csv("papers.csv")
author_df = pd.read_excel("author_data.csv")

# 2. Convert stringified lists into actual Python lists
papers_df["author_ids"] = papers_df["author_ids"].apply(ast.literal_eval)

# 3. Create the real co-authorship network (G_real)
G_real = nx.Graph()
for row in papers_df.itertuples():
    authors = row.author_ids
    for i in range(len(authors)):
        for j in range(i + 1, len(authors)):
            G_real.add_edge(authors[i], authors[j])

# 4. Compute real network stats
N = G_real.number_of_nodes()
L = G_real.number_of_edges()
p = L / (N * (N - 1) / 2)
avg_k = 2 * L / N

print(f"Real network: N = {N}, L = {L}, p = {p:.6f}, avg_k = = {avg_k:.2f}")

# 5. Visualize the real network 
degrees = dict(G_real.degree())
node_sizes_real = [max(10, degrees[n] * 2) for n in G_real.nodes()]
pos_real = nx.spring_layout(G_real, seed=42, iterations=20)

plt.figure(figsize=(12, 12))
nx.draw_networkx_nodes(G_real, pos_real, node_size=node_sizes_real, node_color='orange', alpha=0.8)
nx.draw_networkx_edges(G_real, pos_real, edge_color='gray', width=0.8, alpha=0.3)
plt.title("Real Co-authorship Network", fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()


# 6. Generate random network using np.random.uniform
def generate_random_network_with_uniform(N, p, seed=None):
    if seed is not None:
        np.random.seed(seed)

    G = nx.Graph()
    G.add_nodes_from(range(N))

    for i in range(N):
        for j in range(i + 1, N):
            if np.random.uniform(0, 1) < p:
                G.add_edge(i, j)
    return G

G_random = generate_random_network_with_uniform(N, p, seed=42)

# 7. Visualize the random network
degrees_rand = dict(G_random.degree())
node_sizes_rand = [max(10, degrees_rand[n] * 2) for n in G_random.nodes()]
pos_rand = nx.spring_layout(G_random, seed=42, iterations=20)

plt.figure(figsize=(12, 12))
nx.draw_networkx_nodes(G_random, pos_rand, node_size=node_sizes_rand, node_color='lightblue', alpha=0.8)
nx.draw_networkx_edges(G_random, pos_rand, edge_color='gray', width=0.8, alpha=0.3)
plt.title("Random Network", fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()

### 1. What regime does your random network fall into? Is it above or below the critical threshold? 
The random network is above the critical threshold. The critical value for a giant component to appear is roughly $p_c = \frac{1}{N}$, which in our case is about $0.0000658$. Since our $p \approx 0.000419$, it’s well above that. So we’re in the regime where a large connected component is expected to exist.

### 2. According to the textbook, what does the network’s structure resemble in this regime?
In this regime, the network typically forms one big connected cluster along with some small isolated nodes. The degree distribution tends to follow a Poisson pattern, meaning most nodes have degrees close to the average, and high-degree nodes are rare. Overall, the structure looks pretty uniform and lacks distinct features.


### 3. Based on your visualizations, identify the key differences between the actual and the random networks. Explain whether these differences are consistent with theoretical expectations.
The real co-authorship network is much more clustered and uneven—it has hubs, visible communities, and a lot of local structure. The random network, by contrast, looks more uniform and spread out, without clear groupings. These differences line up with what theory predicts: real social networks often show high clustering and modularity, while random networks don’t capture those social patterns.

In [None]:
# 1. Get giant component from real network
components_real = list(nx.connected_components(G_real))
largest_cc_real = max(components_real, key=len)
G_real_giant = G_real.subgraph(largest_cc_real).copy()

# 2. Get giant component from random network
components_rand = list(nx.connected_components(G_random))
largest_cc_rand = max(components_rand, key=len)
G_rand_giant = G_random.subgraph(largest_cc_rand).copy()

# 3. Calculate average shortest path lengths
avg_path_real = nx.average_shortest_path_length(G_real_giant)
avg_path_rand = nx.average_shortest_path_length(G_rand_giant)

print(f"Avg shortest path (Real Network): {avg_path_real:.4f}")
print(f"Avg shortest path (Random Network): {avg_path_rand:.4f}")

### 1. Why do we consider the giant component only?

Because average shortest path length is only defined between connected node pairs. If we include isolated components, the metric becomes meaningless (infinite distance). The giant component ensures that all node pairs are reachable.

### 2. Why do we consider unweighted edges?

The goal is to examine the basic topological structure of the network (small-world phenomenon), not edge weights or strengths. Unweighted paths help us measure pure connectivity.

### 3. Does the Computational Social Scientists network exhibit the small-world phenomenon?

Likely yes. If the real network shows a similar or slightly higher average shortest path compared to the random graph (with same N, L), and still shows higher clustering (can be computed if needed), it meets the criteria of a small-world network. These networks have short path lengths like random graphs but higher clustering.