# Assigment 2
### 02467 Computational Social Science Group 6

## Part 1: Properties of the real-world network of Computational Social Scientists

1. Random Network

In [None]:
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import ast
import numpy as np

# 1. Load data
papers_df = pd.read_csv("papers.csv")
author_df = pd.read_excel("author_data.csv")

# 2. Convert stringified lists into actual Python lists
papers_df["author_ids"] = papers_df["author_ids"].apply(ast.literal_eval)

# 3. Create the real co-authorship network (G_real)
G_real = nx.Graph()
for row in papers_df.itertuples():
    authors = row.author_ids
    for i in range(len(authors)):
        for j in range(i + 1, len(authors)):
            G_real.add_edge(authors[i], authors[j])

# 4. Compute real network stats
N = G_real.number_of_nodes()
L = G_real.number_of_edges()
p = L / (N * (N - 1) / 2)
avg_k = 2 * L / N

print(f"Real network: N = {N}, L = {L}, p = {p:.6f}, avg_k = = {avg_k:.2f}")

# 5. Visualize the real network 
degrees = dict(G_real.degree())
node_sizes_real = [max(10, degrees[n] * 2) for n in G_real.nodes()]
pos_real = nx.spring_layout(G_real, seed=42, iterations=20)

plt.figure(figsize=(12, 12))
nx.draw_networkx_nodes(G_real, pos_real, node_size=node_sizes_real, node_color='orange', alpha=0.8)
nx.draw_networkx_edges(G_real, pos_real, edge_color='gray', width=0.8, alpha=0.3)
plt.title("Real Co-authorship Network", fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()


# 6. Generate random network using np.random.uniform
def generate_random_network_with_uniform(N, p, seed=None):
    if seed is not None:
        np.random.seed(seed)

    G = nx.Graph()
    G.add_nodes_from(range(N))

    for i in range(N):
        for j in range(i + 1, N):
            if np.random.uniform(0, 1) < p:
                G.add_edge(i, j)
    return G

G_random = generate_random_network_with_uniform(N, p, seed=42)

# 7. Visualize the random network
degrees_rand = dict(G_random.degree())
node_sizes_rand = [max(10, degrees_rand[n] * 2) for n in G_random.nodes()]
pos_rand = nx.spring_layout(G_random, seed=42, iterations=20)

plt.figure(figsize=(12, 12))
nx.draw_networkx_nodes(G_random, pos_rand, node_size=node_sizes_rand, node_color='lightblue', alpha=0.8)
nx.draw_networkx_edges(G_random, pos_rand, edge_color='gray', width=0.8, alpha=0.3)
plt.title("Random Network", fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()

### 1. What regime does your random network fall into? Is it above or below the critical threshold? 
The random network is above the critical threshold. The critical value for a giant component to appear is roughly $p_c = \frac{1}{N}$, which in our case is about $0.0000658$. Since our $p \approx 0.000419$, it’s well above that. So we’re in the regime where a large connected component is expected to exist.

### 2. According to the textbook, what does the network’s structure resemble in this regime?
In this regime, the network typically forms one big connected cluster along with some small isolated nodes. The degree distribution tends to follow a Poisson pattern, meaning most nodes have degrees close to the average, and high-degree nodes are rare. Overall, the structure looks pretty uniform and lacks distinct features.


### 3. Based on your visualizations, identify the key differences between the actual and the random networks. Explain whether these differences are consistent with theoretical expectations.
The real co-authorship network is much more clustered and uneven—it has hubs, visible communities, and a lot of local structure. The random network, by contrast, looks more uniform and spread out, without clear groupings. These differences line up with what theory predicts: real social networks often show high clustering and modularity, while random networks don’t capture those social patterns.

In [None]:
# 1. Get giant component from real network
components_real = list(nx.connected_components(G_real))
largest_cc_real = max(components_real, key=len)
G_real_giant = G_real.subgraph(largest_cc_real).copy()

# 2. Get giant component from random network
components_rand = list(nx.connected_components(G_random))
largest_cc_rand = max(components_rand, key=len)
G_rand_giant = G_random.subgraph(largest_cc_rand).copy()

# 3. Calculate average shortest path lengths
avg_path_real = nx.average_shortest_path_length(G_real_giant)
avg_path_rand = nx.average_shortest_path_length(G_rand_giant)

print(f"Avg shortest path (Real Network): {avg_path_real:.4f}")
print(f"Avg shortest path (Random Network): {avg_path_rand:.4f}")

### 1. Why do we consider the giant component only?

Because average shortest path length is only defined between connected node pairs. If we include isolated components, the metric becomes meaningless (infinite distance). The giant component ensures that all node pairs are reachable.

### 2. Why do we consider unweighted edges?

The goal is to examine the basic topological structure of the network (small-world phenomenon), not edge weights or strengths. Unweighted paths help us measure pure connectivity.

### 3. Does the Computational Social Scientists network exhibit the small-world phenomenon?

Likely yes. If the real network shows a similar or slightly higher average shortest path compared to the random graph (with same N, L), and still shows higher clustering (can be computed if needed), it meets the criteria of a small-world network. These networks have short path lengths like random graphs but higher clustering.

## Part 3 - Words that characterize Computational Social Science communities

### 1.1 What does TF stand for?
#### Answer: TF stands for term frequency which measures how often a specific term appears in a set of text document. Higher frequency = higher relevance of word to document's content

### 1.2 What does IDF stand for?
#### Answer: IDF stands for inverse document frequency, reducing weight of common words and increasing weight of 'rarer' words. When a word appears in fewer text documents = more meaningful

### 2.

In [None]:
import pandas as pd
import requests
from collections import Counter
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor
import json
from sklearn.feature_extraction.text import TfidfVectorizer 
from wordcloud import WordCloud
import matplotlib.pyplot as plt

In [None]:
MAX_WORKERS = 8
API_DELAY = 0.1

authors_df = pd.read_csv('data/authors.csv')
author_comm = pd.read_csv('data/author_communities.csv')
abstracts = pd.read_csv('data/abstracts_with_collocations.csv')

# Preprocess Tokens
abstracts['tokens'] = abstracts['collocation_tokens'].map(
    lambda x: eval(x) if isinstance(x, str) else x,
    na_action='ignore'
)
work_to_tokens = {
    work_id.split('/')[-1]: tokens  # Extract 'W...' from URLs
    for work_id, tokens in zip(abstracts['id'], abstracts['tokens'])
}

# Parallel API Fetching
def fetch_author_works(author_url):
    try:
        response = requests.get(f"{author_url}?per-page=200", timeout=10)
        if response.status_code == 200:
            works = [w['id'].split('/')[-1] for w in response.json().get('results', [])]
            return (author_url.split('/')[-1], works)  # Return (author_id, works)
        return (author_url.split('/')[-1], None)
    except Exception: return (author_url.split('/')[-1], None)

with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    results = list(tqdm(
        executor.map(fetch_author_works, authors_df['Works API URL']),
        total=len(authors_df)
    ))

author_to_works = {k: v for k, v in results if v is not None}
failed_authors = [k for k, v in results if v is None]

In [None]:
# Community Processing
valid_works = set(abstracts['id'])
community_tokens = {}
community_token_arrays = {} 
community_stats = {}

for comm, group in author_comm.groupby('community'):
    works = [
        work 
        for author in group['author_id'] 
        for work in author_to_works.get(author, [])
        if work in valid_works
    ]
    
    comm_tokens = [
        tok 
        for work in works 
        for tok in work_to_tokens.get(work, [])
    ]
    
    community_tokens[comm] = comm_tokens

# Save
with open('community_tokens.json', 'w') as f:
    json.dump(community_tokens, f)

for community, group in tqdm(author_comm.groupby('community'), desc="Processing communities"):
    token_array = []  # store ALL tokens as a list of strings
    stats = {
        'authors_processed': 0,
        'total_works': 0,
        'works_with_tokens': 0
    }
    
    for author_id in group['author_id']:
        author_key = f"works?filter=author.id:{author_id.split('/')[-1]}"
        
        if author_key not in author_to_works: continue
            
        stats['authors_processed'] += 1
        
        for work_id in author_to_works[author_key]:
            stats['total_works'] += 1
            if work_id in work_to_tokens:
                tokens = work_to_tokens[work_id]
                if tokens:  # Only add non-empty
                    token_array.extend(tokens)
                    stats['works_with_tokens'] += 1
    
    community_token_arrays[community] = token_array
    community_stats[community] = stats

# Save
with open('community_token_arrays.json', 'w') as f: 
    json.dump(community_token_arrays, f)

print("\nArray created and saved")

### 3. Calculate TF for each word and find the top 5 terms within the top 5 communities

In [None]:
try:
    author_comm = pd.read_csv('data/author_communities.csv') 
    with open('community_token_arrays.json') as f:
        community_token_arrays = json.load(f)
except FileNotFoundError as e: raise

community_sizes = author_comm['community'].value_counts().head(5)
top_communities = community_sizes.index.astype(str).tolist()
print("Top 5 communities by author count:", top_communities)

top_community_tf = {}
for comm in top_communities:
    tokens = community_token_arrays.get(comm, []) 
    if not tokens: continue
        
    tf = Counter(tokens)
    total_terms = sum(tf.values())
    top_terms = [
        (term, count/total_terms) 
        for term, count in tf.most_common(5)
    ]
    top_community_tf[comm] = top_terms

print("\nTop Terms:")
for comm in top_communities:
    if str(comm) not in top_community_tf: continue
    term_line = ", ".join([f"{term} ({freq:.3f}, {(freq*100):.1f}%)" for term, freq in top_community_tf[str(comm)]])
    
    print(f"\nCommunity {comm} (Authors: {author_comm[author_comm['community']==int(comm)].shape[0]}): {term_line}")

In [None]:
## Next, we calculate IDF for every word.

author_comm = pd.read_csv('data/author_communities.csv')
with open('community_token_arrays.json') as f: community_token_arrays = json.load(f)

top_communities = author_comm['community'].value_counts().head(9).index.astype(str).tolist()

documents = [" ".join(community_token_arrays.get(comm, [])) for comm in top_communities]
tf_results = {}
for i, comm in enumerate(top_communities):
    counter = Counter(documents[i].split())
    total = sum(counter.values())
    tf_results[comm] = [(word, count/total) for word, count in counter.most_common(10)]

vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()

tfidf_results = {}
for i, comm in enumerate(top_communities):
    feature_index = tfidf_matrix[i,:].nonzero()[1]
    tfidf_scores = zip(feature_index, [tfidf_matrix[i, x] for x in feature_index])
    tfidf_results[comm] = [(feature_names[idx], score) for idx, score in 
                         sorted(tfidf_scores, key=lambda x: x[1], reverse=True)[:10]]
    
top_authors = {}
for comm in top_communities:
    top_authors[comm] = (author_comm[author_comm['community'] == int(comm)]
                        .sort_values('degree', ascending=False)
                        .head(3)[['author_name', 'degree', 'author_id']]
                        .values.tolist())

for comm in top_communities:
    print(f"\nCommunity {comm} ({author_comm[author_comm['community']==int(comm)].shape[0]} authors)")
    
    # TF terms
    tf_str = " | ".join([f"{word}:{score:.3f}" for word, score in tf_results[comm]])
    print(f"Top 10 TF: {tf_str}")
    
    # TF-IDF terms
    tfidf_str = " | ".join([f"{word}:{score:.3f}" for word, score in tfidf_results[comm]])
    print(f"Top 10 TF-IDF: {tfidf_str}")
    
    # Top authors in one line
    authors_str = " | ".join([
        f"{name if pd.notna(name) else f'[{author_id}]'}:{degree}" 
        for name, degree, author_id in top_authors[comm]
    ])
    print(f"Top 3 Authors: {authors_str}")

### Describe similarities and differences between the communities.
#### Similarities: most of the communities have word like data, information, and users. There are also common recurring themes related to human-computer interactions.
#### Differences: The focus of each community appears to be quite different. E.g, Community 6 focuses on networks, community 17 focuses on social media.

### Why aren't the TFs not necessarily a good description of the communities?
#### Answer: TFs does not account how some terms may be distinctive to a particular community. The words could be common in other communities as well, such as how "data" could appear in all communities but the "data" may be referring to different focuses and subbject matters. Moreover, rare but significant words can get drowned out by the higher frequency terms. Words like "uncertainty" would be more significant than words like "data", but the word "data" appears much more often.

### What base logarithm did you use? Is that important?
#### Answer: The base logarithm is base e. It is not that important as long as it is consistent with the other calculations.

### Are these 10 words more descriptive of the community? If yes, what is it about IDF that makes the words more informative?
#### Answer: Yes, the IDF makes 'rarer' words stand out more, such as 'reddit' (Community 20) and 'superspreaders' (Community 26) which gives more context as to what the tokens are about, as compared to the TF words which only gives generic words like 'people', 'humans' and 'information' which cannot really give much context.

In [None]:
author_comm = pd.read_csv('data/author_communities.csv')
with open('community_token_arrays.json') as f:
    community_token_arrays = json.load(f)

top_communities = author_comm['community'].value_counts().head(9).index.astype(str).tolist()
top_authors = {}
for comm in top_communities:
    top_authors[comm] = (author_comm[author_comm['community'] == int(comm)]
                        .sort_values('degree', ascending=False)
                        .head(3)[['author_name', 'degree', 'author_id']]
                        .values.tolist())

for comm in top_communities:
    # Prepare text data
    text = " ".join(community_token_arrays.get(comm, []))
    
    wordcloud = WordCloud(
        width=800, 
        height=400,
        background_color='white',
        colormap='viridis',
        max_words=50,
        contour_width=3,
        contour_color='steelblue'
    ).generate(text)
    
    authors_info = "\n".join([
        f"{i+1}. {name if pd.notna(name) else f'[{author_id}]'} (degree: {degree})"
        for i, (name, degree, author_id) in enumerate(top_authors[comm])
    ])
    
    plt.figure(figsize=(12, 8))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.title(f"Community {comm} Word Cloud", fontsize=16, pad=20)
    plt.figtext(
        0.5, 0.05, 
        f"Top Authors:\n{authors_info}",
        ha="center",
        fontsize=12,
        bbox={"facecolor":"white", "alpha":0.8, "pad":5}
    )
    
    plt.tight_layout()
    plt.show()

### Comment on your results. What can you conclude on the different sub-communities in Computational Social Science?
#### Answer: The different sub-communities may all have different focuses but they are all related to people in some way, with the presence of words like "human", "user", "individual" and "people".

### Look up online the top author in each community. In light of your search, do your results make sense?
#### Yes, each of the top author in the communities appear to be professors in well-established universities who are credible, which correlates to the higher degree in the community. Furthermore, the words in the wordcloud seems to agree with their field of study. For example, in community 8 wordcloud, the top author Stephan Lewandowsky is a psychologist and the words present in the wordcloud appears to be related - such as 'psychological', 'participant', 'research'.

### Go back to Week 1, Exercise 1. Revise what you wrote on the topics in Computational Social Science. In light of your data-driven analysis, has your understanding of the field changed? How? (max 150 words)
#### My understanding has changed slightly. Based on the results I have gotten, I believe Computational Social Science is highly relevant to humans. It is also not a single field of specific study, but rather a tapestry of different niches. Although there are some overlap (TF), the IDF terms shows the different focuses of each community. I also learnt that it is a highly interdisciplinary field,  as shown by the different words of IDFS - 'advertising', 'social media', 'cultural' and 'superspreaders' to name a few.