# Graph visualisation - social graph of Wikipedia editors

This notebook presents the full pipeline used to construct and analyse a user–user interaction graph of Wikipedia editors.

* **Nodes** represent individual users (editors).
* **Edges** connect pairs of users who have edited at least one common article.
* **Edge weights** correspond to the number of distinct articles co-edited, capturing the intensity of shared editing activity
(a normalized alternative based on Jaccard similarity can be introduced at a later stage).
* **Node attributes** include editor metadata such as user type, total number of edits, and average weaponising ratio.

Community detection is performed using the **Leiden algorithm**, and the resulting clusters are visualized with the **Distributed Recursive Layout (DRL)**, which is well suited for large graphs.

The detected clusters correspond to communities of editors with overlapping article portfolios, highlighting patterns of shared editorial focus and potential coordination.

In [107]:
import pandas as pd
from itertools import combinations
from collections import Counter
import igraph as ig
import leidenalg
import numpy as np

In [108]:
# import the dataset. We will first test on the small database of 2336 unique users.
df = pd.read_csv('../datas/final/small_db_preprocess.csv')
df

Unnamed: 0,article,user,date,comment,llm_output,weaponised,year,user_type
0,COVID-19 pandemic in Ukraine,Agathoclea,2020-03-11T20:56:06Z,removed [[Category:2019–20 coronavirus outbrea...,"Changed the category from ""2019–20 coronavirus...",Not Weaponised,2020,Registered
1,History of Ukraine,Icey,2006-05-21T14:09:22Z,/* Further reading */ Disambiguation link repa...,Changed the reference format for Andrew Wilson...,Not Weaponised,2006,Registered
2,History of Ukraine,Irpen,2006-06-06T21:00:08Z,"this whole section doesn't belong here, speara...","Removed a section titled ""Ukraine and Nuclear ...",Not Weaponised,2006,Registered
3,History of Ukraine,193.60.161.100,2006-05-23T11:39:26Z,,"Changed ""beyond"" to ""gayniss"" in the context o...",Not Weaponised,2006,Anonymous (IP)
4,History of Ukraine,Irpen,2006-06-14T17:49:44Z,revert to myself,Removed a POV (point of view) section regardin...,Not Weaponised,2006,Registered
...,...,...,...,...,...,...,...,...
6917,COVID-19 pandemic in Ukraine,LSGH,2021-05-06T05:57:13Z,Updating number of cases in infobox,"Changed confirmed cases, recovery cases, death...",Not Weaponised,2021,Registered
6918,2014 pro-Russian unrest in Ukraine,Garik 11,2014-04-08T17:05:13Z,/* Latvian citizen arrested */ more detail abo...,Changed the description of a Latvian citizen b...,Not Weaponised,2014,Registered
6919,2014 pro-Russian unrest in Ukraine,Aleksandr Grigoryev,2014-04-19T15:24:13Z,/* Kidnapping of Ukrainian officials */ update,The change made in this revision is the additi...,Weaponised,2014,Registered
6920,Censuses in Ukraine,Aleksandr Grigoryev,2012-11-17T17:14:35Z,/* External links */ update,"Added a template for ""Ukraine topics"" and a ca...",Not Weaponised,2012,Registered


In [109]:
df["is_weaponised"] = (df["weaponised"].str.lower().str.strip() == "weaponised").astype(int)

# we group by paris of user editing an article and Computes fraction of weaponised edits per user–article pair
# 0.0 is considered as neutral behaviour
# 1.0 is considered as fully weaponised on that article
# intermediate is mixed behaviour
weaponising_ratio_df = (
    df.groupby(["user", "article", "user_type"])
      .agg(
          total_edits=("is_weaponised", "count"),
          weaponised_edits=("is_weaponised", "sum")
      )
      .reset_index()
)

weaponising_ratio_df["weaponising_ratio"] = (
    weaponising_ratio_df["weaponised_edits"] / weaponising_ratio_df["total_edits"]
)

# transform the article name in a number using cat.codes
weaponising_ratio_df["article_id"] = (
    weaponising_ratio_df["article"].astype("category").cat.codes
)

df_graph = weaponising_ratio_df.rename(
    columns={"total_edits": "n_edits"}
)

# we get rid of the Bot users. 
df_graph = df_graph[df_graph["user_type"] != "Bot"]


In [110]:
pairs = [] # list that will contains set of two users that co-edited the same article

# for each article, we get the set of users that edited it. Then for every user in that set, we create unordered combinaison of 2 users.
for article, group in df_graph.groupby("article_id"):
    users = group["user"].unique()
    for u1, u2 in combinations(sorted(users), 2):
        pairs.append((u1, u2))

coedit_counts = Counter(pairs)

edges_df = pd.DataFrame(
    [(u1, u2, w) for (u1, u2), w in coedit_counts.items()],
    columns=["user1", "user2", "coedit_count"]
)
# Counts how many different articles each pair co-edited (since pairs were added once per article). This count will become the edge weight.

# We can remove weak connections (coedit_count < 2)
# A connection >= 3 get rid of all the IP user.
edges_df = edges_df[edges_df["coedit_count"] >= 3].reset_index(drop=True)

user_stats = (
    df_graph.groupby("user")
    .agg(
        total_edits=("n_edits", "sum"),
        mean_weaponising_ratio=("weaponising_ratio", "mean")
    )
    .reset_index()
)

# Graph from tuple list (handles string names properly)
g = ig.Graph.TupleList(
    edges_df.itertuples(index=False, name=None),  # ensures tuple of (user1, user2, weight)
    weights=True,
    directed=False
)

# Add node attributes
user_dict = user_stats.set_index("user").to_dict("index")

g.vs["total_edits"] = [user_dict.get(v["name"], {}).get("total_edits", 0) for v in g.vs]
g.vs["weaponising_ratio"] = [user_dict.get(v["name"], {}).get("mean_weaponising_ratio", 0) for v in g.vs]

partition = leidenalg.find_partition(
    g, leidenalg.ModularityVertexPartition, weights=g.es["weight"]
)
g.vs["cluster"] = partition.membership

print(f"Graph built: {g.vcount()} nodes, {g.ecount()} edges, {len(set(partition.membership))} clusters.")

Graph built: 51 nodes, 140 edges, 5 clusters.


In [111]:
# Save to GraphML format (recommended for Gephi)
wanna_save = False
if wanna_save:
    g.write_graphml("../plots/user_network_min3.graphml")
    print("Graph exported")

## Cluster summaries & quantitative analysis

| Question                               | How                                      |
| -------------------------------------- | ---------------------------------------- |
| How big is each cluster?               | node counts                              |
| How dense is it?                       | internal edge density                    |
| Are clusters isolated?                 | inter-cluster edge weight                |
| Is weaponisation unevenly distributed? | mean / distribution of weaponising ratio |
| Do clusters differ by user type?       | registered vs IP                         |


In [126]:
nodes_df = pd.DataFrame({
    "user": g.vs["name"],
    "cluster": g.vs["cluster"],
    "total_edits": g.vs["total_edits"],
    "weaponising_ratio": g.vs["weaponising_ratio"]
})

nodes_df[nodes_df['cluster'] == 2]

Unnamed: 0,user,cluster,total_edits,weaponising_ratio
5,Nickst,2,9,0.0
16,Ali-al-Bakuvi,2,20,0.333333
18,RGloucester,2,69,0.22902
19,Tobby72,2,14,0.085714
20,Volunteer Marek,2,32,0.372619
38,Iryna Harpy,2,19,0.229167
39,Ahnoneemoos,2,61,0.113459
40,Dbachmann,2,111,0.201312
41,Soffredo,2,6,0.0
45,Ezhiki,2,31,0.172414


In [113]:
# cluster size + activity volume

cluster_size_df = (
    nodes_df
    .groupby("cluster")
    .agg(
        n_users=("user", "count"),
        total_edits=("total_edits", "sum"),
        mean_edits_per_user=("total_edits", "mean")
    )
    .reset_index()
    .sort_values("n_users", ascending=False)
)

cluster_size_df

Unnamed: 0,cluster,n_users,total_edits,mean_edits_per_user
0,0,19,613,32.263158
1,1,13,395,30.384615
2,2,10,372,37.2
3,3,6,89,14.833333
4,4,3,17,5.666667


In [114]:
# Weaponisation profile per cluster

cluster_weapon_df = (
    nodes_df
    .groupby("cluster")
    .agg(
        mean_weaponising_ratio=("weaponising_ratio", "mean"),
        median_weaponising_ratio=("weaponising_ratio", "median"),
        
        # number of users with strong weaponisation behaviour
        high_weaponisers=("weaponising_ratio", lambda x: (x >= 0.5).sum())
    )
    .reset_index()
)

cluster_weapon_df


Unnamed: 0,cluster,mean_weaponising_ratio,median_weaponising_ratio,high_weaponisers
0,0,0.220646,0.213542,2
1,1,0.141372,0.125,0
2,2,0.173704,0.186863,0
3,3,0.173976,0.180262,0
4,4,0.083333,0.083333,0


In [115]:
# each row describes the distribution of weaponising behaviour among users in a cluster.
cluster_weapon_dist = (
    nodes_df
    .groupby("cluster")["weaponising_ratio"]
    .describe()
    .reset_index()
)

cluster_weapon_dist


Unnamed: 0,cluster,count,mean,std,min,25%,50%,75%,max
0,0,19.0,0.220646,0.178646,0.0,0.07275,0.213542,0.333333,0.560976
1,1,13.0,0.141372,0.140731,0.0,0.0,0.125,0.208739,0.375
2,2,10.0,0.173704,0.126344,0.0,0.092651,0.186863,0.22913,0.372619
3,3,6.0,0.173976,0.120914,0.0,0.10625,0.180262,0.246381,0.333333
4,4,3.0,0.083333,0.083333,0.0,0.041667,0.083333,0.125,0.166667


In [116]:
# High internal / low external : cohesive, well-separated community
# High external : bridge-like or diffuse cluster

cluster_ids = sorted(set(g.vs["cluster"]))

rows = []

for c in cluster_ids:
    nodes_in_c = [v.index for v in g.vs if v["cluster"] == c]
    nodes_not_c = [v.index for v in g.vs if v["cluster"] != c]

    internal_edges = g.es.select(_within=nodes_in_c)
    external_edges = g.es.select(_between=(nodes_in_c, nodes_not_c))

    rows.append({
        "cluster": c,
        "internal_edge_weight": sum(internal_edges["weight"]),
        "external_edge_weight": sum(external_edges["weight"]),
        "n_nodes": len(nodes_in_c)
    })

cluster_structure_df = pd.DataFrame(rows)
cluster_structure_df

Unnamed: 0,cluster,internal_edge_weight,external_edge_weight,n_nodes
0,0,122,73,19
1,1,61,73,13
2,2,87,91,10
3,3,22,69,6
4,4,9,6,3


## Deep Analysis of clusters 

From the quantitative metrics, we cannot identitfy some echo chambers. We have more a kind of core-periphery relationship. We want to retrieve the articles and the specific user names to detect :
* thematic specilization
* temporal editing
* editorial roles
* ...

In [117]:
user_cluster_df = pd.DataFrame({
    "user": g.vs["name"],
    "cluster": g.vs["cluster"]
})

df_clustered = df_graph.merge(
    user_cluster_df,
    on="user",
    how="inner"
)

core_articles = (
    df_clustered[df_clustered["cluster"] == 0]
    .groupby("article")
    .agg(
        n_users=("user", "nunique"),
        total_edits=("n_edits", "sum")
    )
    .reset_index()
    .sort_values(["n_users", "total_edits"], ascending=False)
)

core_articles.head(15)

Unnamed: 0,article,n_users,total_edits
20,History of Ukraine,17,190
5,Bessarabia,9,20
19,History of Christianity in Ukraine,8,95
10,Crimea,6,11
0,2004 Ukrainian presidential election,5,49
6,Bukovina,5,18
8,Catherine the Great,5,11
3,Alexander II of Russia,5,10
14,Eastern Front (World War II),5,8
21,Russian annexation of Crimea,4,18


In [124]:
peripheral_cluster = 4

peripheral_articles = (
    df_clustered[df_clustered["cluster"] == peripheral_cluster]
    .groupby("article")
    .agg(
        n_users=("user", "nunique"),
        total_edits=("n_edits", "sum")
    )
    .reset_index()
    .sort_values(["n_users", "total_edits"], ascending=False)
)

core_article_set = set(core_articles["article"])
peripheral_articles["also_in_core"] = (
    peripheral_articles["article"].isin(core_article_set)
)

peripheral_articles

Unnamed: 0,article,n_users,total_edits,also_in_core
4,History of Ukraine,3,9,True
0,2004 Ukrainian presidential election,3,3,True
3,Football in Ukraine,3,3,False
1,Dissolution of the Soviet Union,1,1,False
2,Economy of Ukraine,1,1,True
