# Clustering analysis (based on indirect interactions)

In [12]:
from scipy.sparse import load_npz
import numpy as np
from sklearn.cluster import MiniBatchKMeans
import polars as pl
from pathlib import Path
import Clustering
import importlib as imp
import pandas as pd

from scipy.stats import mannwhitneyu
import itertools as itt
from collections import defaultdict


path = "../../data/users/"
adj_matrix_path = path + 'adj_matrix-indirects.npz'
DB_PATH = path + 'users.sqlite.db'

## Indirect user interactions

Here, we explore indirect user interactions (a indirect interaction is defined as an interaction when two users are active on the same thread)

In [4]:
#load the adjacency matrix
adj_matrix = load_npz(adj_matrix_path).tolil()
adj_matrix.setdiag(0) #set diagonals to zero to remove any "self-interactions"
A = adj_matrix.toarray()

In [16]:
#load users
conn_string = "sqlite://" + str(Path(DB_PATH).absolute())
selected_users = pl.read_sql("SELECT * FROM users WHERE is_selected ORDER BY matrix_id ASC", conn_string)

## Clustering

We run spectral clustering and explore how well it performs with up to 30 clusters. 

In [6]:
selected_vecs, results, sizes, eigvalues = Clustering.spectral_clustering(A, max_clusters=30)

Computing eigenvalues..
Running K-means..


In [7]:
Clustering.plot_cluster_diagnostics(results, sizes, eigvalues)

Based on the above outputs, we can see that there aren't really any strong clusters. In cases where silhouette scores are high, most users simply belong to a single cluster. We'll use 12 clusters for analysis as it is the highest cluster number that still has a good silhouette score.

In [17]:
optimal_clusters = 12
vecs = selected_vecs[:,:optimal_clusters]
cluster_model = MiniBatchKMeans(n_clusters=optimal_clusters, max_iter=1000, random_state=42).fit(vecs)

In [18]:
selected_users['cluster'] = cluster_model.labels_
imp.reload(Clustering)
Clustering.get_basic_stats(selected_users)

Statistic,column_0,column_1,column_2,column_3,column_4,column_5
str,f64,f64,f64,f64,f64,f64
"""cluster""",8.0,7.0,4.0,10.0,9.0,1.0
"""Total number of users""",475.0,295.0,1026.0,4995.0,792.0,174.0
"""Median posts per user""",1.0,1.0,1.0,1.0,1.0,0.0
"""Median comments per user""",224.0,187.0,145.5,159.0,132.0,128.0
"""Median activity window (days)""",123.0,137.0,119.0,128.0,122.0,107.5
"""Median average post karma""",6.0,5.0,2.0,3.0,0.0,0.0
"""Median average comment karma""",13.170616,9.1875,11.238073,11.369318,10.627778,11.434482


We can see that there is quite some variation in the attributes describing the clusters. Are they substantially different enough? We use Mann-Whitney U-test for pair-wise comparison of clusters across multiple attributes.

In [29]:
clusters = [1, 4, 7, 8, 9, 10]
vars = ["avg_post_karma", "no_posts", "no_comments", "avg_comment_karma", "activity_window"]

results = []
for v in vars:
    for c1, c2 in itt.combinations(clusters, 2):
        vals1 = selected_users.filter(pl.col("cluster") == c1).select(v).to_numpy()
        vals2 = selected_users.filter(pl.col("cluster") == c2).select(v).to_numpy()        
        test = mannwhitneyu(vals1, vals2, nan_policy = 'omit')
        results.append((v, c1, c2, test.pvalue[0]))

df = pd.DataFrame(results, columns = ['variable', 'clusterX', 'clusterY', 'p-value'])
df['significant'] = df['p-value'] < 0.01

differences = defaultdict(lambda: defaultdict(int))

for t in df.itertuples():
    if t.significant:
        differences[t.variable][t.clusterX] += 1
        differences[t.variable][t.clusterY] += 1


In [30]:
df_diffs = pd.DataFrame(dict(differences))

df_diffs['total_diffs'] = df_diffs.sum(axis=1)
df_diffs.sort_values(by ="total_diffs",ascending = False)

Unnamed: 0,avg_post_karma,no_posts,no_comments,avg_comment_karma,activity_window,total_diffs
1,1.0,3.0,4,,4.0,12.0
8,1.0,2.0,5,3.0,1.0,12.0
7,,2.0,5,2.0,1.0,10.0
10,,2.0,5,2.0,1.0,10.0
9,,3.0,4,1.0,,8.0
4,,,5,,1.0,6.0


Using p-values of `0.01`, we find that the clusters are not substantially different - out of 25 comparisons, the most different cluster (#1) was found different in 12 cases. Further, the clusters are mostly different by the number of posts and comments made, and less different across average karma scores or activity windows.

## Clustering based on normalized user interactions

What if instead the same analysis was normalized to total user activity? We adjust the adjacency matrix by normalizing for total activity that each user has among in the top user network, which makes the edges represent the weight of interactions among users rather than the absolute interactions themselves.

In [31]:
norm_A = np.nan_to_num(A / np.sum(A, axis=1), 0) * 100

  norm_A = np.nan_to_num(A / np.sum(A, axis=1), 0) * 100


In [14]:
selected_vecs, results, sizes, eigvalues = Clustering.spectral_clustering(norm_A, max_clusters=30)

Computing eigenvalues..
Running K-means..


In [15]:
Clustering.plot_cluster_diagnostics(results, sizes, eigvalues)

We can see that there are no meaningful clusters at all.