# User interaction analysis (based on direct interactions)

This notebook includes analysis on direct user interactions, focusing on:
 - Identification of user clusters within r/antiwork using spectral clustering

A direct interaction is defined as an interaction between a parent post/comment and a child comment)

In [14]:
from scipy.sparse import load_npz
import numpy as np
from sklearn.cluster import MiniBatchKMeans
import polars as pl
from pathlib import Path
import Clustering
import importlib as imp

path = "../../data/users/"
adj_matrix_path = path + 'adj_matrix-directs.npz'
DB_PATH = path + 'users.sqlite.db'

In [4]:
#load the adjacency matrix
adj_matrix = load_npz(adj_matrix_path).tolil()
adj_matrix.setdiag(0) #set diagonals to zero to remove any "self-interactions"
A = adj_matrix.toarray()

In [28]:
#load users
conn_string = "sqlite://" + str(Path(DB_PATH).absolute())
selected_users = pl.read_sql("SELECT * FROM users WHERE is_selected ORDER BY matrix_id ASC", conn_string)

## Clustering

We run spectral clustering and explore how well it performs with up to 30 clusters. 

In [5]:
selected_vecs, results, sizes, eigvalues = Clustering.spectral_clustering(A, max_clusters=30)

In [15]:
Clustering.plot_cluster_diagnostics(results, sizes, eigvalues)

Based on the above outputs, we can see that there aren't really any strong clusters. In cases where silhouette scores are high, most users simply belong to a single cluster.

### Clustering based on normalized user interactions

What if instead the same analysis was normalized to total user activity? We adjust the adjacency matrix by normalizing for total activity that each user has among in the top user network, which makes the edges represent the weight of interactions among users rather than the absolute interactions themselves.

In [24]:
norm_A = np.nan_to_num(A / np.sum(A, axis=1), 0) * 100

  norm_A = np.nan_to_num(A / np.sum(A, axis=1), 0) * 100


In [25]:
selected_vecs, results, sizes, eigvalues = Clustering.spectral_clustering(norm_A, max_clusters=30)

Computing eigenvalues..
Running K-means..


In [13]:
Clustering.plot_cluster_diagnostics(results, sizes, eigvalues)

As the silhouette scores are a bit higher, let's choose one of the configurations with the highest score (clusters == 11) and analyze what clusters have been generated.

In [29]:
optimal_clusters = 11
vecs = selected_vecs[:,:optimal_clusters]
cluster_model = MiniBatchKMeans(n_clusters=optimal_clusters, max_iter=1000, random_state=42).fit(vecs)
selected_users['cluster'] = cluster_model.labels_
imp.reload(Clustering)
Clustering.get_basic_stats(selected_users).to_pandas().round(1)

Unnamed: 0,Statistic,column_0,column_1,column_2,column_3,column_4,column_5,column_6,column_7,column_8
0,cluster,8.0,0.0,6.0,5.0,4.0,3.0,2.0,10.0,9.0
1,Total number of users,24.0,52.0,13.0,10.0,4.0,421.0,16.0,57.0,7208.0
2,Median posts per user,0.5,1.0,1.0,3.0,1.0,1.0,0.0,0.0,1.0
3,Median comments per user,176.0,145.5,158.0,131.0,121.5,140.0,144.5,127.0,156.0
4,Median activity window (days),116.0,119.5,164.0,177.5,144.5,122.0,163.5,105.0,125.0
5,Median average post karma,3.1,1.8,1.0,14.5,16.0,0.0,0.0,0.0,3.0
6,Median average comment karma,16.5,11.4,10.0,7.3,12.5,9.6,6.6,8.7,11.4


The results are quite similar. We find some very small clusters of users (< 60 users) which are likely arising due to interacting on a single post thread.