## Influencers as based on page-rank analysis
In this notebook, we explore if the user interaction network reveals any important players as indicated by page rank algorithm. We also test if connectedness is related to other user attributes via regression analysis.

In [1]:
from scipy.sparse import load_npz
import numpy as np
import polars as pl
import networkx as nx
import statsmodels.api as sm
import statsmodels.formula.api as smf

path = "../../../data/users/summaries/combined/"
adj_matrix_path = path + 'adj_matrix-directs-min-100.npz'
user_stats_path = path + 'user_stats.csv'

In [2]:
#load the adjacency matrix
adj_matrix = load_npz(adj_matrix_path).tolil()
adj_matrix.setdiag(0) #set diagonals to zero to remove any "self-interactions"
A = adj_matrix.toarray()[3:,3:] #exclude skip user and the two bots
norm_A = np.nan_to_num(A / np.sum(A, axis=1), 0) * 100 #normalize

  norm_A = np.nan_to_num(A / np.sum(A, axis=1), 0) * 100 #normalize


In [3]:
G = nx.from_numpy_matrix(norm_A)
pageranks = nx.pagerank(G, max_iter=100)

In [4]:
selected_users = pl.read_csv(user_stats_path) \
    .filter((pl.col("user_name") != "__SKIP__") &  (pl.col("user_name") != "AutoModerator") &  (pl.col("user_name") != "MAGIC_EYE_BOT")) \
    .with_columns([
        (pl.col("post_karma") / pl.col("no_posts")).alias("avg_post_karma"),
        (pl.col("comment_karma") / pl.col("no_comments")).alias("avg_comment_karma"),
        ((pl.col("last_date") - pl.col("first_date")) / 3600 / 24).alias("activity_window")
    ]).filter(pl.col("total_activity") >= 100)

In [5]:
selected_users['pg_rank'] = np.array(list(pageranks.values()))
selected_users = selected_users.with_column(((1641790800 - pl.col("first_date")) / 3600 / 24).alias("longevity"))



Looking at the top 10 best connected users, we can see that they have really high average post and comment karma scores. Also, majority of them are relatively new, with most of them having participated for less than 100 days in the subreddit.

In [6]:
selected_users.sort(pl.col("pg_rank"), reverse=True).head(10)

user_name,no_posts,no_comments,post_karma,comment_karma,first_date,last_date,total_activity,avg_post_karma,avg_comment_karma,activity_window,pg_rank,longevity
str,i64,i64,i64,i64,i64,i64,i64,f64,f64,f64,f64,f64
"""nryan1985""",73,55,176793,442,1624914000,1641506400,128,2421.821918,8.036364,192.041667,0.000986,195.333333
"""Fragrant-Asparagus-2""",2,100,28361,14304,1638835200,1643241600,102,14180.5,143.04,51.0,0.000921,34.208333
"""ceanothourus""",1,103,73294,19270,1632960000,1636927200,104,73294.0,187.087379,45.916667,0.000817,102.208333
"""wexlers""",2,100,11352,7052,1636581600,1642197600,102,5676.0,70.52,65.0,0.000639,60.291667
"""Paratrooperkid""",5,122,13186,2833,1634601600,1642550400,127,2637.2,23.221311,92.0,0.000587,83.208333
"""joevinci""",4,96,25704,2658,1637272800,1640131200,100,6426.0,27.6875,33.083333,0.000556,52.291667
"""TruthToPower77""",102,89,223761,1064,1610402400,1644796800,191,2193.735294,11.955056,398.083333,0.000541,363.291667
"""caligalus""",124,31,95417,458,1633824000,1637452800,155,769.491935,14.774194,42.0,0.000513,92.208333
"""poisonivy47""",89,157,288418,1598,1634342400,1643587200,246,3240.651685,10.178344,107.0,0.000509,86.208333
"""jayzee312""",5,131,47178,5064,1620421200,1645056000,136,9435.6,38.656489,285.125,0.000506,247.333333


This raises an interesting question - is posting quality associated with how well connected a user is? To some degree, this is expected as more popular posts/comments attract more attention. But given that we are looking just at direct interactions, it is not an obvious relationship.

In [7]:
formula = """
np.log(pg_rank) ~ 
np.log(no_posts + 0.001) + np.log(no_comments + 0.001) + 
np.log(avg_post_karma + 0.001) +  np.log(avg_comment_karma + 0.001) + 
np.log(longevity + 0.0001) + np.log(activity_window + 0.001)"""

results = smf.ols(formula, data=selected_users.to_pandas()).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:        np.log(pg_rank)   R-squared:                       0.386
Model:                            OLS   Adj. R-squared:                  0.385
Method:                 Least Squares   F-statistic:                     434.6
Date:                Tue, 26 Apr 2022   Prob (F-statistic):               0.00
Time:                        17:43:37   Log-Likelihood:                -287.63
No. Observations:                4149   AIC:                             589.3
Df Residuals:                    4142   BIC:                             633.6
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
Interc

  result = getattr(ufunc, method)(*inputs, **kwargs)


We find that connectedness is indeed related to content quality. Higher average post and comment karma leads to higher connectedness, even when controlling for total posts and comments made. Also, interestingly enough, longevity on the subreddit does not matter, while users that only have participated during a shorter time window seem to have higher page rank, on average. This may indicate that users with higher connectedness are largely "one-off" wonders who contribute a few items of popular content over a shorter period of time and remain inactive afterwards.