## Influencers as based on page-rank analysis
In this notebook, we explore if the user interaction network reveals any important players as indicated by page rank algorithm. We also test if connectedness is related to other user attributes via regression analysis.

In [27]:
from scipy.sparse import load_npz
import numpy as np
import polars as pl
import networkx as nx
import statsmodels.api as sm
import statsmodels.formula.api as smf
from pathlib import Path
import sqlite3 as sq

path = "../../data/users/"
adj_matrix_path = path + 'adj_matrix-directs-latest.npz'
DB_PATH = path + 'users.sqlite.db'

In [11]:
#load the adjacency matrix
adj_matrix = load_npz(adj_matrix_path).tolil()
adj_matrix.setdiag(0) #set diagonals to zero to remove any "self-interactions"
A = adj_matrix.toarray()
norm_A = np.nan_to_num(A / np.sum(A, axis=1), 0) * 100 #normalize as connectedness doesn't make sense otherwise

  norm_A = np.nan_to_num(A / np.sum(A, axis=1), 0) * 100 #normalize as connectedness doesn't make sense otherwise


In [12]:
G = nx.from_numpy_matrix(norm_A)
pageranks = nx.pagerank(G, max_iter=100)

In [19]:
#load users
conn_string = "sqlite://" + str(Path(DB_PATH).absolute())
sql = "SELECT user_name, no_posts, no_comments, avg_post_karma, avg_comment_karma, activity_window, longevity FROM users WHERE is_selected ORDER BY matrix_id ASC"
selected_users = pl.read_sql(sql, conn_string)

In [22]:
selected_users['direct_pg_rank'] = np.array(list(pageranks.values()))

In [31]:
#save to the database
with sq.connect(DB_PATH) as conn:
    cur = conn.cursor()        
    try:
        cur.execute("ALTER TABLE users ADD COLUMN direct_pg real")        
    except sq.OperationalError:
        print("columns already exist")
    
    cur.executemany("UPDATE users SET direct_pg = ? WHERE user_name = ?", selected_users[['direct_pg_rank', 'user_name']].rows())

Looking at the top 10 best connected users, we can see that they have really high average post and comment karma scores. Also, majority of them are relatively new, with most of them having participated for less than 100 days in the subreddit.

In [24]:
selected_users.sort(pl.col("direct_pg_rank"), reverse=True).head(10)

user_name,no_posts,no_comments,avg_post_karma,avg_comment_karma,activity_window,longevity,pg_rank,direct_pg_rank
str,i64,i64,f64,f64,f64,f64,f64,f64
"""nryan1985""",73,55,2421.821918,8.036364,192.0,195.208333,0.000976,0.000976
"""Fragrant-Asparagus-2""",2,100,14180.5,143.04,51.0,34.208333,0.000921,0.000921
"""ceanothourus""",1,103,73294.0,187.087379,46.0,102.208333,0.000816,0.000816
"""wexlers""",2,100,5676.0,70.52,65.0,60.208333,0.000643,0.000643
"""Paratrooperkid""",5,122,2637.2,23.221311,92.0,83.208333,0.000592,0.000592
"""joevinci""",4,96,6426.0,27.6875,33.0,52.208333,0.000552,0.000552
"""TruthToPower77""",102,89,2193.735294,11.955056,398.0,363.208333,0.000541,0.000541
"""caligalus""",124,31,769.491935,14.774194,42.0,92.208333,0.000513,0.000513
"""poisonivy47""",89,157,3240.651685,10.178344,107.0,86.208333,0.0005,0.0005
"""jayzee312""",5,131,9435.6,38.656489,285.0,247.208333,0.000507,0.000507


This raises an interesting question - is posting quality associated with how well connected a user is? To some degree, this is expected as more popular posts/comments attract more attention. But given that we are looking just at direct interactions, it is not an obvious relationship.

## Factors associated with connectedness

We find that connectedness is indeed related to content quality. Higher average post and comment karma leads to higher connectedness, even when controlling for total posts and comments made. Also, interestingly enough, longevity on the subreddit does not matter, while users that only have participated during a shorter time window seem to have higher page rank, on average. This may indicate that users with higher connectedness are largely "one-off" wonders who contribute a few items of popular content over a shorter period of time and remain inactive afterwards.

In [33]:
formula = """
np.log(direct_pg_rank) ~ 
np.log(no_posts + 0.001) + np.log(no_comments + 0.001) + 
np.log(avg_post_karma + 0.001) +  np.log(avg_comment_karma + 0.001) + 
np.log(longevity + 0.0001) + np.log(activity_window + 0.001)"""

results = smf.ols(formula, data=selected_users.to_pandas()).fit()
print(results.summary())

                              OLS Regression Results                              
Dep. Variable:     np.log(direct_pg_rank)   R-squared:                       0.355
Model:                                OLS   Adj. R-squared:                  0.354
Method:                     Least Squares   F-statistic:                     695.8
Date:                    Tue, 26 Apr 2022   Prob (F-statistic):               0.00
Time:                            21:59:07   Log-Likelihood:                -720.94
No. Observations:                    7603   AIC:                             1456.
Df Residuals:                        7596   BIC:                             1504.
Df Model:                               6                                         
Covariance Type:                nonrobust                                         
                                        coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------

  result = getattr(ufunc, method)(*inputs, **kwargs)
