## Influencers as based on page-rank analysis
In this notebook, we explore if the user interaction network reveals any important players as indicated by page rank algorithm. We also test if connectedness is related to other user attributes via regression analysis.

In [27]:
from scipy.sparse import load_npz
import numpy as np
import polars as pl
import networkx as nx
import statsmodels.api as sm
import statsmodels.formula.api as smf
from pathlib import Path
import sqlite3 as sq

path = "../../data/users/"
adj_matrix_path = path + 'adj_matrix-indirects-latest.npz'
DB_PATH = path + 'users.sqlite.db'

In [11]:
#load the adjacency matrix
adj_matrix = load_npz(adj_matrix_path).tolil()
adj_matrix.setdiag(0) #set diagonals to zero to remove any "self-interactions"
A = adj_matrix.toarray()
norm_A = np.nan_to_num(A / np.sum(A, axis=1), 0) * 100 #normalize as connectedness doesn't make sense otherwise

  norm_A = np.nan_to_num(A / np.sum(A, axis=1), 0) * 100 #normalize as connectedness doesn't make sense otherwise


In [12]:
G = nx.from_numpy_matrix(norm_A)
pageranks = nx.pagerank(G, max_iter=100)

In [19]:
#load users
conn_string = "sqlite://" + str(Path(DB_PATH).absolute())
sql = "SELECT user_name, no_posts, no_comments, avg_post_karma, avg_comment_karma, activity_window, longevity FROM users WHERE is_selected ORDER BY matrix_id ASC"
selected_users = pl.read_sql(sql, conn_string)

In [22]:
selected_users['indirect_pg_rank'] = np.array(list(pageranks.values()))

In [31]:
#save to the database
with sq.connect(DB_PATH) as conn:
    cur = conn.cursor()        
    try:
        cur.execute("ALTER TABLE users ADD COLUMN indirect_pg real")        
    except sq.OperationalError:
        print("columns already exist")
    
    cur.executemany("UPDATE users SET indirect_pg = ? WHERE user_name = ?", selected_users[['indirect_pg_rank', 'user_name']].rows())

## Factors associated with connectedness

We find that connectedness is indeed related to content quality. Higher average post and comment karma leads to higher connectedness, even when controlling for total posts and comments made. Also, interestingly enough, longevity on the subreddit does not matter, while users that only have participated during a shorter time window seem to have higher page rank, on average. This may indicate that users with higher connectedness are largely "one-off" wonders who contribute a few items of popular content over a shorter period of time and remain inactive afterwards.

In [32]:
formula = """
np.log(indirect_pg_rank) ~ 
np.log(no_posts + 0.001) + np.log(no_comments + 0.001) + 
np.log(avg_post_karma + 0.001) +  np.log(avg_comment_karma + 0.001) + 
np.log(longevity + 0.0001) + np.log(activity_window + 0.001)"""

results = smf.ols(formula, data=selected_users.to_pandas()).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:        np.log(pg_rank)   R-squared:                       0.355
Model:                            OLS   Adj. R-squared:                  0.354
Method:                 Least Squares   F-statistic:                     695.8
Date:                Tue, 26 Apr 2022   Prob (F-statistic):               0.00
Time:                        21:56:25   Log-Likelihood:                -720.94
No. Observations:                7603   AIC:                             1456.
Df Residuals:                    7596   BIC:                             1504.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
Interc

  result = getattr(ufunc, method)(*inputs, **kwargs)
