# Bonus: Temporal Pagerank

Pagerank is a method to rank pages by their importance based on how pages link to eachother. PageRank is defined by a steady state of which implies that the underlying network needs to be fixed and static. In previous sections we solved this problem by taking all the edges over a timespan as the static state. This however wasn't the best approach since short lived edges would have the same influence on the pagerank as long lived edges. In addition to that it also ignored time specific influence of edges on the pagerank. To solve this problem Polina Rozenshtein and Aristides Gionis created a temporal pagerank algorithm. This algorithm is a generalization of PageRank for temporal networks. By highlighting the actual information flow in the network, this temporal pagerank algorithm captures more accurately the network dynamics. More information about the temporal pagerank can be found here: https://users.ics.aalto.fi/gionis/temporal-pagerank.pdf.

In this section we are going to run this temporal pagerank algorithm on the wiki dataset and compare it to the findings generated by the static pagerank in previous sections.

In [44]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import networkx as nx
import community as cm
import copy
import os.path
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
%matplotlib inline

We start by reformating our data. For temporal pagerank we only require the time on which the edge originally arrived. Therefore we can remove the time where the edge gets removed from our data. We store the data in a .txt file so it can be loaded by the temporal-pagerank API which we got from: https://github.com/polinapolina/temporal-pagerank.

In [45]:
df = pd.read_table("./data/tgraph_real_wikiedithyperlinks.txt", header = None, sep = " ", names = ["src", "trg", "start", "end"])
df['start'] = pd.to_datetime(df['start'], unit = 's') #convert Unix timestamps to date time, utc = 0
df['start'] = df['start'].astype('str') 
del df['end']
df_ord = df[['start', 'src', 'trg']]
np.savetxt('./data/wike_temporal_pr.txt', df_ord.values, fmt=['%s', '%d', '%d'])

Next we define some utility methods imported from https://github.com/polinapolina/temporal-pagerank:

In [51]:
# From https://github.com/polinapolina/temporal-pagerank/edit/master/allutils/graph_generator.py
def getToy():
    G = nx.DiGraph()
    G.add_edges_from([(1,2,{'weight': 1.0}), (3,2, {'weight': 1.0})])
    nrm = float(sum(G.degree(weight = 'weight').values()))
    for i in G.edges(data=True):
        G[i[0]][i[1]]['weight'] = i[-1]['weight']/nrm
    return G

# From https://github.com/polinapolina/temporal-pagerank/edit/master/allutils/graph_generator.py
def getSubgraph(G, N = 1000):
    Gcc = sorted(nx.connected_component_subgraphs(G.to_undirected()), key = len, reverse=True)
    nodes = set()
    i = 0

    while len(nodes) < N:
        s = np.random.choice(Gcc[i].nodes())
        i += 1
        nodes.add(s)
        for edge in nx.bfs_edges(G.to_undirected(), s):
            nodes.add(edge[1])
            if len(nodes) == N:
                break
    return G.subgraph(nodes)

# From https://github.com/polinapolina/temporal-pagerank/edit/master/allutils/graph_generator.py
def getGraph(edgesTS):
    G = nx.DiGraph()
    edges = {}
    for item in edgesTS:
        edge = item[1]
        edges[edge] = edges.get(edge, 0.0) + 1.0
    G.add_edges_from([(k[0],k[1], {'weight': v}) for k,v in edges.items()])
    return G

# From https://github.com/polinapolina/temporal-pagerank/edit/master/allutils/graph_generator.py
# (Modified to suit our needs)
def readRealGraph(filepath):
    edgesTS = []
    nodes = set()
    edges = set()
    lookup = {}
    c = 0
    with open(filepath,'r') as fd:
        for line in fd.readlines():
            line = line.strip()
            items = line.split(' ')
            tstamp = ' '.join(items[0 : 2])
            tstamp = datetime.strptime(tstamp, '%Y-%m-%d %H:%M:%S')
            t = items[2 : 4]
            if t[0] == t[1]:
                continue
            if tuple(t) in lookup.keys():
                num = lookup[tuple(t)]
            else:
                num = c
                lookup[tuple(t)] = c
                c += 1
            edgesTS.append((tstamp, tuple(t), num ))
            nodes.add(t[0])
            nodes.add(t[1])
            edges.add(tuple([t[0],t[1]]))
    fd.close()
    return edgesTS, nodes, edges

# From https://github.com/polinapolina/temporal-pagerank/edit/master/allutils/graph_generator.py
# (Modified to suit our needs)
def weighted_DiGraph(n, path, seed = 1.0, weights='random'):
    edgesTS, _, _ = readRealGraph(path)
    G = getGraph(edgesTS)
    G = nx.DiGraph(G)
    G.remove_edges_from(G.selfloop_edges())
    G = getSubgraph(G, n)
    for i in G.nodes():
        if G.out_degree(i) == 0:
            for j in G.nodes():
                if i != j:
                    G.add_edge(i, j, weight=1.0)
    if weights == 'random':
        w = np.random.uniform(1e-5, 1.0, G.number_of_edges())
        w /= sum(w)
        c = 0
        for i in G.edges():
            G[i[0]][i[1]]['weight'] = w[c]
            c += 1
    elif weights == 'uniform':
        w = 1.0/G.number_of_edges()
        for i in G.edges():
            G[i[0]][i[1]]['weight'] = w
    else:
        nrm = float(sum(G.out_degree(weight = 'weight').values()))
        for i in G.edges(data=True):
            G[i[0]][i[1]]['weight'] = i[-1]['weight']/nrm
    return G

# From https://github.com/polinapolina/temporal-pagerank/edit/master/allutils/graph_generator.py
def change_weights(G):
    #w = np.random.uniform(1e-5, 1.0, G.number_of_edges())
    w = np.random.uniform(0.0, 1.0, G.number_of_edges())
    w /= sum(w)
    c = 0
    for i in G.edges():
        G[i[0]][i[1]]['weight'] = w[c]
        c += 1
    return G

The page rank algorithm imported from https://github.com/polinapolina/temporal-pagerank:

In [52]:
def flowPR(p_prime_nodes, ref_pr, stream, RS, current, iters = 1000000, alpha = 0.85, beta=0.001, gamma=0.9999, normalization = 1.0, padding = 0):
    if beta == 1.0:
        beta = 0.0
        
    tau = []
    pearson = []
    spearman = []
    error = []
    x = []
    i = 0

    rank_order = [key for (key, value) in sorted(ref_pr.items(), key=operator.itemgetter(1), reverse=True)]
    ordered_pr = np.array([ref_pr[k] for k in rank_order])

    for e in stream:
        i += 1

        RS[e[0]] = RS.get(e[0], 0.0) * gamma + 1.0 * (1.0 - alpha) * p_prime_nodes[e[0]] * normalization
        RS[e[1]] = RS.get(e[1], 0.0) * gamma + (current.get(e[0], 0.0) + 1.0 * (1.0 - alpha) * p_prime_nodes[e[0]]) * alpha * normalization
        current[e[1]] = current.get(e[1], 0.0) + (current.get(e[0], 0.0) + 1.0 * (1.0 - alpha)* p_prime_nodes[e[0]]) * alpha * (1 - beta)
        current[e[0]] = current.get(e[0], 0.0) * beta


        if (i % 100 == 0 or i == len(stream)) and len(RS) == len(ordered_pr):
            sorted_RS4 = np.array([RS[k] / sum(RS.values()) for k in rank_order])
            tau.append(scipy.stats.kendalltau(sorted_RS4, ordered_pr)[0])
            pearson.append(scipy.stats.pearsonr(sorted_RS4, ordered_pr)[0])
            spearman.append(scipy.stats.spearmanr(sorted_RS4, ordered_pr)[0])
            error.append(np.linalg.norm(sorted_RS4 - ordered_pr))
            x.append(i+padding)

    sorted_RS4 = np.array([RS[k] / sum(RS.values()) for k in rank_order])

    return RS, current, tau, spearman, pearson, error, x

We calculate the pagerank with the code segments from https://github.com/polinapolina/temporal-pagerank:

In [53]:
p = 0.1

iters = 100000
alpha = 0.85

beta = 0.0
gamma = 1.0

weights = 'random'

G = weighted_DiGraph(n = 100, path = './data/wike_temporal_pr.txt', seed = 1.0, weights = weights)
#nodes = G.nodes()

# basic
norm = sum(G.out_degree(weight='weight').values())
sampling_edges = {e[:-1]: e[-1]['weight']/norm for e in G.edges(data=True)}
personalization = {k: v / norm for k, v in G.out_degree(weight='weight').items()}
p_prime_nodes = {i: personalization[i]/G.out_degree(i, weight='weight') for i in G.nodes()}
pr = nx.pagerank(G, alpha=alpha, personalization=personalization, weight='weight')

rank_order = [key for (key, value) in sorted(pr.items(), key=operator.itemgetter(1), reverse=True)]
ordered_pr = np.array([pr[k] for k in rank_order])

stream = [sampling_edges.keys()[i] for i in np.random.choice(range(len(sampling_edges)), size=iters, p=sampling_edges.values())]
sorted_RS4, tau, spearman, pearson, error, epochs, top_k = flowPR(p_prime_nodes, pr, stream, iters = iters, beta = beta, 
                                                                  gamma = gamma)

for i in xrange(len(epochs)):
    plt.plot(top_k, tau[i][:])

NetworkXError: SubGraph Views are readonly. Mutations not allowed

We unfortunately couldn't get the code to work. The reason for that is that it's is made in Python 2 and networkx 1. We tried converting it to Python 3 and networkx 2.0 but this failed since from networkx 1 on networkx 2 the subgraph class has become immutable. The algorithm uses this subgraph class to make changes to the main graph. We haven't find a networkx 2 alternative for this subgraph class. A solution for this would be to run this in a python 2 environment with networkx 1 but we just didn't have the time to try this.