### Task 1: Static Network Analysis
* Degree (in-degree/out-degree)
* Diameter
* (Mean shortest path length)
* (Top-k eigenvalues)
* (Betweennes centrality)
* (Closeness centrality)</i>

A temporal network is goint to be analysed which consists of 678907 vertices and 4729035 edges, where each edge has time information associated with it. Some of the edges have the same source and target vertex, but are association with different timestamps. This statick network analysi, however, ignores time-information and thus we use a graph build solely on the pairs of source and target vertices.

### <font color="darkgreen">Imports, configuation and preprocessing</font>

In [1]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
import math

In [49]:
import igraph

In [56]:
wiki = open("data/tgraph_real_wikiedithyperlinks_noTime.txt", 'rb')
g_new = igraph.read(wiki)
wiki.close()

In [68]:
g_new.successors(10)

[5,
 10,
 10,
 12,
 14,
 15,
 17,
 20,
 21,
 24,
 31,
 34,
 47,
 48,
 49,
 51,
 84,
 97,
 114,
 118,
 146,
 146,
 149,
 149,
 200,
 216,
 220,
 249,
 363,
 394,
 396,
 696,
 701,
 940,
 958,
 966,
 1173,
 1264,
 1620,
 1765,
 1769,
 1833,
 2005,
 2051,
 2053,
 2446,
 2479,
 2555,
 2728,
 2764,
 3404,
 3570,
 3722,
 4054,
 4281,
 4466,
 4475,
 4479,
 4491,
 4525,
 4644,
 4724,
 4801,
 5017,
 5020,
 5302,
 5304,
 5342,
 5446,
 5951,
 6237,
 6244,
 6446,
 6563,
 7510,
 7516,
 7540,
 7541,
 7551,
 9145,
 9731,
 15902,
 16500,
 17936,
 19074,
 24001,
 24700,
 26251,
 26254,
 29952,
 33705,
 34183,
 34183,
 34533,
 34936,
 36342,
 36360,
 37559,
 39348,
 42180,
 42584,
 47191,
 54610,
 57237,
 57318,
 65546,
 70098,
 78761,
 79861,
 80178,
 80410,
 82581,
 87426,
 87976,
 88597,
 90674,
 90686,
 90737,
 95371,
 97610,
 97614,
 97628,
 97631,
 97632,
 97632,
 97633,
 97640,
 97641,
 97647,
 97657,
 97663,
 97667,
 97671,
 97672,
 97687,
 97693,
 97694,
 97696,
 97698]

In [76]:
succ = list(nx.dfs_successors(G, '7'))

In [78]:
len(succ)

115601

#### Directed Graph with parallel edges (due to different timestamps of the edges between pairs of vertices)

In [None]:
wiki = open("data/tgraph_real_wikiedithyperlinks_noTime.txt", 'rb')
G_par = nx.read_edgelist(wiki, create_using=nx.MultiDiGraph())
wiki.close()

#### Directed Graph used for static network analysis

In [2]:
wiki = open("data/tgraph_real_wikiedithyperlinks_noTime.txt", 'rb')
G = nx.read_edgelist(wiki, create_using=nx.DiGraph())
wiki.close()

In [4]:
def create_dataframe(G_in):
    df_edge_in = pd.DataFrame(list(G_in.in_degree()), columns=['node', 'in edges'])
    df_edge_out = pd.DataFrame(list(G_in.out_degree()), columns=['node', 'out edges'])
    df_total = pd.merge(df_edge_in, df_edge_out, on='node')
    df_total.index = df_total.index + 1
    return df_total

In [None]:
wiki_df_par = create_dataframe(G_par)

In [46]:
wiki_df.loc[wiki_df.node == '7']

Unnamed: 0,node,in edges,out edges
9,7,580,76


In [42]:
G.has_node('7')

True

In [5]:
wiki_df = create_dataframe(G)
wiki_df.head()

Unnamed: 0,node,in edges,out edges
1,1,28,88
2,6,10,0
3,8,62,6
4,9,1375,477
5,3,1068,451


In [8]:
nx.dfs_successors(G, '6')

{}

In [17]:
values = list(wiki_df.loc[wiki_df['out edges'] == 0].node.values)
values = list(map(int, values))
hist(values)

NameError: name 'hist' is not defined

### <font color="darkgreen">1. Static Network Analysis:</font> Degree distribution

The first network property which is going to be analyzed of the static network is the degree distribution. While it is an easily comprehensible network measure, it can still lead to interesting findings as we have seen in the previous assignment.

In [None]:
wiki_df['in edges'].mean() == wiki_df['out edges'].mean()

Which is expected of course since every outgoing edge is an incoming edge of the graph. This check is just done to check whether the parsing of the Graph object to a dataframe is done without any errors.

In [None]:
wiki_df['in edges'].mean()

In [None]:
wiki_df['in edges'].max(), wiki_df['in edges'].min()

In [None]:
wiki_df['out edges'].max(), wiki_df['out edges'].min()

The mean of the incoming edges is just under above the 5 links. The node with the highest number of incoming edges has a little more than $10^4$ incoming edges, while the node with the highest number of outgoing edges has around  $4.3*10^3$ outgoing edges. These values do not implicate anything by themselves, but these values will be used in elaborations further on.

Before we continue with the static network analysis note that for the static network analysis it does not make sense to consider parallel edges. The parallel edges are edges between a pair of vertices at different time-stamps. But this gives a deceptive view in the static network analysis part. If a page has $y$ incoming edges of the same source vertex $x$, at different time-stamps it does not make sense to consider all of these parallel edges for many measures such as the degree distribution, page-rank etc. The choice is, therefore, made to continue with the network as a directed graph ($DiGraph$ in networkx) instead of a multi-directed graph ($MultiDiGraph$) for the static network analysis. These edges, however, have to be considered in part two for the Temporal Network Analysis where the time information of the edges is of the essence and techniques such as snapshot-based analysis can or even should be exploited.

Just to give you an overview of the deceiving effects it can have on the measures calculated over these two types of networks:

In [None]:
wiki_df_par['in edges'].mean(), wiki_df_par['in edges'].max(), wiki_df_par['out edges'].max()

It adds up the parallel edges between pair of vertices which result in a higher number of incoming and/or outgoing edges for many vertices. 

Let's continue with the analysis of the degree-distribution now that we have this pecularity out of the way. We are going to divide nodes of the network into deciles to get a better grasp of the characteristics of the top and bottom nodes when it boils down to number of incoming or outgoing edges.

In [None]:
wiki_df['decile_incoming_edges'] = pd.cut((wiki_df['in edges']), 10, labels=False)
wiki_df.loc[wiki_df.decile_incoming_edges.between(1, 9)].shape, wiki_df.loc[wiki_df.decile_incoming_edges.between(0, 1)].shape

In [None]:
wiki_df['decile_outgoing_edges'] = pd.cut((wiki_df['out edges']), 10, labels=False)
wiki_df.loc[wiki_df.decile_outgoing_edges.between(1, 9)].shape, wiki_df.loc[wiki_df.decile_outgoing_edges.between(0, 1)].shape

We see that there is a very skewed distribution among the nodes when we make splits based upon their number number of incoming or outgoing edges. There is thus a small set of nodes (i.e. pages) with relatively high number of incoming edges and a small set of nodes with a relatively high number of outgoing edges.

Let's divide the nodes in four groups such that in each quartile we have the same number of vertices (not the same as quartiles) intstead of the current configuration since this will faciliate a better comparison of the the groups considering the very skewed distribution. 

In [None]:
sorted_incoming_edges = wiki_df.sort_values(['in edges']).reset_index(drop = True)
sorted_outgoing_edges = wiki_df.sort_values(['out edges']).reset_index(drop = True)
sorted_incoming_edges.tail()

In [None]:
incoming_edges_q1 = sorted_incoming_edges.iloc[: math.floor(sorted_incoming_edges.shape[0] / 4)]
incoming_edges_q2 = sorted_incoming_edges.iloc[math.floor(sorted_incoming_edges.shape[0] / 4) : 2 * (math.floor(sorted_incoming_edges.shape[0] / 4))]
incoming_edges_q3 = sorted_incoming_edges.iloc[2 * (math.floor(sorted_incoming_edges.shape[0] / 4)) : 3 * (math.floor(sorted_incoming_edges.shape[0] / 4))]
incoming_edges_q4 = sorted_incoming_edges.iloc[3 * (math.floor(sorted_incoming_edges.shape[0] / 4)) :]

incoming_edges_bottom_1 = sorted_incoming_edges.iloc[: math.floor(sorted_incoming_edges.shape[0] / 100)]
incoming_edges_top_1 = sorted_incoming_edges.iloc[99 * math.floor(sorted_incoming_edges.shape[0] / 100):]

In [None]:
outgoing_edges_q1 = sorted_outgoing_edges.iloc[: math.floor(sorted_outgoing_edges.shape[0] / 4)]
outgoing_edges_q2 = sorted_outgoing_edges.iloc[math.floor(sorted_outgoing_edges.shape[0] / 4) : 2 * (math.floor(sorted_outgoing_edges.shape[0] / 4))]
outgoing_edges_q3 = sorted_outgoing_edges.iloc[2 * (math.floor(sorted_outgoing_edges.shape[0] / 4)) : 3 * (math.floor(sorted_outgoing_edges.shape[0] / 4))]
outgoing_edges_q4 = sorted_outgoing_edges.iloc[3 * (math.floor(sorted_outgoing_edges.shape[0] / 4)) :]

outgoing_edges_bottom_1 = sorted_outgoing_edges.iloc[: math.floor(sorted_outgoing_edges.shape[0] / 100)]
outgoing_edges_top_1 = sorted_outgoing_edges.iloc[99 * math.floor(sorted_outgoing_edges.shape[0] / 100):]

And just to double-check

In [None]:
(incoming_edges_q4.shape[0] + incoming_edges_q3.shape[0] + 
 incoming_edges_q2.shape[0] + incoming_edges_q1.shape[0]) == sorted_incoming_edges.shape[0]

Let's have a look at what there are some notable things in the devised groups

In [None]:
incoming_edges_q1['in edges'].mean(), incoming_edges_q2['in edges'].mean(), incoming_edges_q3['in edges'].mean(), incoming_edges_q4['in edges'].mean()

In [None]:
incoming_edges_q1['out edges'].mean(), incoming_edges_q2['out edges'].mean(), incoming_edges_q3['out edges'].mean(), incoming_edges_q4['out edges'].mean()

In [None]:
outgoing_edges_q1['in edges'].mean(), outgoing_edges_q2['in edges'].mean(), outgoing_edges_q3['in edges'].mean(), outgoing_edges_q4['in edges'].mean()

In [None]:
outgoing_edges_q1['out edges'].mean(), outgoing_edges_q2['out edges'].mean(), outgoing_edges_q3['out edges'].mean(), outgoing_edges_q4['out edges'].mean()

In [None]:
incoming_edges_bottom_1['in edges'].mean(), incoming_edges_top_1['in edges'].mean()

In [None]:
incoming_edges_bottom_1['out edges'].mean(), incoming_edges_top_1['out edges'].mean()

In [None]:
outgoing_edges_bottom_1['in edges'].mean(), outgoing_edges_top_1['in edges'].mean()

In [None]:
outgoing_edges_bottom_1['out edges'].mean(), outgoing_edges_top_1['out edges'].mean()

#### <i>Mean incoming and outgoing edges total network: $5.3$</i> 
#### <i>Sorted on number of incoming edges</i>
<table>
    <tr>
        <th>Group</th>
        <th>Mean #Incoming Edges</th>
        <th>Mean #Outgoing Edges</th>
    </tr>
    <tr>
        <td><i>Bottom 1%</i></td>
        <td>$0.00$</td>
        <td>$4.46$</td>
    </tr>
    <tr>
        <td>Q1 <i>(Bottom 25%)</i></td>
        <td>$0.00$</td>
        <td>$3.83$</td>
    </tr>
    <tr>
        <td>Q2</td>
        <td>$0.81$</td>
        <td>$2.36$</td>
    </tr>
    <tr>
        <td>Q3</td>
        <td>$1.82$</td>
        <td>$3.58$</td>
    </tr>
        <tr>
        <td>Q4 <i>(Top 25%)</i></td>
        <td>$18.6$</td>
        <td>$11.5$</td>
    </tr>
        <tr>
        <td><i>Top 1%</i></td>
        <td>$235$</td>
        <td>$64.0$</td>
    </tr>
</table>


#### <i>Sorted on number of outgoing edges</i>
<table>
    <tr>
        <th>Group</th>
        <th>Mean #Incoming Edges</th>
        <th>Mean #Outgoing Edges</th>
    </tr>
        <tr>
        <td><i>Bottom 1%</i></td>
        <td>$2.89$</td>
        <td>$0.00$</td>
    </tr>
    <tr>
        <td>Q1 <i>(Bottom 25%)</i></td>
        <td>$3.97$</td>
        <td>$0.00$</td>
    </tr>
    <tr>
        <td>Q2</td>
        <td>$3.67$</td>
        <td>$0.81$</td>
    </tr>
    <tr>
        <td>Q3</td>
        <td>$2.49$</td>
        <td>$2.21$</td>
    </tr>
        <tr>
        <td>Q4 <i>(Top 25%)</i></td>
        <td>$11.1$</td>
        <td>$18.2$</td>
    </tr>
        <tr>
        <td><i>Top 1%</i></td>
        <td>$112$</td>
        <td>$160$</td>
    </tr>
</table>


There are two noteworthy things in this graph. The nodes with a relatively high number of outgoing edges (i.e. the top $25\%$ or even the top $1\%$) also have a relatively high number of incoming edges when compared to the rest of the network. It thus seems that pages that have a lot of links to other pages also seem to be relatively high linked (i.e. have incoming edges) pages themselves  Note that there is a relatively large group (i.e. pages). The same can be seen in the top groups of when grouped/sorted by number of incoming edges. These groups do not only have a relatively higher number of incoming edges (on average) than the rest of the network, but also a relatively higher frequency of   in the network that do not have incoming edges, but have a relatively high number of outgoing edges.

In [None]:
print(str((wiki_df.loc[wiki_df["out edges"] == 0.00].shape[0] / wiki_df["out edges"].shape[0]) * 100), "%")
print(str((wiki_df.loc[wiki_df["in edges"] == 0.00].shape[0] / wiki_df["in edges"].shape[0]) * 100), "%")

In addition, one can see that part of the skewness of the degree distribution (see bottom groups) is due to pages that have either no incoming or outgoing edges. Having no incoming edges can be explained by pages that are the "entry-page" to this network of pages (i.e. referred to by somebody googling it / typing it in their address bar) while having no outgoing edges can be explained by pages that simply have no links on them (i.e. pages showing files etc.). These possible explanations are, however, just assumptions / guesses. There is no metadata about the pages available, so we can actually not confirm these hypotheses by more in-depth evaluations.

### <font color="darkgreen">2. Static Network Analysis:</font> Diameter

The next meausure which is going to be calculated and evaluated in the context of static network analysis is a grpah distance measure: the diameter. It encapsulates the maximum found eccentricity of any node in the network. Or in other words the longest of all the shortest paths between the vertices in the network. This may sound a bit confusing, but imagine that we have a set of all the shortest paths between each pair of vertices in the network. The maximum found value in this set is the diameter of the network. 

<b>A</b> --- <b>B</b> --- <b>C</b>&emsp;&emsp;&emsp; <i>Diameter:</i> $4$ (A-I and G-C) <br> 
|&emsp;&emsp;|&emsp;&emsp;| <br>
<b>D</b> --- <b>E</b> --- <b>F</b> <br>
|&emsp;&emsp;|&emsp;&emsp;| <br>
<b>G</b> --- <b>H</b> --- <b>I</b> 

<b>A</b> --- <b>B</b> --- <b>C</b>&emsp;&emsp;&emsp; <i>Diameter:</i> $5$  (A-J) <br> 
|&emsp;&emsp;|&emsp;&emsp;| <br>
<b>D</b> --- <b>E</b> --- <b>F</b> <br>
|&emsp;&emsp;|&emsp;&emsp;| <br>
<b>G</b> --- <b>H</b> --- <b>I</b> --- <b>J</b> 

This toyish example should give you a clear overview of what is measured by the distance measure that is diameter. Note, however, that this example is considering undirected edges, while we are of course interested in this measure of a directed graph of links. We will translate the measure back to our problem and network context in a bit.

Since we have a directed graph (i.e. network) that is not strongly connected since there are nodes $u$ which have no path to certain nodes $v$, we cannot simply use the diameter function of $networkx$ to calculate the diameter of the graph. The reason for this is a that if a shortest path from $u$ to $v$ is non-existent in the graph, it will be $\infty$ (think about Dijkstra's shortest path algorithm). And since the diameter function of $networkx$ simply calcultes the eccentricity of the Graph and returns the max value found in this collection, it will not work since the eccentricity function and thus the diameter function will raise an error because the graph is not strongly connected. 

A hack around this issue is pretty simple when one takes a look at the source code of $networkx$ (praise open-source software :)

In [None]:
nx.is_strongly_connected(G)

The problem with the built in $networkx$ function for calculating the diameter of a graphh is that the function relies on the eccentricity function. Eccentricity finds the shortest path from a node $u$ to all the other nodes and does this this for all the graphs in the network. The diameter function then calls this function on the graph and takes the max (i.e. longest shortest path) over the returned collection of all the shortest paths by eccentricity. 

But we still want to be able to calculate the diameter of this graph which is not strongly connected. Some code is devised below which transforms the original graph to a strongly connected graph by removing the nodes which make it 'weak', but after some evaluation we came to the conclusion that there is a risk of not finding the real diameter of the original graph. We, therefore, took a dive into the source code of $networkx$ and implement the function ourselves to fit our problem context and graph pecularities. The function is rather straigthforward, but unfortunately not very efficient. It is, however, exactly what is done under the hood in the functions of $networkx$ except the fact that we check whether there actually exists a path between a source node $u$ and a target node $v$ before actually calculating the shortest path between these nodes to prevent the function from returning 'infinity' and thus deteriorate the results.

In [None]:
'''
Make original graph strongly connected

nodes_in_strong_components = list(list(node)[0] for node in nx.strongly_connected_components(G))
gen_strong = G.nbunch_iter(nodes_in_strong_components)

edges_in_strong_components = nx.edges(G, nodes_in_strong_components)
self_loop_edges = list(nx.selfloop_edges(G))
edges_in_strong_components_no_loop = (set(edges_in_strong_components).difference(set(self_loop_edges)))
G_strong = G.edge_subgraph(edges_in_strong_components_no_loop)

entering_nodes = list(node for node in G_strong.nodes() if ((G_strong.in_degree(node) == 0) | (G_strong.out_degree(node) == 0)))
G_strong.remove_nodes_from(entering_nodes)
((len(G) - len(G_strong))/ len(G)) * 100
nx.is_strongly_connected(G_strong)

nx.diameter(G_strong, e = None, usebounds = False)

'''

Since the function $nx.shortest\_path\_length$ relies on an implementation of Dijkstra's algorithm, the choice is made to remove the selfloop-edges since this (can) corrupt the results and the running time of the algorithm depending on the pecularities of the implementation of the algorithm.

In [None]:
self_loop_edges = list(nx.selfloop_edges(G))
edges_no_loop = list(set(G.edges()).difference(set(self_loop_edges)))
G_no_loop = G.edge_subgraph(edges_no_loop)
len(self_loop_edges) + len(edges_no_loop)== len(G.edges())

In [None]:
# edges_loops = nx.simple_cycles(G)

Calculating the longest shortest path in the network

In [None]:
def longest_shortest_path(G_in):
    max_distance = len(G_in) - 1
    longest = 0
    nodes = list(G_in.nodes())
    for u in nodes:
        for v in list(nx.dfs_successors(G_in, source = u)): # depth-first-search of successors
            if ((u != v) & (nx.has_path(G_in, u, v))):
                path_length = nx.shortest_path_length(G_in, source = u, target = v)
                if (path_length > longest):
                    longest = path_length
                    print("current longest: ", longest)
                    if(longest >= len(G_in)):
                        return max_distance
    return longest

The function could be optimized by implementing depth-first-search ourselves and comparing the shortest path-lengths along the way, but thay may be a bit far fetched for this analysis.

In [None]:
# diameter = longest_shortest_path(G_no_loop)

In [None]:
wiki_df.head()

In [None]:
def number_of_successors(row):
    node = str(row.node)
    successors = set(nx.dfs_postorder_nodes(G, node))
    successors = list(successors.differnece(set(node)))
    print(successors)
    return successors

### <font color="red">To-do: Implication of the value of the distance measure in relation to our problem contex/the characteristics of the network.</font>

### <font color="darkgreen">3. Static Network Analysis:</font> Mean shortest-path length

The next statistical measure in this static network analysis has a strong relationship to the previous statistical measure. While the diameter captured the maximum eccentricity (i.e. the greatest distance between any pair of nodes in the network) found among all the nodes in the graph, the mean shortest-path length is a metric that captures the average distance (i.e. shortest path) between any pair of vertices in the graph.

There are a few things to consider before we can actually continue with the evaluation of this metric. As we know by know the graph view (ignoring the time-stamps) that is taken in this static network analysis contains selfloop-edges and many other peculartities that can deterioriate the calculation of the means-shortest-path length and thus affect any implications that are going to be suggested after the evaluation of the result in our the context of our network.

The choice is made to calculate the mean shortest-path length across all the pairs of vertices $u$ and $v$ in the graph where 1) $u\, !=\, v$ and  2) there exist a path from $u$ to $v$ (would otherwise result in $\infty$)

### <font color="darkgreen">4. Static Network Analysis:</font> Betweenness centrality

In [48]:
katz_central = nx.katz_centrality(G)

KeyboardInterrupt: 

### <font color="darkgreen">5. Static Network Analysis:</font> Closeness centrality