# NB1. Network Statistics

Consider the following networks:
* **Facebook Northwester University**(socfb-Northwestern25.edges.gz). Network of Facebook users at Northwestern University. Nodes represent people, and links stand for Facebook friend connections.
* **US air transportation** (openflights_usa.edges.gz). The US air transportation network using flight data from OpenFlights.org. Nodes represent airports, and links stand for connections between them.
* **Twitter USA Politics**(retweet-digraph.edges.gz). Retweet directed network with weigtht on Twitter, among people sharing posts about US politics. Links represent retweets of posts that used different hashtags (#tcot, #p2). The direction of the link from user A to B indicates that a message has propagated from A to B.

In [1]:
import itertools
import networkx as nx
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statistics

## Task 1
Create a table including the following characteristics for each network:
* Number of Nodes $N$.
* Number of Links $L$.
* Density $d$.
* Average Degree $\langle k\rangle $. 
* Clustering Coefficient $C_C$. 
    
Consider the following observations:
* In the case of undirected networks, compute the average in-degree.
* In the case of undirected networks, compute the clustering coefficient without taking into account the directions of the edges. In NetworkX it is possible to use the ``` G.to_undirected() ``` method to return an undirected copy of a graph G.


In [2]:
fh1=open("socfb-Northwestern25.edgelist", 'rb')
fb=nx.read_edgelist(fh1)

In [3]:
us = nx.read_edgelist("openflights_usa.edges")

In [4]:
#tw=nx.read_edgelist("retweet-digraph.edges",create_using=nx.DiGraph())
#tw=nx.read_edgelist("retweet-digraph.edges", nodetype=int)
"""fh3=open("retweet-digraph.edges", 'rb')
tw=nx.read_edgelist(fh3)"""
tw = nx.read_weighted_edgelist("retweet-digraph.edges")

In [5]:
nx.is_directed(tw)

False

In [6]:
nx.is_directed(fb)

False

In [7]:
nx.is_directed(us)

False

In [8]:
# Create lists to iterate over their items
all_networks = [fb, us, tw]


# Compute using Networkx as nx
to_df_1 = []
for i in all_networks:
    num_nodes = nx.number_of_nodes(i)
    num_links = nx.number_of_edges(i)
    density = nx.density(i)
    degree_sequence = [i.degree(n) for n in i.nodes]
    avg_degree2 = statistics.mean(degree_sequence)
    clust_coef = nx.average_clustering(i)
    lis_of_them = [num_nodes, num_links, density, avg_degree2, clust_coef]
    to_df_1.append(lis_of_them)

In [9]:
df1 = pd.DataFrame(data=to_df_1, columns=['num_edges', 'num_links', 'density', 'avg_degree', 'clust_coef'])
df1

Unnamed: 0,num_edges,num_links,density,avg_degree,clust_coef
0,10567,488337,0.008748,92.4268,0.237991
1,546,2781,0.018691,10.186813,0.493045
2,18470,48053,0.000282,5.203357,0.026153


The average shortest-path length is a common aggregate distance measure for Networks. It can be obtained by averaging the shortest-path lengths across all pairs of nodes. The definition of this distance-based measure assume that the shortest-path length is defined for each pair of nodes. If there is any pairs without a path, then the the average path length is not defined. One way to present this result is by measuring only on the giant component; for the directed network it is possible to consider directed paths in the giants strongly connected component. However, due to the number of possible pairs of nodes, the computing of the average shortest-path length can be computational extensive.

## Task 2
Create a function ``` average_path_length_sample(G, N_sample)``` to compute the average path length on a Network. The function must identify if the network is directed or not.  The following method can be useful: ``` G.is_directed()```. In the case of directed networks it should use the strongly connected component to compute it. On the other hand, if the network is undirected, it should use the giang connected component of the network. 

In order to compute the average path length on a sample. Make a sample of ```N_sample``` randomly chosen nodes on the connected component and compute the average path length using it.

The function must input ```G```a Network and ```N_sample```the number of nodes to be considered in the sample and output the average path length.

Compute the average path length of the three given networks and add them into the table using ```N_sample=1000```.

In [10]:
"""average_path_length_sample(G, N_sample):
    is_directed = G.is_directed()
    if is_directed is False:
        giang = 
        return giang
    else:
        """

'average_path_length_sample(G, N_sample):\n    is_directed = G.is_directed()\n    if is_directed is False:\n        giang = \n        return giang\n    else:\n        '

In [11]:
def average_path_length_sample(G):   #ADD N_sample
    avg = nx.average_shortest_path_length(G)
    return avg

In [12]:
#average_path_length_sample(fb)

In [13]:
#average_path_length_sample(tw)

## Useful NetworkX Methods

* [Reading and writing graphs](https://networkx.github.io/documentation/networkx-1.9/reference/readwrite.html). Check the ```read_edgelist``` method.
* [Components](https://networkx.github.io/documentation/stable/reference/algorithms/component.html).

## References
[1] F. Mencszer, S. Fortunato, C. A. Davis (2020). A First Course in Network Science.