In [None]:
import pandas as pd #module to work with dataframes
import networkx as nx #module to work with networks
import numpy as np
import scipy as scpy
import matplotlib.pyplot as plt
from networkx.algorithms import bipartite #we load the bipartite algorithms to facilitate writing the code
from Functions import *
#%matplotlib inline

# Network Structure I: Centrality metrics (node-escale)

## Paths 

A *path* in a network is a sequence of edges connecting two nodes. Let's start with a very simple, undirected network.

In [None]:
G = nx.Graph()
G.add_nodes_from([1,2,3,4])
G.add_edges_from([(1,2),(2,3),(1,3),(1,4)])
nx.draw(G, with_labels=True)

In this simple example, we can easily see that there is indeed at least one path that connects nodes 3 and 4. We can verify this with the NetworkX function `nx.has_path(G, start node, end node)`

In [None]:
nx.has_path(G, 3, 4)

There can be more than one path between two nodes. Again considering nodes 3 and 4, there are two such "simple" paths:

In [None]:
list(nx.all_simple_paths(G, 3, 4))

We are often most interested in **shortest paths**. In an unweighted network, the shortest path is the one with the fewest edges. We can see that of the two simple paths between nodes 3 and 4, one is shorter than the other. We can get this shortest path with a single NetworkX function `nx.shortest_path(G, start node, end node)`

In [None]:
nx.shortest_path(G, 3, 4)

If you only care about the path length, there's a function for that too: `nx.shortest_path_length(G, start node, end node)`

In [None]:
nx.shortest_path_length(G, 3, 4)

> Note that a path length is defined here by the number of *edges* in the path, not the number of nodes, which implies that for nodes $u$ and $v$.
>
>    `nx.shortest_path_length(G, u, v) == len(nx.shortest_path(G, u, v)) - 1`
   


<div class="alert alert-block alert-success"><b>Up to you: </b>
<h4> Exercise 9</h4>
Let's work with the network of US air travel routes. The nodes in this graph are airports, represented by their IATA codes.

![title](./images/figure6.png)
    
Two nodes are connected with an edge if there is a scheduled flight directly connecting these two airports. We'll assume this graph to be undirected since a flight in one direction usually means there is a return flight.
Thus this graph has edges

[('HOM', 'ANC'), ('BGM', 'PHL'), ('BGM', 'IAD'), ...]

where ANC is Anchorage, IAD is Washington Dulles, etc.
    
Create the network of USA flights and analyze it to answer these questions:
    
- 1) Is there a direct flight between Indianapolis (IND) and Fairbanks, Alaska (FAI)? A direct flight is one with no intermediate stops.
- 2) If I wanted to fly from Indianapolis to Fairbanks, Alaska what would be an itinerary with the fewest number of flights?
</div>

In [None]:
#write your code here. The network is already loaded
G = nx.read_graphml('./data/openflights_usa.graphml.gz')

In [None]:
#SOLUTION: uncomment line below to load the solution
# %load ./snippets/ex9.py

Let's extend these ideas about paths to directed graphs.
### Directed paths

We know that in a directed graph, an edge from an arbitrary node $u$ to an arbitrary node $v$ does not imply that an edge exists from $v$ to $u$. Since paths must follow edge direction in directed graphs, the same asymmetry applies for paths. Observe that this graph has a path from 1 to 4, but not in the reverse direction.

In [None]:
D = nx.DiGraph()
D.add_edges_from([
    (1,2),
    (2,3),
    (3,2), (3,4), (3,5),
    (4,2), (4,5), (4,6),
    (5,6),
    (6,4),
])
nx.draw(D, with_labels=True)

In [None]:
nx.has_path(D, 1, 4)

In [None]:
nx.has_path(D, 4, 1)

The other NetworkX functions dealing with paths take this asymmetry into account as well:

In [None]:
nx.shortest_path(D, 2, 5)

In [None]:
nx.shortest_path(D, 5, 2)

> Note: Since there is no edge from 5 to 3, the shortest path from 5 to 2 cannot simply backtrack the shortest path from 2 to 5 -- it has to go a longer route through nodes 6 and 4.

<div class="alert alert-block alert-success"><b>Up to you: </b>
<h4> Exercise 10</h4>
Imagine that after an accident, the 'Suspension-feeding molluscs' have been contaminated with lead. Taking into cosideration the structure of the trophic interactions in the St Marks estudary, aswer the following questions:
    
- 1 Should we be worried about the well fare of the 'Tonguefish'? 
- 2 and what about the 'Spider crabs'?
- 3 Should we expect more accumulation of lead in 'Red drum' or in 'Tonguefish', according to their diets?
</div>

In [None]:
# Start by loading the network as we did before, and continue with your code
filename="./data/WoL_StMarks/st_marks_Ilist.csv"
Ilist=pd.read_csv(filename, header=None, index_col=None)
Ilist.columns=["source","target","w"]
FW=nx.from_pandas_edgelist(Ilist, edge_attr="w", create_using=nx.DiGraph)
species=list(FW.nodes())
#your code here

## Centrality metrics

Often when looking at a network, we want to find the most "important" nodes, for some definition of important. The most basic measure of centrality is the *degree*, or number of links attached to a node. Let's take a look at the network we have loaded.

In [None]:
pos = nx.kamada_kawai_layout(G)
nx.draw(G,pos,node_size=50)

### Degree centraility

Do all airports seem equally easy to access? The degree centrality tell us the **number of neighbours of each node**. In this case, it can be understood as a proxy of how well connected is a given airport. As we saw in lesson 1, we can obtain the **degree centrality** of the nodes in the network using the method `G.degree(node)`.
Usually the degree of node $i$ is represented with by $k_i$. We will say that a node has a higher degree centrality if it has a higher degree (i.e. if it has many neighbours). The rationale is **the more connections** a node has -> **the more important** hte node is

In [None]:
K=pd.Series(dict(G.degree())).sort_values(ascending=False) #let's store the degrees of the nodes in a series, so we can easily access later
print(K.head(5))

We can now find which is the best connected airport in the US

In [None]:
airport=K.idxmax()
print("The most connected airport is the %s with %s direct flights to other destinations" % (G.nodes[airport]["name"],K[airport]))

#### Degree distributions
What is the bigger difference between these two networks?

![title](./images/figure7.png)

The most basic structural properties of a network are the number of nodes (**N**) and the number of links (**L**). However, how these links are distributed among the nodes (**$K_i$**) has deep implications for other network properties (it is not the same to have all nodes with similar degree, or having a very heterogeneous dostribution). The degree distribution will play a very important role determining other structural metrics in the networks.
We can see the **degree distribution** of a network by doing a histogram of the degree series. This will tell us how many nodes with a given number of neighbours are in the network.


In [None]:
#do histogram of degree. Fixed bin width to 1.
bins = np.arange(K.min(), K.max() + 2, 1)#fix width of bin to 1
hist, bin_edges = np.histogram(K, bins=bins)

In [None]:
#let's plot the histogram to see ho is K distributed
plt.plot(bin_edges[:-1],hist, 'o',color="k",alpha=0.3)
#plt.yscale('log')
#plt.xscale('log')
plt.xlabel("Degree of node (K)")
plt.ylabel("Number of nodes with degree K")
plt.show()

And we can also obtain simple statistics from it, like the mean degree, and its standard deviation

In [None]:
K_mean=K.mean()
K_std=K.std()

print("The average number of direct flights from an US airport is %.2f +- %.2f"  % (K_mean,K_std))

> Note: In these **long tailed distributions** the verage value is not representative of anything, as the standard distribution is larger than the mean!!

There has been a lot of debate regarding the form of the degree distribution ($P(K)$) in real networks. The best practice to determine which function fits better the $P(K)$) is to use the **cumulative degree distribution** (i.e. how many nodes with degree $K$ or below are in the network) because it is less noisi. Let's see how we can do this. 

In [None]:
#Compute the cumulative sum, but in reverse order to count values greater than or equal
cumulative_hist = np.cumsum(hist[::-1])[::-1]
#plot
plt.plot(bin_edges[:-1], cumulative_hist, 'o',color="k",alpha=0.3)
plt.xscale('log')
plt.yscale('log')
plt.xlabel("Degree of node (K)")
plt.ylabel("Number of nodes with degree K or more") #
plt.show()

#### Generalizing "neighbors" to arbitrarily-sized graphs

The concept of neighbors is simple and appealing,
but it leaves us with a slight point of dissatisfaction:
it is difficult to compare graphs of different sizes.
Is a node more important solely because it has more neighbors?
What if it were situated in an extremely large graph?
Would we not expect it to have more neighbors?

As such, we need a normalization factor.
One reasonable one, in fact, is
_the number of nodes that a given node could **possibly** be connected to._
By taking the ratio of the number of neighbors a node has
to the number of neighbors it could possibly have,
we get the **degree centrality** metric.

Formally defined, the degree centrality of a node (let's call it $d$)
is the number of neighbors that a node has (let's call it $k$, its degree)
divided by the number of neighbors it could _possibly_ have (let's call it $N$, all nodes):

$$d = \frac{k}{N}$$

NetworkX provides a function for us to calculate **degree centrality** conveniently:

In [None]:
d = pd.Series(nx.degree_centrality(G)).sort_values(ascending=False)
print(d.head())
d.hist()
plt.show()

### Betweenness centrality

In some cases you can be more interested in knowing the extent to which a node lies on paths between other nodes. For example, the airport that is most used as an intermediate stop between other destinations. 
To claculate the betweeness centrality we do:

In [None]:
betweenness = pd.Series(nx.centrality.betweenness_centrality(G)).sort_values(ascending=False)

In [None]:
print(C)

### Closeness centrality

Closeness centrality indicates how close a node is to all other nodes in the network. It is calculated as the average of the shortest path length from the node to every other node in the network. In the airport network it means the airport that is best connected to the rest of the airports (in fewer jumps).

In [None]:
closenness = pd.Series(nx.centrality.closeness_centrality(G)).sort_values(ascending=False)
print(closenness)

You probably realized that the centrality metrics of nodes do **NOT** neccesarily coincide, that is, one can have many direct lfights to other destinations, but not be an airport where people change flights, or not be very close to all other airports. Let's see it in our Airports network:

In [None]:
Airports=list(G.nodes) # we need to keep the same order as in the graph
nx.draw(G,pos,node_color=c[Airports],node_size=50)

In [None]:
nx.draw(G,pos,node_color=betweenness[Airports],node_size=50)

In [None]:
nx.draw(G,pos,node_color=closenness[Airports],node_size=50)

### Page Rank

PageRank computes a ranking of the nodes in the graph G based on the structure of the incoming links. It was originally designed as an algorithm to rank web pages.
However, it can also be used to identify the species that "move" more biomass trough a network, or in general, the node that is most used when trasnporting information trough the graph. Since this is only interesting in **directed graphs** let's use one of our directed networks. It has been used, for example, to find what are the nodes that are pointing to the more "important" nodes, in order to find the species that can cause more harm when they disapear from the network.

In [None]:
# Start by loading the network as we did before
filename="./data/WoL_StMarks/st_marks_Ilist.csv"
Ilist=pd.read_csv(filename, header=None, index_col=None)
Ilist.columns=["source","target","w"]
FW=nx.from_pandas_edgelist(Ilist, edge_attr="w", create_using=nx.DiGraph)
TL=nx.centrality.trophic_levels(FW)


Now let's compute the Page rank:

In [None]:
PR = pd.Series(nx.pagerank(FW.reverse(), alpha=0.9)) #alpha is damping parameter for PageRank, default=0.85.
PR

In [None]:
#FW.remove_node(base_node)
species=list(FW.nodes)
pos= nx.shell_layout(FW)
TL=nx.centrality.trophic_levels(FW)
# Modify the y-coordinate based on the trophic level
for node in pos:
    pos[node] = (pos[node][0], TL[node])  # Set the y-position as the trophic level

nx.draw(FW, pos, node_color=PR[species], with_labels=True)
plt.show()

## Centrality metrics in Bipartite networks

When we are working with bipartite networks, we should use the algorithms included in `nx.bipartite` and not those of the unipartite networks! 

Lets see some examples

### Degree centraility in bipartite networks

In the bipartite case, the maximum possible degree of a node in a bipartite node set is the number of nodes in the opposite node set. The degree centrality for a node $u$ in the bipartite set $U$ with $n$ nodes that is connected to nodes in the bipartite set $V$ with $m$ nodes is
$$d_u=\frac{k_u}{m}$$, for $u\in U$, and for a node $v$ nodes in set $V$ is $$d_v=\frac{k_v}{n}$$, for $v\in V$,

where $k_v$ is the degree of node v.

In [None]:
#Genate a Bipartie network
Bnet=bipartite.random_graph(4, 5, 0.4, seed=None, directed=False)

### Non-linear maps (fitness-complexity)

## Consecuences of the structure

### Random failure vs targetted attack (nodes)

### Attacking edges