### CS4423 - Networks
Angela Carnevale <br />
School of Mathematical and Statistical Sciences <br />
NUI Galway

# Week 11, lecture 1: 

# More on the Watts-Strogatz model 
# Directed Networks and the WWW

In [None]:
import networkx as nx
opts = { "with_labels": True, "node_color": 'y' }

#### 5. Small worlds

## Small worlds and the WS model

Recall the definition of an $(n,d)$-circle graph:

**Definition (Circle Graph).** For $1 < d < n/2$, an $(n, d)$-**circle graph**
is obtained from a cycle on $n$ vertices by additionally linking each node
to all nodes that are not more than $d$ steps away on the cycle.


In [None]:
def circle_graph(n, d):
    G = nx.cycle_graph(n)
    for v in G:
        for o in range(2, d+1):
            G.add_edge(v, (v+o) % n)
    return G

We have seen that in an $(n,d)$-circle graph the degree of a node, its social graph (and hence its clustering) only depend on $d$. More precisely:

* The graph clustering coefficient of an $(n, d)$-circle graph is **independent of $n$**, and can be determined as
$$
C = \frac{3d - 3}{4d - 2} \to \frac34 \text{, as } d \to \infty.
$$
In particular:
$$
\begin{array}{l|rrrrr}
d & 1 & 2 & 3 & 4 & 5 \\ \hline
C & 0 & 0.5 & 0.6 & 0.643 & 0.667
\end{array}
$$

On the other hand, the characteristic path length of an $(n, d)$-circle graph is
approximately
$$
L \approx \frac{n}{4d},
$$
growing linearly with $n$ (for fixed $d$). 

In conclusion, such regular graphs have **high clustering** but **long shortest paths**,
hence $(n, d)$-circle graphs do not exhibit the small world behaviour.

## The Watts-Strogatz Model

The following modification of the circle graph was suggested by Duncan J. Watts and Steven Strogatz ([1998](https://en.wikipedia.org/wiki/Watts%E2%80%93Strogatz_model)).

**Definition (The WS Model).**
Let $1 < d < n/2$ and $0\leq p \leq 1$.  An $(n, d, p)$-WS graph $G = (X, E)$ is constructed from
an $(n, d)$-circle graph $G_0 = (X, E_0)$ by rewiring each of the edges in $E_0$ with probability $p$,
as follows:

1. visit the nodes $X = \{0, \dots, n{-}1\}$ in turn ('clockwise').

2. for each node $i \in X$ consider the $d$ edges connecting $i$ to $j$
in a clockwise sense ($j = i+1, \dots, i+d$).

3. With probability $p$, in the edge $(i, j)$ replace
$j$ by node $k \in X$ chosen uniformly at random, subject to
  * $k \neq i$, and
  * $(i, k)$ must not be an edge of $G$ already.

In [None]:
import random as rd
def ws_graph(n, d, p):
    G = circle_graph(n, d)
    for v in G:
        for o in range(1, d+1):
            if rd.random() < p:
                w = rd.randint(0,n-1)
                if w != v and not G.has_edge(v, w):
                    G.remove_edge(v, (v+o) % n)
                    G.add_edge(v, w)
    return G

In [None]:
n, d = 21, 3
G = ws_graph(n, d, 0.1)
nx.draw_circular(G, **opts)
print(nx.average_clustering(G))
print(nx.average_shortest_path_length(G))
print((G.order(), G.size()))

In [None]:
G = ws_graph(n, d, 0.2)
nx.draw_circular(G, **opts)
print(nx.average_clustering(G))
print(nx.average_shortest_path_length(G))
print((G.order(), G.size()))

In [None]:
G = ws_graph(n, d, 1)
nx.draw_circular(G, **opts)
print(nx.average_clustering(G))
print(nx.average_shortest_path_length(G))
print((G.order(), G.size()))

A WS graph with parameters $(n, d, p)$ can be generated with the command `nx.watts_strogatz_graph(n, 2*d, p)`.

In [None]:
n, d = 21, 3 
G = nx.watts_strogatz_graph(n, 2*d, 0)
nx.draw_circular(G, **opts)
print(nx.average_clustering(G))
print(nx.average_shortest_path_length(G))

In [None]:
G = nx.watts_strogatz_graph(n, 2*d, 0.1)
nx.draw_circular(G, **opts)
print(nx.average_clustering(G))
print(nx.average_shortest_path_length(G))

In [None]:
G = nx.watts_strogatz_graph(n, 2*d, 0.2)
nx.draw_circular(G, **opts)
print(nx.average_clustering(G))
print(nx.average_shortest_path_length(G))

## Properties of WS-Graphs

* The small-world attributes of a $(n, d, p)$-WS graph depend on the probability $p$.
The following measurements have been taken for $n = 1000$ and $d = 5$.

<table>
    <tr>
        <th>$p$</th>
        <th>$L$</th>
        <th>$C$</th>
    </tr>
    <tr>
        <td>$0$</td>
        <td>$50.5$</td>
        <td>$0.667$</td>
    </tr>
    <tr>
        <td>$0.01$</td>
        <td>$8.94$</td>
        <td>$0.648$</td>
    </tr>
    <tr>
        <td>$0.05$</td>
        <td>$5.26$</td>
        <td>$0.576$</td>
    </tr>
    <tr>
        <td>$1$</td>
        <td>$3.27$</td>
        <td>$0.00910$</td>
    </tr>
</table>

One of the tasks in Assignment 4 asks to do this more systematically for $n=1000$, $d=5$ and $p\in [0,1]$. In that case, you are asked to look at $50$ values of $p$ in that interval and comment on the corresponding values of $L$ and $C$.

* One way to do that is to break $[0,1]$ uniformly into $50$ bits. In practice, you could loop over `range(50)` and proceed from $0$ with $1/50$ increments. 

* You are free to choose the $50$ distinct values of $p$ in other ways too (as long as they belong to the interval $[0,1]$).

#### 6. Directed Networks

## The Structure of the World Wide Web

So far, the networks that have been discussed most of the time
consisted of people or organizations, connected by links representing
opportunities for interactions.   The World Wide Web is an example
of a network of a different kind, a so-called **information network**.

###  Information Networks

Information networks connect pieces of information,
like documents, or parts of documents, through links
that represent references of some kind.  Such links,
in contrast to social relationships which are typically symmetric,
only point in one direction.
The underlying graph of an information network thus
is a **directed graph**.

Information networks have existed before the internet.  Some prominent
examples include:

* **Academic Publications.**  In the scientific literature it is customary
to give credit to sources that have been used in the form of
references to those publications that contain those sources.
This practice creates a network whose nodes are the
publications, and whose links represent the references, pointing from
the citing article back to the cited article.
A large part of this network in the mathematical literature is
captured on [MathSciNet](http://www.ams.org/mathscinet).

* **Mathematical Proofs.**  In mathematics, the proof of a particular theorem
usually relies on theorems that have already been proved.
Citing a theorem in a proof thus creates a link from the theorem
being proved back to the theorem being used, in a network of mathematical
theorems.  In a similar way, a complex computer program,
consisting of several subroutines, can be regarded
as a network of subroutines, pointing to each other through links
that arise from one subroutine calling another.

* **Technical Documentation.** The documentation of complex systems,
such as computer software, typically consists of a collection of
articles (manual pages), each describing one aspect of the system,
frequently using cross-references to each other.  Here the network
consists of the manual pages, and the links represent those cross
references.  In a similar way, an encyclopedia (or a dictionary)
organizes its content as a sequence of articles, sorted
alphabetically, with supporting cross-references.

### Hypertext

The **World Wide Web** arose out of the desire to make technical
documentation more easily accessible by using the physical infrastructure
of the rapidly growing internet.
It was conceived by [Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee) around 1990
as information management system at [CERN](http://info.cern.ch/hypertext/WWW/TheProject.html).

In this system, documents are **web pages**, that anyone can create
and store in a publicly accessible place on their computer.  Moreover,
it supplies a **web browser**, a piece of software that can retrieve
the web pages from those public spaces, allowing others to easily
access those documents.

Web pages contain **hypertext**, that is a mixture of plain text and **hyperlinks**.  Here, a hyperlink (or just link) is a reference to another document
that the reader can follow by simply clicking on it.  Hyperlinks have a **source**
(the document they are contained in) and a **target** (the document
they reference).
This creates a network of documents as nodes
and hyperlinks as **directed edges** between them.

There are many alternative ways to organise information: alphabetically
(like the telephone book),
hierarchical (in folders like the files on a computer), ...
Certainly, the physical constraints of the environment
(like the fact that books need to be stored on shelves,
that pages in a book come in order)
have an influence on how well a particular solution works.

Hypertext originates from the works of the visionaries
[Vannevar Bush](https://en.wikipedia.org/wiki/Vannevar_Bush) (the Memex, 1945) and
[Ted Nelson](https://en.wikipedia.org/wiki/Ted_Nelson) (Xanadu Project, 1965).

It will be useful to distinguish between **navigational links**
(providing access to related pages)
and **transactional links** (which exist more
as a side effect--like ordering a book,
or sending an email--than
for the sake of leading to the next page).
The distinction is not always clear,
but transactional links are the kind that is of
little interest for search engines.
It's the navigational links that
form the edges of the directed graph
that turns the Web into
an information network.

As with undirected graphs, an interesting question in
directed graphs is: which nodes can be reached from
a given node?

### Reachability in Directed Graphs

Recall that a **directed graph** is a pair $G = (X, E)$
with **vertex set** $X$  and **edge set** $E \subseteq X^2 = X \times X$.
For an edge $(x, y) \in E$ we sometimes write $x \to y$.

A **path** in a directed graph  $G = (X, E)$
is a sequence of nodes $(x_0, x_1, \dots, x_l)$
with $x_{i-1} \to x_i$ for $i = 1,\dots, l$.
The number $l$ is called the **length** of the path.
We write $x \leadsto y$
if there exists a path (possibly of length $0$)
from $x$ to $y$ in $G$.

A directed graph $G$ is **weakly connected** if, when
considerd as undirected graph, it is connected.
The **weakly connnected components** (WCCs) of $G$ are its connected components,
when considered as undirected graph.

A directed graph $G$ is **strongly connected** if, for
each pair of vertices $x, y \in X$, there is a path from
$x$ to $y$ in $G$, i.e., if $x \leadsto y$.

A **strongly connected component (SCC)** of a directed graph $G$
is a subset $C$ of $X$ which is (i) strongly connected,
and (ii) not part of a larger strongly connected subset of $X$.

In general, a directed graph is a collection of WCCs.
Each WCC in turn is a collection of SCCs.

When a directed graph $G$ is regarded as a **relation**
on the set $X$, strongly connected components can be described as
the **equivalence classes** of an equivalence relation that is obtained
as follows.

First note that the relation ${x \leadsto y}$
is the reflexive and transitive closure of the
edge relation $x \to y$.  Thus, by construction it is reflexive and
transitive.  It might not be anti-symmetric, though,
meaning that there might be vertices $x$ and $y$
with $x\leadsto y$ and $y 
\leadsto x$.

However, the new relation $x \equiv y$,
defined as $x \leadsto y$ and $y \leadsto x$
is an equivalence relation (why?)
and its equivalence classes are the strongly connected
components of $G$.  Denote the class of $x \in X$ by $[x]$.

Moreover, there is a partial order relation
$\leq$ (a relation which is reflexive, transitive and anti-symmetric)
on the set of equivalence classes,
$[x] \leq [y]$ if $x \leadsto
y$.

## Code Corner

### `random`

* `randint`: [[doc]](https://docs.python.org/2/library/random.html#random.randint) random integer

### `networkx`

* `cycle_graph`: [[doc]](https://networkx.github.io/documentation/networkx-1.9/reference/generated/networkx.generators.classic.cycle_graph.html)

* `watts_strogatz_graph`: [[doc]](https://networkx.github.io/documentation/stable/reference/generated/networkx.generators.random_graphs.watts_strogatz_graph.html)

* `draw_circular`: [[doc]](https://networkx.github.io/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw_circular.html)

##  Exercises

1. In terms of the parameters, $n$, $d$ and $p$, what is the clustering coefficient $C$ of an $(n, d, p)$-WS graph?

1. In terms of the parameters, $n$, $d$ and $p$, what is the average shortest path length $L$  of an $(n, d, p)$-WS graph?