## Graph

**Graph** is a pair of sets `(V,E)`, where `V` - array of vertices(nodes) and `E` - array of edges. For example `V = {a,b,c,d,e,f,g,h,i}` and E = `{ (a,b); (b,c); (c,e); (e,h); (h,i); (c,i)}`

**A directed graph** or digraph is a graph in which edges have orientations.

![alternatvie text](https://upload.wikimedia.org/wikipedia/commons/thumb/2/23/Directed_graph_no_background.svg/1280px-Directed_graph_no_background.svg.png)

**A weighted graph** is a graph in which a number (the weight) is assigned to each edge.

![alternatvie text](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f0/Weighted_network.svg/1920px-Weighted_network.svg.png)

**A multigraph** is a graph which is permitted to have multiple edges (also called parallel edges[1]), that is, edges that have the same end nodes. Thus two vertices may be connected by more than one edge.

![alternatvie text](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Multi-pseudograph.svg/800px-Multi-pseudograph.svg.png)

## Connectivity 

**Path** from $v_0$ to $v_4$: {$v_0e_1v_1e_2v_2e_3v_3e_4v_4$}

In an undirected graph G, two vertices u and v are called **connected** if G contains a path from u to v.

A graph is said to be **connected** if every pair of vertices in the graph is connected.

A directed graph is called **weakly connected** if replacing all of its directed edges with undirected edges produces a connected (undirected) graph. 

It is **strongly connected**, or simply strong, if it contains a directed path from u to v and a directed path from v to u for every pair of vertices u, v.

A **connected component** is a maximal connected subgraph of an undirected graph. Each vertex belongs to exactly one connected component

![alternatvie text](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Scc-1.svg/440px-Scc-1.svg.png)


# DFS

![alternatvie text](https://www.codesdope.com/staticroot/images/algorithm/dfs.gif)

Time complexity: $O(V+E)$

In [9]:
class Graph:
    def __init__(self):
         self.graph = {}
 
    # function to add an edge to graph
    def addEdge(self, u, v):
        if u not in self.graph:
            self.graph[u] = []
        self.graph[u].append(v)
 
    def dfs(self, v, visited):
        visited.add(v)
        print(v, end=' ')
        for neighbour in self.graph[v]:
            if neighbour not in visited:
                self.dfs(neighbour, visited)
 
    def DFSMain(self, v):
        visited = set()
        self.dfs(v, visited)
        
g = Graph()
g.addEdge(0, 1)
g.addEdge(0, 2)
g.addEdge(1, 2)
g.addEdge(2, 0)
g.addEdge(2, 3)
g.addEdge(3, 3)
g.DFSMain(2)


2 0 1 3 


The result of a depth-first search of a graph can be conveniently described in terms of a **spanning tree** of the vertices reached during the search:

![alternatvie text](https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Tree_edges.svg/1280px-Tree_edges.svg.png)


In [7]:
entry = {}
leave = {} # dicts of entry and leave time points for each node
def dfs(u, time):
    entry[u] = time+1
    #...
    leave[u] = time+1

### Lemma 1

During depth-first search there is no time moment where there is an edge from black node to white in DFS path.

**Proof**

Assume that such an edge (u,v) exists at moment *time*. leave[u] is the first moment when  u is black and 

$leave[u] \leq time$

That means that v is white at leave[u] as it is white at moment *time*. However, this means that at the moment of leaving the vertex u, we have an unprocessed edge (u,v) left. Contradiction.

### Lemma 2

Assume some dfs path exists in the graph G. entry[u] and leave[u] are entry and leave time points for node u. Then between these moments:

- Black and gray vertices *G\u* will not change their color

- White vertices *G\u* will either stay white or turn black. Moreover, those who are reachable from u along white paths and only they will become black.

Proof

Black vertex stay black.

The gray vertex will remain gray because it is on the recursion stack.

The reachable white vertex will become black, otherwise on the way to it at time moment leave[u] there will be an edge from the black vertex to the white one

If the vertex turned black, then it was reachable by the white path

- We can find cycles using DFS

If we found a gray vertex during dfs (back edge), then the cycle exists.

If the cycle exists, then we will definitely find it. Let v be the first vertex of cycle C processed by dfs. At time entry[v], all vertices of cycle C are white. There is an edge (u,v) of the cycle C. By Lemma 2, the vertex u becomes black before leaving v. Therefore, at the moment entry[u], the vertex in will be gray.


- We can count connected components in undirected graph during DFS:

Increasing the number of connected components until we visit all vertices during dfs.

## Topological sorting

We can sort an acyclic graph $G(V,E)$. Topological sorting is the ordering of the vertices V such that if $(u,v) \in E$ then u is located in an ordered array before v. There can be many valid topological sorts.:

![alternatvie text](https://upload.wikimedia.org/wikipedia/commons/thumb/0/03/Directed_acyclic_graph_2.svg/610px-Directed_acyclic_graph_2.svg.png)

- 5, 7, 3, 11, 8, 2, 9, 10 (visual top-to-bottom, left-to-right)
- 3, 5, 7, 8, 11, 2, 9, 10 (smallest-numbered available vertex first)
- 5, 7, 3, 8, 11, 10, 9, 2 (fewest edges first)
- 7, 5, 11, 3, 10, 8, 9, 2 (largest-numbered available vertex first)
- 5, 7, 11, 2, 3, 8, 9, 10 (attempting top-to-bottom, left-to-right)
- 3, 7, 8, 5, 11, 10, 2, 9 (arbitrary)


For each vertex v, we can set $\phi(v) = |V|+1-leave[v]$ as a position in topologically sorted array.

 $\phi(u) < \phi(v)$ for each $(u,v) \in E$:
 
At moment entry[u] v -is not gray (acyclic graph)

1) If v is white it will be processed during dfs(u) 

leave[v] < leave[u] => $\phi(u) < \phi(v)$

2) If v is black - it was already processed

## Kosaraju's algorithm

How to find strongly connected components in graph $G(V,E)$?

1) We construct a transpose graph $H = G^t$

![alternatvie text](https://upload.wikimedia.org/wikipedia/commons/d/d0/Amirali_reverse.jpg)


2) DFS(H), collect all $leave_H[v]$

3) DFS(G), iterate over the vertices in descending order of their $leave_H[v]$. Spanning trees of this DSF call contain the vertices of strongly connected components.

- First, we will prove that each strongly connected component (SCC) is completely contained in a tree:

t and s are vertices from one SCC. Then the paths from s to t and from t to s exist. v is the first vertex in dfs path s -> t -> s. Then at moment entry[v] s and t are reachable from v via white paths and according to Lemma 2 they will be processed.

![alternatvie text](img/Kosaraju_1.png)

- Second, one tree contains only one SCC

If C - SCC, then leave[C]=max(leave[v]) for all $v \in C$ 

Lemma

C, C'  - SCCs, edge (u,v) connects C and C'. Then leave[C] > leave[C']. 

a) If C was processed before C': w is first vertex in C during dfs. At the time of entering the C component, the entire C' component is white. leave[w] > leave[C']

b) If C' was processed before C: No path from C' to C exists => whole C' will be processed before C and leave[C] > leave[C']

![alternatvie text](img/Kosaraju_2.png)

Let's show that one tree T contains only one SCC.

If T contains two SCC: C and C' and C is the first component processed in DFS(G). Edge (u,v) that connects C and C' exists.

$leave_H[C] > leave_H[C'] $ (This follows from how we constructed T)

However $leave_H[C] < leave_H[C'] $ follows from Lemma if it is applied to DFS(H).


# Breadth first search

Vertices are processed in order of increasing distance from the starting vertex. Here we use queue data structure.

In [17]:
from collections import deque
class Graph:
    def __init__(self):
        self.graph = {}
        self.vertices = set()
 
    # function to add an edge to graph
    def addEdge(self, u, v):
        if u not in self.graph:
            self.graph[u] = []
        self.graph[u].append(v)
        self.vertices.add(u)
        self.vertices.add(v)

    def bfs(self, v):
        visited = {}
        for u in self.vertices:
            visited[u] = False
        queue = deque([])
        queue.append(v)
        visited[v] = True
        while queue:
            current_node = queue.popleft()
            print(current_node, end=' ')
            for u in self.graph[current_node]:
                if not visited[u]:
                    queue.append(u)
                    visited[u] = True
                
g = Graph()
g.addEdge(0, 1)
g.addEdge(0, 2)
g.addEdge(1, 2)
g.addEdge(2, 0)
g.addEdge(2, 3)
g.addEdge(3, 3)
g.bfs(2)

2 0 3 1 

- Black - vertex that has been extracted from the queue

- Gray - vertex that is in the queue

- White - vertex that has not yet been processed

At each moment of time in the process of BFS, there are vertices in the queue that are at a distance k from the start, and behind them - at a distance k + 1.

##  Shortest paths
Using this algorithm, we can find the shortest paths from a given vertex to all other reachable vertices.

In [25]:
from collections import deque
class Graph:
    def __init__(self):
        self.graph = {}
        self.vertices = set()
 
    # function to add an edge to graph
    def addEdge(self, u, v):
        if u not in self.graph:
            self.graph[u] = []
        self.graph[u].append(v)
        self.vertices.add(u)
        self.vertices.add(v)

    def bfs(self, v):
        dist = {} # shortest distance to given vertex
        pi = {} # previous node from the shortest path
        for u in self.vertices:
            dist[u] = float("inf")
            pi[u] = -1
        queue = deque([])
        queue.append(v)
        dist[v] = 0
        pi[v] = -1
        while queue:
            current_node = queue.popleft()
            for u in self.graph[current_node]:
                if dist[u] > dist[current_node]+1:
                    dist[u] = dist[current_node]+1
                    pi[u] = current_node
                    queue.append(u)
        print(dist)
                
g = Graph()
g.addEdge(0, 1)
g.addEdge(0, 2)
g.addEdge(1, 2)
g.addEdge(2, 0)
g.addEdge(2, 3)
g.addEdge(3, 3)
g.bfs(2)

{0: 1, 1: 2, 2: 0, 3: 1}


#  Eulerian walk

In an undirected graph, find a path (or cycle) that passes through all the edges of the graph once. The corresponding path or cycle is called Eulerian.

![alternatvie text](https://networkx.org/nx-guides/_images/part1.png)



G - connected undirected graph

An Euler path exists if and only if G has at most two odd vertices

An Euler cycle exists if and only if all vertices are even.


=> 

When we process some vertex, we use 2 edges, all intermediate vertices included to the path should be even. 

<= 

An Euler cycle: 
- we find some cycle C during search, exclude it from G. 
- Repeat for each component of a new graph G\C. 
- Combine all found cycles into one. 

An Euler path: 

- build a path from an odd vertex to an odd one.
- Subtract it from the graph. 
- repeat the same procedure as for the Euler cycle


# De novo Assembly

![alternatvie text](https://upload.wikimedia.org/wikipedia/commons/b/b6/Types_of_sequencing_assembly.png)




## Suffix-prefix match

Reads that come from the same region of the genome can overlap:

![alternatvie text](img/suf_pref.png)

Mismatches because of:

- sequencing errors
- polyploidy

More coverage - more and longer overlaps

## Overlap graph

Vertices are sequences, they are connected with directed edge if they overlap (from a vertex that have a common substring as a suffix to a vertex where it is a prefix). Each edge is lables with a length of overlap

We set some threshold length of an overlap between nodes

![alternatvie text](img/overlap_graph.png)



A sequence of the original genome can be build by walking a certain path in this graph




## Shortest common superstring

Given a set of strings S, find shortest string that contains all strings in S as substrings.

NP-complete

This problem is similar to the builing assembly from reads

Algorithm result depends on the order we concatinate strings

![alternatvie text](img/scs_not_greedy.png)


n! different orderings to check


## Greedy algorithm

At each time we neew to peek an edge that represents the lognest overlap in overlap graph

Result is not always optimal

![alternatvie text](img/scs_greedy.png)

**Such a greedy algroritm can eliminate repeats that are present in common string (genome) -> we want to update our model** 

##  De Bruijn graph

Building a De Bruijn graph helps us to overcome this repeat collapsing problem

Let's assume that our reads are k-mers from the genome

Split each k-mer to left and right k-1-mer (1 base difference) - L and R, draw an edge from L to R

Each k-mer in the genome corresponds to one **edge** in this graph

![alternatvie text](img/de_bruijn.png)

How to construct an original sequence from De Bruijn graph -> Eulerian path in graph

What if there is more than one Eulerian path in graph?

## Edge-connection

Vertices u and v in an undirected graph are **edge-connected** if there are two paths between these vertices that do not intersect in edges (edge-disjoint paths).

This is an equivalence relation:

u ~ u

u ~ v => v ~ u

u ~ v, v ~ w => u ~ w


Сycle c - union of edge-disjoint paths from u to v

P1 and P2 are pathes from w to v. P1 and P2 intersect C in 2 vertices - a and b.

We can build 2 edge-disjoint paths from w to u:

- u -> a -> w

- u -> v -> b -> w

![alternatvie text](img/edge_conn_trans.png)

All graph vertices are divided into equivalence classes

![alternatvie text](img/bridges.png)

**Bridge**
is:

- an edge connecting 2 different edge-connected components

- an edge, upon removal of which the connected component breaks up

- an (u, v) edge that lies on any path connecting u with v.

### Finding bridges in graph

We can find bridges in graph using DFS: (v,to) is a bridge if there is no other path from v to *to* except (v,to) edge.

During DFS for each vertex v compute lowest[v]:

$$lowest[v]=\begin{equation}
min\left\{ 
  \begin{aligned}
    entry[v]\\
    entry[p]\\
    lowest[to]\\
  \end{aligned}
  \right.
\end{equation}
$$

where p is a gray vertex from back edge (v,p), *to* - neighbor vertices of v

(v,to) is a bridge if at the moment leave[to]

$lowest[to] > entry[v]$

We failed to find any child vertex that would have some edge leading to the vertex whose entry time is less than the entry[v]

## Vertex-connection

Two edges in a graph are vertex-connected if there are vertex-disjoint paths that connect their ends

All graph edges are divided into equivalence classes

(a,b) ~ (a,b)

(a,b) ~ (c,d) => (c,d) ~ (a,b)

(a,b) ~ (c,d), (c,d) ~ (e,f) => (a,b) ~ (e,f)


![alternatvie text](img/vert_conn_trans.png)

Path1: b -> y -> e

Path2: a -> x -> c -> d -> f


**Articulation point** is

- vertex, upon removal of which, together with the edges incident to it, the connected component breaks up

- vertex, that is incident to edges belonging to two or more connected components