# Graphs

A **graph** is a mathematical structure used to model pairwise relationships between entities. A graph consists of:

- **Nodes**: The objects
- **Edges**: The connections between objects, which can be directed (arrows) or undirected


### `__init__(self, g = {})`

Initialises the graph. The input `g` is a dictionary in the format:

```python
{
    "A": [("B", 3), ("C", 2)],
    "B": [("C", 1)]
}
```
## Basic Graph Operations

### `add_vertex(self, v)`
Adds a vertex `v` to the graph if it doesn’t exist.

### `add_edge(self, o, d, w)`
Adds an edge from origin `o` to destination `d` with weight `w`.

---

## Graph Inspection Methods

### `get_nodes(self)`
Returns a list of all nodes in the graph.

### `get_edges(self)`
Returns all edges in the graph as a list of `(origin, destination, weight)`.

### `print_graph(self)`
Prints the adjacency list: each node and its outgoing connections.

### `size(self)`
Returns a tuple: `(number of vertices, number of edges)`.

---

## Neighbourhood Functions

### `get_successors(self, v)`
Returns the list of nodes directly reachable from `v` (outgoing edges).

### `get_predecessors(self, v)`
Returns the list of nodes that point to `v` (incoming edges).

### `get_adjacents(self, v)`
Returns all nodes adjacent to `v` (successors + predecessors).

---

## Degree Calculations

### `out_degree(self, v)`
Returns the number of outgoing edges from `v`.

### `in_degree(self, v)`
Returns the number of incoming edges to `v`.

### `degree(self, v)`
Returns the total degree (in + out) of node `v`.

---

## Path & Distance

### `distance(self, s, d)`
Returns the shortest path distance from node `s` to `d` using Dijkstra’s algorithm.

### `shortest_path(self, s, d)`
Returns the actual shortest path as a list of nodes from `s` to `d`.

### `_dijkstra(self, source)`
Internal method implementing Dijkstra’s algorithm:

- Returns two dictionaries:
  - `dist[node]`: Shortest distance from source to node
  - `prev[node]`: Previous node in the shortest path

---

## Reachability

### `reachable_bfs(self, v)`
Returns all nodes reachable from `v` using Breadth-First Search(BFS).

### `reachable_dfs(self, v)`
Returns all nodes reachable from `v` using Depth-First Search(DFS).

### `reachable_with_dist(self, s)`
Returns a list of tuples `(node, distance)` representing all reachable nodes from `s` and their respective distances (using BFS-style levels).

---

## Cycle Detection

### `node_has_cycle(self, v)`
Checks whether a cycle exists starting and ending at node `v`.

### `has_cycle(self)`
Checks the entire graph for the presence of any cycle.

In [4]:
class MyGraph:
    """
    Basic directed weighted graph class providing essential methods for 
    graph manipulation and traversal, including shortest path computation, 
    reachability, and cycle detection.
    """

    def __init__(self, g = {}):
        self.graph = g

    def print_graph(self):
        for v in self.graph:
            print(v, "->", self.graph[v])

    def get_nodes(self):
        return list(self.graph.keys())

    def get_edges(self):
        edges = []
        for v in self.graph:
            for d, w in self.graph[v]:
                edges.append((v, d, w))
        return edges

    def size(self):
        return len(self.get_nodes()), len(self.get_edges())

    def add_vertex(self, v):
        if v not in self.graph:
            self.graph[v] = []

    def add_edge(self, o, d, w):
        if o not in self.graph:
            self.add_vertex(o)
        if d not in self.graph:
            self.add_vertex(d)
        self.graph[o].append((d, w))

    def get_successors(self, v):
        return [dest for dest, _ in self.graph[v]]

    def get_predecessors(self, v):
        preds = []
        for node in self.graph:
            for dest, _ in self.graph[node]:
                if dest == v:
                    preds.append(node)
        return preds

    def get_adjacents(self, v):
        return list(set(self.get_successors(v) + self.get_predecessors(v)))

    def out_degree(self, v):
        return len(self.graph[v])

    def in_degree(self, v):
        return len(self.get_predecessors(v))

    def degree(self, v):
        return self.in_degree(v) + self.out_degree(v)

    def distance(self, s, d):
        if s == d: return 0
        dist, _ = self._dijkstra(s)
        return dist.get(d, None)

    def shortest_path(self, s, d):
        if s == d: return [s]
        dist, prev = self._dijkstra(s)
        if d not in dist:
            return None
        path = []
        current = d
        while current != s:
            path.append(current)
            current = prev.get(current)
            if current is None:
                return None
        path.append(s)
        path.reverse()
        return path

    def _dijkstra(self, source):
        """
        Dijkstra's algorithm to compute shortest paths from a source node.
        """
        unvisited = {node: float('inf') for node in self.graph}
        unvisited[source] = 0
        prev = {}
        visited = {}

        while unvisited:
            u = min(unvisited, key=unvisited.get)
            current_dist = unvisited[u]
            visited[u] = current_dist
            del unvisited[u]

            for v, weight in self.graph[u]:
                if v in visited:
                    continue
                new_dist = current_dist + weight
                if new_dist < unvisited.get(v, float('inf')):
                    unvisited[v] = new_dist
                    prev[v] = u

        return visited, prev

    def reachable_bfs(self, v):
        """
        Perform BFS to find all reachable nodes from a given node.
        """
        l = [v]
        res = []
        while l:
            node = l.pop(0)
            if node != v: res.append(node)
            for elem, _ in self.graph[node]:
                if elem not in res and elem not in l and elem != node:
                    l.append(elem)
        return res

    def reachable_dfs(self, v):
        """
        Perform DFS to find all reachable nodes from a given node.
        """
        l = [v]
        res = []
        while l:
            node = l.pop(0)
            if node != v: res.append(node)
            s = 0
            for elem, _ in self.graph[node]:
                if elem not in res and elem not in l:
                    l.insert(s, elem)
                    s += 1
        return res

    def reachable_with_dist(self, s):
        """
        Perform BFS and return reachable nodes with their respective distances.
        """
        res = []
        l = [(s, 0)]
        while l:
            node, dist = l.pop(0)
            if node != s:
                res.append((node, dist))
            for elem, _ in self.graph[node]:
                if not is_in_tuple_list(l, elem) and not is_in_tuple_list(res, elem):
                    l.append((elem, dist + 1))
        return res

    def node_has_cycle(self, v):
        """
        Check if there is a cycle starting and ending at node v.
        """
        l = [v]
        visited = [v]
        while l:
            node = l.pop(0)
            for elem, _ in self.graph[node]:
                if elem == v:
                    return True
                elif elem not in visited:
                    l.append(elem)
                    visited.append(elem)
        return False

    def has_cycle(self):
        """
        Check if the graph contains any cycle.
        """
        for v in self.graph:
            if self.node_has_cycle(v):
                return True
        return False

def is_in_tuple_list(tl, val):
    """
    Helper function to check if a value is the first element of any tuple in a list.
    """
    for x, _ in tl:
        if val == x:
            return True
    return False

def test_graph():
    """
    Simple test function for the MyGraph class.
    """
    g = {
        1: [(2, 3), (3, 1)],
        2: [(3, 7), (4, 5)],
        3: [(4, 2)],
        4: []
    }
    wg = MyGraph(g)
    wg.print_graph()
    print("Nodes:", wg.get_nodes())
    print("Edges:", wg.get_edges())
    print("Shortest path 1->4:", wg.shortest_path(1, 4))
    print("Distance 1->4:", wg.distance(1, 4))
    print("Shortest path 2->4:", wg.shortest_path(2, 4))
    print("Distance 2->4:", wg.distance(2, 4))

if __name__ == "__main__":
    test_graph()

1 -> [(2, 3), (3, 1)]
2 -> [(3, 7), (4, 5)]
3 -> [(4, 2)]
4 -> []
Nodes: [1, 2, 3, 4]
Edges: [(1, 2, 3), (1, 3, 1), (2, 3, 7), (2, 4, 5), (3, 4, 2)]
Shortest path 1->4: [1, 3, 4]
Distance 1->4: 3
Shortest path 2->4: [2, 4]
Distance 2->4: 5


# Metabolic networks

Objective: building and analyzing a metabolic network derived from biochemical reaction data. The goal is to understand how metabolites interact, determine their significance within the network, and simulate how metabolic reactions unfold from a given starting point.

**Data (e-coli.txt)**

- Ecoli.txt contains biochemical reactions, each listed on a separate line.
- Reactions follow a standardized format, where substrates produce one or more products:
- A parser extracts and structures these reactions into a usable format, organizing them into dictionaries of substrates and products.

**Network Construction**

A custom graph structure (MN_Graph) is implemented to model the metabolic network.
Each metabolite is represented as a node.
Edges are created between metabolites that participate together in the same reaction.
The resulting graph captures the biochemical connectivity of the system.

**Centrality and Network Metrics**

To assess the influence of each metabolite, the network is analyzed using standard centrality measures:
Degree Centrality: Counts of direct connections a node has.
Closeness Centrality: Measures how easily a node can reach others in the network.
Betweenness Centrality: Highlights nodes that frequently occur on the shortest paths between other nodes.
These metrics help identify the most critical metabolites in terms of connectivity and influence.

**Dynamic Simulation of Metabolic Propagation**

Starting from an initial list of known metabolites, the system simulates how reactions progress:
identifies active reactions and adds newly produced products to the known set until there is no new metabolites.

**Results**

A ranked list of key metabolites based on centrality analysis.
The final set of metabolites that can be synthesized.

**Use Cases**

This type of analysis is valuable for:
Exploring and optimizing metabolic pathways.
Identifying targets for metabolic engineering.
Simulating biological behavior in synthetic biology applications.

In [2]:
from Graphs import MyGraph
import re
import heapq
from collections import deque


class MN_Graph(MyGraph):
    """
    Specialized graph class for metabolite networks, extending Graph with methods 
    for analyzing node degrees, centrality, and clustering.
    """
    def __init__(self, g = {}):
        super().__init__(g)

    def all_degrees(self, deg_type="inout"):
        """
        Return node degrees based on direction: 'in', 'out', or both.
        """
        degs = {}
        for v in self.graph:
            if deg_type in ("out", "inout"):
                degs[v] = len(self.graph[v])
            else:
                degs[v] = 0
        if deg_type in ("in", "inout"):
            for v in self.graph:
                for d in self.graph[v]:
                    if deg_type == "in" or v not in self.graph.get(d, []):
                        degs[d] = degs.get(d, 0) + 1
        return degs

    def highest_degrees(self, all_deg=None, deg_type="inout", top=10):
        """
        Return top nodes by degree.
        Parameters:
            all_deg (dict): Degree dict.
            deg_type (str): Degree type.
            top (int): Number of top nodes to return.
        """
        if all_deg is None:
            all_deg = self.all_degrees(deg_type)
        return sorted(all_deg, key=all_deg.get, reverse=True)[:top]

    def mean_degree(self, deg_type="inout"):
        """
        Calculate average degree of all nodes.
        """
        degs = self.all_degrees(deg_type)
        return sum(degs.values()) / len(degs)

    def prob_degree(self, deg_type="inout"):
        """
        Return the node degrees distribution as a probability.
        Parameters:
            deg_type (str): Degree type.

        Returns:
            dict: Degree value to probability.
        """
        degs = self.all_degrees(deg_type)
        hist = {}
        for d in degs.values():
            hist[d] = hist.get(d, 0) + 1
        return {k: v / len(degs) for k, v in hist.items()}

    def mean_distances(self):
        """
        Compute the average of shortest path length and the density of the graph.
        """
        total_dist = 0
        count = 0
        for node in self.get_nodes():
            for _, dist in self.reachable_with_dist(node):
                total_dist += dist
            count += len(self.reachable_with_dist(node))
        mean_dist = total_dist / count if count else 0
        n = len(self.get_nodes())
        density = count / (n * (n - 1)) if n > 1 else 0
        return mean_dist, density

    def closeness_centrality(self, node):
        """
        Returns closeness centrality for a given node.
        """
        dists = self.reachable_with_dist(node)
        if not dists:
            return 0.0
        return len(dists) / sum(dist for _, dist in dists)

    def highest_closeness(self, top=10):
        """
        Returns nodes with highest closeness centrality.
        """
        centrality = {n: self.closeness_centrality(n) for n in self.get_nodes()}
        return sorted(centrality, key=centrality.get, reverse=True)[:top]

    def betweenness_centrality(self, node):
        """
        Approximate betweenness centrality for a node.
        """
        total = 0
        through = 0
        for s in self.get_nodes():
            for t in self.get_nodes():
                if s != t and s != node and t != node:
                    path = self.shortest_path(s, t)
                    if path:
                        total += 1
                        if node in path:
                            through += 1
        return through / total if total > 0 else 0

    def clustering_coef(self, v):
        """
        Computes local clustering coefficient for a node.
        """
        neighbors = self.get_adjacents(v)
        if len(neighbors) <= 1:
            return 0.0
        links = 0
        for i in neighbors:
            for j in neighbors:
                if i != j and (j in self.get_successors(i) or i in self.get_successors(j)):
                    links += 1
        return links / (len(neighbors) * (len(neighbors) - 1))

    def all_clustering_coefs(self):
        """
        Returns clustering coefficients for all nodes.
        """
        return {v: self.clustering_coef(v) for v in self.get_nodes()}

    def mean_clustering_coef(self):
        """
        Average clustering coefficient across all nodes.
        """
        cc = self.all_clustering_coefs()
        return sum(cc.values()) / len(cc) if cc else 0.0

    def mean_clustering_perdegree(self, deg_type="inout"):
        """
        Average clustering coefficient grouped by degree.
        """
        degs = self.all_degrees(deg_type)
        ccs = self.all_clustering_coefs()
        grouped = {}
        for node, deg in degs.items():
            grouped.setdefault(deg, []).append(ccs[node])
        return {k: sum(v) / len(v) for k, v in grouped.items()}


class CentralityAnalyzer:
    """
    Centrality calculator using various metrics.
    """

    def __init__(self, graph):
        self.graph = graph

    def degree_centrality(self):
        """
        Return degree centrality of each node.
        """
        return {n: len(self.graph.get_successors(n)) for n in self.graph.get_nodes()}

    def closeness_centrality(self):
        """
        Return closeness centrality for all nodes.
        """
        result = {}
        for node in self.graph.get_nodes():
            dist, count = self._bfs_total_distance_and_reach_count(node)
            result[node] = (count / dist) if dist > 0 else 0.0
        return result

    def _bfs_total_distance_and_reach_count(self, start):
        """
        Breadth-first traversal for closeness computation.
        """
        visited = set()
        queue = deque([(start, 0)])
        total, count = 0, 0
        while queue:
            node, dist = queue.popleft()
            if node not in visited:
                visited.add(node)
                if node != start:
                    total += dist
                    count += 1
                queue.extend((n, dist + 1) for n in self.graph.get_successors(node) if n not in visited)
        return total, count

    def betweenness_centrality(self):
        """
        Compute node betweenness using Brandes' algorithm.
        """
        centrality = dict.fromkeys(self.graph.get_nodes(), 0.0)
        for s in self.graph.get_nodes():
            stack = []
            pred = {w: [] for w in self.graph.get_nodes()}
            sigma = dict.fromkeys(self.graph.get_nodes(), 0)
            dist = dict.fromkeys(self.graph.get_nodes(), -1)
            sigma[s], dist[s] = 1, 0
            queue = deque([s])
            while queue:
                v = queue.popleft()
                stack.append(v)
                for w in self.graph.get_successors(v):
                    if dist[w] < 0:
                        dist[w] = dist[v] + 1
                        queue.append(w)
                    if dist[w] == dist[v] + 1:
                        sigma[w] += sigma[v]
                        pred[w].append(v)
            delta = dict.fromkeys(self.graph.get_nodes(), 0)
            while stack:
                w = stack.pop()
                for v in pred[w]:
                    delta[v] += (sigma[v] / sigma[w]) * (1 + delta[w])
                if w != s:
                    centrality[w] += delta[w]
        return centrality

    def top_nodes(self, centrality_dict, top_n=5):
        """
        Return highest ranked nodes by centrality score.
        """
        return heapq.nlargest(top_n, centrality_dict.items(), key=lambda x: x[1])


def parse_reactions(file_path):
    """
    Parse reaction data into structured reaction dictionaries.
    """
    reactions = []
    with open(file_path) as f:
        for line in f:
            if ':' not in line:
                continue
            parts = re.split(r':\s*', line.strip(), maxsplit=1)
            if len(parts) != 2:
                continue
            reaction_id, formula = parts
            match = re.match(r"(.*?)\s*(<=>|=>)\s*(.*)", formula)
            if not match:
                continue
            substrates = [s.strip() for s in match.group(1).split('+')]
            products = [p.strip() for p in match.group(3).split('+')]
            reactions.append({'id': reaction_id, 'substrates': substrates, 'products': products})
    return reactions


def build_metabolite_graph(reactions):
    """
    Build a metabolite interaction graph from reaction data.
    """
    g = MN_Graph()
    for r in reactions:
        compounds = r['substrates'] + r['products']
        for i in range(len(compounds)):
            for j in range(i + 1, len(compounds)):
                g.add_edge(compounds[i], compounds[j], 1)
                g.add_edge(compounds[j], compounds[i], 1)
    return g


def get_active_reactions(metabolites_set, reactions):
    """
    Return reactions that can occur with the available substrates.
    """
    return [r for r in reactions if all(m in metabolites_set for m in r['substrates'])]

def get_produced_metabolites(active_reactions):
    """
    Extract products from active reactions.
    """
    return set(p for r in active_reactions for p in r['products'])

def compute_final_metabolites(initial_metabolites, reactions):
    """
    Iteratively expand metabolite set by applying reactions.
    """
    known = set(initial_metabolites)
    while True:
        active = get_active_reactions(known, reactions)
        new = get_produced_metabolites(active)
        if new.issubset(known):
            break
        known.update(new)
    return known

In [3]:
reactions_file = "ecoli.txt"

parsed_reactions = parse_reactions(reactions_file)
print(f"Number of reactions parsed: {len(parsed_reactions)}")

metabolite_graph = build_metabolite_graph(parsed_reactions)

centrality_analyzer = CentralityAnalyzer(metabolite_graph)

print("\nDegree Centrality:")
for metabolite, degree_val in centrality_analyzer.top_nodes(centrality_analyzer.degree_centrality()):
    print(f"{metabolite}: {degree_val}")

print("\nCloseness Centrality:")
for metabolite, closeness_val in centrality_analyzer.top_nodes(centrality_analyzer.closeness_centrality()):
    print(f"{metabolite}: {closeness_val:.4f}")

print("\nBetweenness Centrality:")
for metabolite, betweenness_val in centrality_analyzer.top_nodes(centrality_analyzer.betweenness_centrality()):
    print(f"{metabolite}: {betweenness_val:.4f}")

initial_metabolites = ["M_glc_DASH_D_c", "M_h2o_c", "M_nad_c", "M_atp_c"]

reachable_metabolites = compute_final_metabolites(initial_metabolites, parsed_reactions)

print("\nInitial Metabolites:")
print(initial_metabolites)

print("\nReachable Metabolites After Propagation:")
print(sorted(reachable_metabolites))

Number of reactions parsed: 931

Degree Centrality:
M_h_c: 2170
M_h2o_c: 1279
M_atp_c: 863
M_pi_c: 683
M_adp_c: 675

Closeness Centrality:
M_h_c: 0.8342
M_h2o_c: 0.6909
M_atp_c: 0.6056
M_pi_c: 0.5910
M_adp_c: 0.5864

Betweenness Centrality:
M_h_c: 348835.9756
M_h2o_c: 137598.9464
M_pi_c: 33431.8987
M_atp_c: 25958.4243
M_adp_c: 16836.3095

Initial Metabolites:
['M_glc_DASH_D_c', 'M_h2o_c', 'M_nad_c', 'M_atp_c']

Reachable Metabolites After Propagation:
['M_13dpg_c', 'M_23ddhb_c', 'M_23dhb_c', 'M_23dhba_c', 'M_23dhmb_c', 'M_2dda7p_c', 'M_2ddg6p_c', 'M_2me4p_c', 'M_34hpp_c', 'M_3dhq_c', 'M_3dhsk_c', 'M_3mob_c', 'M_3psme_c', 'M_4hbz_c', 'M_4per_c', 'M_6pgc_c', 'M_6pgl_c', 'M_ade_c', 'M_adn_c', 'M_adp_c', 'M_adphep_DASH_DD_c', 'M_adphep_DASH_LD_c', 'M_alac_DASH_S_c', 'M_amp_c', 'M_ara5p_c', 'M_atp_c', 'M_camp_c', 'M_cbp_c', 'M_chor_c', 'M_co2_c', 'M_db4p_c', 'M_dha_c', 'M_dhap_c', 'M_dnad_c', 'M_dxyl5p_c', 'M_e4p_c', 'M_f6p_c', 'M_fdp_c', 'M_for_c', 'M_fprica_c', 'M_g3p_c', 'M_g6p_c', 'M_gl

# Genome Assembly

### Conceptual Overview

**Purpose:**  
Genome assembly aims to reconstruct the original DNA sequence from a collection of overlapping fragments (k-mers), as produced by modern sequencing technologies.

**Main Strategies:**

1. **De Bruijn Graphs:**

   - Each k-mer is represented as a directed edge.
   - The nodes of the graph are all possible (k-1)-length prefixes and suffixes of the k-mers.
   - Assembly involves finding an **Eulerian path**—a path that traverses every edge exactly once.
   - The original sequence is reconstructed by following this path and concatenating the appropriate characters from each node.

2. **Overlap Graphs:**

   - Each k-mer is represented as a node.
   - An edge is drawn from node A to node B if the suffix of A (length k-1) matches the prefix of B.
   - The goal is to find a **Hamiltonian path**—a path that visits every node exactly once.
   - The sequence is assembled by joining the fragments along this path, using the overlaps to avoid redundancy.

**Comparison:**

- De Bruijn graphs are highly efficient for large datasets and handle repetitive regions well, making them the backbone of most modern assemblers.
- Overlap graphs are conceptually straightforward but computationally intensive for large numbers of fragments, as finding Hamiltonian paths is a hard problem.

---

### Implementation Details

**De Bruijn Graph Assembly:**

- For each k-mer, identify its prefix (first k-1 bases) and suffix (last k-1 bases).
- Add a directed edge from the prefix to the suffix.
- The resulting graph should be nearly balanced: one node with one extra outgoing edge (start) and one with one extra incoming edge (end).
- Temporarily add an edge to close the cycle, then use Hierholzer’s algorithm to find an Eulerian cycle.
- Remove the temporary edge to obtain the actual Eulerian path.
- Reconstruct the sequence by concatenating the first node and the last character of each subsequent node in the path.

**Overlap Graph Assembly:**

- Treat each fragment as a unique node (even if sequences repeat, assign unique identifiers).
- For every pair of fragments, create an edge if the suffix of one matches the prefix of another.
- Use backtracking to search for a Hamiltonian path—a path that visits each node exactly once.
- The final sequence is built by taking the full first fragment and, for each subsequent fragment in the path, appending only the last character.

**Practical Considerations:**

- De Bruijn graph methods are preferred for real-world genome assembly due to their scalability and robustness to errors and repeats.
- Overlap graphs are useful for illustrating the assembly concept but are less practical for large-scale datasets due to computational complexity.

In [5]:
def get_prefix(seq):
    """Return the prefix (all but last character) of a sequence."""
    return seq[:-1]

def get_suffix(seq):
    """Return the suffix (all but first character) of a sequence."""
    return seq[1:]

class DeBruijnGraph(MyGraph):
    """
    De Bruijn graph for genome assembly from k-mers.
    Nodes: (k-1)-mers; Edges: k-mers as transitions.
    """

    def __init__(self, kmers):
        super().__init__()
        for kmer in kmers:
            self.add_edge(get_prefix(kmer), get_suffix(kmer), 1)  # Peso 1

    def find_eulerian_path(self):
        """Find an Eulerian path if the graph is nearly balanced."""
        start, end = self._find_path_ends()
        if not start or not end:
            return None

        self.add_edge(end, start, 1) 

        path, stack = [], [start]
        local_edges = {u: list(vs) for u, vs in self.graph.items()}
        while stack:
            u = stack[-1]
            if local_edges.get(u):
                stack.append(local_edges[u].pop()[0])  
            else:
                path.append(stack.pop())
        path.reverse()

        for i in range(len(path) - 1):
            if path[i] == end and path[i + 1] == start:
                return path[i + 1:] + path[1:i + 1]
        return None

    def _find_path_ends(self):
        """Return (start, end) nodes for Eulerian path if nearly balanced."""
        start = end = None
        for node in self.graph:
            indeg = self.in_degree(node)
            outdeg = self.out_degree(node)
            if outdeg - indeg == 1:
                if start is None:
                    start = node
                else:
                    return None, None
            elif indeg - outdeg == 1:
                if end is None:
                    end = node
                else:
                    return None, None
            elif indeg != outdeg:
                return None, None
        return start, end

    def assemble_sequence(self, path):
        """Reconstruct sequence from Eulerian path."""
        if not path:
            return None
        return path[0] + ''.join(node[-1] for node in path[1:])

class OverlapGraph(MyGraph):
    """
    Overlap graph for genome assembly from k-mers.
    Nodes: uniquely labeled k-mers; Edges: overlap of k-1 between suffix and prefix.
    """

    def __init__(self, kmers):
        super().__init__()
        self.labeled = [f"{kmer}#{i}" for i, kmer in enumerate(kmers)]
        for label in self.labeled:
            self.add_vertex(label)
        for src in self.labeled:
            src_seq = src.split('#')[0]
            for dst in self.labeled:
                if src != dst and get_suffix(src_seq) == get_prefix(dst.split('#')[0]):
                    self.add_edge(src, dst, 1) 

    def find_hamiltonian_path(self):
        """Find a Hamiltonian path using DFS with pruning."""
        def dfs(path, visited):
            if len(path) == len(self.graph):
                return path
            for neighbor, _ in self.graph[path[-1]]:
                if neighbor not in visited:
                    res = dfs(path + [neighbor], visited | {neighbor})
                    if res:
                        return res
            return None

        for start in self.graph:
            res = dfs([start], {start})
            if res:
                return res
        return None

    def _extract_seq(self, label):
        return label.split('#')[0]

    def assemble_sequence(self, path):
        """Reconstruct sequence from Hamiltonian path."""
        if not path:
            return None
        return self._extract_seq(path[0]) + ''.join(self._extract_seq(n)[-1] for n in path[1:])

In [6]:
def generate_kmers(seq, k):
    """Generate all k-mers from a sequence (no sorting, preserves order)."""
    return [seq[i:i+k] for i in range(len(seq) - k + 1)]

def genome_assembly_all(seq, k):
    print("Original sequence:", seq)
    kmers = generate_kmers(seq, k)
    print("k-mers:", kmers)

    print("\n--- De Bruijn Graph Assembly ---")
    dbg = DeBruijnGraph(kmers) 
    dbg.print_graph()
    path = dbg.find_eulerian_path()
    if path:
        print("Eulerian path found:", path)
        print("Assembled sequence:", dbg.assemble_sequence(path))
    else:
        print("Eulerian path not found.")

    print("\n--- Overlap Graph Assembly ---")
    og = OverlapGraph(kmers) 
    og.print_graph()
    path = og.find_hamiltonian_path()
    if path:
        print("Hamiltonian path found:", path)
        print("Assembled sequence:", og.assemble_sequence(path))
    else:
        print("Hamiltonian path not found.")

genome_assembly_all('GATTACAGATTACAGGATCAGATTACA', 4)

Original sequence: GATTACAGATTACAGGATCAGATTACA
k-mers: ['GATT', 'ATTA', 'TTAC', 'TACA', 'ACAG', 'CAGA', 'AGAT', 'GATT', 'ATTA', 'TTAC', 'TACA', 'ACAG', 'CAGG', 'AGGA', 'GGAT', 'GATC', 'ATCA', 'TCAG', 'CAGA', 'AGAT', 'GATT', 'ATTA', 'TTAC', 'TACA']

--- De Bruijn Graph Assembly ---
GAT -> [('ATT', 1), ('ATT', 1), ('ATC', 1), ('ATT', 1)]
ATT -> [('TTA', 1), ('TTA', 1), ('TTA', 1)]
TTA -> [('TAC', 1), ('TAC', 1), ('TAC', 1)]
TAC -> [('ACA', 1), ('ACA', 1), ('ACA', 1)]
ACA -> [('CAG', 1), ('CAG', 1)]
CAG -> [('AGA', 1), ('AGG', 1), ('AGA', 1)]
AGA -> [('GAT', 1), ('GAT', 1)]
AGG -> [('GGA', 1)]
GGA -> [('GAT', 1)]
ATC -> [('TCA', 1)]
TCA -> [('CAG', 1)]
Eulerian path found: ['GAT', 'ATC', 'TCA', 'CAG', 'AGA', 'GAT', 'ATT', 'TTA', 'TAC', 'ACA', 'CAG', 'AGG', 'GGA', 'GAT', 'ATT', 'TTA', 'TAC', 'ACA', 'CAG', 'AGA', 'GAT', 'ATT', 'TTA', 'TAC', 'ACA']
Assembled sequence: GATCAGATTACAGGATTACAGATTACA

--- Overlap Graph Assembly ---
GAT -> [('ATT', 1), ('ATT', 1), ('ATC', 1), ('ATT', 1)]
ATT -> [(