Skip to content

alexandrutomescu/create-flow-graphs

Repository files navigation

Create flow de Bruijn graphs from a collection of ecoli genomes

This script creates de Bruijn graphs from the first g genomes in the directory ecoli. This directory contains 50 E. coli reference genomes from the dataset 3682 E. coli assemblies from NCBI.

A de Bruijn graph of order k built on a set of strings has a node for each k-mer, i.e. a string of length k in one of the strings (duplicated k-mers induce just a single node of the graph). We add an edge between two k-mers having a suffix-prefix overlap of (k-1) chracters if their merged (k+1)-mer appears in one of the strings. Since the genomes have repeated k-mers, this graph will contain cycles. Moreover, by construction, each of the g genomes can be "traced" as a walk from its first k-mer to its last k-mer. We also add a global source s connected to all such start k-mers, and a global sink t connected from all such ending k-mers.

To obtain flow values for the edges, we proceed as follows. First, for each of the g genomes, we assign a weigth from log-normal distribution of mean -4 and variance 4 (or mean 1 and variance 1, see parameter --distribution), as in this Bioinformatics package. Then, we initialize the flow values of all edge as 0, and for each (k+1)-mer in a genome of weight w, we add w to the flow vlue of the edge corresponding to that (k+1)-mer. Thus, each genome of weight w adds flow w to the s-t walk to which it corresponds in the graph. The resulting edge values are still a flow (statisfy flow conservation), and we can continue with the next genome.

We restrict the above procedure to non-overlapping windows of length --windowsize (default 2000) inside each of the g genomes. You can use the parameter --acyclic to keep only acyclic graphs, or the parameter --mincycles to keep only graphs containing at least a minimum number of cycles (these are computed with the NetworkX function simple_cycles).

Options:

  -h, --help            show this help message and exit
  -k KMERSIZE, --kmersize KMERSIZE [default 15]
                        The kmer size
  -w WINDOWSIZE, --windowsize WINDOWSIZE 
                        The length of the genome windows from which to build the graphs. Use 0 for whole genomes. [default 2000]. 
  -a, --acyclic
                        Keep only acyclic graphs
  -c MINCYCLES, --mincycles MINCYCLES 
                        Keep only graphs with at least this many cycles [default 0]
  -d DISTRIBUTION, --distribution DISTRIBUTION 
                        lognormal-44 or lognormal11 [default lognormal-44]
  -g NGENOMES, --ngenomes NGENOMES
                        The number of ecoli genomes from which to construct the graph
  -o OUTDIR, --outdir OUTDIR
                        outputdir
  -p, --pdf
                        Render PDF

Example

    python3 construct-flow-graph.py -o graphs-g15-w4000-acyc/ -k 15 -g 15 --acyclic --windowsize 4000 --pdf

The above command will produce a directory graphs-g15-w4000-acyc, which will contain de Bruijn graphs of order -k 15 from the first -g 15 E.coli genomes, in windows of length --windowsize 2000, only --acyclic, and also render PDFs for these --pdf. One such graph is gt15.kmer15.(4384000.4388000).V22.E42.acyc.graph which looks like:

Example E.coli Graph

Requirements:

    numpy
    networkx
    graphviz

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages